Kinds of Query Parallelism

Inter-Query

Multiple SQL queries runs in parallel.

Intra-Query

Inter-Operator
- Pipeline
- Bushy (Tree): Sub-tree uses pipeline. Between sub-trees runs in parallel.

Pipeline	Bushy (Tree)

Intra-Operator
- Single query runs in parallel by partitioning.

Intra-Operator

Data Partition

Assume we have multiple machines, how can we partition data into these individual machines? Introduce several partitioning methods:

Range: Good for equijoins, range queries and group-by
Hash: Good for equijoins, group-by
Round-Robin: Good for spreading load

Parallel Scans

We can do parallel scans on multiple machines:

Simply scan in parallel and merge result as the ouput.
$σ_{p}$ : If we use range or hash partitioning, we can skip entire sites that have no tuples satisfying $p$ .
We can build indexes at each partition

Lookup by key

If data partitioned on function of key (range and hash for example), we can only lookup the relevant node.
Otherwise we have to broadcast lookup to all nodes.

Insert a key

Similarly, if data partitioned on function of key, we route the insert to relevant node.
Else, route insert to any node is ok.

Insert a unique key

Data partitioned, route the insert to the relevant node;
Else, broadcast lookup and collect results. If not exists, insert anywhere.

Hash Join

Naive Parallel Hash Join

Phase 1: shuffle each table across machines (using $h_{n}$ ).
Phase 2: receivers proceed with naive hashing in a pipeline as probe data streams in.

Grace Parallel Hash Join

Pass 1: parallel streaming
Pass 2: local Grace Hash Join per node
Every node waits for Pass 1 to end, and works at its top speed in Pass 2.

Sorting

Pass 0: shuffle data across machines
Pass 1-n: independently run as single-node sorting

Sort-Merge Join

Pass 0 to n-1: like parallel sorting above, but do it twice: once for each relation with same ranges
Pass n: merge join partitions locally on each node

Note: this picture is a 2-pass sort

Parallel Aggregates

Hierarchical aggregation

hcplantern's garden

Explorer

Recent Notes

9-设计模式

8-中间件

7-Spring

12-消息队列

11-设计题

Parallel Query