Skip to content

Commit

Permalink
docs: rfc of distributed planner (#1554)
Browse files Browse the repository at this point in the history
* docs: rfc of distributed planner

Signed-off-by: Ruihang Xia <[email protected]>

* Update docs/rfcs/2023-05-09-distributed-planner.md

Co-authored-by: LFC <[email protected]>

---------

Signed-off-by: Ruihang Xia <[email protected]>
Co-authored-by: dennis zhuang <[email protected]>
Co-authored-by: LFC <[email protected]>
  • Loading branch information
3 people authored May 11, 2023
1 parent 8fef32f commit 7a310cb
Showing 1 changed file with 137 additions and 0 deletions.
137 changes: 137 additions & 0 deletions docs/rfcs/2023-05-09-distributed-planner.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,137 @@
---
Feature Name: distributed-planner
Tracking Issue: TBD
Date: 2023-05-09
Author: "Ruihang Xia <[email protected]>"
---

Distributed Planner
-------------------
# Summary
Enhance the logical planner with aware of distributed, multi-region table topology. To achieve "push computation down" execution rather than the current "pull data up" manner.

# Motivation
Query distributively can leverage the advantage of GreptimeDB's architecture to process large dataset that exceeds the capacity of a single node, or accelerate the query execution by executing it in parallel. This task includes two sub-tasks
- Be able to transform the plan that can push as much as possible computation down to data source.
- Be able to handle pipeline breaker (like `Join` or `Sort`) on multiple computation nodes.
This is a relatively complex topic. To keep this RFC concentrated I'll focus on the first one.

# Details
## Background: Partition and Region
GreptimeDB supports table partitioning, where the partition rule is set during table creation. Each partition can be further divided into one or more physical storage units known as "regions". Both partitions and regions are divided based on rows:
``` text
┌────────────────────────────────────┐
│ │
│ Table │
│ │
└─────┬────────────┬────────────┬────┘
│ │ │
│ │ │
┌─────▼────┐ ┌─────▼────┐ ┌─────▼────┐
│ Region 1 │ │ Region 2 │ │ Region 3 │
└──────────┘ └──────────┘ └──────────┘
Row 1~10 Row 11~20 Row 21~30
```
General speaking, region is the minimum element of data distribution, and we can also use it as the unit to distribute computation. This can greatly simplify the routing logic of this distributed planner, by always schedule the computation to the node that currently opening the corresponding region. And is also easy to scale more node for computing since GreptimeDB's data is persisted on shared storage backend like S3. But this is a bit beyond the scope of this specific topic.
## Background: Commutativity
Commutativity is an attribute that describes whether two operation can exchange their apply order: $P1(P2(R)) \Leftrightarrow P2(P1(R))$. If the equation keeps, we can transform one expression into another form without changing its result. This is useful on rewriting SQL expression, and is the theoretical basis of this RFC.

Take this SQL as an example

``` sql
SELECT a FROM t WHERE a > 10;
```

As we know projection and filter are commutative (todo: latex), it can be translated to the following two identical plan trees:

```text
┌─────────────┐ ┌─────────────┐
│Projection(a)│ │Filter(a>10) │
└──────▲──────┘ └──────▲──────┘
│ │
┌──────┴──────┐ ┌──────┴──────┐
│Filter(a>10) │ │Projection(a)│
└──────▲──────┘ └──────▲──────┘
│ │
┌──────┴──────┐ ┌──────┴──────┐
│ TableScan │ │ TableScan │
└─────────────┘ └─────────────┘
```

## Merge Operation

This RFC proposes to add a new expression node `MergeScan` to merge result from several regions in the frontend. It wrap the abstraction of remote data and execution, and expose a `TableScan` interface to upper level.

``` text
┌───────┼───────┐
│ │ │
│ ┌──┴──┐ │
│ └──▲──┘ │
│ │ │
│ ┌──┴──┐ │
│ └──▲──┘ │ ┌─────────────────────────────┐
│ │ │ │ │
│ ┌────┴────┐ │ │ ┌──────────┐ ┌───┐ ┌───┐ │
│ │MergeScan◄──┼────┤ │ Region 1 │ │ │ .. │ │ │
│ └─────────┘ │ │ └──────────┘ └───┘ └───┘ │
│ │ │ │
└─Frontend──────┘ └─Remote-Sources──────────────┘
```
This merge operation simply chains all the the underlying remote data sources and return `RecordBatch`, just like a coalesce op. And each remote sources is a gRPC query to datanode via the substrait logical plan interface. The plan is transformed and divided from the original query that comes to frontend.

## Commutativity of MergeScan

Obviously, The position of `MergeScan` is the key of the distributed plan. The more closer to the underlying `TableScan`, the less computation is taken by datanodes. Thus the goal is to pull the `MergeScan` up as more as possible. The word "pull up" means exchange `MergeScan` with its parent node in the plan tree, which means we should check the commutativity between the existing expression nodes and the `MergeScan`. Here I classify all the possibility into five categories:

- Commutative: $P1(P2(R)) \Leftrightarrow P2(P1(R))$
- filter
- projection
- operations that match the partition key
- Partial Commutative: $P1(P2(R)) \Leftrightarrow P1(P2(P1(R)))$
- $min(R) \rightarrow min(MERGE(min(R)))$
- $max(R) \rightarrow max(MERGE(max(R)))$
- Conditional Commutative: $P1(P2(R)) \Leftrightarrow P3(P2(P1(R)))$
- $count(R) \rightarrow sum(count(R))$
- Transformed Commutative: $P1(P2(R)) \Leftrightarrow P1(P3(R)) \Leftrightarrow P3(P1(R))$
- $avg(R) \rightarrow sum(R)/count(R)$
- Non-commutative
- sort
- join
- percentile
## Steps to plan
After establishing the set of commutative relations for all expressions, we can begin transforming the logical plan. There are four steps:

- Add a merge node before table scan
- Evaluate commutativity in a bottom-up way, stop at the first non-commutative node
- Divide the TableScan to scan over partitions
- Execute

First insert the `MergeScan` on top of the bottom `TableScan` node. Then examine the commutativity start from the `MergeScan` node transform the plan tree based on the result. Stop this process on the first non-commutative node.
``` text
┌─────────────┐ ┌─────────────┐
│ Sort │ │ Sort │
└──────▲──────┘ └──────▲──────┘
│ │
┌─────────────┐ ┌──────┴──────┐ ┌──────┴──────┐
│ Sort │ │Projection(a)│ │ MergeScan │
└──────▲──────┘ └──────▲──────┘ └──────▲──────┘
│ │ │
┌──────┴──────┐ ┌──────┴──────┐ ┌──────┴──────┐
│Projection(a)│ │ MergeScan │ │Projection(a)│
└──────▲──────┘ └──────▲──────┘ └──────▲──────┘
│ │ │
┌──────┴──────┐ ┌──────┴──────┐ ┌──────┴──────┐
│ TableScan │ │ TableScan │ │ TableScan │
└─────────────┘ └─────────────┘ └─────────────┘
(a) (b) (c)
```
Then in the physical planning phase, convert the sub-tree below `MergeScan` into a remote query request and dispatch to all the regions. And let the `MergeScan` to receive the results and feed to it parent node.

To simplify the overall complexity, any error in the procedure will lead to a failure to the entire query, and cancel all other parts.
# Alternatives
## Spill
If only consider the ability of processing large dataset, we can enable DataFusion's spill ability to temporary persist intermediate data into disk, like the "swap" memory. But this will lead to a super slow performance and very large write amplification.
# Future Work
As described in the `Motivation` section we can further explore the distributed planner on the physical execution level, by introducing mechanism like Spark's shuffle to improve parallelism and reduce intermediate pipeline breaker's stage.

0 comments on commit 7a310cb

Please sign in to comment.