Support for GROUPING SETS/CUBE/ROLLUP #2716

thinkharderdev · 2022-06-10T12:19:58Z

Which issue does this PR close?

TODO

Implement CUBE expansion
Implement ROLLUP expansion
Add SQL tests for CUBE/ROLLUP queries

Note that currently the sql parser doesn't seem to handle GROUP BY GROUPING SETS ... so we need to address that to test that explicitly.

Rationale for this change

This PR adds support for GROUPING SETS (and special cases CUBE/ROLLUP) in the physical planner and execution plan.

What changes are included in this PR?

There are three primary changes:

AggregateExec now takes a Vec<Vec<(Arc<dyn PhysicalExpr>,String)>> to represent grouping sets. A normal GROUP BY is just a special case. We expect the grouping sets to be "aligned". For example, for a SQL clause like GROUP BY GROUPING SETS ((a),(b),(a,b)), AggregateExec assumes that the planner will expand that to the grouping set ((a,NULL),(NULL,b),(a,b)). We can't handle this in the execution plan because we don't have ParialEq for PhysicalExpr.
In DefaultPhysicalPlanner handle expanding and aligning grouping sets. This includes expanding CUBE/ROLLUP expressions and merging and aligning GROUPING SET expressions.
Handle grouping sets correctly in optimizers.

Also we include serialization for grouping set expression in datafusion-proto

Are there any user-facing changes?

SQL statements with CUBE/ROLLUP should now be supported. GROUPING SETS should also be supported but it seems like the sql parser is not handling them correctly.

I don't think so.

thinkharderdev · 2022-06-10T12:23:36Z

cc @alamb @tustvold @jimexist @yjshen @andygrove

The part of this that I am least confident about is that I didn't break anything in any of the optimizers :). So if someone familiar with that code can review that part I would be very grateful.

alamb · 2022-06-10T18:40:11Z

Thanks @thinkharderdev -- I'll try and find some time to review this over the weekend.

…or now

codecov-commenter · 2022-06-11T12:42:41Z

Codecov Report

Merging #2716 (a2cb52d) into master (080c324) will increase coverage by 0.13%.
The diff coverage is 93.96%.

@@            Coverage Diff             @@
##           master    #2716      +/-   ##
==========================================
+ Coverage   84.72%   84.86%   +0.13%     
==========================================
  Files         270      270              
  Lines       47254    47717     +463     
==========================================
+ Hits        40036    40495     +459     
- Misses       7218     7222       +4

Impacted Files	Coverage Δ
datafusion/expr/src/expr_fn.rs	`88.23% <0.00%> (-3.23%)`	⬇️
datafusion/expr/src/utils.rs	`90.80% <71.42%> (-0.39%)`	⬇️
datafusion/proto/src/from_proto.rs	`34.64% <81.25%> (+0.85%)`	⬆️
...atafusion/core/src/physical_plan/aggregates/mod.rs	`91.16% <84.49%> (-3.44%)`	⬇️
datafusion/core/src/physical_plan/planner.rs	`80.83% <96.55%> (+2.61%)`	⬆️
datafusion/core/tests/dataframe.rs	`98.62% <97.18%> (-1.38%)`	⬇️
datafusion/core/tests/sql/aggregates.rs	`99.27% <98.24%> (-0.10%)`	⬇️
...ore/src/physical_optimizer/aggregate_statistics.rs	`100.00% <100.00%> (ø)`
...afusion/core/src/physical_optimizer/repartition.rs	`100.00% <100.00%> (ø)`
...tafusion/core/src/physical_plan/aggregates/hash.rs	`92.95% <100.00%> (+0.04%)`	⬆️
... and 17 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 080c324...a2cb52d. Read the comment docs.

alamb

Thank you @thinkharderdev and @Tomczik76 -- this is super cool. I haven't made it all the way through yet but what I have reviewed is 👌

I found the whitespace blind diff easier to review: https://github.com/apache/arrow-datafusion/pull/2716/files?w=1

cc @andygrove @liukun4515

alamb · 2022-06-12T10:08:33Z

datafusion/core/src/physical_plan/aggregates/mod.rs

+            &input.schema(),
+            &grouping_set.expr,
+            &aggr_expr,
+            grouping_set.groups.iter().flatten().any(|is_null| *is_null),


I wonder if extracting this code to a function such as GroupingSets::contains_null() might make the code easier to read. The same comment applies to other places where GroupingSets::groups is referenced as well.

Given the size of this PR already, definitely could be done as a follow on

alamb · 2022-06-12T10:10:27Z

datafusion/core/src/lib.rs

@@ -204,6 +204,7 @@
 /// DataFusion crate version
 pub const DATAFUSION_VERSION: &str = env!("CARGO_PKG_VERSION");

+extern crate core;


Why is this necessary?

It's not :) Not sure where it came from but removed now.

datafusion/core/src/physical_plan/aggregates/mod.rs

alamb · 2022-06-12T10:32:07Z

datafusion/core/src/physical_plan/aggregates/row_hash.rs

@@ -110,12 +111,15 @@ impl GroupedHashAggregateStreamV2 {
        // The expressions to evaluate the batch, one vec of expressions per aggregation.
        // Assume create_schema() always put group columns in front of aggr columns, we set
        // col_idx_base to group expression count.
-        let aggregate_expressions =
-            aggregates::aggregate_expressions(&aggr_expr, &mode, group_expr.len())?;
+        let aggregate_expressions = aggregates::aggregate_expressions(


FYI @yjshen -- it would be really nice to try and consolidate row_hash and hash -- filed #2723 to track 👍

alamb · 2022-06-12T10:36:07Z

datafusion/expr/src/expr.rs

+            GroupingSet::GroupingSets(groups) => {
+                let mut exprs: Vec<Expr> = vec![];
+                for exp in groups.iter().flatten() {
+                    if !exprs.contains(exp) {


This is N^2 in the number of grouping sets -- probably not an issue, I just figured I would point it out

Yeah, this is unfortunate

datafusion/expr/src/expr_fn.rs

alamb · 2022-06-12T10:37:18Z

datafusion/core/tests/sql/aggregates.rs

+        "| e  | 4  |      | -16064.57142857143         |",
+        "| e  | 5  | -86  | 32514                      |",
+        "| e  | 5  | 64   | -26526                     |",
+        "| e  | 5  |      | 2994                       |",


datafusion/expr/src/utils.rs

alamb · 2022-06-12T10:42:49Z

datafusion/optimizer/src/single_distinct_to_groupby.rs

@@ -250,6 +280,29 @@ mod tests {
        Ok(())
    }

+    #[test]
+    fn single_distinct_and_grouping_set() -> Result<()> {


Given there is special handling for CUBE and ROLLUP in this pass, I suggest test coverage for those cases too

Yeah, I think there is actually a bug in this. I'll work on a fix.

Ok, this optimization is a bit more complicated for grouping sets. We need to create a separate alias for each group. For the moment I have just disabled the optimization for this case.

I think disabling the optimization for grouping sets is a wise idea.

Co-authored-by: Andrew Lamb <[email protected]>

…y doc comment

…unit tests for single distinct queries.

alamb

I think it looks great to me. Thanks again!

I had a few minor comments (e.g. some left over printlns) but all in all I think this one is good to go

alamb · 2022-06-13T10:42:58Z

datafusion/core/src/physical_optimizer/aggregate_statistics.rs

@@ -265,7 +265,7 @@ mod tests {

    use crate::error::Result;
    use crate::logical_plan::Operator;
-    use crate::physical_plan::aggregates::AggregateExec;
+    use crate::physical_plan::aggregates::{AggregateExec, PhysicalGroupBy};


Love the new name PhysicalGroupBy

alamb · 2022-06-13T10:43:51Z

datafusion/core/src/physical_plan/aggregates/mod.rs

@@ -65,13 +66,60 @@ pub enum AggregateMode {
    FinalPartitioned,
 }

+/// Represents `GROUP BY` clause in the plan (including the more general GROUPING SET)


Thank you -- this is super helpful

alamb · 2022-06-13T10:49:47Z

datafusion/core/src/physical_plan/aggregates/mod.rs

@@ -117,14 +171,16 @@ impl AggregateExec {

    /// Grouping expressions
    pub fn group_expr(&self) -> &[(Arc<dyn PhysicalExpr>, String)] {
-        &self.group_expr
+        // TODO Is this right?


I don't think so -- this seems to be used by the "use statistics instead of aggregates" optimization

/Users/alamb/Software/arrow-datafusion/datafusion/core/src/physical_optimizer/aggregate_statistics.rs 113: && final_agg_exec.group_expr().is_empty() 121: && partial_agg_exec.group_expr().is_empty() /Users/alamb/Software/arrow-datafusion/datafusion/core/src/physical_plan/aggregates/mod.rs 728: let groups = partial_aggregate.group_expr().to_vec();

In general, it might make sense to disable / skip all such optimizations in the cases of grouping sets / cube / rollup -- that would be the conservative approach and avoid potential subtle wrong answer bugs. As the feature is used more and people have a need to optimize it more, we can revisit the optimizations and make sure they are relevant to grouping sets

In this case it would still be correct right? The aggregate stats are only used if there is no group by which this would still represent correctly.

Or maybe this should just return &PhysicalGroupBy instead? I could see how this could lead to issues elsewhere if it is used for optimizations.

Returning &PhysicalGroupBy sounds like a good future proof idea

alamb · 2022-06-13T10:54:33Z

datafusion/optimizer/src/single_distinct_to_groupby.rs

@@ -62,9 +63,11 @@ fn optimize(plan: &LogicalPlan) -> Result<LogicalPlan> {
            schema,
            group_expr,
        }) => {
-            if is_single_distinct_agg(plan) {
+            if is_single_distinct_agg(plan) && !contains_grouping_set(group_expr) {


I think this is a good idea -- to skip grouping sets in optimizations

alamb · 2022-06-13T10:54:54Z

datafusion/optimizer/src/single_distinct_to_groupby.rs

@@ -160,6 +166,7 @@ fn optimize_children(plan: &LogicalPlan) -> Result<LogicalPlan> {
 }

 fn is_single_distinct_agg(plan: &LogicalPlan) -> bool {
+    // false


alamb · 2022-06-13T10:58:26Z

datafusion/optimizer/src/single_distinct_to_groupby.rs

@@ -212,6 +224,9 @@ mod tests {
        let optimized_plan = rule
            .optimize(plan, &OptimizerConfig::new())
            .expect("failed to optimize plan");
+
+        println!("{:?}", optimized_plan);


alamb · 2022-06-13T11:02:30Z

datafusion/core/src/physical_plan/planner.rs

                    let contains_dict = groups
+                        .expr


I think it is a minor thing, but one might imagine keeping the fields of PhysicalGroupBy private and adding functions like fn expr() and fn is_empty() mostly as a way of additional documentation

Yeah, I think that is a good idea. Fixed.

alamb · 2022-06-13T11:05:27Z

datafusion/core/src/physical_plan/planner.rs

+        let mut group: Vec<bool> = Vec::with_capacity(expr_count);
+        for expr in all_exprs.iter() {
+            if expr_group.contains(expr) {
+                group.push(false);
+            } else {
+                group.push(true)
+            }
+        }


I don't think it matters, but you can probably express this in a functional style like:

Suggested change

let mut group: Vec<bool> = Vec::with_capacity(expr_count);

for expr in all_exprs.iter() {

if expr_group.contains(expr) {

group.push(false);

} else {

group.push(true)

}

}

let group: Vec<bool> = all_exprs.iter()

.map(expr_group.contains(expr))

.collect();

alamb · 2022-06-13T11:07:42Z

datafusion/core/src/physical_plan/planner.rs

+            .aggregate(
+                vec![cube(vec![col("c1"), col("c2"), col("c3")])],
+                vec![sum(col("c2"))],
+            )?


I don't understand the need to creating the aggregate on the logical plan (as then new cube expressions are planned below). Can you simply use the output of the project plan?

The same question applies to the other plans below

alamb · 2022-06-13T11:10:47Z

datafusion/core/tests/sql/aggregates.rs

+    assert_batches_sorted_eq!(expected, &results);
+    Ok(())
+}
+


I think the test coverage is quite good. Thank you

alamb

👍

alamb · 2022-06-13T12:37:44Z

datafusion/core/src/physical_plan/aggregates/mod.rs

@@ -117,14 +171,16 @@ impl AggregateExec {

    /// Grouping expressions
    pub fn group_expr(&self) -> &[(Arc<dyn PhysicalExpr>, String)] {
-        &self.group_expr
+        // TODO Is this right?


Returning &PhysicalGroupBy sounds like a good future proof idea

alamb · 2022-06-13T19:55:25Z

Looks great to me -- thanks again for all the work @thinkharderdev 🎉

github-actions bot added core Core DataFusion crate logical-expr Logical plan and expressions optimizer Optimizer rules labels Jun 10, 2022

thinkharderdev marked this pull request as draft June 10, 2022 12:20

thinkharderdev and others added 17 commits June 11, 2022 08:06

WIP

95d9eb5

Implement for non-row based accumulators

60facd5

Non-row aggregations

9e1bf30

Map logical plan and add some basic tests

5a8d4ba

Handle grouping sets in various optimize passes.

c20c3c0

Implemented create_cube_expr and create_rollup_expr functions

bcfe2bb

Cleanup and ignore SingleDistinctToGroupBy when using grouping sets f…

fee7536

…or now

Handle grouping sets in SingleDistinctToGroupBy

6c6c0d3

Add more tests and burn the boats

5658da3

Fix(ish) partitioning

44b8dfa

Serialization for grouping set exprs

31a05ec

fixed bug with create_cube_expr function

3fbd4e3

fixed bug with create_cube_expr function

2dfda2a

Fixed bug in row-based-aggregation

d05a649

Added unit tests for test_create_rollup_expr and test_create_cube_expr

8fb2461

Formatting

0532f03

Tests, linter fixes and docs

5eb1881

thinkharderdev force-pushed the grouping-sets branch from dfbfee2 to 5eb1881 Compare June 11, 2022 12:10

thinkharderdev marked this pull request as ready for review June 11, 2022 12:12

thinkharderdev added 3 commits June 11, 2022 09:13

Linting

4084611

Better encoding which avoids evaluating grouping expressions redundantly

a1654a4

Remove commented code

a049675

alamb mentioned this pull request Jun 12, 2022

Consolidate GroupByHash implementations row_hash.rs and hash.rs (remove duplication) #2723

Closed

alamb reviewed Jun 12, 2022

View reviewed changes

thinkharderdev and others added 3 commits June 12, 2022 07:10

Apply suggestions from code review

aabd639

Co-authored-by: Andrew Lamb <[email protected]>

PR Comments: Rename PhysicalGroupingSet -> PhysicalGroupBy and clarif…

e5f3df5

…y doc comment

Disable single_distinct_to_groupby for grouping sets for now and add …

c0ed1b9

…unit tests for single distinct queries.

alamb approved these changes Jun 13, 2022

View reviewed changes

thinkharderdev added 2 commits June 13, 2022 07:53

PR comments

2bc6976

Remove old comment

27dcdb1

alamb reviewed Jun 13, 2022

View reviewed changes

Return PhysicalGroupBy from AggregateExec::group_expr

a2cb52d

alamb merged commit ca5339b into apache:master Jun 13, 2022

alamb mentioned this pull request Jun 13, 2022

implement grouping sets, cubes, and rollups #1327

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for GROUPING SETS/CUBE/ROLLUP #2716

Support for GROUPING SETS/CUBE/ROLLUP #2716

thinkharderdev commented Jun 10, 2022 •

edited

Loading

thinkharderdev commented Jun 10, 2022

alamb commented Jun 10, 2022

codecov-commenter commented Jun 11, 2022 •

edited

Loading

alamb left a comment

alamb Jun 12, 2022

alamb Jun 12, 2022

thinkharderdev Jun 12, 2022

alamb Jun 12, 2022

alamb Jun 12, 2022

thinkharderdev Jun 12, 2022

alamb Jun 12, 2022

alamb Jun 12, 2022

thinkharderdev Jun 12, 2022

thinkharderdev Jun 12, 2022

alamb Jun 13, 2022

alamb left a comment

alamb Jun 13, 2022

alamb Jun 13, 2022

alamb Jun 13, 2022

thinkharderdev Jun 13, 2022

thinkharderdev Jun 13, 2022

alamb Jun 13, 2022

alamb Jun 13, 2022

alamb Jun 13, 2022

alamb Jun 13, 2022

alamb Jun 13, 2022

thinkharderdev Jun 13, 2022 •

edited

Loading

alamb Jun 13, 2022

alamb Jun 13, 2022

alamb Jun 13, 2022

alamb left a comment

alamb Jun 13, 2022

alamb commented Jun 13, 2022

Support for GROUPING SETS/CUBE/ROLLUP #2716

Support for GROUPING SETS/CUBE/ROLLUP #2716

Conversation

thinkharderdev commented Jun 10, 2022 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

thinkharderdev commented Jun 10, 2022

alamb commented Jun 10, 2022

codecov-commenter commented Jun 11, 2022 • edited Loading

Codecov Report

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thinkharderdev Jun 13, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb commented Jun 13, 2022

thinkharderdev commented Jun 10, 2022 •

edited

Loading

codecov-commenter commented Jun 11, 2022 •

edited

Loading

thinkharderdev Jun 13, 2022 •

edited

Loading