-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for GROUPING SETS/CUBE/ROLLUP #2716
Conversation
cc @alamb @tustvold @jimexist @yjshen @andygrove The part of this that I am least confident about is that I didn't break anything in any of the optimizers :). So if someone familiar with that code can review that part I would be very grateful. |
Thanks @thinkharderdev -- I'll try and find some time to review this over the weekend. |
dfbfee2
to
5eb1881
Compare
Codecov Report
@@ Coverage Diff @@
## master #2716 +/- ##
==========================================
+ Coverage 84.72% 84.86% +0.13%
==========================================
Files 270 270
Lines 47254 47717 +463
==========================================
+ Hits 40036 40495 +459
- Misses 7218 7222 +4
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @thinkharderdev and @Tomczik76 -- this is super cool. I haven't made it all the way through yet but what I have reviewed is 👌
I found the whitespace blind diff easier to review: https://github.com/apache/arrow-datafusion/pull/2716/files?w=1
&input.schema(), | ||
&grouping_set.expr, | ||
&aggr_expr, | ||
grouping_set.groups.iter().flatten().any(|is_null| *is_null), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if extracting this code to a function such as GroupingSets::contains_null()
might make the code easier to read. The same comment applies to other places where GroupingSets::groups
is referenced as well.
Given the size of this PR already, definitely could be done as a follow on
datafusion/core/src/lib.rs
Outdated
@@ -204,6 +204,7 @@ | |||
/// DataFusion crate version | |||
pub const DATAFUSION_VERSION: &str = env!("CARGO_PKG_VERSION"); | |||
|
|||
extern crate core; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not :) Not sure where it came from but removed now.
@@ -110,12 +111,15 @@ impl GroupedHashAggregateStreamV2 { | |||
// The expressions to evaluate the batch, one vec of expressions per aggregation. | |||
// Assume create_schema() always put group columns in front of aggr columns, we set | |||
// col_idx_base to group expression count. | |||
let aggregate_expressions = | |||
aggregates::aggregate_expressions(&aggr_expr, &mode, group_expr.len())?; | |||
let aggregate_expressions = aggregates::aggregate_expressions( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GroupingSet::GroupingSets(groups) => { | ||
let mut exprs: Vec<Expr> = vec![]; | ||
for exp in groups.iter().flatten() { | ||
if !exprs.contains(exp) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is N^2 in the number of grouping sets -- probably not an issue, I just figured I would point it out
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this is unfortunate
"| e | 4 | | -16064.57142857143 |", | ||
"| e | 5 | -86 | 32514 |", | ||
"| e | 5 | 64 | -26526 |", | ||
"| e | 5 | | 2994 |", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
@@ -250,6 +280,29 @@ mod tests { | |||
Ok(()) | |||
} | |||
|
|||
#[test] | |||
fn single_distinct_and_grouping_set() -> Result<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given there is special handling for CUBE and ROLLUP in this pass, I suggest test coverage for those cases too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think there is actually a bug in this. I'll work on a fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, this optimization is a bit more complicated for grouping sets. We need to create a separate alias for each group. For the moment I have just disabled the optimization for this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think disabling the optimization for grouping sets is a wise idea.
Co-authored-by: Andrew Lamb <[email protected]>
…unit tests for single distinct queries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it looks great to me. Thanks again!
I had a few minor comments (e.g. some left over printlns) but all in all I think this one is good to go
@@ -265,7 +265,7 @@ mod tests { | |||
|
|||
use crate::error::Result; | |||
use crate::logical_plan::Operator; | |||
use crate::physical_plan::aggregates::AggregateExec; | |||
use crate::physical_plan::aggregates::{AggregateExec, PhysicalGroupBy}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love the new name PhysicalGroupBy
@@ -65,13 +66,60 @@ pub enum AggregateMode { | |||
FinalPartitioned, | |||
} | |||
|
|||
/// Represents `GROUP BY` clause in the plan (including the more general GROUPING SET) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you -- this is super helpful
@@ -117,14 +171,16 @@ impl AggregateExec { | |||
|
|||
/// Grouping expressions | |||
pub fn group_expr(&self) -> &[(Arc<dyn PhysicalExpr>, String)] { | |||
&self.group_expr | |||
// TODO Is this right? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so -- this seems to be used by the "use statistics instead of aggregates" optimization
/Users/alamb/Software/arrow-datafusion/datafusion/core/src/physical_optimizer/aggregate_statistics.rs
113: && final_agg_exec.group_expr().is_empty()
121: && partial_agg_exec.group_expr().is_empty()
/Users/alamb/Software/arrow-datafusion/datafusion/core/src/physical_plan/aggregates/mod.rs
728: let groups = partial_aggregate.group_expr().to_vec();
In general, it might make sense to disable / skip all such optimizations in the cases of grouping sets / cube / rollup -- that would be the conservative approach and avoid potential subtle wrong answer bugs. As the feature is used more and people have a need to optimize it more, we can revisit the optimizations and make sure they are relevant to grouping sets
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case it would still be correct right? The aggregate stats are only used if there is no group by which this would still represent correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or maybe this should just return &PhysicalGroupBy
instead? I could see how this could lead to issues elsewhere if it is used for optimizations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Returning &PhysicalGroupBy
sounds like a good future proof idea
@@ -62,9 +63,11 @@ fn optimize(plan: &LogicalPlan) -> Result<LogicalPlan> { | |||
schema, | |||
group_expr, | |||
}) => { | |||
if is_single_distinct_agg(plan) { | |||
if is_single_distinct_agg(plan) && !contains_grouping_set(group_expr) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a good idea -- to skip grouping sets in optimizations
@@ -160,6 +166,7 @@ fn optimize_children(plan: &LogicalPlan) -> Result<LogicalPlan> { | |||
} | |||
|
|||
fn is_single_distinct_agg(plan: &LogicalPlan) -> bool { | |||
// false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left over?
@@ -212,6 +224,9 @@ mod tests { | |||
let optimized_plan = rule | |||
.optimize(plan, &OptimizerConfig::new()) | |||
.expect("failed to optimize plan"); | |||
|
|||
println!("{:?}", optimized_plan); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
left over?
let contains_dict = groups | ||
.expr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is a minor thing, but one might imagine keeping the fields of PhysicalGroupBy
private and adding functions like fn expr()
and fn is_empty()
mostly as a way of additional documentation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think that is a good idea. Fixed.
let mut group: Vec<bool> = Vec::with_capacity(expr_count); | ||
for expr in all_exprs.iter() { | ||
if expr_group.contains(expr) { | ||
group.push(false); | ||
} else { | ||
group.push(true) | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it matters, but you can probably express this in a functional style like:
let mut group: Vec<bool> = Vec::with_capacity(expr_count); | |
for expr in all_exprs.iter() { | |
if expr_group.contains(expr) { | |
group.push(false); | |
} else { | |
group.push(true) | |
} | |
} | |
let group: Vec<bool> = all_exprs.iter() | |
.map(expr_group.contains(expr)) | |
.collect(); |
.aggregate( | ||
vec![cube(vec![col("c1"), col("c2"), col("c3")])], | ||
vec![sum(col("c2"))], | ||
)? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand the need to creating the aggregate
on the logical plan (as then new cube expressions are planned below). Can you simply use the output of the project
plan?
The same question applies to the other plans below
assert_batches_sorted_eq!(expected, &results); | ||
Ok(()) | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the test coverage is quite good. Thank you
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
@@ -117,14 +171,16 @@ impl AggregateExec { | |||
|
|||
/// Grouping expressions | |||
pub fn group_expr(&self) -> &[(Arc<dyn PhysicalExpr>, String)] { | |||
&self.group_expr | |||
// TODO Is this right? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Returning &PhysicalGroupBy
sounds like a good future proof idea
Looks great to me -- thanks again for all the work @thinkharderdev 🎉 |
Which issue does this PR close?
Closes #1327
TODO
Note that currently the sql parser doesn't seem to handle
GROUP BY GROUPING SETS ...
so we need to address that to test that explicitly.Rationale for this change
This PR adds support for GROUPING SETS (and special cases CUBE/ROLLUP) in the physical planner and execution plan.
What changes are included in this PR?
There are three primary changes:
AggregateExec
now takes aVec<Vec<(Arc<dyn PhysicalExpr>,String)>>
to represent grouping sets. A normalGROUP BY
is just a special case. We expect the grouping sets to be "aligned". For example, for a SQL clause likeGROUP BY GROUPING SETS ((a),(b),(a,b))
,AggregateExec
assumes that the planner will expand that to the grouping set((a,NULL),(NULL,b),(a,b))
. We can't handle this in the execution plan because we don't haveParialEq
forPhysicalExpr
.DefaultPhysicalPlanner
handle expanding and aligning grouping sets. This includes expanding CUBE/ROLLUP expressions and merging and aligning GROUPING SET expressions.Also we include serialization for grouping set expression in
datafusion-proto
Are there any user-facing changes?
SQL statements with CUBE/ROLLUP should now be supported. GROUPING SETS should also be supported but it seems like the sql parser is not handling them correctly.
I don't think so.