-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove CountWildcardRule in Analyzer and move the functionality in ExprPlanner, add plan_aggregate
and plan_window
to planner
#14689
Conversation
datafusion/expr/src/expr.rs
Outdated
if *distinct { "DISTINCT " } else { "" }, | ||
schema_name_from_exprs_comma_separated_without_space(args)? | ||
)?; | ||
// TODO: Make this customizable by adding `schema_name` for UDAF |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see #14695
plan_aggregate
and plan_window
to planner
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
THanks @jayzhan211 - this looks great to me.
I recommend merging #14695 first and then updating this PR to use that to show count(*)
I think that would make the diff much smaller and easier to understand
💯
"| 10 | 110 | 20 |", | ||
"+--------------+--------------+----------+", | ||
"+--------------+--------------+--------------+", | ||
"| sum(test.c1) | sum(test.c2) | count_star() |", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be great if this could stay count(*)
Perhaps that is why you implemented AggregateUDFImpl::display_name
🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually #14695 is for others to keep count(*)
given that I prefer count_star()
:)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, DuckDB formats COUNT(*)
as count_star
and preserves COUNT(<expr>)
where expr
isn't a wildcard.
In this case, it looks like we don't preserve COUNT(1)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case, it looks like we don't preserve COUNT(1).
Yes we don't. count(const_expr), count(), count(*) are all the same thing, I don't think we need to preserve them all
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes we don't. count(const_expr), count(), count(*) are all the same thing, I don't think we need to preserve them all
fine for me.
calling it just count()
(or count(1)
) would be shorter and cleaner
in particular, the count_star
function should not exist as user-callable syntax, (SELECT count_star()
should fail with "function not found")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be great if this could stay
count(*)
i agree ...
Perhaps that is why you implemented
AggregateUDFImpl::display_name
🤔
... but not at any cost.
if displaying count(*)
rather than (equivalent) count(1)
requires ~200 lines of duplicated code, i'd stick with count(1)
datafusion/expr/src/planner.rs
Outdated
|
||
/// Plans Count(exprs), e.g., `COUNT(*) to Count(1)` | ||
/// | ||
/// Returns origin expression arguments if not possible |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// Returns origin expression arguments if not possible | |
/// Returns original expression arguments if not possible |
datafusion/expr/src/planner.rs
Outdated
|
||
/// Plans Count(exprs), e.g., `COUNT(*) to Count(1)` | ||
/// | ||
/// Returns origin expression arguments if not possible |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// Returns origin expression arguments if not possible | |
/// Returns original expression arguments if not possible |
@@ -2447,8 +2448,8 @@ async fn test_count_wildcard_on_sort() -> Result<()> { | |||
let df_results = ctx | |||
.table("t1") | |||
.await? | |||
.aggregate(vec![col("b")], vec![count(wildcard())])? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also deprecate wildcard()
and Expr::Wildcard (in a follow on PR?)
https://docs.rs/datafusion/latest/datafusion/logical_expr/fn.wildcard.html
https://docs.rs/datafusion/latest/datafusion/prelude/enum.Expr.html#variant.Wildcard
🤔
datafusion/expr/src/planner.rs
Outdated
Ok(PlannerResult::Original(expr)) | ||
} | ||
|
||
/// Plans Count(exprs), e.g., `COUNT(*) to Count(1)` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we make the doc here more general to window functions?
/// Plans Count(exprs), e.g., `COUNT(*) to Count(1)` | |
/// Plans window functions, such as `COUNT(<expr>)` |
datafusion/expr/src/planner.rs
Outdated
@@ -211,6 +214,23 @@ pub trait ExprPlanner: Debug + Send + Sync { | |||
fn plan_any(&self, expr: RawBinaryExpr) -> Result<PlannerResult<RawBinaryExpr>> { | |||
Ok(PlannerResult::Original(expr)) | |||
} | |||
|
|||
/// Plans Count(exprs), e.g., `COUNT(*) to Count(1)` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here regarding wording
/// Plans Count(exprs), e.g., `COUNT(*) to Count(1)` | |
/// Plans aggregate functions, such as `COUNT(<expr>)` |
"| 10 | 110 | 20 |", | ||
"+--------------+--------------+----------+", | ||
"+--------------+--------------+--------------+", | ||
"| sum(test.c1) | sum(test.c2) | count_star() |", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, DuckDB formats COUNT(*)
as count_star
and preserves COUNT(<expr>)
where expr
isn't a wildcard.
In this case, it looks like we don't preserve COUNT(1)
.
"| 10 | 110 | 20 |", | ||
"+--------------+--------------+----------+", | ||
"+--------------+--------------+--------------+", | ||
"| sum(test.c1) | sum(test.c2) | count_star() |", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes we don't. count(const_expr), count(), count(*) are all the same thing, I don't think we need to preserve them all
fine for me.
calling it just count()
(or count(1)
) would be shorter and cleaner
in particular, the count_star
function should not exist as user-callable syntax, (SELECT count_star()
should fail with "function not found")
/// Count(*), Count(), Count(1) are all equivalent expression | ||
/// In DataFusion, we convert them to Count(1) expression | ||
pub fn count_wildcard() -> Expr { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wildcard here is a syntactical remnant.
this is a function to call all rows, so call it like that
/// Count(*), Count(), Count(1) are all equivalent expression | |
/// In DataFusion, we convert them to Count(1) expression | |
pub fn count_wildcard() -> Expr { | |
/// Creates aggregation to count all rows | |
pub fn count_all() -> Expr { |
expr: RawAggregateExpr, | ||
) -> Result<PlannerResult<RawAggregateExpr>> { | ||
if expr.func.name() == "count" | ||
&& (expr.args.len() == 1 && matches!(expr.args[0], Expr::Wildcard { .. }) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hope we are able to remove Expr::Wildcard
as a follow-up 🙏
We need display name / schema name for WindowFunction as well |
So close! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM %
fn schema_name(&self, params: &AggregateFunctionParams) -> Result<String> { | ||
let AggregateFunctionParams { | ||
args, | ||
distinct, | ||
filter, | ||
order_by, | ||
null_treatment, | ||
} = params; | ||
|
||
let mut schema_name = String::new(); | ||
|
||
if !args.is_empty() && args[0] == Expr::Literal(COUNT_STAR_EXPANSION) { | ||
schema_name.write_str("count(*)")?; | ||
} else { | ||
schema_name.write_fmt(format_args!( | ||
"{}({}{})", | ||
self.name(), | ||
if *distinct { "DISTINCT " } else { "" }, | ||
schema_name_from_exprs(args)? | ||
))?; | ||
} | ||
|
||
if let Some(null_treatment) = null_treatment { | ||
schema_name.write_fmt(format_args!(" {}", null_treatment))?; | ||
} | ||
|
||
if let Some(filter) = filter { | ||
schema_name.write_fmt(format_args!(" FILTER (WHERE {filter})"))?; | ||
}; | ||
|
||
if let Some(order_by) = order_by { | ||
schema_name.write_fmt(format_args!( | ||
" ORDER BY [{}]", | ||
schema_name_from_sorts(order_by)? | ||
))?; | ||
}; | ||
|
||
Ok(schema_name) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a long copy of the default implementation.
Overall we have 4 methods copied, 177 lines overall, where all we need is customize that count(1) is displayed as count(*). Not good for maintainability.
I wonder why the logic for formatting distinct, filter and order by is handed to the function itself, if it's attribute of the containing AggregateFunction. If we want to solve this, this could be a prep PR to avoid PR scope screep.
Otherwise it's better to leave count(1) as count(1), rather than copy so many lines, unless some other option exits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since it is part of the name so we must bring them all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't find any nice way to avoid the duplication.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would happen if these name-generating functions were not overridden in the count aggregation?
if !args.is_empty() && args[0] == Expr::Literal(COUNT_STAR_EXPANSION) { | ||
schema_name.write_str("count(*)")?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
count(1)
andcount(2)
are the same thing, so what about checking for args[0] to be a non-null constant? -
count(1, a, b)
is something else thancount(1)
; this should check that args.len = 1
@@ -167,14 +170,14 @@ pub trait ExprPlanner: Debug + Send + Sync { | |||
|
|||
/// Plan an extract expression, such as`EXTRACT(month FROM foo)` | |||
/// | |||
/// Returns origin expression arguments if not possible | |||
/// Returns original expression arguments if not possible |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good change. nit Could go in separate PR to keep PR size lower.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can keep this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍👍
match args { | ||
[] => true, // count() | ||
// All const should be coerced to int64 or rejected by the signature | ||
[Expr::Literal(ScalarValue::Int64(_))] => true, // count(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might need to consider select count(null::bigint)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
count(null) is not wildcard
D create table t(a int);
D insert into t values (1);
D select count(null) from t;
┌─────────────┐
│ count(NULL) │
│ int64 │
├─────────────┤
│ 0 │
└─────────────┘
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pyarrow is failing across all the CI, so this is ready for review |
Thanks ALL |
CI broke after this PR. We have this extended-tests-failing-after-merge situation happening very frequently recently. Maybe the decision to defer extended tests to post-merge was a wrong one. Committers are now unaware that their PRs can break main, and they only get to know this after the fact. We should discuss how to mitigate this. If we don't see an obvious solution, it may be prudent to go back to the inefficient but safe run-everything-in-the-PR mode. @alamb and @jayzhan211, what do you think? |
Making extended tests optional BUT easily visible and run it before merge (maybe github supports such UI?) seems like a better approach. This way, for minor changes or cases where we're confident in the outcome, we can choose to skip the tests. We also need to add more tests to SQLLogicTest to improve coverage. This failure highlights the need for additional tests to address the gap. I think we can do both |
If this is possible, certainly. If not, we will need to fall back to the old run-everything mode until we figure out a way to implement something like this. Having broken main commits frequently is not a sustainable practice. |
The downside is that the "sqllogictests" thing takes 2 hours to run (and it takes quite a while to run even locally)
Yeah, I agree The upside of the current approach is that at least now we know there is an issue that was introduced.
@buraksenn has some version of it here, but it was not working I'll see if I can get someone to help out to make it work |
Filed a ticket to track this issue And @jayzhan211 has a proposed fix: |
…prPlanner, add `plan_aggregate` and `plan_window` to planner (apache#14689) * count planner * window * update slt * remove rule * rm rule * doc * fix name * fix name * fix test * tpch test * fix avro * rename * switch to count(*) * use count(*) * rename * doc * rename window funciotn * fmt * rm print * upd logic * count null
Which issue does this PR close?
Analyzer
within LogicalPlan building stage #14618Rationale for this change
We can convert count(*) to count(1) in ExprPlanner.
What changes are included in this PR?
Use name
count_star()
for wildcard.Are these changes tested?
Are there any user-facing changes?
count(wildcard())
is used in dataframe API, they need to change tocount_wildcard()
for the same functionality.