Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove CountWildcardRule in Analyzer and move the functionality in ExprPlanner, add plan_aggregate and plan_window to planner #14689

Merged
merged 25 commits into from
Feb 21, 2025

Conversation

jayzhan211
Copy link
Contributor

@jayzhan211 jayzhan211 commented Feb 16, 2025

Which issue does this PR close?

Rationale for this change

We can convert count(*) to count(1) in ExprPlanner.

What changes are included in this PR?

Use name count_star() for wildcard.

Are these changes tested?

Are there any user-facing changes?

count(wildcard()) is used in dataframe API, they need to change to count_wildcard() for the same functionality.

@jayzhan211 jayzhan211 added the api change Changes the API exposed to users of the crate label Feb 16, 2025
@github-actions github-actions bot added sql SQL Planner logical-expr Logical plan and expressions optimizer Optimizer rules core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) functions labels Feb 16, 2025
@jayzhan211 jayzhan211 marked this pull request as ready for review February 16, 2025 09:25
if *distinct { "DISTINCT " } else { "" },
schema_name_from_exprs_comma_separated_without_space(args)?
)?;
// TODO: Make this customizable by adding `schema_name` for UDAF
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see #14695

@alamb alamb changed the title Remove CountWildcardRule in Analyzer and move the functionality in ExprPlanner Remove CountWildcardRule in Analyzer and move the functionality in ExprPlanner, add plan_aggregate and plan_window to planner Feb 16, 2025
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

THanks @jayzhan211 - this looks great to me.

I recommend merging #14695 first and then updating this PR to use that to show count(*)

I think that would make the diff much smaller and easier to understand

💯

FYI @jonahgao and @findepi

"| 10 | 110 | 20 |",
"+--------------+--------------+----------+",
"+--------------+--------------+--------------+",
"| sum(test.c1) | sum(test.c2) | count_star() |",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great if this could stay count(*)

Perhaps that is why you implemented AggregateUDFImpl::display_name 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually #14695 is for others to keep count(*) given that I prefer count_star() :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, DuckDB formats COUNT(*) as count_star and preserves COUNT(<expr>) where expr isn't a wildcard.

In this case, it looks like we don't preserve COUNT(1).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, it looks like we don't preserve COUNT(1).

Yes we don't. count(const_expr), count(), count(*) are all the same thing, I don't think we need to preserve them all

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we don't. count(const_expr), count(), count(*) are all the same thing, I don't think we need to preserve them all

fine for me.

calling it just count() (or count(1)) would be shorter and cleaner

in particular, the count_star function should not exist as user-callable syntax, (SELECT count_star() should fail with "function not found")

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great if this could stay count(*)

i agree ...

Perhaps that is why you implemented AggregateUDFImpl::display_name 🤔

... but not at any cost.
if displaying count(*) rather than (equivalent) count(1) requires ~200 lines of duplicated code, i'd stick with count(1)


/// Plans Count(exprs), e.g., `COUNT(*) to Count(1)`
///
/// Returns origin expression arguments if not possible
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Returns origin expression arguments if not possible
/// Returns original expression arguments if not possible


/// Plans Count(exprs), e.g., `COUNT(*) to Count(1)`
///
/// Returns origin expression arguments if not possible
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Returns origin expression arguments if not possible
/// Returns original expression arguments if not possible

@@ -2447,8 +2448,8 @@ async fn test_count_wildcard_on_sort() -> Result<()> {
let df_results = ctx
.table("t1")
.await?
.aggregate(vec![col("b")], vec![count(wildcard())])?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb alamb mentioned this pull request Feb 16, 2025
Ok(PlannerResult::Original(expr))
}

/// Plans Count(exprs), e.g., `COUNT(*) to Count(1)`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we make the doc here more general to window functions?

Suggested change
/// Plans Count(exprs), e.g., `COUNT(*) to Count(1)`
/// Plans window functions, such as `COUNT(<expr>)`

@@ -211,6 +214,23 @@ pub trait ExprPlanner: Debug + Send + Sync {
fn plan_any(&self, expr: RawBinaryExpr) -> Result<PlannerResult<RawBinaryExpr>> {
Ok(PlannerResult::Original(expr))
}

/// Plans Count(exprs), e.g., `COUNT(*) to Count(1)`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here regarding wording

Suggested change
/// Plans Count(exprs), e.g., `COUNT(*) to Count(1)`
/// Plans aggregate functions, such as `COUNT(<expr>)`

"| 10 | 110 | 20 |",
"+--------------+--------------+----------+",
"+--------------+--------------+--------------+",
"| sum(test.c1) | sum(test.c2) | count_star() |",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, DuckDB formats COUNT(*) as count_star and preserves COUNT(<expr>) where expr isn't a wildcard.

In this case, it looks like we don't preserve COUNT(1).

"| 10 | 110 | 20 |",
"+--------------+--------------+----------+",
"+--------------+--------------+--------------+",
"| sum(test.c1) | sum(test.c2) | count_star() |",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we don't. count(const_expr), count(), count(*) are all the same thing, I don't think we need to preserve them all

fine for me.

calling it just count() (or count(1)) would be shorter and cleaner

in particular, the count_star function should not exist as user-callable syntax, (SELECT count_star() should fail with "function not found")

Comment on lines 82 to 84
/// Count(*), Count(), Count(1) are all equivalent expression
/// In DataFusion, we convert them to Count(1) expression
pub fn count_wildcard() -> Expr {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wildcard here is a syntactical remnant.

this is a function to call all rows, so call it like that

Suggested change
/// Count(*), Count(), Count(1) are all equivalent expression
/// In DataFusion, we convert them to Count(1) expression
pub fn count_wildcard() -> Expr {
/// Creates aggregation to count all rows
pub fn count_all() -> Expr {

expr: RawAggregateExpr,
) -> Result<PlannerResult<RawAggregateExpr>> {
if expr.func.name() == "count"
&& (expr.args.len() == 1 && matches!(expr.args[0], Expr::Wildcard { .. })
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope we are able to remove Expr::Wildcard as a follow-up 🙏

@jayzhan211 jayzhan211 marked this pull request as draft February 17, 2025 13:55
@jayzhan211
Copy link
Contributor Author

jayzhan211 commented Feb 18, 2025

We need display name / schema name for WindowFunction as well

#14750

@alamb
Copy link
Contributor

alamb commented Feb 19, 2025

We need display name / schema name for WindowFunction as well

#14750

So close!

@jayzhan211 jayzhan211 marked this pull request as ready for review February 19, 2025 13:33
Copy link
Member

@findepi findepi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM %

Comment on lines 151 to 189
fn schema_name(&self, params: &AggregateFunctionParams) -> Result<String> {
let AggregateFunctionParams {
args,
distinct,
filter,
order_by,
null_treatment,
} = params;

let mut schema_name = String::new();

if !args.is_empty() && args[0] == Expr::Literal(COUNT_STAR_EXPANSION) {
schema_name.write_str("count(*)")?;
} else {
schema_name.write_fmt(format_args!(
"{}({}{})",
self.name(),
if *distinct { "DISTINCT " } else { "" },
schema_name_from_exprs(args)?
))?;
}

if let Some(null_treatment) = null_treatment {
schema_name.write_fmt(format_args!(" {}", null_treatment))?;
}

if let Some(filter) = filter {
schema_name.write_fmt(format_args!(" FILTER (WHERE {filter})"))?;
};

if let Some(order_by) = order_by {
schema_name.write_fmt(format_args!(
" ORDER BY [{}]",
schema_name_from_sorts(order_by)?
))?;
};

Ok(schema_name)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a long copy of the default implementation.
Overall we have 4 methods copied, 177 lines overall, where all we need is customize that count(1) is displayed as count(*). Not good for maintainability.

I wonder why the logic for formatting distinct, filter and order by is handed to the function itself, if it's attribute of the containing AggregateFunction. If we want to solve this, this could be a prep PR to avoid PR scope screep.
Otherwise it's better to leave count(1) as count(1), rather than copy so many lines, unless some other option exits.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since it is part of the name so we must bring them all.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't find any nice way to avoid the duplication.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would happen if these name-generating functions were not overridden in the count aggregation?

Comment on lines 162 to 163
if !args.is_empty() && args[0] == Expr::Literal(COUNT_STAR_EXPANSION) {
schema_name.write_str("count(*)")?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • count(1) and count(2) are the same thing, so what about checking for args[0] to be a non-null constant?

  • count(1, a, b) is something else than count(1); this should check that args.len = 1

@@ -167,14 +170,14 @@ pub trait ExprPlanner: Debug + Send + Sync {

/// Plan an extract expression, such as`EXTRACT(month FROM foo)`
///
/// Returns origin expression arguments if not possible
/// Returns original expression arguments if not possible
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good change. nit Could go in separate PR to keep PR size lower.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can keep this

Copy link
Member

@jonahgao jonahgao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍👍

match args {
[] => true, // count()
// All const should be coerced to int64 or rejected by the signature
[Expr::Literal(ScalarValue::Int64(_))] => true, // count(1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might need to consider select count(null::bigint)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

count(null) is not wildcard

D create table t(a int);
D insert into t values (1);
D select count(null) from t;
┌─────────────┐
│ count(NULL) │
│    int64    │
├─────────────┤
│           0 │
└─────────────┘

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So is_count_wildcard() should reject it.
It only has a problem in the display of logical plan.
CleanShot 2025-02-20 at 11 10 02@2x

@jayzhan211 jayzhan211 marked this pull request as draft February 20, 2025 10:53
@jayzhan211 jayzhan211 marked this pull request as ready for review February 20, 2025 12:53
@jayzhan211
Copy link
Contributor Author

pyarrow is failing across all the CI, so this is ready for review

@jayzhan211 jayzhan211 merged commit e03f9f6 into apache:main Feb 21, 2025
24 checks passed
@jayzhan211 jayzhan211 deleted the rm-count-wildcard-rule branch February 21, 2025 23:45
@jayzhan211
Copy link
Contributor Author

Thanks ALL

@ozankabak
Copy link
Contributor

ozankabak commented Feb 22, 2025

CI broke after this PR.

We have this extended-tests-failing-after-merge situation happening very frequently recently. Maybe the decision to defer extended tests to post-merge was a wrong one. Committers are now unaware that their PRs can break main, and they only get to know this after the fact.

We should discuss how to mitigate this. If we don't see an obvious solution, it may be prudent to go back to the inefficient but safe run-everything-in-the-PR mode. @alamb and @jayzhan211, what do you think?

@jayzhan211
Copy link
Contributor Author

jayzhan211 commented Feb 22, 2025

Making extended tests optional BUT easily visible and run it before merge (maybe github supports such UI?) seems like a better approach. This way, for minor changes or cases where we're confident in the outcome, we can choose to skip the tests.
#14319

We also need to add more tests to SQLLogicTest to improve coverage. This failure highlights the need for additional tests to address the gap.
#14824

I think we can do both

@ozankabak
Copy link
Contributor

Making extended tests optional BUT easily visible and run it before merge (maybe github supports such UI?) seems like a better approach.

If this is possible, certainly. If not, we will need to fall back to the old run-everything mode until we figure out a way to implement something like this. Having broken main commits frequently is not a sustainable practice.

@alamb
Copy link
Contributor

alamb commented Feb 22, 2025

Making extended tests optional BUT easily visible and run it before merge (maybe github supports such UI?) seems like a better approach.

If this is possible, certainly. If not, we will need to fall back to the old run-everything mode until we figure out a way to implement something like this.

The downside is that the "sqllogictests" thing takes 2 hours to run (and it takes quite a while to run even locally)

Having broken main commits frequently is not a sustainable practice.

Yeah, I agree

The upside of the current approach is that at least now we know there is an issue that was introduced.

I had hoped we would be able to run the extended suite on PRs by now

@buraksenn has some version of it here, but it was not working

I'll see if I can get someone to help out to make it work

@alamb
Copy link
Contributor

alamb commented Feb 24, 2025

ozankabak pushed a commit to synnada-ai/datafusion-upstream that referenced this pull request Feb 25, 2025
…prPlanner, add `plan_aggregate` and `plan_window` to planner (apache#14689)

* count planner

* window

* update slt

* remove rule

* rm rule

* doc

* fix name

* fix name

* fix test

* tpch test

* fix avro

* rename

* switch to count(*)

* use count(*)

* rename

* doc

* rename window funciotn

* fmt

* rm print

* upd logic

* count null
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api change Changes the API exposed to users of the crate core Core DataFusion crate functions logical-expr Logical plan and expressions optimizer Optimizer rules sql SQL Planner sqllogictest SQL Logic Tests (.slt) substrait
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants