Remove CountWildcardRule in Analyzer and move the functionality in ExprPlanner, add `plan_aggregate` and `plan_window` to planner #14689

jayzhan211 · 2025-02-16T05:14:38Z

Which issue does this PR close?

Part of Integrate Analyzer within LogicalPlan building stage #14618

Rationale for this change

We can convert count(*) to count(1) in ExprPlanner.

What changes are included in this PR?

Use name count_star() for wildcard.

Are these changes tested?

Are there any user-facing changes?

count(wildcard()) is used in dataframe API, they need to change to count_wildcard() for the same functionality.

…dcard-rule

jayzhan211 · 2025-02-16T09:26:37Z

datafusion/expr/src/expr.rs

-                    if *distinct { "DISTINCT " } else { "" },
-                    schema_name_from_exprs_comma_separated_without_space(args)?
-                )?;
+                // TODO: Make this customizable by adding `schema_name` for UDAF


alamb

THanks @jayzhan211 - this looks great to me.

I recommend merging #14695 first and then updating this PR to use that to show count(*)

I think that would make the diff much smaller and easier to understand

💯

FYI @jonahgao and @findepi

alamb · 2025-02-16T13:54:07Z

datafusion/core/src/execution/context/csv.rs

-            "| 10           | 110          | 20       |",
-            "+--------------+--------------+----------+",
+            "+--------------+--------------+--------------+",
+            "| sum(test.c1) | sum(test.c2) | count_star() |",


It would be great if this could stay count(*)

Perhaps that is why you implemented AggregateUDFImpl::display_name 🤔

AggregateUDFImpl::schema_name and AggregateUDFImpl::display_name for customizable name #14695

Actually #14695 is for others to keep count(*) given that I prefer count_star() :)

FWIW, DuckDB formats COUNT(*) as count_star and preserves COUNT(<expr>) where expr isn't a wildcard.

In this case, it looks like we don't preserve COUNT(1).

In this case, it looks like we don't preserve COUNT(1).

Yes we don't. count(const_expr), count(), count(*) are all the same thing, I don't think we need to preserve them all

Yes we don't. count(const_expr), count(), count(*) are all the same thing, I don't think we need to preserve them all

fine for me.

calling it just count() (or count(1)) would be shorter and cleaner

in particular, the count_star function should not exist as user-callable syntax, (SELECT count_star() should fail with "function not found")

It would be great if this could stay count(*)

i agree ...

Perhaps that is why you implemented AggregateUDFImpl::display_name 🤔

... but not at any cost.
if displaying count(*) rather than (equivalent) count(1) requires ~200 lines of duplicated code, i'd stick with count(1)

alamb · 2025-02-16T13:54:49Z

datafusion/expr/src/planner.rs

+
+    /// Plans Count(exprs), e.g., `COUNT(*) to Count(1)`
+    ///
+    /// Returns origin expression arguments if not possible


Suggested change

/// Returns origin expression arguments if not possible

/// Returns original expression arguments if not possible

alamb · 2025-02-16T13:54:53Z

datafusion/expr/src/planner.rs

+
+    /// Plans Count(exprs), e.g., `COUNT(*) to Count(1)`
+    ///
+    /// Returns origin expression arguments if not possible


Suggested change

/// Returns origin expression arguments if not possible

/// Returns original expression arguments if not possible

alamb · 2025-02-16T13:57:58Z

datafusion/core/tests/dataframe/mod.rs

@@ -2447,8 +2448,8 @@ async fn test_count_wildcard_on_sort() -> Result<()> {
    let df_results = ctx
        .table("t1")
        .await?
-        .aggregate(vec![col("b")], vec![count(wildcard())])?


Should we also deprecate wildcard() and Expr::Wildcard (in a follow on PR?)

https://docs.rs/datafusion/latest/datafusion/logical_expr/fn.wildcard.html
https://docs.rs/datafusion/latest/datafusion/prelude/enum.Expr.html#variant.Wildcard

🤔

rkrishn7 · 2025-02-17T03:53:47Z

datafusion/expr/src/planner.rs

+        Ok(PlannerResult::Original(expr))
+    }
+
+    /// Plans Count(exprs), e.g., `COUNT(*) to Count(1)`


Should we make the doc here more general to window functions?

Suggested change

/// Plans Count(exprs), e.g., `COUNT(*) to Count(1)`

/// Plans window functions, such as `COUNT(<expr>)`

rkrishn7 · 2025-02-17T03:54:45Z

datafusion/expr/src/planner.rs

@@ -211,6 +214,23 @@ pub trait ExprPlanner: Debug + Send + Sync {
    fn plan_any(&self, expr: RawBinaryExpr) -> Result<PlannerResult<RawBinaryExpr>> {
        Ok(PlannerResult::Original(expr))
    }
+
+    /// Plans Count(exprs), e.g., `COUNT(*) to Count(1)`


Same here regarding wording

Suggested change

/// Plans Count(exprs), e.g., `COUNT(*) to Count(1)`

/// Plans aggregate functions, such as `COUNT(<expr>)`

rkrishn7 · 2025-02-17T04:04:20Z

datafusion/core/src/execution/context/csv.rs

-            "| 10           | 110          | 20       |",
-            "+--------------+--------------+----------+",
+            "+--------------+--------------+--------------+",
+            "| sum(test.c1) | sum(test.c2) | count_star() |",


FWIW, DuckDB formats COUNT(*) as count_star and preserves COUNT(<expr>) where expr isn't a wildcard.

In this case, it looks like we don't preserve COUNT(1).

findepi · 2025-02-17T08:01:24Z

datafusion/core/src/execution/context/csv.rs

-            "| 10           | 110          | 20       |",
-            "+--------------+--------------+----------+",
+            "+--------------+--------------+--------------+",
+            "| sum(test.c1) | sum(test.c2) | count_star() |",


Yes we don't. count(const_expr), count(), count(*) are all the same thing, I don't think we need to preserve them all

fine for me.

calling it just count() (or count(1)) would be shorter and cleaner

in particular, the count_star function should not exist as user-callable syntax, (SELECT count_star() should fail with "function not found")

findepi · 2025-02-17T08:04:50Z

datafusion/functions-aggregate/src/count.rs

+/// Count(*), Count(), Count(1) are all equivalent expression
+/// In DataFusion, we convert them to Count(1) expression
+pub fn count_wildcard() -> Expr {


wildcard here is a syntactical remnant.

this is a function to call all rows, so call it like that

Suggested change

/// Count(*), Count(), Count(1) are all equivalent expression

/// In DataFusion, we convert them to Count(1) expression

pub fn count_wildcard() -> Expr {

/// Creates aggregation to count all rows

pub fn count_all() -> Expr {

findepi · 2025-02-17T08:05:57Z

datafusion/functions-aggregate/src/planner.rs

+        expr: RawAggregateExpr,
+    ) -> Result<PlannerResult<RawAggregateExpr>> {
+        if expr.func.name() == "count"
+            && (expr.args.len() == 1 && matches!(expr.args[0], Expr::Wildcard { .. })


I hope we are able to remove Expr::Wildcard as a follow-up 🙏

…dcard-rule

jayzhan211 · 2025-02-18T01:21:41Z

We need display name / schema name for WindowFunction as well

#14750

alamb · 2025-02-19T12:23:03Z

We need display name / schema name for WindowFunction as well

#14750

So close!

…dcard-rule

findepi

LGTM %

findepi · 2025-02-19T14:08:36Z

datafusion/functions-aggregate/src/count.rs

+    fn schema_name(&self, params: &AggregateFunctionParams) -> Result<String> {
+        let AggregateFunctionParams {
+            args,
+            distinct,
+            filter,
+            order_by,
+            null_treatment,
+        } = params;
+
+        let mut schema_name = String::new();
+
+        if !args.is_empty() && args[0] == Expr::Literal(COUNT_STAR_EXPANSION) {
+            schema_name.write_str("count(*)")?;
+        } else {
+            schema_name.write_fmt(format_args!(
+                "{}({}{})",
+                self.name(),
+                if *distinct { "DISTINCT " } else { "" },
+                schema_name_from_exprs(args)?
+            ))?;
+        }
+
+        if let Some(null_treatment) = null_treatment {
+            schema_name.write_fmt(format_args!(" {}", null_treatment))?;
+        }
+
+        if let Some(filter) = filter {
+            schema_name.write_fmt(format_args!(" FILTER (WHERE {filter})"))?;
+        };
+
+        if let Some(order_by) = order_by {
+            schema_name.write_fmt(format_args!(
+                " ORDER BY [{}]",
+                schema_name_from_sorts(order_by)?
+            ))?;
+        };
+
+        Ok(schema_name)
+    }


This looks like a long copy of the default implementation.
Overall we have 4 methods copied, 177 lines overall, where all we need is customize that count(1) is displayed as count(*). Not good for maintainability.

I wonder why the logic for formatting distinct, filter and order by is handed to the function itself, if it's attribute of the containing AggregateFunction. If we want to solve this, this could be a prep PR to avoid PR scope screep.
Otherwise it's better to leave count(1) as count(1), rather than copy so many lines, unless some other option exits.

Since it is part of the name so we must bring them all.

Didn't find any nice way to avoid the duplication.

What would happen if these name-generating functions were not overridden in the count aggregation?

findepi · 2025-02-19T14:10:32Z

datafusion/functions-aggregate/src/count.rs

+        if !args.is_empty() && args[0] == Expr::Literal(COUNT_STAR_EXPANSION) {
+            schema_name.write_str("count(*)")?;


count(1) and count(2) are the same thing, so what about checking for args[0] to be a non-null constant?

count(1, a, b) is something else than count(1); this should check that args.len = 1

findepi · 2025-02-19T14:15:32Z

datafusion/expr/src/planner.rs

@@ -167,14 +170,14 @@ pub trait ExprPlanner: Debug + Send + Sync {

    /// Plan an extract expression, such as`EXTRACT(month FROM foo)`
    ///
-    /// Returns origin expression arguments if not possible
+    /// Returns original expression arguments if not possible


This is a good change. nit Could go in separate PR to keep PR size lower.

I think we can keep this

jonahgao

LGTM 👍👍

jonahgao · 2025-02-20T02:42:52Z

datafusion/functions-aggregate/src/count.rs

+    match args {
+        [] => true, // count()
+        // All const should be coerced to int64 or rejected by the signature
+        [Expr::Literal(ScalarValue::Int64(_))] => true, // count(1)


We might need to consider select count(null::bigint)

count(null) is not wildcard

D create table t(a int); D insert into t values (1); D select count(null) from t; ┌─────────────┐ │ count(NULL) │ │ int64 │ ├─────────────┤ │ 0 │ └─────────────┘

So is_count_wildcard() should reject it.
It only has a problem in the display of logical plan.

jayzhan211 · 2025-02-20T12:54:16Z

pyarrow is failing across all the CI, so this is ready for review

…dcard-rule

jayzhan211 · 2025-02-21T23:46:09Z

Thanks ALL

ozankabak · 2025-02-22T06:12:42Z

CI broke after this PR.

We have this extended-tests-failing-after-merge situation happening very frequently recently. Maybe the decision to defer extended tests to post-merge was a wrong one. Committers are now unaware that their PRs can break main, and they only get to know this after the fact.

We should discuss how to mitigate this. If we don't see an obvious solution, it may be prudent to go back to the inefficient but safe run-everything-in-the-PR mode. @alamb and @jayzhan211, what do you think?

jayzhan211 · 2025-02-22T12:08:27Z

Making extended tests optional BUT easily visible and run it before merge (maybe github supports such UI?) seems like a better approach. This way, for minor changes or cases where we're confident in the outcome, we can choose to skip the tests.
#14319

We also need to add more tests to SQLLogicTest to improve coverage. This failure highlights the need for additional tests to address the gap.
#14824

I think we can do both

ozankabak · 2025-02-22T12:17:15Z

Making extended tests optional BUT easily visible and run it before merge (maybe github supports such UI?) seems like a better approach.

If this is possible, certainly. If not, we will need to fall back to the old run-everything mode until we figure out a way to implement something like this. Having broken main commits frequently is not a sustainable practice.

alamb · 2025-02-22T12:36:44Z

Making extended tests optional BUT easily visible and run it before merge (maybe github supports such UI?) seems like a better approach.

If this is possible, certainly. If not, we will need to fall back to the old run-everything mode until we figure out a way to implement something like this.

The downside is that the "sqllogictests" thing takes 2 hours to run (and it takes quite a while to run even locally)

Having broken main commits frequently is not a sustainable practice.

Yeah, I agree

The upside of the current approach is that at least now we know there is an issue that was introduced.

I had hoped we would be able to run the extended suite on PRs by now

Add a way to trigger the extended test suite from a PR #14319

@buraksenn has some version of it here, but it was not working

add manual trigger for extended tests in pull requests #14331

I'll see if I can get someone to help out to make it work

alamb · 2025-02-24T14:20:53Z

Filed a ticket to track this issue

Extended sqllite tests are failing on main #14853

And @jayzhan211 has a proposed fix:

fix duplicated schema name error from count wildcard #14824

…prPlanner, add `plan_aggregate` and `plan_window` to planner (apache#14689) * count planner * window * update slt * remove rule * rm rule * doc * fix name * fix name * fix test * tpch test * fix avro * rename * switch to count(*) * use count(*) * rename * doc * rename window funciotn * fmt * rm print * upd logic * count null

jayzhan211 added 6 commits February 16, 2025 11:49

count planner

29e77c1

window

64a5694

update slt

350ce74

remove rule

9165168

rm rule

6ae8b44

doc

47bd66c

jayzhan211 added the api change Changes the API exposed to users of the crate label Feb 16, 2025

github-actions bot added sql SQL Planner logical-expr Logical plan and expressions optimizer Optimizer rules core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) functions labels Feb 16, 2025

jayzhan211 added 2 commits February 16, 2025 13:41

fix name

ece96ea

fix name

2740604

github-actions bot added the substrait label Feb 16, 2025

fix test

92362e6

jayzhan211 mentioned this pull request Feb 16, 2025

AggregateUDFImpl::schema_name and AggregateUDFImpl::display_name for customizable name #14695

Merged

jayzhan211 added 3 commits February 16, 2025 16:12

tpch test

7afc362

Merge branch 'main' of github.com:apache/datafusion into rm-count-wil…

4d38a31

…dcard-rule

fix avro

33c6e08

jayzhan211 marked this pull request as ready for review February 16, 2025 09:25

jayzhan211 commented Feb 16, 2025

View reviewed changes

alamb changed the title ~~Remove CountWildcardRule in Analyzer and move the functionality in ExprPlanner~~ Remove CountWildcardRule in Analyzer and move the functionality in ExprPlanner, add plan_aggregate and plan_window to planner Feb 16, 2025

alamb approved these changes Feb 16, 2025

View reviewed changes

alamb mentioned this pull request Feb 16, 2025

alamb's review queue #14698

Closed

rkrishn7 reviewed Feb 17, 2025

View reviewed changes

findepi approved these changes Feb 17, 2025

View reviewed changes

jayzhan211 marked this pull request as draft February 17, 2025 13:55

Merge branch 'main' of github.com:apache/datafusion into rm-count-wil…

4fd79fd

…dcard-rule

jayzhan211 added 4 commits February 19, 2025 20:44

Merge branch 'main' of github.com:apache/datafusion into rm-count-wil…

497d201

…dcard-rule

rename window funciotn

20664e4

fmt

5c2acd5

rm print

7379fad

jayzhan211 marked this pull request as ready for review February 19, 2025 13:33

jayzhan211 requested review from alamb, findepi and rkrishn7 February 19, 2025 13:33

findepi reviewed Feb 19, 2025

View reviewed changes

upd logic

7211656

jonahgao approved these changes Feb 20, 2025

View reviewed changes

jayzhan211 marked this pull request as draft February 20, 2025 10:53

count null

f31d574

jayzhan211 marked this pull request as ready for review February 20, 2025 12:53

Merge branch 'main' of github.com:apache/datafusion into rm-count-wil…

7f524a4

…dcard-rule

jayzhan211 merged commit e03f9f6 into apache:main Feb 21, 2025
24 checks passed

jayzhan211 deleted the rm-count-wildcard-rule branch February 21, 2025 23:45

This was referenced Feb 22, 2025

Add a way to trigger the extended test suite from a PR #14319

Open

Extended sqllite tests are failing on main #14853

Open

alamb mentioned this pull request Feb 24, 2025

Regression since 45.0.0: select count(), count(*) does not work #14855

Open

	/// Returns origin expression arguments if not possible
	/// Returns original expression arguments if not possible

	/// Plans Count(exprs), e.g., `COUNT(*) to Count(1)`
	/// Plans window functions, such as `COUNT(<expr>)`

	/// Plans Count(exprs), e.g., `COUNT(*) to Count(1)`
	/// Plans aggregate functions, such as `COUNT(<expr>)`

		if !args.is_empty() && args[0] == Expr::Literal(COUNT_STAR_EXPANSION) {
		schema_name.write_str("count(*)")?;

Remove CountWildcardRule in Analyzer and move the functionality in ExprPlanner, add plan_aggregate and plan_window to planner #14689

Remove CountWildcardRule in Analyzer and move the functionality in ExprPlanner, add plan_aggregate and plan_window to planner #14689

Conversation

jayzhan211 commented Feb 16, 2025 • edited by alamb Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jayzhan211 commented Feb 18, 2025 • edited Loading

alamb commented Feb 19, 2025

findepi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jonahgao left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jayzhan211 commented Feb 20, 2025

jayzhan211 commented Feb 21, 2025

ozankabak commented Feb 22, 2025 • edited Loading

jayzhan211 commented Feb 22, 2025 • edited Loading

ozankabak commented Feb 22, 2025

alamb commented Feb 22, 2025

alamb commented Feb 24, 2025 • edited Loading

Remove CountWildcardRule in Analyzer and move the functionality in ExprPlanner, add `plan_aggregate` and `plan_window` to planner #14689

Remove CountWildcardRule in Analyzer and move the functionality in ExprPlanner, add `plan_aggregate` and `plan_window` to planner #14689

jayzhan211 commented Feb 16, 2025 •

edited by alamb

Loading

jayzhan211 commented Feb 18, 2025 •

edited

Loading

ozankabak commented Feb 22, 2025 •

edited

Loading

jayzhan211 commented Feb 22, 2025 •

edited

Loading

alamb commented Feb 24, 2025 •

edited

Loading