Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add additional counting aggregations to AggBy #6358

Open
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

lbooker42
Copy link
Contributor

New aggregations are:

  • AggCountNonNull()
  • AggCountNull()
  • AggCountNegative()
  • AggCountPositive()
  • AggCountZero()
  • AggCountNaN()
  • AggCountInfinite()
  • AggCountFinite()

@lbooker42 lbooker42 self-assigned this Nov 8, 2024
@lbooker42 lbooker42 added this to the 0.38.0 milestone Nov 8, 2024
Comment on lines +46 to +47
@Parameter
public abstract AggCountType countType();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll post the same sort of concerns here that I have for #6270; I think we would do better to express this in terms of a Filter. What if I want to count NULL or NAN or INFINITE? Or any combination?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting idea, would prefer to see this as a new feature suggestion.

These agg_by ops are companions to Numeric / Basic vector ops (and will be paralleled by update_by cum_count and rolling_count variants) and this filter idea isn't trivial to extend to all these routines.

@lbooker42
Copy link
Contributor Author

lbooker42 commented Nov 20, 2024

Note on extensive use of lambda and performance

I ran repeated tests on 4 different VM (arm64 on Mac M1) and verified that the lambda-heavy ops are not penalized and the lambda are (apparently) in-lined by the JIT.

The comparison is with AggSum that uses type-specialized operators but evaluates every value, AggCountNonNull that uses lambdas and evaluates every value, and AggCountAll (which aliases AggCount) that evaluates no values, simply returning the size of the bucket rowset.

image

The expected (and confirmed) result is that the new counting ops will fall between AggSum and AggCountAll/AggCount in performance.

Copy link
Contributor

@jmao-denver jmao-denver left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Python changes LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants