Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write a blog post fast Vectorized grouping for high cardinality #6988

Closed
Tracked by #6889
alamb opened this issue Jul 16, 2023 · 4 comments · Fixed by apache/arrow-site#386
Closed
Tracked by #6889

Write a blog post fast Vectorized grouping for high cardinality #6988

alamb opened this issue Jul 16, 2023 · 4 comments · Fixed by apache/arrow-site#386
Assignees

Comments

@alamb
Copy link
Contributor

alamb commented Jul 16, 2023

The idea here is to write a blog post explaining / motivating the improvement in DataFusion grouping made in #6904

@alamb alamb changed the title Write a blog post about it Write a blog post fast Vectorized grouping for high cardinality Jul 16, 2023
@alamb alamb self-assigned this Jul 16, 2023
@alamb alamb added the devrel label Jul 24, 2023
@alamb
Copy link
Contributor Author

alamb commented Jul 24, 2023

I have drafted a blog about this with @tustvold and @Dandandan -- it will be published on the InfluxData blog first and then I will propose reposting it on the arrow blog site. I expect to have a draft up later this week

@alamb
Copy link
Contributor Author

alamb commented Aug 2, 2023

here is a blog we wrote about how to do high cardinality grouping really fast: https://www.influxdata.com/blog/aggregating-millions-groups-fast-apache-arrow-datafusion/

I will propose a PR to cross-post the content to the arrow blog as well in the coming days

@alamb
Copy link
Contributor Author

alamb commented Aug 5, 2023

PR on arrow-site ready: apache/arrow-site#386

alamb added a commit to apache/arrow-site that referenced this issue Aug 14, 2023
…ion 28.0.0 (#386)

Closes apache/datafusion#6988

**Note**: This describes work @tustvold @Dandandan and I did in
DataFusion 28.0.0. This content was originally published on the
[InfluxData
Blog](https://www.influxdata.com/blog/aggregating-millions-groups-fast-apache-arrow-datafusion/)
but since it is general applicable to Apache Arrow DataFusion I would
like to syndicate it here becase:
1. This is a form where the community can comment / keep it up to date
via PR
2. It is hosted on a platform with a different lifetime than a company
blog

This is the same model we followed with
https://arrow.apache.org/blog/2022/12/26/querying-parquet-with-millisecond-latency/
which was also republished on the arrow blog after the InfluxData blog

It also gives me an example to use my original ASCII art diagrams :)
@alamb
Copy link
Contributor Author

alamb commented Aug 14, 2023

It is now re-published on https://arrow.apache.org/blog/2023/08/05/datafusion_fast_grouping/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant