From 18243e0ab8181bc79321cb5d02ae1baf694ebecc Mon Sep 17 00:00:00 2001 From: Andrew Lamb Date: Sun, 7 Jan 2024 08:53:34 -0500 Subject: [PATCH 1/2] [Website]: DataFusion 26-34 blog --- _posts/2024-01-25-datafusion-34.0.0.md | 203 +++++++++++++++++++++++++ 1 file changed, 203 insertions(+) create mode 100644 _posts/2024-01-25-datafusion-34.0.0.md diff --git a/_posts/2024-01-25-datafusion-34.0.0.md b/_posts/2024-01-25-datafusion-34.0.0.md new file mode 100644 index 000000000000..f2a38f1f1172 --- /dev/null +++ b/_posts/2024-01-25-datafusion-34.0.0.md @@ -0,0 +1,203 @@ +--- +layout: post +title: "Apache Arrow DataFusion 34.0.0 Released, Looking Forward to 2024" +date: "2023-06-24 00:00:00" +author: pmc +categories: [release] +--- + + + +## Introduction + +[Apache Arrow DataFusion] is an extensible query engine and database +toolkit, written in [Rust], that uses [Apache Arrow] as its in-memory +format. DataFusion is targeted primarily at developers creating other data +intensive analytics. + + +[apache arrow datafusion]: https://arrow.apache.org/datafusion/ +[apache arrow]: https://arrow.apache.org +[rust]: https://www.rust-lang.org/ + +We recently [released DataFusion 34.0.0]. This blog highlights some of the major +improvements since we released [DataFusion 26.0.0] -– spoiler alert it is a lot +– and a preview of where the community plans to head in the next 6 months. + +[Apache Arrow DataFusion 26.0.0]: https://arrow.apache.org/blog/2023/06/24/datafusion-25.0.0/. +[released DataFusion 34.0.0]: https://crates.io/crates/datafusion/34.0.0 + +This is also likely to be our last update blog post on the apache arrow site – +future updates will likely be on our own website as we are working to [graduate +to a top level project] (Apache Arrow DataFusion → Apache DataFusion) to help +focus governance and help the project grow. We plan to have our [first DataFusion in person meetup] in March 2024. + + +[graduate to a top level project]: https://github.com/apache/arrow-datafusion/discussions/6475 +[first DataFusion in person meetup]: https://github.com/apache/arrow-datafusion/discussions/8522 + +DataFusion is very much a community endeavor. Our core thesis is that as a +community we can build much more advanced technology than any individual or +company could do alone. In the last 6 months between `26.0.0` and `34.0.0`, +community growth has been strong. We have accepted and reviewed over a +thousand PRs from 124 different committers. XXX issues have been created and YYY issues +have been closed. + + + +The rest of this post highlights some of the improvements we have made +to DataFusion over the last 6 months and a preview of where we are +heading. You can see a list of all changes in the detailed +[CHANGELOG]. + +[CHANGELOG]: https://github.com/apache/arrow-datafusion/blob/main/datafusion/CHANGELOG.md + +# Improved Performance +Performance is a key feature of DataFusion and we have made major improvements since 25.0.0 + +(TODO get benchmark runs of TPCH and ClickBench between 25 and 34) + +Some key improvements we made: +* [2-3x Better aggregation performance with many distinct groups] +* Partially ordered grouping / streaming grouping (reduced memory) +* Joins (what to highlight??) +* TopK (ORDER BY LIMIT XXX) +* Specialized min(col) GROUP BY ORDER by min(col) LIMIT 1 type query (TODO link / better descrioption +* Improved join performance would maybe be another thing to highlight. +* Improved sort order awareness: TODO link, example showing that we avoid resorting + +[2-3x Better aggregation performance with many distinct groups]: https://arrow.apache.org/blog/2023/08/05/datafusion_fast_grouping/ + + +# New Features + +## DML / Insert / Creating Files + +DataFusion now supports writing, in parallel to Parquet, CSV, JSON (ARROW?) and +soon AVRO. This includes writing in parallel to individual and multiple files + +You can do this via `CREATE EXTERNAL TABLE`, for example: + +```sql +(TODO example) + + +``` + +As well as the COPY command (modeled after DuckDB’s copy (TODO duckdb copy blog link) + +```sql + +(TODO example) +``` + + +## Improved Struct/array support + +26.0.0 had basic support for structs and arrays, but 34.0.0 has much improved +support, and a full range of functions (TODO link to docs) for working with +structs and arrays. + +(@izveigor ❤️ and jayzhan, and wejun) – + +```sql +(TODO make a good example showing how arrays are used example) +``` + +# Other New Features: +* Group by spill to disk (TODO link) +* Join memory limiting (TODO find when this wa done) + + +# Easier to use DataFusion to build systems + +DataFusion’s primary design goal is as the basis used to create other systems +(TODO link to doc explaining that), though it is a reasonably good experience +out of the box as a dataframe library (TODO link to datafusion-python) and +command line SQL tool (todo link to datafusion-cli docs) + +Part of making it easier to use DataFusion is improving the documentation. To +that end we have created a new [Library Users Guide] to help people build the +applications on top of DataFusion. + +[Library Users Guide]:https://arrow.apache.org/datafusion/library-user-guide/index.html + +We have also added User Defined Window Functions (TODO link to documentation) , User Defined Table Functions (TODO link to doc) + + +### Growth of DataFusion in the Wild + +DataFusion has also appeared more in public conversations such as: +* Found out about many new projects that are being built with DataFusion such as lancedb, XX and YYY (TODO links)) +* A talk at [CommunityOverCode Asia 2023] titled [Apache Arrow Datafusion: Vectorized +Execution Framework For Maximum Performance] +* Blogs such as +[Flight, DataFusion, Arrow, and Parquet: Using the FDAP Architecture to build InfluxDB 3.0]. + + +[Apache Arrow Datafusion: Vectorized Execution Framework For Maximum Performance]: https://www.youtube.com/watch?v=AJU9rdRNk9I +[CommunityOverCode Asia 2023]: https://www.bagevent.com/event/8432178 +[Flight, DataFusion, Arrow, and Parquet: Using the FDAP Architecture to build InfluxDB 3.0]: https://www.influxdata.com/blog/flight-datafusion-arrow-parquet-fdap-architecture-influxdb/ + +We also submitted a paper(TODO LINK) to SIGMOD 2024 (TODO link), one of the +premiere database conferences, describing DataFusion in a technically formal +style, and makes the case that a modular query engine such as DataFusion does +not have to sacrifice performance. We hope this can help people who may be +considering DataFusion to decide if it is a good fit for their needs. + + +# DataFusion in 2024 + +This year, some major initiatives we aim to undertake are: + +1. Split apart the system so it is even more modular, including unifying how functions are defined to make it easier to mix and match. Improved APIs (trait based) APIs for defining extensions for Scalar, Aggregate, Window functions + +2. Better testing and CI infrastructure, including fuzz testing, and better functional and performance regression testing. +2. Performance: Improved planning performance (both for wide tables of 1000s of columns, as well as in general improving efficiency of the planning process) TODO find ticket. +3. Performance of performance on high cardinality grouping Also, our aggregation is still about 2x slower than state of the art (e.g. DuckDB) for “very high cardinality” aggregates, and we hope to improve there as well +4. Better statistics handling such as XXX, YYY with an eye towards more sophisticated analysis and cost models. +5. Further invest in advanced optimization techniques such as additional predicate pruning, bloom filtering, etc (TODO link). This largely takes the form of logic rules that can take arbitrary predicates (EXMAPLE) and prove they can not be true given information at hand, either statistics from parquet files, bloom filter,s or other sources). This is a key building block for both fast performance within datafusion as well as other systems that use datafusion expressions + + + + +# How to Get Involved + +If you are interested in contributing to DataFusion, we would love to have you +join us. You can try out DataFusion on some of your own data and projects and +let us know how it goes, contribute suggestions, documentation, bug reports, or +contribute a PR with documentation, tests or code. A list of open issues +suitable for beginners is [here]. + +Also looking for help to restart biweekly calls / meetings. Timezones are always a +challenge for such meetings, but we hope to have two calls that can work for +most attendees + +Check out our [Communication Doc] for more ways to engage with the +community. + +[here]: https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22 +[communication doc]: https://arrow.apache.org/datafusion/contributor-guide/communication.html From 78b6625695fbd3c9e7a22537e6fc9afb2a1b8476 Mon Sep 17 00:00:00 2001 From: Bruce Ritchie Date: Sun, 7 Jan 2024 11:10:05 -0500 Subject: [PATCH 2/2] Links and minor textual improvements --- _posts/2024-01-25-datafusion-34.0.0.md | 45 +++++++++++++++----------- 1 file changed, 27 insertions(+), 18 deletions(-) diff --git a/_posts/2024-01-25-datafusion-34.0.0.md b/_posts/2024-01-25-datafusion-34.0.0.md index f2a38f1f1172..ea2e45fce174 100644 --- a/_posts/2024-01-25-datafusion-34.0.0.md +++ b/_posts/2024-01-25-datafusion-34.0.0.md @@ -38,10 +38,10 @@ intensive analytics. [rust]: https://www.rust-lang.org/ We recently [released DataFusion 34.0.0]. This blog highlights some of the major -improvements since we released [DataFusion 26.0.0] -– spoiler alert it is a lot +improvements since we [released DataFusion 26.0.0] -– spoiler alert it is a lot – and a preview of where the community plans to head in the next 6 months. -[Apache Arrow DataFusion 26.0.0]: https://arrow.apache.org/blog/2023/06/24/datafusion-25.0.0/. +[released DataFusion 26.0.0]: https://arrow.apache.org/blog/2023/06/24/datafusion-25.0.0/. [released DataFusion 34.0.0]: https://crates.io/crates/datafusion/34.0.0 This is also likely to be our last update blog post on the apache arrow site – @@ -96,8 +96,8 @@ Some key improvements we made: ## DML / Insert / Creating Files -DataFusion now supports writing, in parallel to Parquet, CSV, JSON (ARROW?) and -soon AVRO. This includes writing in parallel to individual and multiple files +DataFusion now supports writing in parallel to Parquet, CSV, JSON (ARROW?) and +soon AVRO. This includes writing in parallel to individual and multiple files. You can do this via `CREATE EXTERNAL TABLE`, for example: @@ -118,8 +118,10 @@ As well as the COPY command (modeled after DuckDB’s copy (TODO duckdb copy blo ## Improved Struct/array support 26.0.0 had basic support for structs and arrays, but 34.0.0 has much improved -support, and a full range of functions (TODO link to docs) for working with -structs and arrays. +support and a full range of functions for working with [structs] and [arrays]. + +[structs]: https://arrow.apache.org/datafusion/user-guide/sql/scalar_functions.html#struct-functions +[arrays]: https://arrow.apache.org/datafusion/user-guide/sql/scalar_functions.html#array-functions (@izveigor ❤️ and jayzhan, and wejun) – @@ -128,16 +130,19 @@ structs and arrays. ``` # Other New Features: -* Group by spill to disk (TODO link) +* Group by spill to disk * Join memory limiting (TODO find when this wa done) +[Group by spill to disk]: https://github.com/apache/arrow-datafusion/pull/7400 # Easier to use DataFusion to build systems -DataFusion’s primary design goal is as the basis used to create other systems +DataFusion’s primary design goal is to be used as the basis to create other systems (TODO link to doc explaining that), though it is a reasonably good experience -out of the box as a dataframe library (TODO link to datafusion-python) and -command line SQL tool (todo link to datafusion-cli docs) +out of the box as a [dataframe library] and as a [command line SQL tool]. + +[dataframe library]: https://arrow.apache.org/datafusion-python/ +[command line SQL tool]: https://arrow.apache.org/datafusion/user-guide/cli.html Part of making it easier to use DataFusion is improving the documentation. To that end we have created a new [Library Users Guide] to help people build the @@ -145,7 +150,10 @@ applications on top of DataFusion. [Library Users Guide]:https://arrow.apache.org/datafusion/library-user-guide/index.html -We have also added User Defined Window Functions (TODO link to documentation) , User Defined Table Functions (TODO link to doc) +We have also added [User Defined Window Functions] and [User Defined Table Functions]. + +[User Defined Window Functions]: https://arrow.apache.org/datafusion/library-user-guide/adding-udfs.html#adding-a-window-udf +[User Defined Table Functions]: https://arrow.apache.org/datafusion/library-user-guide/adding-udfs.html#adding-a-user-defined-table-function ### Growth of DataFusion in the Wild @@ -171,22 +179,23 @@ considering DataFusion to decide if it is a good fit for their needs. # DataFusion in 2024 -This year, some major initiatives we aim to undertake are: +This year some major initiatives we aim to undertake are: -1. Split apart the system so it is even more modular, including unifying how functions are defined to make it easier to mix and match. Improved APIs (trait based) APIs for defining extensions for Scalar, Aggregate, Window functions +1. Split apart the system so it is even more modular, including unifying how functions are defined to make it easier to mix and match. Improved APIs (trait based) APIs for defining extensions for Scalar, Aggregate and Window functions. 2. Better testing and CI infrastructure, including fuzz testing, and better functional and performance regression testing. -2. Performance: Improved planning performance (both for wide tables of 1000s of columns, as well as in general improving efficiency of the planning process) TODO find ticket. -3. Performance of performance on high cardinality grouping Also, our aggregation is still about 2x slower than state of the art (e.g. DuckDB) for “very high cardinality” aggregates, and we hope to improve there as well +2. Performance: Improved planning performance (both for [wide tables of 1000s of columns], as well as in [general improving efficiency of the planning process]). +3. Performance of performance on high cardinality grouping Also, our aggregation is still about 2x slower than state of the art (e.g. DuckDB) for “very high cardinality” aggregates, and we hope to improve there as well. 4. Better statistics handling such as XXX, YYY with an eye towards more sophisticated analysis and cost models. -5. Further invest in advanced optimization techniques such as additional predicate pruning, bloom filtering, etc (TODO link). This largely takes the form of logic rules that can take arbitrary predicates (EXMAPLE) and prove they can not be true given information at hand, either statistics from parquet files, bloom filter,s or other sources). This is a key building block for both fast performance within datafusion as well as other systems that use datafusion expressions - +5. Further invest in advanced optimization techniques such as additional predicate pruning, bloom filtering, etc (TODO link). This largely takes the form of logic rules that can take arbitrary predicates (EXMAPLE) and prove they can not be true given information at hand, either statistics from parquet files, bloom filters or other sources. This is a key building block for both fast performance within datafusion as well as other systems that use datafusion expressions. +[wide tables of 1000s of columns]: https://github.com/apache/arrow-datafusion/issues/7698 +[general improving efficiency of the planning process]: https://github.com/apache/arrow-datafusion/issues/5637 # How to Get Involved -If you are interested in contributing to DataFusion, we would love to have you +If you are interested in contributing to DataFusion we would love to have you join us. You can try out DataFusion on some of your own data and projects and let us know how it goes, contribute suggestions, documentation, bug reports, or contribute a PR with documentation, tests or code. A list of open issues