[Website]: DataFusion 26-34 blog post (#457)

Closes apache/datafusion#6780 This blog post describes DataFusion over the last 6 months, DataFusion 26 to 34. If anyone has time to pitch in and look up links or help with the language that would be most apprecaited --------- Co-authored-by: Bruce Ritchie <[email protected]> Co-authored-by: Andy Grove <[email protected]> Co-authored-by: Mustafa Akur <[email protected]>
apache · Jan 19, 2024 · b49af25 · b49af25
1 parent 91c782c
commit b49af25
Show file tree

Hide file tree

Showing 3 changed files with 369 additions and 0 deletions.
diff --git a/_posts/2024-01-19-datafusion-34.0.0.md b/_posts/2024-01-19-datafusion-34.0.0.md
@@ -0,0 +1,369 @@
+---
+layout: post
+title: "Apache Arrow DataFusion 34.0.0 Released, Looking Forward to 2024"
+date: "2024-01-19 00:00:00"
+author: pmc
+categories: [release]
+---
+
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+## Introduction
+
+We recently [released DataFusion 34.0.0]. This blog highlights some of the major
+improvements since we [released DataFusion 26.0.0] (spoiler alert there are many)
+and a preview of where the community plans to focus in the next 6 months.
+
+[released DataFusion 26.0.0]: https://arrow.apache.org/blog/2023/06/24/datafusion-25.0.0/.
+[released DataFusion 34.0.0]: https://crates.io/crates/datafusion/34.0.0
+
+[Apache Arrow DataFusion] is an extensible query engine, written in [Rust], that
+uses [Apache Arrow] as its in-memory format. DataFusion is used by developers to
+create new, fast data centric systems such as databases, dataframe libraries,
+machine learning and streaming applications. While [DataFusion’s primary design
+goal] is to accelerate creating other data centric systems, it has a
+reasonable experience directly out of the box as a [dataframe library] and
+[command line SQL tool].
+
+[DataFusion’s primary design goal]: https://arrow.apache.org/datafusion/user-guide/introduction.html#project-goals
+[dataframe library]: https://arrow.apache.org/datafusion-python/
+[command line SQL tool]: https://arrow.apache.org/datafusion/user-guide/cli.html
+
+
+[apache arrow datafusion]: https://arrow.apache.org/datafusion/
+[apache arrow]: https://arrow.apache.org
+[rust]: https://www.rust-lang.org/
+
+
+This may also be our last update on the Apache Arrow Site. Future
+updates will likely be on the DataFusion website as we are working to [graduate
+to a top level project] (Apache Arrow DataFusion → Apache DataFusion!) which
+will help focus governance and project growth. Also exciting, our [first
+DataFusion in person meetup] is planned for March 2024.
+
+[graduate to a top level project]: https://github.com/apache/arrow-datafusion/discussions/6475
+[first DataFusion in person meetup]: https://github.com/apache/arrow-datafusion/discussions/8522
+
+DataFusion is very much a community endeavor. Our core thesis is that as a
+community we can build much more advanced technology than any of us as
+individuals or companies could alone. In the last 6 months between `26.0.0` and
+`34.0.0`, community growth has been strong. We accepted and reviewed over a
+thousand PRs from 124 different committers, created over 650 issues and closed 517
+of them.
+You can find a list of all changes in the detailed [CHANGELOG].
+
+<!--
+$ git log --pretty=oneline 26.0.0..34.0.0 . | wc -l
+     1009
+
+$ git shortlog -sn 26.0.0..34.0.0 . | wc -l
+      124
+
+https://crates.io/crates/datafusion/26.0.0
+DataFusion 26 released June 7, 2023
+
+https://crates.io/crates/datafusion/34.0.0
+DataFusion 34 released Dec 17, 2023
+
+Issues created in this time: 214 open, 437 closed
+https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+created%3A2023-06-23..2023-12-17
+
+Issues closes: 517
+https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+closed%3A2023-06-23..2023-12-17+
+
+PRs merged in this time 908
+https://github.com/apache/arrow-datafusion/pulls?q=is%3Apr+merged%3A2023-06-23..2023-12-17
+-->
+
+
+
+[CHANGELOG]: https://github.com/apache/arrow-datafusion/blob/main/datafusion/CHANGELOG.md
+
+# Improved Performance 🚀 
+
+Performance is a key feature of DataFusion, DataFusion is 
+more than 2x faster on [ClickBench] compared to version `25.0.0`, as shown below:
+
+<!--
+  Scripts: https://github.com/alamb/datafusion-duckdb-benchmark/tree/datafusion-25-34
+  Spreadsheet: https://docs.google.com/spreadsheets/d/1FtI3652WIJMC5LmJbLfT3G06w0JQIxEPG4yfMafexh8/edit#gid=1879366976
+  Average runtime on 25.0.0: 7.2s (for the queries that actually ran)
+  Average runtime on 34.0.0: 3.6s (for the same queries that ran in 25.0.0)
+-->
+
+[ClickBench]: https://benchmark.clickhouse.com/
+
+<figure style="text-align: center;">
+  <img src="{{ site.baseurl }}/img/datafusion-34.0.0/compare-new.png" width="100%" class="img-responsive" alt="Fig 1: Adaptive Arrow schema architecture overview.">
+  <figcaption>
+    <b>Figure 1</b>: Performance improvement between <code>25.0.0</code> and <code>34.0.0</code> on ClickBench. 
+    Note that DataFusion <code>25.0.0</code>, could not run several queries due to 
+    unsupported SQL (Q9, Q11, Q12, Q14) or memory requirements (Q33).
+  </figcaption>
+</figure>
+
+<figure style="text-align: center;">
+  <img src="{{ site.baseurl }}/img/datafusion-34.0.0/compare.png" width="100%" class="img-responsive" alt="Fig 1: Adaptive Arrow schema architecture overview.">
+  <figcaption>
+    <b>Figure 2</b>: Total query runtime for DataFusion <code>34.0.0</code> and DataFusion <code>25.0.0</code>.
+  </figcaption>
+</figure>
+
+
+Here are some specific enhancements we have made to improve performance:
+* [2-3x better aggregation performance with many distinct groups]
+* Partially ordered grouping / streaming grouping
+* [Specialized operator for "TopK" `ORDER BY LIMIT XXX`] 
+* [Specialized operator for `min(col) GROUP BY .. ORDER by min(col) LIMIT XXX`]
+* [Improved join performance]
+* Eliminate redundant sorting with sort order aware optimizers
+
+[2-3x better aggregation performance with many distinct groups]: https://arrow.apache.org/blog/2023/08/05/datafusion_fast_grouping/
+[Specialized operator for `min(col) GROUP BY .. ORDER by min(col) LIMIT XXX`]: https://github.com/apache/arrow-datafusion/pull/7192
+[Specialized operator for "TopK" `ORDER BY LIMIT XXX`]: https://github.com/apache/arrow-datafusion/pull/7721
+[Improved join performance]: https://github.com/apache/arrow-datafusion/pull/8126
+[Pushdown Filter Condition(s) into Cross join]: https://github.com/apache/arrow-datafusion/pull/8626
+# New Features ✨
+
+## DML / Insert / Creating Files
+
+DataFusion now supports writing data in parallel, to individual or multiple
+files, using `Parquet`, `CSV`, `JSON`, `ARROW` and user defined formats.
+[Benchmark results] show improvements up to 5x in some cases.
+
+[Benchmark results]: https://github.com/apache/arrow-datafusion/pull/7655
+
+Similarly to reading, data can now be written to any [`ObjectStore`]
+implementation, including AWS S3, Azure Blob Storage, GCP Cloud Storage, local
+files, and user defined implementations. While reading from [hive style
+partitioned tables] has long been supported, it is now possible to write to such
+tables as well.
+
+[hive style partitioned tables]: https://docs.rs/datafusion/latest/datafusion/datasource/listing/struct.ListingTable.html#features
+
+[`ObjectStore`]: https://docs.rs/object_store/0.9.0/object_store/index.html
+
+For example, to write to a local file:
+
+```sql
+❯ CREATE EXTERNAL TABLE awesome_table(x INT) STORED AS PARQUET LOCATION '/tmp/my_awesome_table';
+0 rows in set. Query took 0.003 seconds.
+
+❯ INSERT INTO awesome_table SELECT x * 10 FROM my_source_table;
++-------+
+| count |
++-------+
+| 3     |
++-------+
+1 row in set. Query took 0.024 seconds.
+```
+
+[`CREATE EXTERNAL TABLE` statement]: https://arrow.apache.org/datafusion/user-guide/sql/ddl.html#create-external-table
+
+You can also write to files with the [`COPY`], similarly to [DuckDB’s `COPY`]:
+
+[`COPY`]: https://arrow.apache.org/datafusion/user-guide/sql/dml.html#copy
+[DuckDB’s `COPY`]: https://duckdb.org/docs/sql/statements/copy.html
+
+```sql
+❯ COPY (SELECT x + 1 FROM my_source_table) TO '/tmp/output.json';
++-------+
+| count |
++-------+
+| 3     |
++-------+
+1 row in set. Query took 0.014 seconds.
+```
+
+```shell
+$ cat /tmp/output.json
+{"x":1}
+{"x":2}
+{"x":3}
+```
+
+## Improved `STRUCT` and `ARRAY` support
+
+DataFusion `34.0.0` has much improved `STRUCT` and `ARRAY`
+support, including a full range of [struct functions] and [array functions].
+
+[struct functions]: https://arrow.apache.org/datafusion/user-guide/sql/scalar_functions.html#struct-functions
+[array functions]: https://arrow.apache.org/datafusion/user-guide/sql/scalar_functions.html#array-functions
+
+<!--
+❯ create table my_table as values ([1,2,3]), ([2]), ([4,5]);
+--> 
+
+For example, you can now use `[]` syntax and `array_length` to access and inspect arrays:
+```sql
+❯ SELECT column1, 
+         column1[1] AS first_element, 
+         array_length(column1) AS len 
+  FROM my_table;
++-----------+---------------+-----+
+| column1   | first_element | len |
++-----------+---------------+-----+
+| [1, 2, 3] | 1             | 3   |
+| [2]       | 2             | 1   |
+| [4, 5]    | 4             | 2   |
++-----------+---------------+-----+
+```
+
+```sql
+❯ SELECT column1, column1['c0'] FROM  my_table;
++------------------+----------------------+
+| column1          | my_table.column1[c0] |
++------------------+----------------------+
+| {c0: foo, c1: 1} | foo                  |
+| {c0: bar, c1: 2} | bar                  |
++------------------+----------------------+
+2 rows in set. Query took 0.002 seconds.
+```
+
+## Other Features
+Other notable features include:
+* Support aggregating datasets that exceed memory size, with [group by spill to disk]
+* All operators now track and limit their memory consumption, including Joins
+
+[group by spill to disk]: https://github.com/apache/arrow-datafusion/pull/7400
+
+# Building Systems is Easier with DataFusion 🛠️
+
+## Documentation
+It is easier than ever to get started using DataFusion with the
+new [Library Users Guide] as well as significantly improved the [API documentation]. 
+
+[Library Users Guide]:https://arrow.apache.org/datafusion/library-user-guide/index.html
+[API documentation]: https://docs.rs/datafusion/latest/datafusion/index.html
+
+## User Defined Window and Table Functions
+In addition to DataFusion's [User Defined Scalar Functions], and [User Defined Aggregate Functions], DataFusion now supports [User Defined Window Functions] 
+ and [User Defined Table Functions].
+
+For example, [the `datafusion-cli`] implements a DuckDB style [`parquet_metadata`]
+function as a user defined table function ([source code here]): 
+
+[the `datafusion-cli`]: https://arrow.apache.org/datafusion/user-guide/cli.html
+[`parquet_metadata`]: https://arrow.apache.org/datafusion/user-guide/cli.html#supported-sql
+[source code here]: https://github.com/apache/arrow-datafusion/blob/3f219bc929cfd418b0e3d3501f8eba1d5a2c87ae/datafusion-cli/src/functions.rs#L222-L248
+
+```sql
+❯ SELECT 
+      path_in_schema, row_group_id, row_group_num_rows, stats_min, stats_max, total_compressed_size 
+FROM 
+      parquet_metadata('hits.parquet')
+WHERE path_in_schema = '"WatchID"' 
+LIMIT 3;
+
++----------------+--------------+--------------------+---------------------+---------------------+-----------------------+
+| path_in_schema | row_group_id | row_group_num_rows | stats_min           | stats_max           | total_compressed_size |
++----------------+--------------+--------------------+---------------------+---------------------+-----------------------+
+| "WatchID"      | 0            | 450560             | 4611687214012840539 | 9223369186199968220 | 3883759               |
+| "WatchID"      | 1            | 612174             | 4611689135232456464 | 9223371478009085789 | 5176803               |
+| "WatchID"      | 2            | 344064             | 4611692774829951781 | 9223363791697310021 | 3031680               |
++----------------+--------------+--------------------+---------------------+---------------------+-----------------------+
+3 rows in set. Query took 0.053 seconds.
+```
+
+
+[User Defined Scalar Functions]: https://arrow.apache.org/datafusion/library-user-guide/adding-udfs.html#adding-a-scalar-udf
+[User Defined Aggregate Functions]: https://arrow.apache.org/datafusion/library-user-guide/adding-udfs.html#adding-an-aggregate-udf
+[User Defined Window Functions]: https://arrow.apache.org/datafusion/library-user-guide/adding-udfs.html#adding-a-window-udf
+[User Defined Table Functions]: https://arrow.apache.org/datafusion/library-user-guide/adding-udfs.html#adding-a-user-defined-table-function
+
+
+### Growth of DataFusion 📈
+DataFusion has been appearing more publically in the wild. For example
+* New projects built using DataFusion such as [lancedb], [GlareDB], [Arroyo], and [optd].
+* Public talks such as [Apache Arrow Datafusion: Vectorized
+  Execution Framework For Maximum Performance] in [CommunityOverCode Asia 2023] 
+* Blogs posts such as [Apache Arrow, Arrow/DataFusion, AI-native Data Infra],
+  [Flight, DataFusion, Arrow, and Parquet: Using the FDAP Architecture to build InfluxDB 3.0], and 
+  [A Guide to User-Defined Functions in Apache Arrow DataFusion]
+
+[glaredb]: https://glaredb.com/
+[lancedb]: https://lancedb.com/
+[arroyo]: https://www.arroyo.dev/
+[optd]: https://github.com/cmu-db/optd
+
+[Apache Arrow Datafusion: Vectorized Execution Framework For Maximum Performance]: https://www.youtube.com/watch?v=AJU9rdRNk9I
+[CommunityOverCode Asia 2023]: https://www.bagevent.com/event/8432178
+[Flight, DataFusion, Arrow, and Parquet: Using the FDAP Architecture to build InfluxDB 3.0]: https://www.influxdata.com/blog/flight-datafusion-arrow-parquet-fdap-architecture-influxdb/
+[Apache Arrow, Arrow/DataFusion, AI-native Data Infra]: https://www.synnada.ai/blog/apache-arrow-arrow-datafusion-ai-native-data-infra-an-interview-with-our-ceo-ozan
+[A Guide to User-Defined Functions in Apache Arrow DataFusion]: https://www.linkedin.com/pulse/guide-user-defined-functions-apache-arrow-datafusion-dade-aderemi/
+
+We have also [submitted a paper] to [SIGMOD 2024], one of the
+premiere database conferences, describing DataFusion in a technically formal
+style and making the case that it is possible to create a modular and extensive query engine 
+without sacrificing performance. We hope this paper helps people 
+evaluating DataFusion for their needs understand it better.
+
+[submitted a paper]: https://github.com/apache/arrow-datafusion/issues/6782
+[SIGMOD 2024]: https://2024.sigmod.org/
+
+# DataFusion in 2024 🥳
+
+Some major initiatives from contributors we know of this year are:
+
+1. *Modularity*: Make DataFusion even more modular, such as [unifying
+   built in and user functions], making it easier to customize 
+   DataFusion's behavior.
+
+2. *Community Growth*: Graduate to our own top level Apache project, and
+   subsequently add more committers and PMC members to keep pace with project
+   growth.
+
+5. *Use case white papers*: Write blog posts and videos explaining
+   how to use DataFusion for real-world use cases.
+
+3. *Testing*: Improve CI infrastructure and test coverage, more fuzz
+   testing, and better functional and performance regression testing.
+
+3. *Planning Time*: Reduce the time taken to plan queries, both [wide
+   tables of 1000s of columns], and in [general].
+
+4. *Aggregate Performance*: Improve the speed of [aggregating "high cardinality"] data
+   when there are many (e.g. millions) of distinct groups.
+
+5. *Statistics*: [Improved statistics handling] with an eye towards more
+   sophisticated expression analysis and cost models.
+
+[aggregating "high cardinality"]: https://github.com/apache/arrow-datafusion/issues/7000
+[wide tables of 1000s of columns]: https://github.com/apache/arrow-datafusion/issues/7698
+[general]: https://github.com/apache/arrow-datafusion/issues/5637
+[unifying built in and user functions]: https://github.com/apache/arrow-datafusion/issues/8045
+[Improved statistics handling]: https://github.com/apache/arrow-datafusion/issues/8227
+
+# How to Get Involved
+
+If you are interested in contributing to DataFusion we would love to have you
+join us. You can try out DataFusion on some of your own data and projects and
+let us know how it goes, contribute suggestions, documentation, bug reports, or
+a PR with documentation, tests or code. A list of open issues
+suitable for beginners is [here].
+
+As the community grows, we are also looking to restart biweekly calls /
+meetings. Timezones are always a challenge for such meetings, but we hope to
+have two calls that can work for most attendees. If you are interested
+in helping, or just want to say hi, please drop us a note via one of 
+the methods listed in our [Communication Doc].
+
+[here]: https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22
+[communication doc]: https://arrow.apache.org/datafusion/contributor-guide/communication.html
diff --git a/img/datafusion-34.0.0/compare-new.png b/img/datafusion-34.0.0/compare-new.png
diff --git a/img/datafusion-34.0.0/compare.png b/img/datafusion-34.0.0/compare.png