-
Notifications
You must be signed in to change notification settings - Fork 110
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[Website]: DataFusion 26-34 blog post (#457)
Closes apache/datafusion#6780 This blog post describes DataFusion over the last 6 months, DataFusion 26 to 34. If anyone has time to pitch in and look up links or help with the language that would be most apprecaited --------- Co-authored-by: Bruce Ritchie <[email protected]> Co-authored-by: Andy Grove <[email protected]> Co-authored-by: Mustafa Akur <[email protected]>
- Loading branch information
1 parent
91c782c
commit b49af25
Showing
3 changed files
with
369 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,369 @@ | ||
--- | ||
layout: post | ||
title: "Apache Arrow DataFusion 34.0.0 Released, Looking Forward to 2024" | ||
date: "2024-01-19 00:00:00" | ||
author: pmc | ||
categories: [release] | ||
--- | ||
|
||
<!-- | ||
{% comment %} | ||
Licensed to the Apache Software Foundation (ASF) under one or more | ||
contributor license agreements. See the NOTICE file distributed with | ||
this work for additional information regarding copyright ownership. | ||
The ASF licenses this file to you under the Apache License, Version 2.0 | ||
(the "License"); you may not use this file except in compliance with | ||
the License. You may obtain a copy of the License at | ||
http://www.apache.org/licenses/LICENSE-2.0 | ||
Unless required by applicable law or agreed to in writing, software | ||
distributed under the License is distributed on an "AS IS" BASIS, | ||
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
See the License for the specific language governing permissions and | ||
limitations under the License. | ||
{% endcomment %} | ||
--> | ||
|
||
## Introduction | ||
|
||
We recently [released DataFusion 34.0.0]. This blog highlights some of the major | ||
improvements since we [released DataFusion 26.0.0] (spoiler alert there are many) | ||
and a preview of where the community plans to focus in the next 6 months. | ||
|
||
[released DataFusion 26.0.0]: https://arrow.apache.org/blog/2023/06/24/datafusion-25.0.0/. | ||
[released DataFusion 34.0.0]: https://crates.io/crates/datafusion/34.0.0 | ||
|
||
[Apache Arrow DataFusion] is an extensible query engine, written in [Rust], that | ||
uses [Apache Arrow] as its in-memory format. DataFusion is used by developers to | ||
create new, fast data centric systems such as databases, dataframe libraries, | ||
machine learning and streaming applications. While [DataFusion’s primary design | ||
goal] is to accelerate creating other data centric systems, it has a | ||
reasonable experience directly out of the box as a [dataframe library] and | ||
[command line SQL tool]. | ||
|
||
[DataFusion’s primary design goal]: https://arrow.apache.org/datafusion/user-guide/introduction.html#project-goals | ||
[dataframe library]: https://arrow.apache.org/datafusion-python/ | ||
[command line SQL tool]: https://arrow.apache.org/datafusion/user-guide/cli.html | ||
|
||
|
||
[apache arrow datafusion]: https://arrow.apache.org/datafusion/ | ||
[apache arrow]: https://arrow.apache.org | ||
[rust]: https://www.rust-lang.org/ | ||
|
||
|
||
This may also be our last update on the Apache Arrow Site. Future | ||
updates will likely be on the DataFusion website as we are working to [graduate | ||
to a top level project] (Apache Arrow DataFusion → Apache DataFusion!) which | ||
will help focus governance and project growth. Also exciting, our [first | ||
DataFusion in person meetup] is planned for March 2024. | ||
|
||
[graduate to a top level project]: https://github.com/apache/arrow-datafusion/discussions/6475 | ||
[first DataFusion in person meetup]: https://github.com/apache/arrow-datafusion/discussions/8522 | ||
|
||
DataFusion is very much a community endeavor. Our core thesis is that as a | ||
community we can build much more advanced technology than any of us as | ||
individuals or companies could alone. In the last 6 months between `26.0.0` and | ||
`34.0.0`, community growth has been strong. We accepted and reviewed over a | ||
thousand PRs from 124 different committers, created over 650 issues and closed 517 | ||
of them. | ||
You can find a list of all changes in the detailed [CHANGELOG]. | ||
|
||
<!-- | ||
$ git log --pretty=oneline 26.0.0..34.0.0 . | wc -l | ||
1009 | ||
$ git shortlog -sn 26.0.0..34.0.0 . | wc -l | ||
124 | ||
https://crates.io/crates/datafusion/26.0.0 | ||
DataFusion 26 released June 7, 2023 | ||
https://crates.io/crates/datafusion/34.0.0 | ||
DataFusion 34 released Dec 17, 2023 | ||
Issues created in this time: 214 open, 437 closed | ||
https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+created%3A2023-06-23..2023-12-17 | ||
Issues closes: 517 | ||
https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+closed%3A2023-06-23..2023-12-17+ | ||
PRs merged in this time 908 | ||
https://github.com/apache/arrow-datafusion/pulls?q=is%3Apr+merged%3A2023-06-23..2023-12-17 | ||
--> | ||
|
||
|
||
|
||
[CHANGELOG]: https://github.com/apache/arrow-datafusion/blob/main/datafusion/CHANGELOG.md | ||
|
||
# Improved Performance 🚀 | ||
|
||
Performance is a key feature of DataFusion, DataFusion is | ||
more than 2x faster on [ClickBench] compared to version `25.0.0`, as shown below: | ||
|
||
<!-- | ||
Scripts: https://github.com/alamb/datafusion-duckdb-benchmark/tree/datafusion-25-34 | ||
Spreadsheet: https://docs.google.com/spreadsheets/d/1FtI3652WIJMC5LmJbLfT3G06w0JQIxEPG4yfMafexh8/edit#gid=1879366976 | ||
Average runtime on 25.0.0: 7.2s (for the queries that actually ran) | ||
Average runtime on 34.0.0: 3.6s (for the same queries that ran in 25.0.0) | ||
--> | ||
|
||
[ClickBench]: https://benchmark.clickhouse.com/ | ||
|
||
<figure style="text-align: center;"> | ||
<img src="{{ site.baseurl }}/img/datafusion-34.0.0/compare-new.png" width="100%" class="img-responsive" alt="Fig 1: Adaptive Arrow schema architecture overview."> | ||
<figcaption> | ||
<b>Figure 1</b>: Performance improvement between <code>25.0.0</code> and <code>34.0.0</code> on ClickBench. | ||
Note that DataFusion <code>25.0.0</code>, could not run several queries due to | ||
unsupported SQL (Q9, Q11, Q12, Q14) or memory requirements (Q33). | ||
</figcaption> | ||
</figure> | ||
|
||
<figure style="text-align: center;"> | ||
<img src="{{ site.baseurl }}/img/datafusion-34.0.0/compare.png" width="100%" class="img-responsive" alt="Fig 1: Adaptive Arrow schema architecture overview."> | ||
<figcaption> | ||
<b>Figure 2</b>: Total query runtime for DataFusion <code>34.0.0</code> and DataFusion <code>25.0.0</code>. | ||
</figcaption> | ||
</figure> | ||
|
||
|
||
Here are some specific enhancements we have made to improve performance: | ||
* [2-3x better aggregation performance with many distinct groups] | ||
* Partially ordered grouping / streaming grouping | ||
* [Specialized operator for "TopK" `ORDER BY LIMIT XXX`] | ||
* [Specialized operator for `min(col) GROUP BY .. ORDER by min(col) LIMIT XXX`] | ||
* [Improved join performance] | ||
* Eliminate redundant sorting with sort order aware optimizers | ||
|
||
[2-3x better aggregation performance with many distinct groups]: https://arrow.apache.org/blog/2023/08/05/datafusion_fast_grouping/ | ||
[Specialized operator for `min(col) GROUP BY .. ORDER by min(col) LIMIT XXX`]: https://github.com/apache/arrow-datafusion/pull/7192 | ||
[Specialized operator for "TopK" `ORDER BY LIMIT XXX`]: https://github.com/apache/arrow-datafusion/pull/7721 | ||
[Improved join performance]: https://github.com/apache/arrow-datafusion/pull/8126 | ||
[Pushdown Filter Condition(s) into Cross join]: https://github.com/apache/arrow-datafusion/pull/8626 | ||
# New Features ✨ | ||
|
||
## DML / Insert / Creating Files | ||
|
||
DataFusion now supports writing data in parallel, to individual or multiple | ||
files, using `Parquet`, `CSV`, `JSON`, `ARROW` and user defined formats. | ||
[Benchmark results] show improvements up to 5x in some cases. | ||
|
||
[Benchmark results]: https://github.com/apache/arrow-datafusion/pull/7655 | ||
|
||
Similarly to reading, data can now be written to any [`ObjectStore`] | ||
implementation, including AWS S3, Azure Blob Storage, GCP Cloud Storage, local | ||
files, and user defined implementations. While reading from [hive style | ||
partitioned tables] has long been supported, it is now possible to write to such | ||
tables as well. | ||
|
||
[hive style partitioned tables]: https://docs.rs/datafusion/latest/datafusion/datasource/listing/struct.ListingTable.html#features | ||
|
||
[`ObjectStore`]: https://docs.rs/object_store/0.9.0/object_store/index.html | ||
|
||
For example, to write to a local file: | ||
|
||
```sql | ||
❯ CREATE EXTERNAL TABLE awesome_table(x INT) STORED AS PARQUET LOCATION '/tmp/my_awesome_table'; | ||
0 rows in set. Query took 0.003 seconds. | ||
|
||
❯ INSERT INTO awesome_table SELECT x * 10 FROM my_source_table; | ||
+-------+ | ||
| count | | ||
+-------+ | ||
| 3 | | ||
+-------+ | ||
1 row in set. Query took 0.024 seconds. | ||
``` | ||
|
||
[`CREATE EXTERNAL TABLE` statement]: https://arrow.apache.org/datafusion/user-guide/sql/ddl.html#create-external-table | ||
|
||
You can also write to files with the [`COPY`], similarly to [DuckDB’s `COPY`]: | ||
|
||
[`COPY`]: https://arrow.apache.org/datafusion/user-guide/sql/dml.html#copy | ||
[DuckDB’s `COPY`]: https://duckdb.org/docs/sql/statements/copy.html | ||
|
||
```sql | ||
❯ COPY (SELECT x + 1 FROM my_source_table) TO '/tmp/output.json'; | ||
+-------+ | ||
| count | | ||
+-------+ | ||
| 3 | | ||
+-------+ | ||
1 row in set. Query took 0.014 seconds. | ||
``` | ||
|
||
```shell | ||
$ cat /tmp/output.json | ||
{"x":1} | ||
{"x":2} | ||
{"x":3} | ||
``` | ||
|
||
## Improved `STRUCT` and `ARRAY` support | ||
|
||
DataFusion `34.0.0` has much improved `STRUCT` and `ARRAY` | ||
support, including a full range of [struct functions] and [array functions]. | ||
|
||
[struct functions]: https://arrow.apache.org/datafusion/user-guide/sql/scalar_functions.html#struct-functions | ||
[array functions]: https://arrow.apache.org/datafusion/user-guide/sql/scalar_functions.html#array-functions | ||
|
||
<!-- | ||
❯ create table my_table as values ([1,2,3]), ([2]), ([4,5]); | ||
--> | ||
|
||
For example, you can now use `[]` syntax and `array_length` to access and inspect arrays: | ||
```sql | ||
❯ SELECT column1, | ||
column1[1] AS first_element, | ||
array_length(column1) AS len | ||
FROM my_table; | ||
+-----------+---------------+-----+ | ||
| column1 | first_element | len | | ||
+-----------+---------------+-----+ | ||
| [1, 2, 3] | 1 | 3 | | ||
| [2] | 2 | 1 | | ||
| [4, 5] | 4 | 2 | | ||
+-----------+---------------+-----+ | ||
``` | ||
|
||
```sql | ||
❯ SELECT column1, column1['c0'] FROM my_table; | ||
+------------------+----------------------+ | ||
| column1 | my_table.column1[c0] | | ||
+------------------+----------------------+ | ||
| {c0: foo, c1: 1} | foo | | ||
| {c0: bar, c1: 2} | bar | | ||
+------------------+----------------------+ | ||
2 rows in set. Query took 0.002 seconds. | ||
``` | ||
|
||
## Other Features | ||
Other notable features include: | ||
* Support aggregating datasets that exceed memory size, with [group by spill to disk] | ||
* All operators now track and limit their memory consumption, including Joins | ||
|
||
[group by spill to disk]: https://github.com/apache/arrow-datafusion/pull/7400 | ||
|
||
# Building Systems is Easier with DataFusion 🛠️ | ||
|
||
## Documentation | ||
It is easier than ever to get started using DataFusion with the | ||
new [Library Users Guide] as well as significantly improved the [API documentation]. | ||
|
||
[Library Users Guide]:https://arrow.apache.org/datafusion/library-user-guide/index.html | ||
[API documentation]: https://docs.rs/datafusion/latest/datafusion/index.html | ||
|
||
## User Defined Window and Table Functions | ||
In addition to DataFusion's [User Defined Scalar Functions], and [User Defined Aggregate Functions], DataFusion now supports [User Defined Window Functions] | ||
and [User Defined Table Functions]. | ||
|
||
For example, [the `datafusion-cli`] implements a DuckDB style [`parquet_metadata`] | ||
function as a user defined table function ([source code here]): | ||
|
||
[the `datafusion-cli`]: https://arrow.apache.org/datafusion/user-guide/cli.html | ||
[`parquet_metadata`]: https://arrow.apache.org/datafusion/user-guide/cli.html#supported-sql | ||
[source code here]: https://github.com/apache/arrow-datafusion/blob/3f219bc929cfd418b0e3d3501f8eba1d5a2c87ae/datafusion-cli/src/functions.rs#L222-L248 | ||
|
||
```sql | ||
❯ SELECT | ||
path_in_schema, row_group_id, row_group_num_rows, stats_min, stats_max, total_compressed_size | ||
FROM | ||
parquet_metadata('hits.parquet') | ||
WHERE path_in_schema = '"WatchID"' | ||
LIMIT 3; | ||
|
||
+----------------+--------------+--------------------+---------------------+---------------------+-----------------------+ | ||
| path_in_schema | row_group_id | row_group_num_rows | stats_min | stats_max | total_compressed_size | | ||
+----------------+--------------+--------------------+---------------------+---------------------+-----------------------+ | ||
| "WatchID" | 0 | 450560 | 4611687214012840539 | 9223369186199968220 | 3883759 | | ||
| "WatchID" | 1 | 612174 | 4611689135232456464 | 9223371478009085789 | 5176803 | | ||
| "WatchID" | 2 | 344064 | 4611692774829951781 | 9223363791697310021 | 3031680 | | ||
+----------------+--------------+--------------------+---------------------+---------------------+-----------------------+ | ||
3 rows in set. Query took 0.053 seconds. | ||
``` | ||
|
||
|
||
[User Defined Scalar Functions]: https://arrow.apache.org/datafusion/library-user-guide/adding-udfs.html#adding-a-scalar-udf | ||
[User Defined Aggregate Functions]: https://arrow.apache.org/datafusion/library-user-guide/adding-udfs.html#adding-an-aggregate-udf | ||
[User Defined Window Functions]: https://arrow.apache.org/datafusion/library-user-guide/adding-udfs.html#adding-a-window-udf | ||
[User Defined Table Functions]: https://arrow.apache.org/datafusion/library-user-guide/adding-udfs.html#adding-a-user-defined-table-function | ||
|
||
|
||
### Growth of DataFusion 📈 | ||
DataFusion has been appearing more publically in the wild. For example | ||
* New projects built using DataFusion such as [lancedb], [GlareDB], [Arroyo], and [optd]. | ||
* Public talks such as [Apache Arrow Datafusion: Vectorized | ||
Execution Framework For Maximum Performance] in [CommunityOverCode Asia 2023] | ||
* Blogs posts such as [Apache Arrow, Arrow/DataFusion, AI-native Data Infra], | ||
[Flight, DataFusion, Arrow, and Parquet: Using the FDAP Architecture to build InfluxDB 3.0], and | ||
[A Guide to User-Defined Functions in Apache Arrow DataFusion] | ||
|
||
[glaredb]: https://glaredb.com/ | ||
[lancedb]: https://lancedb.com/ | ||
[arroyo]: https://www.arroyo.dev/ | ||
[optd]: https://github.com/cmu-db/optd | ||
|
||
[Apache Arrow Datafusion: Vectorized Execution Framework For Maximum Performance]: https://www.youtube.com/watch?v=AJU9rdRNk9I | ||
[CommunityOverCode Asia 2023]: https://www.bagevent.com/event/8432178 | ||
[Flight, DataFusion, Arrow, and Parquet: Using the FDAP Architecture to build InfluxDB 3.0]: https://www.influxdata.com/blog/flight-datafusion-arrow-parquet-fdap-architecture-influxdb/ | ||
[Apache Arrow, Arrow/DataFusion, AI-native Data Infra]: https://www.synnada.ai/blog/apache-arrow-arrow-datafusion-ai-native-data-infra-an-interview-with-our-ceo-ozan | ||
[A Guide to User-Defined Functions in Apache Arrow DataFusion]: https://www.linkedin.com/pulse/guide-user-defined-functions-apache-arrow-datafusion-dade-aderemi/ | ||
|
||
We have also [submitted a paper] to [SIGMOD 2024], one of the | ||
premiere database conferences, describing DataFusion in a technically formal | ||
style and making the case that it is possible to create a modular and extensive query engine | ||
without sacrificing performance. We hope this paper helps people | ||
evaluating DataFusion for their needs understand it better. | ||
|
||
[submitted a paper]: https://github.com/apache/arrow-datafusion/issues/6782 | ||
[SIGMOD 2024]: https://2024.sigmod.org/ | ||
|
||
# DataFusion in 2024 🥳 | ||
|
||
Some major initiatives from contributors we know of this year are: | ||
|
||
1. *Modularity*: Make DataFusion even more modular, such as [unifying | ||
built in and user functions], making it easier to customize | ||
DataFusion's behavior. | ||
|
||
2. *Community Growth*: Graduate to our own top level Apache project, and | ||
subsequently add more committers and PMC members to keep pace with project | ||
growth. | ||
|
||
5. *Use case white papers*: Write blog posts and videos explaining | ||
how to use DataFusion for real-world use cases. | ||
|
||
3. *Testing*: Improve CI infrastructure and test coverage, more fuzz | ||
testing, and better functional and performance regression testing. | ||
|
||
3. *Planning Time*: Reduce the time taken to plan queries, both [wide | ||
tables of 1000s of columns], and in [general]. | ||
|
||
4. *Aggregate Performance*: Improve the speed of [aggregating "high cardinality"] data | ||
when there are many (e.g. millions) of distinct groups. | ||
|
||
5. *Statistics*: [Improved statistics handling] with an eye towards more | ||
sophisticated expression analysis and cost models. | ||
|
||
[aggregating "high cardinality"]: https://github.com/apache/arrow-datafusion/issues/7000 | ||
[wide tables of 1000s of columns]: https://github.com/apache/arrow-datafusion/issues/7698 | ||
[general]: https://github.com/apache/arrow-datafusion/issues/5637 | ||
[unifying built in and user functions]: https://github.com/apache/arrow-datafusion/issues/8045 | ||
[Improved statistics handling]: https://github.com/apache/arrow-datafusion/issues/8227 | ||
|
||
# How to Get Involved | ||
|
||
If you are interested in contributing to DataFusion we would love to have you | ||
join us. You can try out DataFusion on some of your own data and projects and | ||
let us know how it goes, contribute suggestions, documentation, bug reports, or | ||
a PR with documentation, tests or code. A list of open issues | ||
suitable for beginners is [here]. | ||
|
||
As the community grows, we are also looking to restart biweekly calls / | ||
meetings. Timezones are always a challenge for such meetings, but we hope to | ||
have two calls that can work for most attendees. If you are interested | ||
in helping, or just want to say hi, please drop us a note via one of | ||
the methods listed in our [Communication Doc]. | ||
|
||
[here]: https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22 | ||
[communication doc]: https://arrow.apache.org/datafusion/contributor-guide/communication.html |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.