Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Blog post for release 40.0.0 #6

Merged
merged 32 commits into from
Jul 24, 2024
Merged
Show file tree
Hide file tree
Changes from 11 commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
8e38da3
Blog post for release 40.0.0
alamb Jul 9, 2024
d2b5112
add some more contents / links
alamb Jul 9, 2024
e37f78b
more
alamb Jul 9, 2024
ed02820
Update _posts/2024-07-09-datafusion-40.0.0.md
alamb Jul 10, 2024
114f983
Update _posts/2024-07-09-datafusion-40.0.0.md
alamb Jul 10, 2024
5867e65
add details of parquet indexing
alamb Jul 10, 2024
9b2a29c
Merge branch 'alamb/40.0.0_blog' of github.com:alamb/datafusion-site …
alamb Jul 10, 2024
7c4a9e5
SQL unparser
alamb Jul 12, 2024
359c268
Update community growth section
alamb Jul 18, 2024
71757bd
Hone
alamb Jul 18, 2024
7feccdb
Add planning time improvements
alamb Jul 18, 2024
22ee0e7
Flesh out more
alamb Jul 19, 2024
bd30952
More
alamb Jul 19, 2024
7fd03c2
Apply suggestions from code review
alamb Jul 20, 2024
0fb4fd0
Reduce passive voice in introduction
alamb Jul 22, 2024
f88fbfc
Update sqlancer spelling, etc
alamb Jul 22, 2024
0db3d52
Merge branch 'alamb/40.0.0_blog' of github.com:alamb/datafusion-site …
alamb Jul 22, 2024
46ed5d7
improve performance section
alamb Jul 22, 2024
09e968c
hone
alamb Jul 22, 2024
4cee7b9
hone
alamb Jul 22, 2024
0163c36
complete first pass
alamb Jul 22, 2024
fc900b0
fix heading
alamb Jul 22, 2024
6d3678a
aspirationally set publish date to07-23
alamb Jul 22, 2024
9932ba8
add github names
alamb Jul 22, 2024
79e8f16
Add functon factory example link
alamb Jul 22, 2024
6c0223f
fix comma
alamb Jul 23, 2024
a711708
Apply suggestions from code review
alamb Jul 23, 2024
7b408c3
Merge branch 'alamb/40.0.0_blog' of github.com:alamb/datafusion-site …
alamb Jul 23, 2024
cf84326
wordsmith, fix heading sizes
alamb Jul 23, 2024
ec45edb
more wordsmithing
alamb Jul 23, 2024
f3b420e
Update date to publish date
alamb Jul 23, 2024
00ba35b
Include waynexia on the list of people joining the PMC
alamb Jul 24, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
328 changes: 328 additions & 0 deletions _posts/2024-07-09-datafusion-40.0.0.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,328 @@
---
layout: post
title: "Apache Arrow DataFusion 40.0.0 Released"
date: "2024-07-09 00:00:00"
author: alamb
categories: [release]
---

<!--
{% comment %}
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to you under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
{% endcomment %}
-->

<!-- see https://github.com/apache/datafusion/issues/9602 for details -->

## Introduction

We recently [released DataFusion 40.0.0]. This blog highlights some of the many
major improvements since we [released DataFusion 34.0.0]
and a preview of where the community is thinking about improving in the next 6 months.

[released DataFusion 34.0.0]: https://datafusion.apache.org/blog/2024/01/19/datafusion-34.0.0/
[released DataFusion 40.0.0]: https://crates.io/crates/datafusion/40.0.0

<!-- todo update this intro -->
[Apache Arrow DataFusion] is an extensible query engine, written in [Rust], that
uses [Apache Arrow] as its in-memory format. DataFusion is used by developers to
create new, fast data centric systems such as databases, dataframe libraries,
machine learning and streaming applications. While [DataFusion’s primary design
goal] is to accelerate creating other data centric systems, it has a
reasonable experience directly out of the box as a [dataframe library] and
[command line SQL tool].
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reminder about the todo (this paragraph isn't to be reviewed yet, right?)


[DataFusion’s primary design goal]: https://arrow.apache.org/datafusion/user-guide/introduction.html#project-goals
[dataframe library]: https://arrow.apache.org/datafusion-python/
[command line SQL tool]: https://arrow.apache.org/datafusion/user-guide/cli.html


[apache arrow datafusion]: https://arrow.apache.org/datafusion/
[apache arrow]: https://arrow.apache.org
[rust]: https://www.rust-lang.org/

DataFusion's core thesis is that as a community together, we can build much more
advanced technology than any of us as individuals or companies could do alone.
Without DataFusion, highly performant vectorized query engines would remain
the domain of a few large companies and world-class research institutions.
With DataFusion, we can all build on top of a shared foundation, and focus on
what makes our projects unique.




# Community Growth 📈

In the last 6 months between `34.0.0` and `40.0.0`, our community continues to
grow in new ane exciting ways.

1. DataFusion became a top level Apache Software Foundation project (read the
[press release] and [blog post]), and added several PMC members and new committers (TODO get list of them with links to mailing list announcement)
2. [DataFusion Comet] was [donated] and is nearing its first release.
3. In the [core DataFusion repo] alone We accepted and reviewed almost 1500 PRs from 182 different
committers, created over 1000 issues and closed 781 of them 🚀. This is up from
1000 PRs from 124 committers with 650 issues created in our last post 🤯. You
can find a list of all changes in the detailed [CHANGELOG].
3. DataFusion meetups in multiple cities around the world: [Austin], [San Francisco], [Hangzhou], [New York], and [Belgrade].
4. Many new projects in the [datafusion-contrib] organization, including [Table Providers], [SQL Lancer], [Open Variant], [JSON], and [ORC].

[core DataFusion repo]: https://github.com/apache/arrow-datafusion
[CHANGELOG]: https://github.com/apache/datafusion/blob/main/datafusion/CHANGELOG.md
[press release]: https://news.apache.org/foundation/entry/apache-software-foundation-announces-new-top-level-project-apache-datafusion
[blog post]: https://datafusion.apache.org/blog/2024/05/07/datafusion-tlp/
[Austin]: https://github.com/apache/datafusion/discussions/8522
[San Francisco]: https://github.com/apache/datafusion/discussions/10800
[Hangzhou]: https://www.huodongxing.com/event/5761971909400?td=1965290734055
[New York]: https://github.com/apache/datafusion/discussions/11213
[Belgrade]: https://github.com/apache/datafusion/discussions/11431

[datafusion-contrib]: https://github.com/datafusion-contrib
[Table Providers]: https://github.com/datafusion-contrib/datafusion-table-providers
[SQL Lancer]: https://github.com/datafusion-contrib/datafusion-sqllancer
[Open Variant]: https://github.com/datafusion-contrib/datafusion-functions-variant
[JSON]: https://github.com/datafusion-contrib/datafusion-functions-json
[ORC]: https://github.com/datafusion-contrib/datafusion-orc

<!--
$ git log --pretty=oneline 34.0.0..40.0.0 . | wc -l
1453 (up from 1009)

$ git shortlog -sn 34.0.0..40.0.0 . | wc -l
182 (up from 124)


https://crates.io/crates/datafusion/34.0.0
DataFusion 34 released Dec 17, 2023

https://crates.io/crates/datafusion/40.0.0
DataFusion 34 released July 12, 2024

Issues created in this time: 321 open, 781 closed (up from 214 open, 437 closed)
https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+created%3A2023-12-17..2024-07-12

Issues closed: 911 (up from 517)
https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+closed%3A2023-12-17..2024-07-12

PRs merged in this time 1490 (up from 908)
https://github.com/apache/arrow-datafusion/pulls?q=is%3Apr+merged%3A2023-12-17..2024-07-12

-->


In addition, DataFusion has been appearing in more and more writing, both online and offline. Here are some highlights:

1. [Apache Arrow DataFusion: A Fast, Embeddable, Modular Analytic Query Engine], was presented in [SIGMOD '24], one of the major database conferences
2. DataFusion described as part of the trend to define "the POSIX of databases" in ["What Goes Around Comes Around... And Around...] from Andy Pavlo and Mike Stonebraker
3. ["Why you should keep an eye on Apache DataFusion and its community"]
4. [Apache DataFusion offline meetup in the Bay Area]


[DataFusion Comet]: https://datafusion.apache.org/comet/
[donated]: https://arrow.apache.org/blog/2024/03/06/comet-donation/
[SIGMOD '24]: https://2024.sigmod.org/


[Apache Arrow DataFusion: A Fast, Embeddable, Modular Analytic Query Engine]: https://dl.acm.org/doi/10.1145/3626246.3653368
["What Goes Around Comes Around... And Around...]: https://db.cs.cmu.edu/papers/2024/whatgoesaround-sigmodrec2024.pdf
["Why you should keep an eye on Apache DataFusion and its community"]: https://www.cpard.xyz/posts/datafusion/
[Apache DataFusion offline meetup in the Bay Area]: https://www.tisonkun.org/2024/07/15/datafusion-meetup-san-francisco/


# Improved Performance 🚀

Performance is a key feature of DataFusion, and the community continues to
invest in this area. One major area of improvement is the time it takes to
convert a SQL query into a plan that can be executed. Here is a chart showing
the improvement due to the concerted effort of many contributors (TODO list contributors by name) over several
months (see [ticket] for more details)

Planning is almost 2x faster for TPC-DS and TPC-H queries, over 10x faster for
some queries with many columns.

<img src="{{ site.baseurl }}/assets/datafusion-40.0.0/improved-planning-time.png" width="700">

[ticket]: https://github.com/apache/datafusion/issues/9637


Also, we implemented [specialization for single Uft8/LargeUtf8/Binary/LargeBinary]
group by columns which resulted in a 40% performance improvement for some
benchmarks.

[specialization for single Uft8/LargeUtf8/Binary/LargeBinary]: https://github.com/apache/datafusion/pull/8827

We have more improvements planned (see below).


# Improved Quality

DataFusion continues to improve overall in quality. One of the most exiciting
improvements is the addition of a new [SQLancer] based [DataFusion Fuzzing]
suite thanks to [@2010YOUY01] that has already found and fixed several bugs.

[SQLancer]: https://github.com/apache/datafusion/issues/11030
[DataFusion Fuzzing]: https://github.com/datafusion-contrib/datafusion-sqllancer
[@2010YOUY01]: https://github.com/2010YOUY01



# New Features ✨

There are many new features in the last 6 months. Here are some of the highlights:

## SQL Features
* Support for unnest (TODO LINK)
* Support Recursive CTEs https://github.com/apache/datafusion/pull/9619 / https://github.com/apache/datafusion/issues/462
* Support for `CREATE FUNCTION` (see below)
* New functions: TODO find list


# Building Systems is Easier with DataFusion 🛠️

* Faster and easier to use [TreeNode API] for traversing and manipulating plans and expressions.
* All functions now use the same [Scalar User Defined Function API], making it easier to customize
DataFusion's behavior without sacrificing performance. See [ticket] for more details.
* Unparser: Plans and Exprs to SQL String (both exprs and plans) (see below)
* [WASM support]: https://github.com/apache/datafusion/discussions/9834


[TreeNode API]: https://docs.rs/datafusion/latest/datafusion/common/tree_node/trait.TreeNode.html#overview
[Scalar User Defined Function API]: https://docs.rs/datafusion/latest/datafusion/logical_expr/trait.ScalarUDFImpl.html
[ticket]: https://github.com/apache/arrow-datafusion/issues/8045
[WASM support]: https://github.com/apache/datafusion/discussions/9834

We are close to completing the same treatment for aggregates and then window functions


## Documentation

We continue to improve the documentation to make it easier to get started using DataFusion with
the [Library Users Guide], [API documentation], and [Examples].

[Library Users Guide]: https://datafusion.apache.org/library-user-guide/index.html
[API documentation]: https://docs.rs/datafusion/latest/datafusion/index.html
[Examples]: https://github.com/apache/datafusion/tree/main/datafusion-examples

# SQL Unparser

SQL to String (both exprs and plans): https://github.com/apache/datafusion/issues/9494
This feature allows you to convert logical plans and exprs back to SQL text

This can be useful for federation, and for building systems that generate SQL
https://docs.rs/datafusion/latest/datafusion/sql/unparser/fn.expr_to_sql.html

TODO more doc ideas from https://github.com/apache/datafusion/pull/11395

## User Defined SQL Parsing Extensions

TODO get link



## Support for `CREATE FUNCTION`
https://github.com/apache/datafusion/pull/9333

Let's you build systems that support user defined functions

Huge thanks to [@milenkovicm](https://github.com/milenkovicm)

```sql
CREATE FUNCTION my_func(DOUBLE, DOUBLE)
RETURNS DOUBLE
RETURN $1 $3
"#;
```

And

```sql
CREATE FUNCTION iris(FLOAT[])
RETURNS FLOAT[]
LANGUAGE TORCH
AS 'models:/iris@champion'
```

```sql
CREATE FUNCTION func(FLOAT[])
RETURNS FLOAT[]
LANGUAGE WASM
AS 'func.wasm'
```

BTW it would be great if someone made a demo showing how to do this (see https://github.com/apache/datafusion/issues/9326 )


## Parquet indexing / low latency queries
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We used the Parquet indexing feature (ParquetAccessPlan) to add efficient support for reading from DeltaLake tables and handling deletion vectors. I think its a cool use-case for the indexing feature and perhaps worth calling out as an example of a real-world use-case.


We used the Parquet indexing feature (`ParquetAccessPlan`) to add [efficient support] for reading from DeltaLake tables and handling [deletion vectors]. I think its a cool use-case for the indexing feature and perhaps worth calling out as an example of a real-world use-case.

[efficient support]: https://github.com/spiceai/spiceai/pull/1891
[deletion vectors]: https://docs.delta.io/latest/delta-deletion-vectors.html






# Looking Ahead: The next Six Months

Discussion on https://github.com/apache/datafusion/issues/11442

Some major initiatives from contributors we know of this year are:

1. *Modularity*: Make DataFusion even more modular, such as [unifying
built in and user functions], making it easier to customize
DataFusion's behavior.


5. *Use case white papers*: Write blog posts and videos explaining
how to use DataFusion for real-world use cases.

3. *Testing*: Improve CI infrastructure and test coverage, more fuzz
testing, and better functional and performance regression testing.

3. *Planning Time*: Reduce the time taken to plan queries, both [wide
tables of 1000s of columns], and in [general].

4. *Aggregate Performance*: Improve the speed of [aggregating "high cardinality"] data
when there are many (e.g. millions) of distinct groups.

5. *Statistics*: [Improved statistics handling] with an eye towards more
sophisticated expression analysis and cost models.

StringView (TODO get links)


[aggregating "high cardinality"]: https://github.com/apache/arrow-datafusion/issues/7000
[wide tables of 1000s of columns]: https://github.com/apache/arrow-datafusion/issues/7698
[general]: https://github.com/apache/arrow-datafusion/issues/5637
[Improved statistics handling]: https://github.com/apache/arrow-datafusion/issues/8227

# How to Get Involved

If you are interested in contributing to DataFusion we would love to have you
join us. You can try out DataFusion on some of your own data and projects and
let us know how it goes, contribute suggestions, documentation, bug reports, or
a PR with documentation, tests or code. A list of open issues
suitable for beginners is [here].

As the community grows, we are also looking to restart biweekly calls /
meetings. Timezones are always a challenge for such meetings, but we hope to
have two calls that can work for most attendees. If you are interested
in helping, or just want to say hi, please drop us a note via one of
the methods listed in our [Communication Doc].

[here]: https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22
[communication doc]: https://arrow.apache.org/datafusion/contributor-guide/communication.html
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.