Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Blog post for release 40.0.0 #6

Merged
merged 32 commits into from
Jul 24, 2024
Merged
Changes from 3 commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
8e38da3
Blog post for release 40.0.0
alamb Jul 9, 2024
d2b5112
add some more contents / links
alamb Jul 9, 2024
e37f78b
more
alamb Jul 9, 2024
ed02820
Update _posts/2024-07-09-datafusion-40.0.0.md
alamb Jul 10, 2024
114f983
Update _posts/2024-07-09-datafusion-40.0.0.md
alamb Jul 10, 2024
5867e65
add details of parquet indexing
alamb Jul 10, 2024
9b2a29c
Merge branch 'alamb/40.0.0_blog' of github.com:alamb/datafusion-site …
alamb Jul 10, 2024
7c4a9e5
SQL unparser
alamb Jul 12, 2024
359c268
Update community growth section
alamb Jul 18, 2024
71757bd
Hone
alamb Jul 18, 2024
7feccdb
Add planning time improvements
alamb Jul 18, 2024
22ee0e7
Flesh out more
alamb Jul 19, 2024
bd30952
More
alamb Jul 19, 2024
7fd03c2
Apply suggestions from code review
alamb Jul 20, 2024
0fb4fd0
Reduce passive voice in introduction
alamb Jul 22, 2024
f88fbfc
Update sqlancer spelling, etc
alamb Jul 22, 2024
0db3d52
Merge branch 'alamb/40.0.0_blog' of github.com:alamb/datafusion-site …
alamb Jul 22, 2024
46ed5d7
improve performance section
alamb Jul 22, 2024
09e968c
hone
alamb Jul 22, 2024
4cee7b9
hone
alamb Jul 22, 2024
0163c36
complete first pass
alamb Jul 22, 2024
fc900b0
fix heading
alamb Jul 22, 2024
6d3678a
aspirationally set publish date to07-23
alamb Jul 22, 2024
9932ba8
add github names
alamb Jul 22, 2024
79e8f16
Add functon factory example link
alamb Jul 22, 2024
6c0223f
fix comma
alamb Jul 23, 2024
a711708
Apply suggestions from code review
alamb Jul 23, 2024
7b408c3
Merge branch 'alamb/40.0.0_blog' of github.com:alamb/datafusion-site …
alamb Jul 23, 2024
cf84326
wordsmith, fix heading sizes
alamb Jul 23, 2024
ec45edb
more wordsmithing
alamb Jul 23, 2024
f3b420e
Update date to publish date
alamb Jul 23, 2024
00ba35b
Include waynexia on the list of people joining the PMC
alamb Jul 24, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
248 changes: 248 additions & 0 deletions _posts/2024-07-09-datafusion-40.0.0.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,248 @@
---
layout: post
title: "Apache Arrow DataFusion 40.0.0 Released"
date: "2024-07-09 00:00:00"
author: alamb
categories: [release]
---

<!--
{% comment %}
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to you under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
{% endcomment %}
-->

<!-- see https://github.com/apache/datafusion/issues/9602 for details -->

## Introduction

We recently [released DataFusion 40.0.0]. This blog highlights some of the major
improvements since we [released DataFusion 34.0.0] (spoiler alert there are many)
and a preview of where the community plans to focus in the next 6 months.

[released DataFusion 34.0.0]: https://datafusion.apache.org/blog/2024/01/19/datafusion-34.0.0/
[released DataFusion 40.0.0]: https://crates.io/crates/datafusion/40.0.0

<!-- todo update this intro -->
[Apache Arrow DataFusion] is an extensible query engine, written in [Rust], that
uses [Apache Arrow] as its in-memory format. DataFusion is used by developers to
create new, fast data centric systems such as databases, dataframe libraries,
machine learning and streaming applications. While [DataFusion’s primary design
goal] is to accelerate creating other data centric systems, it has a
reasonable experience directly out of the box as a [dataframe library] and
[command line SQL tool].
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reminder about the todo (this paragraph isn't to be reviewed yet, right?)


[DataFusion’s primary design goal]: https://arrow.apache.org/datafusion/user-guide/introduction.html#project-goals
[dataframe library]: https://arrow.apache.org/datafusion-python/
[command line SQL tool]: https://arrow.apache.org/datafusion/user-guide/cli.html


[apache arrow datafusion]: https://arrow.apache.org/datafusion/
[apache arrow]: https://arrow.apache.org
[rust]: https://www.rust-lang.org/


DataFusion is very much a community endeavor. Our core thesis is that as a
community we can build much more advanced technology than any of us as
individuals or companies could alone. In the last 6 months between `26.0.0` and
`34.0.0`, community growth has been strong. We accepted and reviewed over a
thousand PRs from 124 different committers, created over 650 issues and closed 517
of them.
You can find a list of all changes in the detailed [CHANGELOG].

<!--
$ git log --pretty=oneline 26.0.0..34.0.0 . | wc -l
1009

$ git shortlog -sn 26.0.0..34.0.0 . | wc -l
124

https://crates.io/crates/datafusion/26.0.0
DataFusion 26 released June 7, 2023

https://crates.io/crates/datafusion/34.0.0
DataFusion 34 released Dec 17, 2023

Issues created in this time: 214 open, 437 closed
https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+created%3A2023-06-23..2023-12-17

Issues closes: 517
https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+closed%3A2023-06-23..2023-12-17+

PRs merged in this time 908
https://github.com/apache/arrow-datafusion/pulls?q=is%3Apr+merged%3A2023-06-23..2023-12-17
-->



# Community Growth 📈

* DataFusion is now a new top level project (TODO LINK)
* DataFusion Comet has been donated and is nearing its first release (TODO LINK)
* We published a paper in SIGMOD [Apache Arrow DataFusion: A Fast, Embeddable, Modular Analytic Query Engine](https://dl.acm.org/doi/10.1145/3626246.3653368)
* Many interesting posts: TODO find some
* DataFusion mentioned as a candidate to be the POSIX of databases in ["What Goes Around Comes Around... And Around...](https://db.cs.cmu.edu/papers/2024/whatgoesaround-sigmodrec2024.pdf)
* ["Why you should keep an eye on Apache DataFusion and its community."](https://www.cpard.xyz/posts/datafusion/)


Meetups:
Austin https://github.com/apache/datafusion/discussions/8522
San Francisco
Hangzhou
NYC https://github.com/apache/datafusion/discussions/11213
Belgrade

[CHANGELOG]: https://github.com/apache/arrow-datafusion/blob/main/datafusion/CHANGELOG.md

# Improved Performance 🚀

Performance is a key feature of DataFusion, DataFusion is

TODO: Planning speed improvements:

Implement specialized group values for single Uft8/LargeUtf8/Binary/LargeBinary column #8827


https://github.com/apache/datafusion/pull/8827

## In progress
StringView (TODO get links)



# Improved quality

TODO mention the SQLancer


# New Features ✨

Improvements: unnest, null handling improvements for lead/lag

##

## TreeNode API


## Other Features

Recursive CTEs https://github.com/apache/datafusion/pull/9619 / https://github.com/apache/datafusion/issues/462

# Building Systems is Easier with DataFusion 🛠️

Large scale "extract scalar functions from the core": all scalar functions are now exactly the same API as built in https://github.com/apache/datafusion/issues/9285

We are close to completing the same treatment for aggregates and then window functions

SQL to String (both exprs and plans): https://github.com/apache/datafusion/issues/9494

WASM builds: https://github.com/apache/datafusion/discussions/9834

## Documentation
It is easier than ever to get started using DataFusion with the
new [Library Users Guide] as well as significantly improved the [API documentation].

[Library Users Guide]:https://arrow.apache.org/datafusion/library-user-guide/index.html
[API documentation]: https://docs.rs/datafusion/latest/datafusion/index.html

## User Defined SQL Extensiions

## Support for `CREATE FUNCTION`
https://github.com/apache/datafusion/pull/9333

Let's you build systems that support user defined functions

Huge thanks to [@milenkovicm](https://github.com/milenkovicm)

```sql
CREATE FUNCTION my_func(DOUBLE, DOUBLE)
RETURNS DOUBLE
RETURN $1 $3
"#;
```

And

```sql
CREATE FUNCTION iris(FLOAT[])
RETURNS FLOAT[]
LANGUAGE TORCH
AS 'models:/iris@champion'
```

```sql
CREATE FUNCTION func(FLOAT[])
RETURNS FLOAT[]
LANGUAGE WASM
AS 'func.wasm'
```

BTW it would be great if someone made a demo showing how to do this (see https://github.com/apache/datafusion/issues/9326 )


## Parquet indexing / low latency queries
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We used the Parquet indexing feature (ParquetAccessPlan) to add efficient support for reading from DeltaLake tables and handling deletion vectors. I think its a cool use-case for the indexing feature and perhaps worth calling out as an example of a real-world use-case.




# DataFusion in 2dn half 2024 🥳

Some major initiatives from contributors we know of this year are:

1. *Modularity*: Make DataFusion even more modular, such as [unifying
built in and user functions], making it easier to customize
DataFusion's behavior.

2. *Community Growth*: Graduate to our own top level Apache project, and
subsequently add more committers and PMC members to keep pace with project
growth.

5. *Use case white papers*: Write blog posts and videos explaining
how to use DataFusion for real-world use cases.

3. *Testing*: Improve CI infrastructure and test coverage, more fuzz
testing, and better functional and performance regression testing.

3. *Planning Time*: Reduce the time taken to plan queries, both [wide
tables of 1000s of columns], and in [general].

4. *Aggregate Performance*: Improve the speed of [aggregating "high cardinality"] data
when there are many (e.g. millions) of distinct groups.

5. *Statistics*: [Improved statistics handling] with an eye towards more
sophisticated expression analysis and cost models.

[aggregating "high cardinality"]: https://github.com/apache/arrow-datafusion/issues/7000
[wide tables of 1000s of columns]: https://github.com/apache/arrow-datafusion/issues/7698
[general]: https://github.com/apache/arrow-datafusion/issues/5637
[unifying built in and user functions]: https://github.com/apache/arrow-datafusion/issues/8045
[Improved statistics handling]: https://github.com/apache/arrow-datafusion/issues/8227

# How to Get Involved

If you are interested in contributing to DataFusion we would love to have you
join us. You can try out DataFusion on some of your own data and projects and
let us know how it goes, contribute suggestions, documentation, bug reports, or
a PR with documentation, tests or code. A list of open issues
suitable for beginners is [here].

As the community grows, we are also looking to restart biweekly calls /
meetings. Timezones are always a challenge for such meetings, but we hope to
have two calls that can work for most attendees. If you are interested
in helping, or just want to say hi, please drop us a note via one of
the methods listed in our [Communication Doc].

[here]: https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22
[communication doc]: https://arrow.apache.org/datafusion/contributor-guide/communication.html