Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Website]: DataFusion 26-34 blog post #457
[Website]: DataFusion 26-34 blog post #457
Changes from 9 commits
18243e0
78b6625
2f2d407
5f516f4
a6524b7
a7d70fc
46d8597
75d1f7d
a732a98
38041ae
94c53be
afa8ea2
95158bb
5a35ec3
a7186fd
6ce72da
8c6710e
3dc34dc
d726c00
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW does any one have an example of writing to remote object storage (e.g.
s3
) handy that they could share so I can include it here?@devinjdangelo do you have this setup ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alamb We have an example of writing to S3 here: https://github.com/apache/arrow-datafusion/blob/main/datafusion-examples/examples/external_dependency/dataframe-to-s3.rs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
COPY table to 's3://my_bucket/my_prefix'
should work in datafusion-cli so long as the credentials are set up. I'll verify real quick...There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok confirmed with caveat. Using
COPY
directly to object store from datafusion cli does not work, but insert to external table does. We probably need to add special logic to datafusion-cli to make copy to object store to work directly. That would be a neat feature to add.For now this works:
I see the parquet file in my s3 bucket as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @devinjdangelo -- I'll file a ticket about this in DataFusion later today
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#8907 filed to make the COPY example work...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you 🙏
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jayzhan211 is there any other improvements you think we should call out about struct/array support over the last 6 months?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nope. I think all of them are new features compare to 6 months before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ozankabak are there any plans you and your team may have that you want to share publically?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't have anything to share just yet on features we will contribute in 2024 (but there will be many!). We will probably have something to publish in a month or two.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We plan to write show-and-tell blog posts and videos that explain how one can use Datafusion in real-world use cases. We will try to partner with members of the community to create toy examples relating to their use cases and try to come up with demo scripts that offer guidance to others on how they can use Datafusion in similar contexts.
Maybe it could be a good idea to mention these upcoming show-and-tells as a near-future community growth effort.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is great. I will include it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in 95158bb
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this plot is hard to read, mostly due to the wildly different numbers and because "execution time" often is an aggregate. For these kinds of plots, I see two possible ways to improve them:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good call --
I took a call creating this version. What do you think?
BTW source data: https://github.com/alamb/datafusion-duckdb-benchmark/tree/datafusion-25-34
Sheet https://docs.google.com/spreadsheets/d/1FtI3652WIJMC5LmJbLfT3G06w0JQIxEPG4yfMafexh8/edit#gid=530035076
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used this new chart. Here is what the page looks like rendered now:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's better. I think that the execution time (blue) shouldn't be a line plot because that type implies a connection between the neighboring points (like a time series where subsequent entries are indeed related). The linear interpolation between the measurements makes this even more misleading / "weird".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't figure out how to make Google sheets do what I wanted, so I eventually just made two charts so that one shows the overall magnitude and one shoes the relative improvement
I am sure we can do better if we spent more time on this, but I think it is good enough for now