-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use arrow IPC Stream format for spill files #14868
base: main
Are you sure you want to change the base?
Conversation
cc413d5
to
7ac9fdf
Compare
Maybe it's better to add tests to cover your issue? |
Good point, I will add them probably tomorrow 👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the nice work. I think it's ready to go after the regression test is added.
7ac9fdf
to
f904b63
Compare
Rebased on main & test pushed 👍 |
} | ||
|
||
#[test] | ||
fn test_batch_spill_and_read_dictionary_arrays() -> Result<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I confirmed that this test fails on main with the error in #4658
#14078 looks like a similar problem but I suspect that the IPC Stream format is not much different in cost to the file format. At least with the test added here if the file format is changed again in the future, dictionary arrays will not regress. |
Thanks @davidhewitt I think this PR is good, please check the clippy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @davidhewitt it is a really nice first contribution
Thanks! |
Which issue does this PR close?
Rationale for this change
The IPC Stream format allows for dictionary replacement, unlike the IPC File format. As per #4658 (comment) the File format does not offer advantages for the spill use case.
What changes are included in this PR?
Replaced the functionality in spilled sorts to write the IPC Stream format, instead of the IPC File format.
Are these changes tested?
Covered by existing tests of spill, I adapted these where necessary.
Are there any user-facing changes?
The code change is internal only; if for some reason users were inspecting the contents of spilled files without using the datafusion APIs to read them, they will find the format has changed.