Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use arrow IPC Stream format for spill files #14868

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

davidhewitt
Copy link

Which issue does this PR close?

Rationale for this change

The IPC Stream format allows for dictionary replacement, unlike the IPC File format. As per #4658 (comment) the File format does not offer advantages for the spill use case.

What changes are included in this PR?

Replaced the functionality in spilled sorts to write the IPC Stream format, instead of the IPC File format.

Are these changes tested?

Covered by existing tests of spill, I adapted these where necessary.

Are there any user-facing changes?

The code change is internal only; if for some reason users were inspecting the contents of spilled files without using the datafusion APIs to read them, they will find the format has changed.

@davidhewitt davidhewitt changed the title use arrow IPC Stream format for spill files Use arrow IPC Stream format for spill files Feb 25, 2025
@github-actions github-actions bot added the physical-expr Physical Expressions label Feb 25, 2025
@xudong963
Copy link
Member

Maybe it's better to add tests to cover your issue?

@davidhewitt
Copy link
Author

Good point, I will add them probably tomorrow 👍

@comphead
Copy link
Contributor

Wondering if this PR also addresses partially #14078

Another thing to mention we still lack some spilling test cases like #11541

Copy link
Contributor

@2010YOUY01 2010YOUY01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the nice work. I think it's ready to go after the regression test is added.

@davidhewitt
Copy link
Author

Rebased on main & test pushed 👍

}

#[test]
fn test_batch_spill_and_read_dictionary_arrays() -> Result<()> {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I confirmed that this test fails on main with the error in #4658

@davidhewitt
Copy link
Author

#14078 looks like a similar problem but I suspect that the IPC Stream format is not much different in cost to the file format. At least with the test added here if the file format is changed again in the future, dictionary arrays will not regress.

@comphead
Copy link
Contributor

Thanks @davidhewitt I think this PR is good, please check the clippy

Copy link
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @davidhewitt it is a really nice first contribution

@davidhewitt
Copy link
Author

Thanks! clippy should be happy now... 🤞

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
physical-expr Physical Expressions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ExternalSorter Fails to Spill Dictionaries
4 participants