-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix: External sort failing on StringView
due to shared buffers
#14823
Conversation
writer.write(&batch)?; | ||
spill_writer = Some(writer); | ||
self.in_mem_batches.push(batch); | ||
self.spill().await?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This refactor is not related to the PR, I did this along the way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to keep the original logic:
-
self.spill().await?;
-
And then write the remaining batch to disk.
Because the self.in_mem_batches.push(batch) may cause OOM for memory?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in_mem_batches
is a Vec<RecordBatch>
, so it doesn't have any internal mechanism to update reservation. Also, the new batch
is already in memory, so there is no difference to spill together or not. I added a comment for this.
Besides, I think manually keep the buffered batch in sync with reservation is quite tricky, it makes the implementation hard to reason and perhaps cause bugs, hopefully we can find a RAII way to improve it in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @2010YOUY01 for explain, got it, i was wrong, so we do not have internal mechanism to update reservation. If we have a RAII way to improve it in the future, this is a great idea, may be we can file a ticket for it!
in_mem_batches
is aVec<RecordBatch>
, so it doesn't have any internal mechanism to update reservation. Also, the newbatch
is already in memory, so there is no difference to spill together or not. I added a comment for this. Besides, I think manually keep the buffered batch in sync with reservation is quite tricky, it makes the implementation hard to reason and perhaps cause bugs, hopefully we can find a RAII way to improve it in the future.
@@ -1425,7 +1478,7 @@ mod tests { | |||
// Processing 840 KB of data using 400 KB of memory requires at least 2 spills | |||
// It will spill roughly 18000 rows and 800 KBytes. | |||
// We leave a little wiggle room for the actual numbers. | |||
assert!((2..=10).contains(&spill_count)); | |||
assert!((12..=18).contains(&spill_count)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is caused by the above refactor, the old implementation forget to update the statistics, so we missed several counts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also update the comments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated in 92ca3b0
if let Some(string_view_array) = | ||
array.as_any().downcast_ref::<StringViewArray>() | ||
{ | ||
let new_array = string_view_array.gc(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will string_view_array.gc() affect the performance when it call many times?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated, if i make sense right, it seems not too much affection to performance, because we only remain the used buffer data?
before gc:
/// sorted_batch1 -> buffer1
/// -> buffer2
/// sorted_batch2 -> buffer1
/// -> buffer2
after gc:
/// sorted_batch1 -> new buffer (used data of buffer1, used data of buffer2)
/// sorted_batch2 -> new buffer (used data of buffer1, used data of buffer2)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, there is some inefficiency here, I filed apache/arrow-rs#7184. Once done on arrow
side, we can remove the copies here to speed it up.
IMO it won't cause regression for datafusion, it's only done when the spill is triggered, and if we don't copy here it can cause larger inefficiency or fail some memory-limited sort queries.
FWIW I think this compaction logic probably makes sense as part of the IPCWriter, as opposed to a workaround here - apache/arrow-rs#7185 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me -- thank you @2010YOUY01 and @zhuqi-lucas
I agree this would be better in the arrow ipc writer, so leaving a link to that ticket in the comments would be good I think
@@ -1425,7 +1478,7 @@ mod tests { | |||
// Processing 840 KB of data using 400 KB of memory requires at least 2 spills | |||
// It will spill roughly 18000 rows and 800 KBytes. | |||
// We leave a little wiggle room for the actual numbers. | |||
assert!((2..=10).contains(&spill_count)); | |||
assert!((12..=18).contains(&spill_count)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also update the comments?
Co-authored-by: Andrew Lamb <[email protected]>
Thank you all for the feedbacks. I have addressed the review comments (also added a small further simplification for the refactor) |
🚀 |
Which issue does this PR close?
Follow up to #14644, this PR fixes an unsolved failing case for external sort. It's found by #12136 (comment).
Rationale for this change
Recap for major steps for external sort (there are detailed doc in datafusion/physical-plan/src/sorts/sort.rs)
The problem is: if step 2 is done on batches with
StringViewArray
columns, the sorted array will only reorder theview
s (string prefix plus pointer into range in the payload buffer, if this element is long), and the underlying buffers won't be moved.When reading back spilled batches from a single sorted run, it's required to read one batch by one batch, otherwise memory pressure won't be reduced. As a result, the spill writer has to write all referenced
buffer
s for each batch, and many buffers will be written multiple times. The size of the spilled batch would explode, and some memory-limited sort query can fail due to this reason.Reproducer
Under
benchmarks/
, runMain branch result (20GB spilled)
Q10 iteration 0 took 12839.6 ms and returned 59986052 rows
Q10 avg time: 12839.60 ms
PR result (12GB spilled)
Q10 iteration 0 took 9424.3 ms and returned 59986052 rows
Q10 avg time: 9424.33 ms
What changes are included in this PR?
Before spilling,
StringViewArray
columns must be permutated the way described above, so this PR reorganize the array to sequential order before spilling.Are these changes tested?
Added one unit regression test. This test fails without the change.
Are there any user-facing changes?