Fix: External sort failing on `StringView` due to shared buffers #14823

2010YOUY01 · 2025-02-22T11:20:19Z

Which issue does this PR close?

Follow up to #14644, this PR fixes an unsolved failing case for external sort. It's found by #12136 (comment).

Rationale for this change

Recap for major steps for external sort (there are detailed doc in datafusion/physical-plan/src/sorts/sort.rs)

Inside each partition, input batches will be buffered until it reaches the memory limit.
When OOM is triggered, all buffered batches will be sorted one by one, and finally merged into one large sorted run (though physically it's chunked into many small batches)
Spill the sorted run to disk. After reading all inputs, read back spilled batches and merge into the final result. Since only one batch is needed for each spilled sorted run to generate the final result, the memory overhead is reduced in this step.

The problem is: if step 2 is done on batches with StringViewArray columns, the sorted array will only reorder the views (string prefix plus pointer into range in the payload buffer, if this element is long), and the underlying buffers won't be moved.
When reading back spilled batches from a single sorted run, it's required to read one batch by one batch, otherwise memory pressure won't be reduced. As a result, the spill writer has to write all referenced buffers for each batch, and many buffers will be written multiple times. The size of the spilled batch would explode, and some memory-limited sort query can fail due to this reason.

Reproducer

Under benchmarks/, run

cargo run --profile release-nonlto --bin dfbench -- sort-tpch  -p '/Users/yongting/Code/datafusion/benchmarks/data/tpch_sf10' -o '/tmp/main.json' -q 10 --iterations 1 --partitions 4 --memory-limit 5000M --debug

Main branch result (20GB spilled)

SortExec: expr=[l_orderkey@0 ASC NULLS LAST, l_suppkey@1 ASC NULLS LAST, l_linenumber@2 ASC NULLS LAST, l_comment@3 ASC NULLS LAST], preserve_partitioning=[true], metrics=[output_rows=59986052, elapsed_compute=12.965460651s, spill_count=24, spilled_bytes=19472334146, spilled_rows=51202072]

Q10 iteration 0 took 12839.6 ms and returned 59986052 rows
Q10 avg time: 12839.60 ms

PR result (12GB spilled)

SortExec: expr=[l_orderkey@0 ASC NULLS LAST, l_suppkey@1 ASC NULLS LAST, l_linenumber@2 ASC NULLS LAST, l_comment@3 ASC NULLS LAST], preserve_partitioning=[true], metrics=[output_rows=59986052, elapsed_compute=15.363745389s, spill_count=48, spilled_bytes=12600523712, spilled_rows=55216152]

Q10 iteration 0 took 9424.3 ms and returned 59986052 rows
Q10 avg time: 9424.33 ms

What changes are included in this PR?

Before spilling, StringViewArray columns must be permutated the way described above, so this PR reorganize the array to sequential order before spilling.

Are these changes tested?

Added one unit regression test. This test fails without the change.

Are there any user-facing changes?

2010YOUY01 · 2025-02-22T11:21:51Z

datafusion/physical-plan/src/sorts/sort.rs

-                        writer.write(&batch)?;
-                        spill_writer = Some(writer);
+                        self.in_mem_batches.push(batch);
+                        self.spill().await?;


This refactor is not related to the PR, I did this along the way.

Do we need to keep the original logic:

self.spill().await?;

And then write the remaining batch to disk.

Because the self.in_mem_batches.push(batch) may cause OOM for memory?

in_mem_batches is a Vec<RecordBatch>, so it doesn't have any internal mechanism to update reservation. Also, the new batch is already in memory, so there is no difference to spill together or not. I added a comment for this.
Besides, I think manually keep the buffered batch in sync with reservation is quite tricky, it makes the implementation hard to reason and perhaps cause bugs, hopefully we can find a RAII way to improve it in the future.

Thank you @2010YOUY01 for explain, got it, i was wrong, so we do not have internal mechanism to update reservation. If we have a RAII way to improve it in the future, this is a great idea, may be we can file a ticket for it!

in_mem_batches is a Vec<RecordBatch>, so it doesn't have any internal mechanism to update reservation. Also, the new batch is already in memory, so there is no difference to spill together or not. I added a comment for this. Besides, I think manually keep the buffered batch in sync with reservation is quite tricky, it makes the implementation hard to reason and perhaps cause bugs, hopefully we can find a RAII way to improve it in the future.

2010YOUY01 · 2025-02-22T11:22:43Z

datafusion/physical-plan/src/sorts/sort.rs

@@ -1425,7 +1478,7 @@ mod tests {
        // Processing 840 KB of data using 400 KB of memory requires at least 2 spills
        // It will spill roughly 18000 rows and 800 KBytes.
        // We leave a little wiggle room for the actual numbers.
-        assert!((2..=10).contains(&spill_count));
+        assert!((12..=18).contains(&spill_count));


This is caused by the above refactor, the old implementation forget to update the statistics, so we missed several counts.

Should we also update the comments?

updated in 92ca3b0

zhuqi-lucas · 2025-02-23T08:52:25Z

datafusion/physical-plan/src/sorts/sort.rs

+                if let Some(string_view_array) =
+                    array.as_any().downcast_ref::<StringViewArray>()
+                {
+                    let new_array = string_view_array.gc();


Will string_view_array.gc() affect the performance when it call many times?

Updated, if i make sense right, it seems not too much affection to performance, because we only remain the used buffer data?

before gc:

/// sorted_batch1 -> buffer1 /// -> buffer2 /// sorted_batch2 -> buffer1 /// -> buffer2

after gc:

/// sorted_batch1 -> new buffer (used data of buffer1, used data of buffer2) /// sorted_batch2 -> new buffer (used data of buffer1, used data of buffer2)

Good point, there is some inefficiency here, I filed apache/arrow-rs#7184. Once done on arrow side, we can remove the copies here to speed it up.
IMO it won't cause regression for datafusion, it's only done when the spill is triggered, and if we don't copy here it can cause larger inefficiency or fail some memory-limited sort queries.

tustvold · 2025-02-24T09:15:03Z

FWIW I think this compaction logic probably makes sense as part of the IPCWriter, as opposed to a workaround here - apache/arrow-rs#7185

alamb

Looks good to me -- thank you @2010YOUY01 and @zhuqi-lucas

I agree this would be better in the arrow ipc writer, so leaving a link to that ticket in the comments would be good I think

alamb · 2025-02-25T19:44:47Z

datafusion/physical-plan/src/sorts/sort.rs

@@ -1425,7 +1478,7 @@ mod tests {
        // Processing 840 KB of data using 400 KB of memory requires at least 2 spills
        // It will spill roughly 18000 rows and 800 KBytes.
        // We leave a little wiggle room for the actual numbers.
-        assert!((2..=10).contains(&spill_count));
+        assert!((12..=18).contains(&spill_count));


Should we also update the comments?

datafusion/physical-plan/src/sorts/sort.rs

Co-authored-by: Andrew Lamb <[email protected]>

2010YOUY01 · 2025-02-26T07:19:34Z

Thank you all for the feedbacks. I have addressed the review comments (also added a small further simplification for the refactor)

alamb · 2025-02-26T15:46:48Z

🚀

2010YOUY01 added 3 commits February 22, 2025 17:13

organize stringview arrays before spilling

55bda70

add unit test

cea8b2b

fix

0e4c164

github-actions bot added physical-expr Physical Expressions core Core DataFusion crate labels Feb 22, 2025

2010YOUY01 commented Feb 22, 2025

View reviewed changes

add comment

6c3f4d8

zhuqi-lucas reviewed Feb 23, 2025

View reviewed changes

alamb approved these changes Feb 25, 2025

View reviewed changes

2010YOUY01 and others added 2 commits February 26, 2025 13:01

review

92ca3b0

Update datafusion/physical-plan/src/sorts/sort.rs

d5e703a

Co-authored-by: Andrew Lamb <[email protected]>

zhuqi-lucas approved these changes Feb 26, 2025

View reviewed changes

2010YOUY01 mentioned this pull request Feb 26, 2025

Use arrow IPC Stream format for spill files #14868

Open

alamb merged commit 99c811a into apache:main Feb 26, 2025
24 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix: External sort failing on `StringView` due to shared buffers #14823

Fix: External sort failing on `StringView` due to shared buffers #14823

2010YOUY01 commented Feb 22, 2025

2010YOUY01 Feb 22, 2025

zhuqi-lucas Feb 22, 2025

2010YOUY01 Feb 23, 2025

zhuqi-lucas Feb 23, 2025

2010YOUY01 Feb 22, 2025

alamb Feb 25, 2025

2010YOUY01 Feb 26, 2025

zhuqi-lucas Feb 23, 2025

zhuqi-lucas Feb 23, 2025

2010YOUY01 Feb 24, 2025 •

edited

Loading

tustvold commented Feb 24, 2025

alamb left a comment

alamb Feb 25, 2025

2010YOUY01 commented Feb 26, 2025

alamb commented Feb 26, 2025

Fix: External sort failing on StringView due to shared buffers #14823

Fix: External sort failing on StringView due to shared buffers #14823

Conversation

2010YOUY01 commented Feb 22, 2025

Which issue does this PR close?

Rationale for this change

Reproducer

Main branch result (20GB spilled)

PR result (12GB spilled)

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

2010YOUY01 Feb 24, 2025 • edited Loading

Choose a reason for hiding this comment

tustvold commented Feb 24, 2025

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

2010YOUY01 commented Feb 26, 2025

alamb commented Feb 26, 2025

Fix: External sort failing on `StringView` due to shared buffers #14823

Fix: External sort failing on `StringView` due to shared buffers #14823

2010YOUY01 Feb 24, 2025 •

edited

Loading