Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-45624: [C++][Parquet] BitReader optimize case for num_bits == 1 #45625

Closed
wants to merge 5 commits into from

Conversation

mapleFU
Copy link
Member

@mapleFU mapleFU commented Feb 25, 2025

Rationale for this change

The benchmark shows that, in parquet-column-reader-benchmark, the GetValue_ takes half of the times. We can try to optimize this.

What changes are included in this PR?

Avoid call detail::GetValue_ if possible.

Are these changes tested?

Covered by existing

Are there any user-facing changes?

No

@github-actions github-actions bot added the awaiting review Awaiting review label Feb 25, 2025
@mapleFU mapleFU changed the title GH-45624: [C++][Parquet] BitReader optimize case for num_bits == 1 GH-45624: [C++][Parquet] POC: BitReader optimize case for num_bits == 1 Feb 25, 2025
Copy link

⚠️ GitHub issue #45624 has been automatically assigned in GitHub to PR creator.

@mapleFU mapleFU changed the title GH-45624: [C++][Parquet] POC: BitReader optimize case for num_bits == 1 GH-45624: [C++][Parquet] BitReader optimize case for num_bits == 1 Feb 26, 2025
@mapleFU mapleFU requested a review from pitrou February 26, 2025 06:53
@mapleFU
Copy link
Member Author

mapleFU commented Feb 26, 2025

On my MacOS M1Pro with release O2 with LLVM-17:

Before:

--------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                                  Time             CPU   Iterations UserCounters...
--------------------------------------------------------------------------------------------------------------------------------------------
ColumnReaderSkipInt32/Repetition:0/BatchSize:1000                                     114159 ns       105934 ns         6386 bytes_per_second=45.0128G/s
ColumnReaderSkipInt32/Repetition:1/BatchSize:1000                                     671094 ns       671074 ns         1042 bytes_per_second=3.78579G/s
ColumnReaderSkipInt32/Repetition:2/BatchSize:1000                                    1558048 ns      1557483 ns          451 bytes_per_second=1.73942G/s
ColumnReaderReadBatchInt32/Repetition:0/BatchSize:1000                                109435 ns       107567 ns         7154 bytes_per_second=44.3293G/s
ColumnReaderReadBatchInt32/Repetition:1/BatchSize:1000                                678187 ns       678090 ns         1032 bytes_per_second=3.74662G/s
ColumnReaderReadBatchInt32/Repetition:2/BatchSize:1000                               1575079 ns      1573793 ns          445 bytes_per_second=1.7214G/s
RecordReaderSkipRecords/Repetition:0/BatchSize:1000                                   107927 ns       107921 ns         6836 bytes_per_second=44.184G/s items_per_second=11.8605G/s
RecordReaderSkipRecords/Repetition:1/BatchSize:1000                                   699862 ns       681619 ns         1043 bytes_per_second=3.72722G/s items_per_second=1.87788G/s
RecordReaderSkipRecords/Repetition:2/BatchSize:1000                                  1968489 ns      1967636 ns          360 bytes_per_second=1.37684G/s items_per_second=650.527M/s
RecordReaderReadRecords/Repetition:0/BatchSize:1000/ReadDense:1                       135719 ns       135545 ns         5505 bytes_per_second=35.1793G/s items_per_second=9.44337G/s
RecordReaderReadRecords/Repetition:0/BatchSize:1000/ReadDense:0                       134429 ns       134208 ns         5412 bytes_per_second=35.5296G/s items_per_second=9.53742G/s
RecordReaderReadRecords/Repetition:1/BatchSize:1000/ReadDense:1                       810494 ns       809804 ns          868 bytes_per_second=3.13723G/s items_per_second=1.58063G/s
RecordReaderReadRecords/Repetition:1/BatchSize:1000/ReadDense:0                      4488228 ns      4474854 ns          158 bytes_per_second=581.363M/s items_per_second=286.043M/s
RecordReaderReadRecords/Repetition:2/BatchSize:1000/ReadDense:1                      1989086 ns      1987642 ns          349 bytes_per_second=1.36298G/s items_per_second=643.979M/s
RecordReaderReadRecords/Repetition:2/BatchSize:1000/ReadDense:0                      6130641 ns      6125723 ns          112 bytes_per_second=452.867M/s items_per_second=208.955M/s
RecordReaderReadAndSkipRecords/Repetition:0/BatchSize:10/LevelsPerPage:80000         3163549 ns      3162216 ns          222 bytes_per_second=1.50792G/s items_per_second=404.779M/s
RecordReaderReadAndSkipRecords/Repetition:0/BatchSize:1000/LevelsPerPage:80000        123934 ns       123942 ns         5851 bytes_per_second=38.4727G/s items_per_second=10.3274G/s
RecordReaderReadAndSkipRecords/Repetition:0/BatchSize:10000/LevelsPerPage:1000000    1294597 ns      1294418 ns          543 bytes_per_second=46.0474G/s items_per_second=12.3608G/s
RecordReaderReadAndSkipRecords/Repetition:1/BatchSize:10/LevelsPerPage:80000        13249022 ns     13240245 ns           53 bytes_per_second=196.485M/s items_per_second=96.6749M/s
RecordReaderReadAndSkipRecords/Repetition:1/BatchSize:1000/LevelsPerPage:80000       2433174 ns      2431853 ns          278 bytes_per_second=1069.77M/s items_per_second=526.348M/s
RecordReaderReadAndSkipRecords/Repetition:1/BatchSize:10000/LevelsPerPage:1000000   28409928 ns     28392640 ns           25 bytes_per_second=1.11716G/s items_per_second=563.526M/s
RecordReaderReadAndSkipRecords/Repetition:2/BatchSize:10/LevelsPerPage:80000        16677514 ns     16630512 ns           43 bytes_per_second=166.81M/s items_per_second=76.967M/s
RecordReaderReadAndSkipRecords/Repetition:2/BatchSize:100/LevelsPerPage:80000        5556215 ns      5519492 ns          130 bytes_per_second=502.608M/s items_per_second=231.905M/s
RecordReaderReadAndSkipRecords/Repetition:2/BatchSize:10000/LevelsPerPage:1000000   44267583 ns     44072625 ns           16 bytes_per_second=785.973M/s items_per_second=363.037M/s
ReadLevels_Rle/MaxLevel:1/NumLevels:8096/BatchSize:1024/LevelRepeatCount:1              2897 ns         2899 ns       241548 bytes_per_second=5.20201G/s items_per_second=2.79281G/s
ReadLevels_Rle/MaxLevel:1/NumLevels:8096/BatchSize:1024/LevelRepeatCount:7              8307 ns         8307 ns        84341 bytes_per_second=1.81527G/s items_per_second=974.563M/s
ReadLevels_Rle/MaxLevel:1/NumLevels:8096/BatchSize:1024/LevelRepeatCount:1024           1033 ns         1035 ns       669901 bytes_per_second=14.5643G/s items_per_second=7.81916G/s
ReadLevels_Rle/MaxLevel:1/NumLevels:8096/BatchSize:2048/LevelRepeatCount:1              2317 ns         2318 ns       302602 bytes_per_second=6.50544G/s items_per_second=3.49258G/s
ReadLevels_Rle/MaxLevel:3/NumLevels:8096/BatchSize:1024/LevelRepeatCount:1              2404 ns         2344 ns       301618 bytes_per_second=6.43295G/s items_per_second=3.45366G/s
ReadLevels_Rle/MaxLevel:3/NumLevels:8096/BatchSize:2048/LevelRepeatCount:1              2075 ns         2076 ns       337047 bytes_per_second=7.26225G/s items_per_second=3.89889G/s
ReadLevels_Rle/MaxLevel:3/NumLevels:8096/BatchSize:1024/LevelRepeatCount:7              7493 ns         7485 ns        92441 bytes_per_second=2.01477G/s items_per_second=1081.67M/s
ReadLevels_BitPack/MaxLevel:1/NumLevels:8096/BatchSize:1024/LevelRepeatCount:1          1651 ns         1650 ns       423798 bytes_per_second=9.13662G/s items_per_second=4.90518G/s
ReadLevels_BitPack/MaxLevel:1/NumLevels:8096/BatchSize:1024/LevelRepeatCount:7          1656 ns         1657 ns       425351 bytes_per_second=9.10184G/s items_per_second=4.88651G/s
ReadLevels_BitPack/MaxLevel:1/NumLevels:8096/BatchSize:1024/LevelRepeatCount:1024       1655 ns         1657 ns       409220 bytes_per_second=9.10111G/s items_per_second=4.88612G/s
ReadLevels_BitPack/MaxLevel:1/NumLevels:8096/BatchSize:2048/LevelRepeatCount:1          1612 ns         1614 ns       432136 bytes_per_second=9.34538G/s items_per_second=5.01726G/s
ReadLevels_BitPack/MaxLevel:3/NumLevels:8096/BatchSize:1024/LevelRepeatCount:1          1750 ns         1754 ns       397256 bytes_per_second=8.59702G/s items_per_second=4.61549G/s
ReadLevels_BitPack/MaxLevel:3/NumLevels:8096/BatchSize:2048/LevelRepeatCount:1          1716 ns         1716 ns       408788 bytes_per_second=8.78623G/s items_per_second=4.71707G/s
ReadLevels_BitPack/MaxLevel:3/NumLevels:8096/BatchSize:1024/LevelRepeatCount:7          1751 ns         1752 ns       397994 bytes_per_second=8.60641G/s items_per_second=4.62053G/s

After:

ColumnReaderSkipInt32/Repetition:0/BatchSize:1000                                     104728 ns       104715 ns         6313 bytes_per_second=45.5369G/s
ColumnReaderSkipInt32/Repetition:1/BatchSize:1000                                     593590 ns       593340 ns         1180 bytes_per_second=4.28177G/s
ColumnReaderSkipInt32/Repetition:2/BatchSize:1000                                    1447860 ns      1442420 ns          488 bytes_per_second=1.87818G/s
ColumnReaderReadBatchInt32/Repetition:0/BatchSize:1000                                107405 ns       107358 ns         6358 bytes_per_second=44.4158G/s
ColumnReaderReadBatchInt32/Repetition:1/BatchSize:1000                                598314 ns       597914 ns         1160 bytes_per_second=4.24901G/s
ColumnReaderReadBatchInt32/Repetition:2/BatchSize:1000                               1450487 ns      1449603 ns          484 bytes_per_second=1.86887G/s
RecordReaderSkipRecords/Repetition:0/BatchSize:1000                                   107744 ns       107665 ns         6477 bytes_per_second=44.289G/s items_per_second=11.8887G/s
RecordReaderSkipRecords/Repetition:1/BatchSize:1000                                   599904 ns       599654 ns         1175 bytes_per_second=4.23668G/s items_per_second=2.13457G/s
RecordReaderSkipRecords/Repetition:2/BatchSize:1000                                  1873631 ns      1869307 ns          381 bytes_per_second=1.44927G/s items_per_second=684.746M/s
RecordReaderReadRecords/Repetition:0/BatchSize:1000/ReadDense:1                       129576 ns       129590 ns         5411 bytes_per_second=36.7957G/s items_per_second=9.87727G/s
RecordReaderReadRecords/Repetition:0/BatchSize:1000/ReadDense:0                       130494 ns       130503 ns         5519 bytes_per_second=36.5385G/s items_per_second=9.80822G/s
RecordReaderReadRecords/Repetition:1/BatchSize:1000/ReadDense:1                       745940 ns       733402 ns          980 bytes_per_second=3.46405G/s items_per_second=1.74529G/s
RecordReaderReadRecords/Repetition:1/BatchSize:1000/ReadDense:0                      4283998 ns      4260679 ns          165 bytes_per_second=610.587M/s items_per_second=300.422M/s
RecordReaderReadRecords/Repetition:2/BatchSize:1000/ReadDense:1                      1910741 ns      1904768 ns          362 bytes_per_second=1.42228G/s items_per_second=671.998M/s
RecordReaderReadRecords/Repetition:2/BatchSize:1000/ReadDense:0                      6738269 ns      6262120 ns          117 bytes_per_second=443.003M/s items_per_second=204.404M/s
RecordReaderReadAndSkipRecords/Repetition:0/BatchSize:10/LevelsPerPage:80000         3185470 ns      3172373 ns          220 bytes_per_second=1.50309G/s items_per_second=403.483M/s
RecordReaderReadAndSkipRecords/Repetition:0/BatchSize:1000/LevelsPerPage:80000        123844 ns       123821 ns         5571 bytes_per_second=38.5103G/s items_per_second=10.3375G/s
RecordReaderReadAndSkipRecords/Repetition:0/BatchSize:10000/LevelsPerPage:1000000    1418022 ns      1368611 ns          501 bytes_per_second=43.5512G/s items_per_second=11.6907G/s
RecordReaderReadAndSkipRecords/Repetition:1/BatchSize:10/LevelsPerPage:80000        13900133 ns     13384222 ns           54 bytes_per_second=194.372M/s items_per_second=95.635M/s
RecordReaderReadAndSkipRecords/Repetition:1/BatchSize:1000/LevelsPerPage:80000       2475976 ns      2456324 ns          284 bytes_per_second=1059.11M/s items_per_second=521.104M/s
RecordReaderReadAndSkipRecords/Repetition:1/BatchSize:10000/LevelsPerPage:1000000   29345005 ns     29020917 ns           24 bytes_per_second=1119.21M/s items_per_second=551.326M/s
RecordReaderReadAndSkipRecords/Repetition:2/BatchSize:10/LevelsPerPage:80000        17031921 ns     16925167 ns           42 bytes_per_second=163.906M/s items_per_second=75.627M/s
RecordReaderReadAndSkipRecords/Repetition:2/BatchSize:100/LevelsPerPage:80000        5533645 ns      5508984 ns          128 bytes_per_second=503.567M/s items_per_second=232.348M/s
RecordReaderReadAndSkipRecords/Repetition:2/BatchSize:10000/LevelsPerPage:1000000   45686734 ns     45607312 ns           16 bytes_per_second=759.525M/s items_per_second=350.821M/s
ReadLevels_Rle/MaxLevel:1/NumLevels:8096/BatchSize:1024/LevelRepeatCount:1              2733 ns         2611 ns       257929 bytes_per_second=5.7755G/s items_per_second=3.1007G/s
ReadLevels_Rle/MaxLevel:1/NumLevels:8096/BatchSize:1024/LevelRepeatCount:7              8815 ns         8593 ns        80671 bytes_per_second=1.75493G/s items_per_second=942.172M/s
ReadLevels_Rle/MaxLevel:1/NumLevels:8096/BatchSize:1024/LevelRepeatCount:1024           1067 ns         1066 ns       657154 bytes_per_second=14.1458G/s items_per_second=7.59446G/s
ReadLevels_Rle/MaxLevel:1/NumLevels:8096/BatchSize:2048/LevelRepeatCount:1              2302 ns         2298 ns       305955 bytes_per_second=6.56181G/s items_per_second=3.52285G/s
ReadLevels_Rle/MaxLevel:3/NumLevels:8096/BatchSize:1024/LevelRepeatCount:1              2335 ns         2331 ns       298706 bytes_per_second=6.46837G/s items_per_second=3.47268G/s
ReadLevels_Rle/MaxLevel:3/NumLevels:8096/BatchSize:2048/LevelRepeatCount:1              2148 ns         2143 ns       319148 bytes_per_second=7.03663G/s items_per_second=3.77776G/s
ReadLevels_Rle/MaxLevel:3/NumLevels:8096/BatchSize:1024/LevelRepeatCount:7              7559 ns         7559 ns        92018 bytes_per_second=1.99507G/s items_per_second=1071.1M/s
ReadLevels_BitPack/MaxLevel:1/NumLevels:8096/BatchSize:1024/LevelRepeatCount:1          1676 ns         1681 ns       418708 bytes_per_second=8.97093G/s items_per_second=4.81623G/s
ReadLevels_BitPack/MaxLevel:1/NumLevels:8096/BatchSize:1024/LevelRepeatCount:7          1684 ns         1682 ns       420145 bytes_per_second=8.9672G/s items_per_second=4.81423G/s
ReadLevels_BitPack/MaxLevel:1/NumLevels:8096/BatchSize:1024/LevelRepeatCount:1024       1642 ns         1644 ns       429106 bytes_per_second=9.1724G/s items_per_second=4.92439G/s
ReadLevels_BitPack/MaxLevel:1/NumLevels:8096/BatchSize:2048/LevelRepeatCount:1          1614 ns         1617 ns       433235 bytes_per_second=9.32817G/s items_per_second=5.00803G/s
ReadLevels_BitPack/MaxLevel:3/NumLevels:8096/BatchSize:1024/LevelRepeatCount:1          1711 ns         1711 ns       411779 bytes_per_second=8.81428G/s items_per_second=4.73213G/s
ReadLevels_BitPack/MaxLevel:3/NumLevels:8096/BatchSize:2048/LevelRepeatCount:1          1679 ns         1680 ns       415300 bytes_per_second=8.97395G/s items_per_second=4.81785G/s
ReadLevels_BitPack/MaxLevel:3/NumLevels:8096/BatchSize:1024/LevelRepeatCount:7          1707 ns         1709 ns       408807 bytes_per_second=8.8216G/s items_per_second=4.73606G/s

@pitrou
Copy link
Member

pitrou commented Feb 26, 2025

@ursabot please benchmark

@ursabot
Copy link

ursabot commented Feb 26, 2025

Benchmark runs are scheduled for commit 2f8f27d. Watch https://buildkite.com/apache-arrow and https://conbench.ursa.dev for updates. A comment will be posted here when the runs are complete.

Copy link

Thanks for your patience. Conbench analyzed the 4 benchmarking runs that have been run so far on PR commit 2f8f27d.

There were 3 benchmark results indicating a performance regression:

The full Conbench report has more details.

@pitrou
Copy link
Member

pitrou commented Feb 26, 2025

Well, the benchmark results don't look terrific. Only test-mac-arm shows an improvement, while on other machines there are no improvements and some regressions might be related.

@pitrou
Copy link
Member

pitrou commented Feb 26, 2025

Also, the PR complicates already complicated code. I would only be in favor if there was a really sizable improvement with this.

@mapleFU
Copy link
Member Author

mapleFU commented Feb 26, 2025

ColumnReaderReadBatchInt32/Repetition:1/BatchSize:1000 get a 10% improvement and RecordReaderReadRecords/Repetition:1/BatchSize:1000/ReadDense:0 get a 10% improvement, however, it depends on the batch size, when data is aligned and get a full batch this optimization might not helps, because it will hit the simd unpack code rather than detail::GetValue_

Actually I think the GetBatch would improve when num_bits == 1 if the read batch is not aligned. Let me close this first

@mapleFU mapleFU closed this Feb 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants