Why shared histogram sizes are RADIX * 2 in OneSweep? #13

kannaiah · 2025-01-08T00:08:45Z

kannaiah
Jan 8, 2025

In https://github.com/b0nes164/GPUSorting/blob/main/GPUSortingCUDA/Sort/OneSweep.cu#L49
Why 4 shared histograms are of size RADIX * 2?

    __shared__ uint32_t s_globalHistFirst[RADIX * 2];
    __shared__ uint32_t s_globalHistSec[RADIX * 2];
    __shared__ uint32_t s_globalHistThird[RADIX * 2];
    __shared__ uint32_t s_globalHistFourth[RADIX * 2];

Answered by b0nes164

Jan 14, 2025

We use 2 histograms to avoid atomic conflicts as we count digits in the histogramming. Twice the histograms means half the conflicts, and most contemporary GPUs have ample shared memory, so there is no worry about cutting into occupancy. Why not more than 2 histograms then? Going past 2 necessitates more complicated reduction operations across the histograms, which may slow down the kernel. So we use 2 as a nice sweet spot between conflict avoidance, and low histogram reduction cost.

View full answer

b0nes164 · 2025-01-14T08:16:58Z

b0nes164
Jan 14, 2025
Maintainer

We use 2 histograms to avoid atomic conflicts as we count digits in the histogramming. Twice the histograms means half the conflicts, and most contemporary GPUs have ample shared memory, so there is no worry about cutting into occupancy. Why not more than 2 histograms then? Going past 2 necessitates more complicated reduction operations across the histograms, which may slow down the kernel. So we use 2 as a nice sweet spot between conflict avoidance, and low histogram reduction cost.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why shared histogram sizes are RADIX * 2 in OneSweep? #13

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Why shared histogram sizes are RADIX * 2 in OneSweep? #13

kannaiah Jan 8, 2025

Replies: 1 comment

b0nes164 Jan 14, 2025 Maintainer

kannaiah
Jan 8, 2025

b0nes164
Jan 14, 2025
Maintainer