-
Notifications
You must be signed in to change notification settings - Fork 448
Wrap launch bounds #570
base: main
Are you sure you want to change the base?
Wrap launch bounds #570
Conversation
The performance results are interesting with the mix of better and worse. My interpretation of that is that some of the functions are not as well tuned as they could be. |
My interpretation is that presence of |
Testing revealed some issues of this approach. We can't simply remove #include <stdio.h>
#include <cub/cub.cuh>
using MaxPolicyT = cub::DispatchRadixSort<false, unsigned short, cub::NullType, unsigned int>::MaxPolicy;
int main() {
int sm_occupancy{};
if (cudaOccupancyMaxActiveBlocksPerMultiprocessor(
&sm_occupancy,
cub::DeviceRadixSortDownsweepKernel<MaxPolicyT, false, true, unsigned short, cub::NullType, unsigned int>,
512,
0)) {
std::printf("error\n");
}
std::printf("%d\n", sm_occupancy);
} When In order to proceed with this PR, we have to retune entire CUB for every supported HW with and without |
As a shorter-term work around, we could clamp the threads-per-block at 128 when RDC is used. With the 255 registers per thread limit, then we are guaranteed that a CTA size of 128 will always fall within the register allocation requirements. This would obviously have performance implications, but I don't see what else we can do better any time soon.
I don't think this would work. You can only use |
@jrhemstad I agree with your point, thanks! I'll probably try to clamp the threads block size. |
This PR addresses the following issue by replacing
__launch_bounds__
usages withCUB_DETAIL_LAUNCH_BOUNDS
.CUB_DETAIL_LAUNCH_BOUNDS
leads to__launch_bounds__
usage only when RDC is not specified. Builds without RDC are not affected by this PR. For builds with RDC, the max performance differences are:Negative diff means speedup of the version without
__launch_bounds__
. Since the results are quite controversial, I wouldn't like to advertise the macro as our API. If absolutely needed, one might define:But for now it's an implementation detail that fixes compilation with RDC in some corner cases. Going forward, we might consider having tuning API that would control
__launch_bounds__
specification as well as pragma unroll usage. The default tuning would be a function of the input types.