You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I don't have a concrete high priority issue that needs solving, but it may be surprising to users that avoiding the cub segmented sum is much faster here.
The following cupy code uses CUB by default on newer versions (ensure with CUPY_ACCELERATORS=cub):
import cupy as cp
x = cp.ones((1000000, 2))
from cupyx.profiler import benchmark
# sum over the last axes which has only two elements:
benchmark(lambda: x.sum(-1), n_repeat=100)
# GPU time spend: 1927.055 us
# Manually do the sum:
benchmark(lambda: x[..., 0] + x[..., 1], n_repeat=100)
# GPU time spend: 56.361 us
Which means a factor of 35 slower than what would be close to optimal.
Now, as a NumPy dev, I accept that NumPy is also still bad at this: by about a factor of 10! CuPy without CUB was good at it, though.
But, maybe there is an easy win here that would remove the surprise of having to rewrite the code.
The text was updated successfully, but these errors were encountered:
The following PR partially addresses the issue. In the offline discussion, we concluded that providing an overload that takes a single segment size would be preferable. This overload would significantly reduce temporary storage size and improve performance.
I don't have a concrete high priority issue that needs solving, but it may be surprising to users that avoiding the cub segmented sum is much faster here.
The following cupy code uses CUB by default on newer versions (ensure with
CUPY_ACCELERATORS=cub
):Which means a factor of 35 slower than what would be close to optimal.
Now, as a NumPy dev, I accept that NumPy is also still bad at this: by about a factor of 10! CuPy without CUB was good at it, though.
But, maybe there is an easy win here that would remove the surprise of having to rewrite the code.
The text was updated successfully, but these errors were encountered: