Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

__shfl_sync instructions may have wrong member mask #854

Open
jglaser opened this issue Dec 8, 2018 · 4 comments
Open

__shfl_sync instructions may have wrong member mask #854

jglaser opened this issue Dec 8, 2018 · 4 comments
Labels
cub For all items related to CUB

Comments

@jglaser
Copy link

jglaser commented Dec 8, 2018

When using WarpScanShfl from warp_scan_shfl.cuh inside a while() loop and in conjunction with a sub-warp LOGICAL_WARP_THREADS argument, i.e. LOGICAL_WARP_THREADS=2^n with n<5, I get lots of errors like these with cuda-memcheck --tool synccheck

========= Barrier error detected. Invalid arguments
=========     at 0x000000d0 in __cuda_sm70_shflsync_idx
=========     by thread (17,0,0) in block (204,0,0)
=========     Device Frame:__cuda_sm70_shflsync_idx (__cuda_sm70_shflsync_idx : 0xd0)
=========     Device Frame:/ccsopen/home/glaser/hoomd-blue/hoomd/extern/cub/cub/block/specializations/../../block/../util_ptx.cuh:358:void gpu_compute_nlist_binned_kernel<unsigned char=0, int=1, int=1>(unsigned int*, unsigned int*, double4*, unsigned int*, unsigned int const *, unsigned int const *, double4 const *, unsigned int const *, double const *, unsigned int, unsigned int const *, double4 const *, unsigned int const *, double4 const *, unsigned int const *, Index3D, Index2D, Index2D, BoxDim, double const *, double, unsigned int, double3, unsigned int, unsigned int, unsigned int) (void gpu_compute_nlist_binned_kernel<unsigned char=0, int=1, int=1>(unsigned int*, unsigned int*, double4*, unsigned int*, unsigned int const *, unsigned int const *, double4 const *, unsigned int const *, double const *, unsigned int, unsigned int const *, double4 const *, unsigned int const *, double4 const *, unsigned int const *, Index3D, Index2D, Index2D, BoxDim, double const *, double, unsigned int, double3, unsigned int, unsigned int, unsigned 

I believe the root cause is the following.

WarpScanShfl sets its member_mask for the shfl_sync to reflect the sub-warp membership. However, what happens if some threads exit early, the compiler may predicate off this initialization statement, leaving member_mask in an invalid state. Later, when the PTX assembly instruction shfl.sync.idx.b32 is hit, it is executed without predicate (such as @p) and thus with a wrong mask. Then cuda-memcheck finds that the executing warp lane executes an implicit syncwarp but without its mask bits set, and issues an error, as documented here:
https://docs.nvidia.com/cuda/cuda-memcheck/index.html#synccheck-demo-illegal-syncwarp

The safe solution would be to always use the full mask (0xffffffffu) to synchronize the entire warp. I realize this may not fully take advantage of Volta's independent thread scheduling. However, if that were the goal I think the CUB API would have to expose the member_mask somehow to allow the user to set it, so that it is possible to issue e.g. a ballot_sync outside CUB first, and then pass the member mask to CUB. As discussed here: https://devblogs.nvidia.com/using-cuda-warp-level-primitives/

I will submit a pull request shortly for this solution.

@alliepiper
Copy link
Collaborator

This had a PR (#155). It looks like it closed during the reorganizations.

We should resurrect the PR, rebase it on main, and take a look, this sounds important.

@alliepiper
Copy link
Collaborator

@jglaser I know this issue is several years old, but do you happen to have some code that reproduces this or a recollection of how this situation occurred for you?

I'll summarize your report to make sure I understand it:

  1. A partial warp enters WarpScanShfl::InclusiveScan using a LOGICAL_WARP_SIZE less than 32.
  2. The compiler implements this divergence by predicating the instructions from the warp scan implementation.
  3. WarpScanShuffle injects inline PTX that calls an unpredicated shfl.sync.up.b32.
  4. The lanes that /should/ be inactive execute this instruction with an uninitialized member_mask.

Basically, this: https://www.godbolt.org/z/T6jrx75M6

I'm trying to find a repro where the compiler would implement the branch using an execution mask instead of branching so I can test this, but it's not clear how this would happen.

@jrhemstad jrhemstad added the cub For all items related to CUB label Feb 22, 2023
@miscco
Copy link
Collaborator

miscco commented Feb 23, 2023

We would need to get a reproducer on this to keep this open. @jglaser could you provide a recent reproducer?

@jglaser
Copy link
Author

jglaser commented Feb 23, 2023

I am afraid that I am not actively working on the CUB code inside HOOMD-blue which would allow me to test the above behavior again. Feel free to close, if I remember correctly it may have been a false positive with cuda-memcheck. If not, can always reopen later.

@jarmak-nv jarmak-nv transferred this issue from NVIDIA/cub Nov 8, 2023
@github-project-automation github-project-automation bot moved this to Todo in CCCL Nov 8, 2023
@miscco miscco removed their assignment Dec 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cub For all items related to CUB
Projects
Status: Todo
Development

No branches or pull requests

4 participants