-
The feature However, while using it, we probably noticed some race conditions, when Now, I wrote probably because I am not sure if there are other bugs in our kernel. Hence, this topic is meant to ask whether Here, "atomic" means to prevent race conditions across CTAs, i.e., in the sense of If the answer is no, can you suggest a way to synchronize them? Your help is deeply appreciated! |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
following excerpt is from the (PTX ISA docs)[https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-cp-reduce-async-bulk]:
if you want to perform reductions atomically, you can use a semaphore based technique like we do in the serial split K kernels |
Beta Was this translation helpful? Give feedback.
following excerpt is from the (PTX ISA docs)[https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-cp-reduce-async-bulk]: