Is `TMA_REDUCE_ADD` atomic? #1487

hyhieu · 2024-04-16T03:53:39Z

hyhieu
Apr 16, 2024

The feature TMA_REDUCE_ADD -- newly added in CUTLASS-3.5 -- is amazing.

However, while using it, we probably noticed some race conditions, when REDUCE_ADD-ing to the same target GMEM from multiple CTAs.

Now, I wrote probably because I am not sure if there are other bugs in our kernel. Hence, this topic is meant to ask whether TMA_REDUCE_ADD has the "atomic" behavior?

Here, "atomic" means to prevent race conditions across CTAs, i.e., in the sense of atomicAdd or atomicCAS.

If the answer is no, can you suggest a way to synchronize them?

Your help is deeply appreciated!

Answered by thakkarV

Apr 16, 2024

following excerpt is from the (PTX ISA docs)[https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-cp-reduce-async-bulk]:

Each reduction operation performed by the cp.reduce.async.bulk has individually .relaxed.gpu memory ordering semantics. The load operations in cp.reduce.async.bulk are treated as weak memory operation and the [complete-tx](https://docs.nvidia.com/cuda/parallel-thread-execution/#parallel-synchronization-and-communication-instructions-mbarrier-complete-tx-operation) operation on the mbarrier has .release semantics at the .cluster scope as described in the [Memory Consistency Model](https://docs.nvidia.com/cuda/parallel-thread-e…

View full answer

thakkarV · 2024-04-16T04:14:45Z

thakkarV
Apr 16, 2024
Collaborator

following excerpt is from the (PTX ISA docs)[https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-cp-reduce-async-bulk]:

Each reduction operation performed by the cp.reduce.async.bulk has individually .relaxed.gpu memory ordering semantics. The load operations in cp.reduce.async.bulk are treated as weak memory operation and the [complete-tx](https://docs.nvidia.com/cuda/parallel-thread-execution/#parallel-synchronization-and-communication-instructions-mbarrier-complete-tx-operation) operation on the mbarrier has .release semantics at the .cluster scope as described in the [Memory Consistency Model](https://docs.nvidia.com/cuda/parallel-thread-execution/#memory-consistency-model).

if you want to perform reductions atomically, you can use a semaphore based technique like we do in the serial split K kernels

1 reply

hyhieu Apr 16, 2024
Author

Thank you @thakkarV.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is `TMA_REDUCE_ADD` atomic? #1487

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Is TMA_REDUCE_ADD atomic? #1487

hyhieu Apr 16, 2024

Replies: 1 comment · 1 reply

thakkarV Apr 16, 2024 Collaborator

hyhieu Apr 16, 2024 Author

Is `TMA_REDUCE_ADD` atomic? #1487

hyhieu
Apr 16, 2024

Replies: 1 comment 1 reply

thakkarV
Apr 16, 2024
Collaborator

hyhieu Apr 16, 2024
Author