Skip to content

Is TMA_REDUCE_ADD atomic? #1487

Answered by thakkarV
hyhieu asked this question in Q&A
Apr 16, 2024 · 1 comments · 1 reply
Discussion options

You must be logged in to vote

following excerpt is from the (PTX ISA docs)[https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-cp-reduce-async-bulk]:

Each reduction operation performed by the cp.reduce.async.bulk has individually .relaxed.gpu memory ordering semantics. The load operations in cp.reduce.async.bulk are treated as weak memory operation and the [complete-tx](https://docs.nvidia.com/cuda/parallel-thread-execution/#parallel-synchronization-and-communication-instructions-mbarrier-complete-tx-operation) operation on the mbarrier has .release semantics at the .cluster scope as described in the [Memory Consistency Model](https://docs.nvidia.com/cuda/parallel-thread-e…

Replies: 1 comment 1 reply

Comment options

You must be logged in to vote
1 reply
@hyhieu
Comment options

Answer selected by hyhieu
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants