Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QST]Atomic addition for cute::half_t #1847

Closed
jpli02 opened this issue Sep 29, 2024 · 1 comment
Closed

[QST]Atomic addition for cute::half_t #1847

jpli02 opened this issue Sep 29, 2024 · 1 comment

Comments

@jpli02
Copy link

jpli02 commented Sep 29, 2024

Hi,

I am trying to use cute tensor to save the sum of different row blocks of a Tensor.
Different thread may write to same location when performing accumulated sum.

Given cute tensor A and B(both are cute::half_t type), we want to accumulate the sum on A.
As the following code shows, I want to sum up A, how to avoid race condition?

    #pragma unroll
    for (int idx = 0; idx < size(A); ++idx) {
        A(idx) += B(idx);
    }  
    A.data() = A.data() + (-kBlockN);

Is there an atomicAdd supported for cute::half_t, I cannot use atomicAdd since it doesn't support half_t.

Thank you so much!

@thakkarV
Copy link
Collaborator

thakkarV commented Sep 30, 2024

from https://docs.nvidia.com/cuda/parallel-thread-execution/#parallel-synchronization-and-communication-instructions-atom

Atomic operation with vector type:

atom{.sem}{.scope}{.global}.add{.level::cache_hint}.vec_32_bit.f32                  d, [a], b{, cache-policy};
atom{.sem}{.scope}{.global}.op.noftz{.level::cache_hint}.vec_16_bit.half_word_type  d, [a], b{, cache-policy};
atom{.sem}{.scope}{.global}.op.noftz{.level::cache_hint}.vec_32_bit.packed_type     d, [a], b{, cache-policy};

.sem =               { .relaxed, .acquire, .release, .acq_rel };
.scope =             { .cta, .cluster, .gpu, .sys };
.op =                { .add, .min, .max };
.half_word_type =    { .f16, .bf16 };
.packed_type =       { .f16x2, .bf16x2 };
.vec_16_bit =        { .v2, .v4, .v8 }
.vec_32_bit =        { .v2, .v4 };
.level::cache_hint = { .L2::cache_hint }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants