Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Directed rounding #2576

Draft
wants to merge 16 commits into
base: master
Choose a base branch
from
Draft

Conversation

orkolorko
Copy link

@orkolorko orkolorko commented Dec 3, 2024

In this pull request we add call to the intrinsics for directed add, sub, mul, div and fma and a small tutorial on how to expose intrinsics. The ptx code is as expected.

@maleadt
Copy link
Member

maleadt commented Dec 3, 2024

I think it would be better to use RoundingMode arguments instead of different functions. Base does this as well: https://github.com/JuliaLang/julia/blob/2590e675885b97579a7531c343a546f6f5bbcbe5/base/rounding.jl#L469-L472.

@orkolorko
Copy link
Author

What do you think if we define aliases with RoundingMode? Something like:
@device_override Base.fma(x, y, z, ::RoundingMode{:ToZero}) = CUDA.fma_rn(x, y, z).

I tried to pass a vector of RoundingMode to a CUDA kernel, but it did not work.
Another thing is that at the moment, I'm not confident that NVPTX is calling the right intrisics, I'm investigating the numerical behavior. Can you point me in the direction of where to check this?

@maleadt
Copy link
Member

maleadt commented Dec 3, 2024

What do you think if we define aliases with RoundingMode? Something like:
@device_override Base.fma(x, y, z, ::RoundingMode{:ToZero}) = CUDA.fma_rn(x, y, z).

Base.fma doesn't typically take this argument, so adding this definition wouldn't make your code generic. Might be something valuable to add to Base though.

Why wouldn't you unify the CUDA.fma definitions using a RoundingMode arg?

I'm not confident that NVPTX is calling the right intrisics, I'm investigating the numerical behavior. Can you point me in the direction of where to check this?

You can always inspect the SASS code using @device_code_sass.

@orkolorko
Copy link
Author

orkolorko commented Dec 8, 2024

I implemented the calls using a RoundingMode arg; I kept the function names and aliased them, but if you prefer I can remove the function names. I was able to expose the MMA interface, and tested against the test kernel in
#1426

function kernel_wmma_f64_lowlevel(a_dev, b_dev, c_dev, d_dev)
    a_frag = WMMA.llvm_wmma_load_a_col_m8n8k4_global_stride_f64(pointer(a_dev), 8)
    b_frag = WMMA.llvm_wmma_load_b_col_m8n8k4_global_stride_f64(pointer(b_dev), 4)
    c_frag = WMMA.llvm_wmma_load_c_col_m8n8k4_global_stride_f64(pointer(c_dev), 8)

    #d_frag = WMMA.llvm_wmma_mma_col_col_m8n8k4_f64(a_frag, b_frag, c_frag)
    #d_frag = WMMA.llvm_wmma_mma_col_col_m8n8k4_f64(a_frag, b_frag, c_frag, RoundToZero)
    #d_frag = WMMA.llvm_wmma_mma_col_col_m8n8k4_f64(a_frag, b_frag, c_frag, RoundUp)
    d_frag = WMMA.llvm_wmma_mma_col_col_m8n8k4_f64(a_frag, b_frag, c_frag, RoundDown)
    
    WMMA.llvm_wmma_store_d_col_m8n8k4_global_stride_f64(pointer(d_dev), d_frag, 8)
    return nothing
end

function call_kernel()
    m = n = 8
    k = 4
    dtype_a = dtype_b = Float64
    dtype_c = dtype_d = Float64

    d_a = CUDA.rand(dtype_a, m, k)
    d_b = CUDA.rand(dtype_b, k, n)
    d_c = CUDA.rand(dtype_c, m, n)
    d_d = CUDA.zeros(dtype_d, m, n)

    CUDA.@sync @cuda kernel_wmma_f64_lowlevel(d_a, d_b, d_c, d_d)
    return nothing
end

Everything seems to work fine! I also checked the numerical results for the operations and everything is fine.
As a reference it may be worth recording here that there is an error in the PTX documentation, i.e., there is a clash on the fragment size between here which states correctly that the accumulator fragments are A vector expression containing two .f64 elements from the matrix C. while here says it is a single element (wrong!!!).

@orkolorko orkolorko mentioned this pull request Dec 8, 2024
Copy link
Member

@maleadt maleadt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Keeping the _rn etc names seems fine to me seeing how CUDA defines them as well.

docs/src/tutorials/exposing_new_intrinsics.jl Outdated Show resolved Hide resolved
docs/src/tutorials/exposing_new_intrinsics.jl Outdated Show resolved Hide resolved
Comment on lines 47 to 101
# The binary operations as add, sub, mul, div have been implemented through a macro

function test_add!(out, x, y)
I = threadIdx().x
if I == 1
out[I] = CUDA.add(x, y, RoundNearest)
elseif I == 2
out[I] = CUDA.add(x, y, RoundToZero)
elseif I == 3
out[I] = CUDA.add(x, y, RoundUp)
elseif I == 4
out[I] = CUDA.add(x, y, RoundDown)
end
return
end

out_d = CuArray(zeros(4))
@cuda threads = 4 test_add!(out_d, 1.0, 2^(-54))
out_h = Array(out_d)

function test_sub!(out, x, y)
I = threadIdx().x
if I == 1
out[I] = CUDA.sub(x, y, RoundNearest)
elseif I == 2
out[I] = CUDA.sub(x, y, RoundToZero)
elseif I == 3
out[I] = CUDA.sub(x, y, RoundUp)
elseif I == 4
out[I] = CUDA.sub(x, y, RoundDown)
end
return
end

out_d = CuArray(zeros(4))
@cuda threads = 4 test_sub!(out_d, 1.0, 2^(-53))
out_h = Array(out_d)

function test_mul!(out, x, y)
I = threadIdx().x
if I == 1
out[I] = CUDA.mul(x, y, RoundNearest)
elseif I == 2
out[I] = CUDA.mul(x, y, RoundToZero)
elseif I == 3
out[I] = CUDA.mul(x, y, RoundUp)
elseif I == 4
out[I] = CUDA.mul(x, y, RoundDown)
end
return
end

out_d = CuArray(zeros(4))
@cuda threads = 4 test_mul!(out_d, 1.0 - 2^(-52), 1.0 + 2^(-52))
out_h = Array(out_d)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how this part is still relevant to the 'defining an intrinsic' tutorial?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left only one example

Comment on lines -393 to -394


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated change.

Comment on lines 18 to 19
"f32" => Float32,
"f64" => Float64
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated changes?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added intrinsics calls for WMMA with directed rounding modes

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you keep that to a separate PR? We also currently don't support Float64 WMMA, see #1426.

@maleadt maleadt force-pushed the master branch 4 times, most recently from 5d585c4 to c850163 Compare December 20, 2024 08:18
@maleadt maleadt added enhancement New feature or request needs tests Tests are requested. cuda kernels Stuff about writing CUDA kernels. labels Dec 20, 2024
@maleadt maleadt marked this pull request as draft January 8, 2025 10:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuda kernels Stuff about writing CUDA kernels. enhancement New feature or request needs tests Tests are requested.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants