We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Describe the bug The CUDA code generated for the attached SDFG cannot be compiled:
.dacecache/calculate_nabla2_for_w_gpu/src/cuda/calculate_nabla2_for_w_gpu_cuda.cu(95): error: too many arguments for class template "dace::SharedToGlobal1D" 91 dace::wcr_fixed<dace::ReductionType::Sum, double>::reduce_atomic(__var_174, *(&__var_228)); 92 } 93 } 94 } 95 dace::SharedToGlobal1D<double, 4, 1, 1, 1, 1, true>(__var_174, 1, __var_230); 96 97 }
The problem disappears if I enable the template SharedToGlobal1D in copy.cuh which is currently commented out:
SharedToGlobal1D
/* template <typename T, int BLOCK_WIDTH, int BLOCK_HEIGHT, int BLOCK_DEPTH, int COPY_XLEN, int DST_XSTRIDE, bool ASYNC> static DACE_DFI void SharedToGlobal1D( const T *smem, int src_xstride, T *ptr) { GlobalToShared3D<T, BLOCK_WIDTH, BLOCK_HEIGHT, BLOCK_DEPTH, 1, 1, COPY_XLEN, 1, 1, DST_XSTRIDE, ASYNC>( smem, 1, 1, src_xstride, ptr); } */
So it seems to me that the lowering to CUDA code does not make use of the right template construct.
To Reproduce Please load the SDFG using the following program:
import dace import os run_on_gpu = True sdfg_name = "calculate_nabla2_for_w_gpu.sdfg" path = os.path.join(os.getcwd(), sdfg_name) sdfg = dace.SDFG.from_file(path) if run_on_gpu: device = dace.DeviceType.GPU sdfg._name = f"{sdfg.name}_gpu" for _, _, array in sdfg.arrays_recursive(): if not array.transient: array.storage = dace.dtypes.StorageType.GPU_Global else: device = dace.DeviceType.CPU sdfg.compile(validate=True)
sdfg.zip
The text was updated successfully, but these errors were encountered:
Fix for CUDA codegen (#1442)
7c06755
This PR addresses #1388: fix python codegen and `SharedToGlobal1D` template to generate correct code for write without reduction.
edopao
Successfully merging a pull request may close this issue.
Describe the bug
The CUDA code generated for the attached SDFG cannot be compiled:
The problem disappears if I enable the template
SharedToGlobal1D
in copy.cuh which is currently commented out:So it seems to me that the lowering to CUDA code does not make use of the right template construct.
To Reproduce
Please load the SDFG using the following program:
sdfg.zip
The text was updated successfully, but these errors were encountered: