-
Notifications
You must be signed in to change notification settings - Fork 133
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix for CUDA codegen #1442
Fix for CUDA codegen #1442
Conversation
Argument to std::ifloor should be double, otherwise invalid result on gpu target.
Use new template for dace::SharedToGlobal1D
After uplift to dace v0.15, one SDFG which was working before started to show compilation errors. The latest DaCe is moving a data access to an inter-state edge. For the data-access, the symbols that define array strides are needed for code generation. The SDFG was validated, before and after the simplify pass, but it did not compile for CPU. When skipping the simplify pass, the compilation did work. The problem has been narrowed down to the scalar-to-symbol promotion, which is moving a data access to an inter-state edge. Then, the method _used_symbols_internal needs to be update to account for data containers, including symbolic shape and strides. This commit contains a unit test to reproduce the issue and verify the proposed fix.
Keep new logic, fix cuda codegen for 1D shared-to-global
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you the review, I will address your comments in a new commit.
6d75df6
to
12b3cdd
Compare
@tbennun Test added, please re-review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you. Only minor comments remain
dace/codegen/targets/cuda.py
Outdated
@@ -1132,10 +1132,22 @@ def _emit_copy(self, state_id, src_node, src_storage, dst_node, dst_storage, dst | |||
func=funcname, | |||
type=dst_node.desc(sdfg).dtype.ctype, | |||
bdims=', '.join(_topy(self._block_dims)), | |||
is_async='true' if state_dfg.out_degree(dst_node) > 0 else 'true', | |||
is_async='true' if state_dfg.out_degree(dst_node) > 0 else 'false', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be the other way around (if there is a dependent read after it in the same state, sync).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct. I did not pay enough attention to is_async
before. Besides correcting the value of this argument, I have also moved the synchronization point in the template function after the thread-level copy (see my last commit on copy.cuh).
@tbennun Thank you for the review. As I commented above, I have done one additional change related to |
Gentle reminder for @tbennun: please check my last comment and whether you can approve this PR. |
This PR addresses #1388: fix python codegen and
SharedToGlobal1D
template to generate correct code for write without reduction.