Questions about GemmWithReduceK #723
-
|
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 3 replies
-
What do you mean? We don't do it unless we use splitK.
Do you plan to use tf32 tensor core? |
Beta Was this translation helpful? Give feedback.
-
Only split k will do it. If you don't use splitk, we don't output partial sum.
Yes, that is the right place. Suppose you want to reduce A operand, your cuda code is
If you reduce for B, your cuda code is
You can try to write the above code in inline ptx to get better performance. The mainloop of tf32 gemm is already very busy. Fusing it may make the performance drop a lot, you need to benchmark it. |
Beta Was this translation helpful? Give feedback.
-
@Enter-tainer I also had the need to extend example 32 to tf32 and came up with this snippet that worked for me. Was planning on upstreaming but probably won't have time to make a PR before I'm going on holiday leave. For now, I'll just leave this code snippet (feel free to make it into a PR).
|
Beta Was this translation helpful? Give feedback.
Only split k will do it. If you don't use splitk, we don't output partial sum.
Yes, that is the right place. Suppose you want to reduce A operand, your cuda code is