-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] transpose in epilogue/prologue #1780
Comments
are you using 2.x or 3.x API? in 3.x you should just be able to set your epilogue stride to whatever you want and it should just work |
Thanks for your suggestion. I tried to find epilogue APIs with "stride" in https://github.com/NVIDIA/cutlass/blob/main/include/cutlass/epilogue/threadblock/epilogue.h. But I did not find parameters related to stride. Would you pls share more detailed hints on how to set epilogue stride? In the above epilogue.h file, it can be seen that mma results are first stored in the shared memory and then loaded to registers before stored to the global memory. If I modify the iterators used in the epilogue.h file, it is really a huge work and I do not think that is the right to do. |
I'm confused. The file you point to is not a 3.x api citizen. @hwu36 can help |
What is your data type and hardware? If fp16 or bf16, A can be any layout on ampere. |
My data type is fp16, and hardware is A100-80G. I want to scatter the output on column, as described in this issue. After that, I need to feed the output to PyTorch where the tensor is assumed to be row-major. If I add a transpose kernel to change the order from column-major to row-major before feeding into PyTorch, the overhead will hurt the e2e performance of a model. I want to ask is there any method to add transpose in the epilogue, so that I can do scatter on the column and transpose the tensor to row-major order after that in a single kernel. |
This issue has been labeled |
This issue has been labeled |
Now I'm using cutlass in my project. I found that some cases have constraints to the layout, such as input matrix A and output matrix C should be row major. These kinds of assumption limit the feasibility to add the cutlass gemm kernel directly in my project without transpose kernels. If cutlass can provide examples to show how to fuse the transpose in the epilogue and "prologue" phases, overhead of the transpose kernels will be eliminated.
I did some tests to show the overhead of adding transpose kernels of an MxN matrix on A100-80G:
m = 8, n = 4096, latency = 8.40 us
m = 8, n = 1024, latency = 7.03 us
m = 8, n = 14336, latency = 7.12 us
m = 8, n = 4096, latency = 6.98 us
m = 32, n = 4096, latency = 6.97 us
m = 32, n = 1024, latency = 6.98 us
m = 32, n = 14336, latency = 8.26 us
m = 32, n = 4096, latency = 8.35 us
m = 256, n = 4096, latency = 15.65 us
m = 256, n = 1024, latency = 7.16 us
m = 256, n = 14336, latency = 46.49 us
m = 256, n = 4096, latency = 15.83 us
m = 512, n = 4096, latency = 28.82 us
m = 512, n = 1024, latency = 10.31 us
m = 512, n = 14336, latency = 92.73 us
m = 512, n = 4096, latency = 28.88 us
m = 4096, n = 8, latency = 7.07 us
m = 1024, n = 8, latency = 8.18 us
m = 14336, n = 8, latency = 7.52 us
m = 4096, n = 8, latency = 8.34 us
m = 4096, n = 32, latency = 8.35 us
m = 1024, n = 32, latency = 7.83 us
m = 14336, n = 32, latency = 9.65 us
m = 4096, n = 32, latency = 8.34 us
m = 4096, n = 256, latency = 15.90 us
m = 1024, n = 256, latency = 8.18 us
m = 14336, n = 256, latency = 46.24 us
m = 4096, n = 256, latency = 15.90 us
m = 4096, n = 512, latency = 26.69 us
m = 1024, n = 512, latency = 10.29 us
m = 14336, n = 512, latency = 88.09 us
m = 4096, n = 512, latency = 26.65 us
It can be seen that when m or n is large, e.g. 14336, the transpose kernel hurts the e2e performance of a model.
The text was updated successfully, but these errors were encountered: