[FEA] transpose in epilogue/prologue #1780

xiaonans · 2024-09-04T09:07:46Z

Now I'm using cutlass in my project. I found that some cases have constraints to the layout, such as input matrix A and output matrix C should be row major. These kinds of assumption limit the feasibility to add the cutlass gemm kernel directly in my project without transpose kernels. If cutlass can provide examples to show how to fuse the transpose in the epilogue and "prologue" phases, overhead of the transpose kernels will be eliminated.

I did some tests to show the overhead of adding transpose kernels of an MxN matrix on A100-80G:
m = 8, n = 4096, latency = 8.40 us
m = 8, n = 1024, latency = 7.03 us
m = 8, n = 14336, latency = 7.12 us
m = 8, n = 4096, latency = 6.98 us
m = 32, n = 4096, latency = 6.97 us
m = 32, n = 1024, latency = 6.98 us
m = 32, n = 14336, latency = 8.26 us
m = 32, n = 4096, latency = 8.35 us
m = 256, n = 4096, latency = 15.65 us
m = 256, n = 1024, latency = 7.16 us
m = 256, n = 14336, latency = 46.49 us
m = 256, n = 4096, latency = 15.83 us
m = 512, n = 4096, latency = 28.82 us
m = 512, n = 1024, latency = 10.31 us
m = 512, n = 14336, latency = 92.73 us
m = 512, n = 4096, latency = 28.88 us

m = 4096, n = 8, latency = 7.07 us
m = 1024, n = 8, latency = 8.18 us
m = 14336, n = 8, latency = 7.52 us
m = 4096, n = 8, latency = 8.34 us
m = 4096, n = 32, latency = 8.35 us
m = 1024, n = 32, latency = 7.83 us
m = 14336, n = 32, latency = 9.65 us
m = 4096, n = 32, latency = 8.34 us
m = 4096, n = 256, latency = 15.90 us
m = 1024, n = 256, latency = 8.18 us
m = 14336, n = 256, latency = 46.24 us
m = 4096, n = 256, latency = 15.90 us
m = 4096, n = 512, latency = 26.69 us
m = 1024, n = 512, latency = 10.29 us
m = 14336, n = 512, latency = 88.09 us
m = 4096, n = 512, latency = 26.65 us

It can be seen that when m or n is large, e.g. 14336, the transpose kernel hurts the e2e performance of a model.

thakkarV · 2024-09-04T14:23:45Z

are you using 2.x or 3.x API? in 3.x you should just be able to set your epilogue stride to whatever you want and it should just work

xiaonans · 2024-09-06T10:15:29Z

are you using 2.x or 3.x API? in 3.x you should just be able to set your epilogue stride to whatever you want and it should just work

Thanks for your suggestion.
I'm using cutlass 3.x.

I tried to find epilogue APIs with "stride" in https://github.com/NVIDIA/cutlass/blob/main/include/cutlass/epilogue/threadblock/epilogue.h. But I did not find parameters related to stride. Would you pls share more detailed hints on how to set epilogue stride?

In the above epilogue.h file, it can be seen that mma results are first stored in the shared memory and then loaded to registers before stored to the global memory. If I modify the iterators used in the epilogue.h file, it is really a huge work and I do not think that is the right to do.

thakkarV · 2024-09-06T16:33:00Z

I'm confused. The file you point to is not a 3.x api citizen. @hwu36 can help
maybe?

hwu36 · 2024-09-06T17:02:28Z

What is your data type and hardware? If fp16 or bf16, A can be any layout on ampere.

xiaonans · 2024-09-09T02:03:02Z

What is your data type and hardware? If fp16 or bf16, A can be any layout on ampere.

My data type is fp16, and hardware is A100-80G.

I want to scatter the output on column, as described in this issue. After that, I need to feed the output to PyTorch where the tensor is assumed to be row-major.

If I add a transpose kernel to change the order from column-major to row-major before feeding into PyTorch, the overhead will hurt the e2e performance of a model.

I want to ask is there any method to add transpose in the epilogue, so that I can do scatter on the column and transpose the tensor to row-major order after that in a single kernel.

github-actions · 2024-10-09T14:05:08Z

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

github-actions · 2025-01-07T14:57:14Z

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

xiaonans added ? - Needs Triage feature request New feature or request labels Sep 4, 2024

mnicely removed the ? - Needs Triage label Sep 9, 2024

github-actions bot added the inactive-30d label Oct 9, 2024

github-actions bot added the inactive-90d label Jan 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] transpose in epilogue/prologue #1780

[FEA] transpose in epilogue/prologue #1780

xiaonans commented Sep 4, 2024 •

edited

Loading

thakkarV commented Sep 4, 2024

xiaonans commented Sep 6, 2024 •

edited

Loading

thakkarV commented Sep 6, 2024

hwu36 commented Sep 6, 2024 •

edited

Loading

xiaonans commented Sep 9, 2024

github-actions bot commented Oct 9, 2024

github-actions bot commented Jan 7, 2025

[FEA] transpose in epilogue/prologue #1780

[FEA] transpose in epilogue/prologue #1780

Comments

xiaonans commented Sep 4, 2024 • edited Loading

thakkarV commented Sep 4, 2024

xiaonans commented Sep 6, 2024 • edited Loading

thakkarV commented Sep 6, 2024

hwu36 commented Sep 6, 2024 • edited Loading

xiaonans commented Sep 9, 2024

github-actions bot commented Oct 9, 2024

github-actions bot commented Jan 7, 2025

xiaonans commented Sep 4, 2024 •

edited

Loading

xiaonans commented Sep 6, 2024 •

edited

Loading

hwu36 commented Sep 6, 2024 •

edited

Loading