Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] transpose in epilogue/prologue #1780

Open
xiaonans opened this issue Sep 4, 2024 · 7 comments
Open

[FEA] transpose in epilogue/prologue #1780

xiaonans opened this issue Sep 4, 2024 · 7 comments

Comments

@xiaonans
Copy link

xiaonans commented Sep 4, 2024

Now I'm using cutlass in my project. I found that some cases have constraints to the layout, such as input matrix A and output matrix C should be row major. These kinds of assumption limit the feasibility to add the cutlass gemm kernel directly in my project without transpose kernels. If cutlass can provide examples to show how to fuse the transpose in the epilogue and "prologue" phases, overhead of the transpose kernels will be eliminated.

I did some tests to show the overhead of adding transpose kernels of an MxN matrix on A100-80G:
m = 8, n = 4096, latency = 8.40 us
m = 8, n = 1024, latency = 7.03 us
m = 8, n = 14336, latency = 7.12 us
m = 8, n = 4096, latency = 6.98 us
m = 32, n = 4096, latency = 6.97 us
m = 32, n = 1024, latency = 6.98 us
m = 32, n = 14336, latency = 8.26 us
m = 32, n = 4096, latency = 8.35 us
m = 256, n = 4096, latency = 15.65 us
m = 256, n = 1024, latency = 7.16 us
m = 256, n = 14336, latency = 46.49 us
m = 256, n = 4096, latency = 15.83 us
m = 512, n = 4096, latency = 28.82 us
m = 512, n = 1024, latency = 10.31 us
m = 512, n = 14336, latency = 92.73 us
m = 512, n = 4096, latency = 28.88 us

m = 4096, n = 8, latency = 7.07 us
m = 1024, n = 8, latency = 8.18 us
m = 14336, n = 8, latency = 7.52 us
m = 4096, n = 8, latency = 8.34 us
m = 4096, n = 32, latency = 8.35 us
m = 1024, n = 32, latency = 7.83 us
m = 14336, n = 32, latency = 9.65 us
m = 4096, n = 32, latency = 8.34 us
m = 4096, n = 256, latency = 15.90 us
m = 1024, n = 256, latency = 8.18 us
m = 14336, n = 256, latency = 46.24 us
m = 4096, n = 256, latency = 15.90 us
m = 4096, n = 512, latency = 26.69 us
m = 1024, n = 512, latency = 10.29 us
m = 14336, n = 512, latency = 88.09 us
m = 4096, n = 512, latency = 26.65 us

It can be seen that when m or n is large, e.g. 14336, the transpose kernel hurts the e2e performance of a model.

@xiaonans xiaonans added ? - Needs Triage feature request New feature or request labels Sep 4, 2024
@thakkarV
Copy link
Collaborator

thakkarV commented Sep 4, 2024

are you using 2.x or 3.x API? in 3.x you should just be able to set your epilogue stride to whatever you want and it should just work

@xiaonans
Copy link
Author

xiaonans commented Sep 6, 2024

are you using 2.x or 3.x API? in 3.x you should just be able to set your epilogue stride to whatever you want and it should just work

Thanks for your suggestion.
I'm using cutlass 3.x.

I tried to find epilogue APIs with "stride" in https://github.com/NVIDIA/cutlass/blob/main/include/cutlass/epilogue/threadblock/epilogue.h. But I did not find parameters related to stride. Would you pls share more detailed hints on how to set epilogue stride?

In the above epilogue.h file, it can be seen that mma results are first stored in the shared memory and then loaded to registers before stored to the global memory. If I modify the iterators used in the epilogue.h file, it is really a huge work and I do not think that is the right to do.

@thakkarV
Copy link
Collaborator

thakkarV commented Sep 6, 2024

I'm confused. The file you point to is not a 3.x api citizen. @hwu36 can help
maybe?

@hwu36
Copy link
Collaborator

hwu36 commented Sep 6, 2024

What is your data type and hardware? If fp16 or bf16, A can be any layout on ampere.

@xiaonans
Copy link
Author

xiaonans commented Sep 9, 2024

What is your data type and hardware? If fp16 or bf16, A can be any layout on ampere.

My data type is fp16, and hardware is A100-80G.

I want to scatter the output on column, as described in this issue. After that, I need to feed the output to PyTorch where the tensor is assumed to be row-major.

If I add a transpose kernel to change the order from column-major to row-major before feeding into PyTorch, the overhead will hurt the e2e performance of a model.

I want to ask is there any method to add transpose in the epilogue, so that I can do scatter on the column and transpose the tensor to row-major order after that in a single kernel.

Copy link

github-actions bot commented Oct 9, 2024

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

Copy link

github-actions bot commented Jan 7, 2025

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants