-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[QST] Where is FlashAttention-2 CUTLASS kernel #1838
Comments
xFormers is different from FA2 and FA3. FA2 and FA3 are downstream of CUTLASS in @tridao 's FA repo itself : https://github.com/Dao-AILab/flash-attention. Both FA2 and FA3 are written using CUTLASS. |
Thank you for reply. Thank you. |
FA2 should work quite well on all Sm8x GPUs which includes RTX 3000 and RTX 4000 series GPUs. I suspect it works well on Jetson Orin too since that is Sm8x as well. YMMW so you should benchmark to confirm. If it is not near peak util, it should be quite easy to tune. Although for inference I suspect you want flash decode instead? |
Thank you for reply. |
This issue has been labeled |
This issue has been labeled |
Hello, I'am study fused_multi_head_attention example in CUTLASS.
In CUTLASS 3.5.1 README.md, it said flash attention 2 kernel is in CUTLASS.
But in fused_multi_head attention, it is based on Meta/xFormer.
I can not find flash attention2 CUTLASS kernels.
Is fused_multi_head attention and flash attention is same?
The text was updated successfully, but these errors were encountered: