Releases: DefTruth/CUDA-Learn-Notes
Releases · DefTruth/CUDA-Learn-Notes
HGEMM Up to 115 TFLOPS:L20
What's Changed
Full Changelog: v2.4.13...v2.4.15
HGEMM Up to 113 TFLOPS:L20
What's Changed
- [Mat][Trans] Add f32/f32x4 row/col first kernel by @bear-zd in #89
- [Docs][Contribute] Add How to contribute Notes by @DefTruth in #90
- [HGEMM] optimize SMEM padding, up to 113 TFLOPS by @DefTruth in #92
- [Mat][Trans] Add f32x4_shared/bcf row/col first kernel. by @bear-zd in #91
- [Docs] rename mat_transpose -> mat-transpose by @DefTruth in #93
- [HGEMM] Add GeForce RTX 3080 Laptop benchmark by @DefTruth in #94
- [HGEMM] update HGEMM benchmark option by @DefTruth in #95
- [HGEMM] Refactor HGEMM WMMA 161616 kernels by @DefTruth in #96
- [HGEMM] Update HGEMM WMMA Benchmark by @DefTruth in #97
Full Changelog: v2.4.12...v2.4.13
v2.4.12 SGEMM TF32 Swizzle
What's Changed
- [SGEMM] SGEMM TF32 Thread Block Swizzle by @DefTruth in #84
- [HGEMM] mma4x4_warp4x4_stages with swizzle by @DefTruth in #86
- [SWISH] support Swish F32/F16 kernel by @wangzijian1010 in #85
- [SGEMM] Update SGEMM TF32 Benchmark by @DefTruth in #87
New Contributors
- @wangzijian1010 made their first contribution in #85
Full Changelog: v2.4.11...v2.4.12
v2.4.11 HGEMM Block Swizzle
v2.4.10 SGEMM TF32 Stage 2/3
What's Changed
- [HGEMM] HGEMM WMMA Stage mma4x2+warp4x4 by @DefTruth in #76
- [SGEMM] Add SGEMM WMMA TF32 Stage2/3 by @DefTruth in #77
- [SGEMM] Add cuBLAS SGEMM F32/TF32 baseline by @DefTruth in #78
- [SGEMM] Add Kernel cudaFuncSetAttribute hint by @DefTruth in #79
- [RoPE] Add minimal RoPE f32/f32x4 pack impl by @bear-zd in #80
Full Changelog: v2.4.9...v2.4.10
v2.4.9 HGEMM WMMA Stage
What's Changed
- [HGEMM] Add HGEMM WMMA Double Buffers by @DefTruth in #69
- [Embedding] Add embedding kernel f32/x4/x4_pack, f16/x8/x8_pack by @bear-zd in #68
- [HGEMM] Add HGEMM mma4x2, warp2x4x2 kernel by @DefTruth in #70
- [HGEMM] HGEMM WMMA with Reg double buffers by @DefTruth in #71
- [HGEMM] Add HGEMM WMMA Stage 3/4 Kernel by @DefTruth in #74
- [Softmax] Add online softmax f32x4 pack kernel by @bear-zd in #73
- [HEGMM][Bugfix] fix HGEMM Stage cp.async error by @DefTruth in #75
Full Changelog: v2.4.8...v2.4.9
v2.4.8 HGEMM WMMA Part-1
What's Changed
- [GELU] Add f32/x4, f16/x2/x8/x8pack kernel. by @bear-zd in #66
- [HGEMM] HGEMM Tensor Cores Support Part-1 by @DefTruth in #67
Full Changelog: v2.4.7...v2.4.8
v2.4.7 SGEMM Copy Async
What's Changed
- [SGEMM][Async] Add naive copy async SGEMM by @DefTruth in #64
- [SGEMM][Async] Add K16 + Copy Async Kernel by @DefTruth in #65
Full Changelog: v2.4.6...v2.4.7
v2.4.6 HGEMM Copy Async
v2.4.5 HGEMM Double Buffers
What's Changed
- [FlashAttention] Refactor FlashAttention PyTorch bindings by @DefTruth in #55
- [SGEMM] test bank conflicts free with smem offset by @DefTruth in #56
- [HGEMM] HEGMM kernel with double buffers by @DefTruth in #57
- [Docs] Add docs for HGEMM/SGEMM double buffers by @DefTruth in #58
- [HGEMM] Add PyTorch HGEMM profile by @DefTruth in #59
Full Changelog: v2.4.4...v2.4.5