Skip to content

Releases: DefTruth/CUDA-Learn-Notes

HGEMM Up to 115 TFLOPS:L20

21 Oct 12:55
a2934b9
Compare
Choose a tag to compare

What's Changed

  • [HGEMM] Add MMA 16816 swizzle, Up to 115 TFLOPS by @DefTruth in #98

Full Changelog: v2.4.13...v2.4.15

HGEMM Up to 113 TFLOPS:L20

21 Oct 01:56
0aeb450
Compare
Choose a tag to compare

What's Changed

  • [Mat][Trans] Add f32/f32x4 row/col first kernel by @bear-zd in #89
  • [Docs][Contribute] Add How to contribute Notes by @DefTruth in #90
  • [HGEMM] optimize SMEM padding, up to 113 TFLOPS by @DefTruth in #92
  • [Mat][Trans] Add f32x4_shared/bcf row/col first kernel. by @bear-zd in #91
  • [Docs] rename mat_transpose -> mat-transpose by @DefTruth in #93
  • [HGEMM] Add GeForce RTX 3080 Laptop benchmark by @DefTruth in #94
  • [HGEMM] update HGEMM benchmark option by @DefTruth in #95
  • [HGEMM] Refactor HGEMM WMMA 161616 kernels by @DefTruth in #96
  • [HGEMM] Update HGEMM WMMA Benchmark by @DefTruth in #97

Full Changelog: v2.4.12...v2.4.13

v2.4.12 SGEMM TF32 Swizzle

17 Oct 02:24
8c6922b
Compare
Choose a tag to compare

What's Changed

New Contributors

Full Changelog: v2.4.11...v2.4.12

v2.4.11 HGEMM Block Swizzle

16 Oct 03:04
bc3d78e
Compare
Choose a tag to compare

What's Changed

Full Changelog: v2.4.10...v2.4.11

v2.4.10 SGEMM TF32 Stage 2/3

15 Oct 02:04
2906e78
Compare
Choose a tag to compare

What's Changed

  • [HGEMM] HGEMM WMMA Stage mma4x2+warp4x4 by @DefTruth in #76
  • [SGEMM] Add SGEMM WMMA TF32 Stage2/3 by @DefTruth in #77
  • [SGEMM] Add cuBLAS SGEMM F32/TF32 baseline by @DefTruth in #78
  • [SGEMM] Add Kernel cudaFuncSetAttribute hint by @DefTruth in #79
  • [RoPE] Add minimal RoPE f32/f32x4 pack impl by @bear-zd in #80

Full Changelog: v2.4.9...v2.4.10

v2.4.9 HGEMM WMMA Stage

13 Oct 09:15
3acd5e2
Compare
Choose a tag to compare

What's Changed

  • [HGEMM] Add HGEMM WMMA Double Buffers by @DefTruth in #69
  • [Embedding] Add embedding kernel f32/x4/x4_pack, f16/x8/x8_pack by @bear-zd in #68
  • [HGEMM] Add HGEMM mma4x2, warp2x4x2 kernel by @DefTruth in #70
  • [HGEMM] HGEMM WMMA with Reg double buffers by @DefTruth in #71
  • [HGEMM] Add HGEMM WMMA Stage 3/4 Kernel by @DefTruth in #74
  • [Softmax] Add online softmax f32x4 pack kernel by @bear-zd in #73
  • [HEGMM][Bugfix] fix HGEMM Stage cp.async error by @DefTruth in #75

Full Changelog: v2.4.8...v2.4.9

v2.4.8 HGEMM WMMA Part-1

11 Oct 11:05
5aef1b1
Compare
Choose a tag to compare

What's Changed

  • [GELU] Add f32/x4, f16/x2/x8/x8pack kernel. by @bear-zd in #66
  • [HGEMM] HGEMM Tensor Cores Support Part-1 by @DefTruth in #67

Full Changelog: v2.4.7...v2.4.8

v2.4.7 SGEMM Copy Async

10 Oct 06:16
3b56750
Compare
Choose a tag to compare

What's Changed

  • [SGEMM][Async] Add naive copy async SGEMM by @DefTruth in #64
  • [SGEMM][Async] Add K16 + Copy Async Kernel by @DefTruth in #65

Full Changelog: v2.4.6...v2.4.7

v2.4.6 HGEMM Copy Async

08 Oct 03:48
bbec7b5
Compare
Choose a tag to compare

What's Changed

  • [Softmax] Add online softmax according to Nvidia Paper by @bear-zd in #60
  • [HGEMM][Async] support K16/32 pack+cp.async+dbuf by @DefTruth in #62
  • [Softmax][Bugfix] fixed softmax compile error by @DefTruth in #63

New Contributors

Full Changelog: v2.4.5...v2.4.6

v2.4.5 HGEMM Double Buffers

30 Sep 07:47
3f5ace3
Compare
Choose a tag to compare

What's Changed

  • [FlashAttention] Refactor FlashAttention PyTorch bindings by @DefTruth in #55
  • [SGEMM] test bank conflicts free with smem offset by @DefTruth in #56
  • [HGEMM] HEGMM kernel with double buffers by @DefTruth in #57
  • [Docs] Add docs for HGEMM/SGEMM double buffers by @DefTruth in #58
  • [HGEMM] Add PyTorch HGEMM profile by @DefTruth in #59

Full Changelog: v2.4.4...v2.4.5