Skip to content

v2.3 Refactor 6/N

Compare
Choose a tag to compare
@DefTruth DefTruth released this 17 Sep 07:57
· 249 commits to main since this release
f9001b9

What's Changed

  • [Refactor][6/N] CUDA Learn Notes refactor Part-6 by @DefTruth in #17
  • [Refactor][5/N] CUDA Learn Notes refactor Part-6 by @DefTruth in #18
  • [LayerNorm][Half] support fp16x8 packed LayerNorm by @DefTruth in #19
  • [Reduce][Half] add HALF2 & BFLOAT2 macro by @DefTruth in #21
  • [RMSNorm][Half] support fp16x8 packed RMSNorm by @DefTruth in #22
  • [Bugfix][Kernel] fixed some kernel blocks calculate errors by @DefTruth in #23
  • [Elementwise][Half] support fp16x8 packed Elementwise by @DefTruth in #24
  • [Elementwise][Half] support fp16x8 packed Elementwise by @DefTruth in #25
  • [RELU][Half] support fp16x8 RELU kernel by @DefTruth in #26
  • [RMSNorm] support f16x8_f32 RMSNorm by @DefTruth in #28
  • [RMSNorm][Kernel] Add FLOAT2/HALF2_VARIANCE macro by @DefTruth in #29
  • [LayerNorm][Kernel] Add HALF2 SUM/SUB/VAR macro by @DefTruth in #30
  • [HGEMM] Add slicked_k&t_8x8_sliced_k_f16x4 by @DefTruth in #31
  • [HGEMV][Half] support hgemv k32/k128/f16 by @DefTruth in #32
  • [FlashAttention] Refactor flash_attn_1_fwd_f32 kernel by @DefTruth in #33
  • Bump up to v2.3 by @DefTruth in #34

Full Changelog: v2.2...v2.3