Releases · ashvardanian/less_slow.cpp

29 Jan 00:28

v0.6.0

c56ed50

Less Slow v0.6: Thrust → CUDA → PTX → SASS 🏋️‍♂️🏋️‍♀️ Latest

Latest

It's almost impossible to imagine modern High-Performance Computing without GPUs. Yet, there are surprisingly few "full stack" demos out there for folks wanting to build intuition around CUDA C++, PTX Intermediate Representations, SASS Assembly, and higher-level libraries like Thrust, CUB, or the various cuBLAS flavors. This new release of Less Slow covers all of those! 🥳

Tensor Cores

The main highlight is an in-depth look at Tensor Core designs, from their extensive type system to the complexity of tile shapes—notoriously under-documented and confusing areas. These capabilities differ across Volta, Turing, Ampere, Ada, and Hopper GPUs, mapping to different PTX intrinsics (like wmma, binary bmma, or warp-group wgmma) and culminating in yet another shape at the SASS level with instructions such as multiple HMMA.884.F32.F32.STEPx instructions for each wmma.mma.sync.aligned.row.col.m16n16k16.f32.f32 intrinsic on Volta. And if you believe that instruction is long... be warned 😅

__global__ void tops_f16f16_sm70tc_16x16x16_1024unroll_cuda_kernel() {
    using namespace nvcuda;
    wmma::fragment<wmma::matrix_a, 16,16,16, half, wmma::row_major> a_frag;
    wmma::fragment<wmma::matrix_b, 16,16,16, half, wmma::col_major> b_frag;
    wmma::fragment<wmma::accumulator, 16,16,16, half> c_frag;
    for (int i = 0; i < 1024; ++i)
        wmma::mma_sync(c_frag, a_frag, b_frag, c_frag);
}

$ cuobjdump -sass less_slow_from_cu.cubin | grep -i mma
# e.g. HMMA.884.F32.F32.STEP2 ...

This indicates the 8×8×4 shape actually used by the hardware on Volta.

PTX vs SASS

I've also hand-written PTX kernels, that may look like:

.visible .entry tops_f16f16_sm70tc_16x16x16_1024loop_ptx_kernel()
{
  // ...
  loop_start:
    // A single wmma instruction
    wmma.mma.sync.aligned.row.col.m16n16k16.f16.f16
      { %f0, %f1, %f2, %f3 }, // output accumulators
      { %f4, ... },          // A
      { %f12, ... },         // B
      { %f0, %f1, %f2, %f3 }; // input accumulators
    // ...
  bra loop_start;
}

Using the provided scripts, you can see for yourself just how different manually written vs. machine-generated PTX can be and how to invoke kernels directly from C++ in various ways — whether through the CUDA Runtime API or the CUDA Driver API — loading and JIT-compiling bits of PTX on the fly!

cuInit(0);
CUdevice dev; cuDeviceGet(&dev, 0);
CUcontext ctx; cuCtxCreate(&ctx, 0, dev);
CUmodule mod; cuModuleLoad(&mod, "less_slow.ptx");
CUfunction fun; cuModuleGetFunction(&fun, mod, "tops_f16f16_sm70tc_16x16x16_1024loop_ptx_kernel");

void* args[] = { /* kernel parameters here */ };
cuLaunchKernel(fun,
               1, 1, 1,  // gridDim
               256, 1, 1,// blockDim
               0, nullptr, args, nullptr);
cuCtxSynchronize();
cuModuleUnload(mod);
cuCtxDestroy(ctx);

cuBLAS on Practice

I've also included theoretical throughput benchmarks alongside real matrix multiplications via cuBLAS in case you want to compare actual performance to the raw theoretical numbers. One important observation here may be the lack of low-resolution numeric types:

if constexpr (std::is_same_v<scalar_type_, float>) {
    scalar_type_ alpha = 1, beta = 0;
    cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, n, n, n, &alpha, a.begin(), lda, b.begin(), ldb, &beta, c.begin(), ldc);
} else if constexpr (std::is_same_v<scalar_type_, double>) {
    scalar_type_ alpha = 1, beta = 0;
    cublasDgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, n, n, n, &alpha, a.begin(), lda, b.begin(), ldb, &beta, c.begin(), ldc);
} else if constexpr (std::is_same_v<scalar_type_, __half>) {
    scalar_type_ alpha = 1, beta = 0;
    cublasHgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, n, n, n, &alpha, a.begin(), lda, b.begin(), ldb, &beta, c.begin(), ldc);
} else if constexpr (std::is_same_v<scalar_type_, int8_t>) {
    int32_t alpha_int = 1, beta_int = 0;
    cublasGemmEx(handle, CUBLAS_OP_N, CUBLAS_OP_N, n, n, n, &alpha_int, a.begin(), CUDA_R_8I, lda, b.begin(), CUDA_R_8I, ldb, &beta_int, c.begin(), CUDA_R_32I, ldc, CUDA_R_32I, CUBLAS_GEMM_DEFAULT);
}

Even integer kernels have a different signature, requiring $Alpha$ and $Beta$ to match the accumulator type, rather than the inputs. Very few libraries have adaptations for binary matrices and or sub-byte representations.

Beyond Linear Algebra

Since GPUs obviously go beyond linear algebra, Thrust and CUB are perfect for exploring other domains in heterogeneous computing. I’ve added snippets that mostly revolve around sorting algorithms, showcasing the differences in memory management between Thrust and CUB and explaining why CUB calls often come in pairs, like:

size_t temp_size = 0;
void *d_temp = nullptr;
cub::DeviceRadixSort::SortKeys(nullptr, temp_size, d_in_keys, d_out_keys, count);
cudaMalloc(&d_temp, temp_size);
cub::DeviceRadixSort::SortKeys(d_temp, temp_size, d_in_keys, d_out_keys, count);

This was also a good place to show how Thrust and CUB operations can be scheduled together on the same asynchronous streams and profiled with GPU time instead of CPU time to avoid unnecessary blocking ⏲️

Enjoy exploring, and happy GPU hacking! I’ll keep adding to this project (and other related ones) as we go along!

Changelog

Add: Binary BMMA kernels for GPU (6a609a0)
Add: Tensor Core intrinsic benchmarks (1bdb5df)
Add: cuBLAS benchmarks (2f791fe)
Add: Precompiled CUDA C++ kernels (c1a6f3e)
Add: Using CUDA Driver API to JIT .ptx (82cb684)
Add: PTX and .cuh kernels (824e473)
Add: Sorting with thrust and cub (df3b2c1)
Add: Thrust, CUB, CUDA sorting (551402d)
Add: Thrust, CUB, CUDA sorting (8481114)
Make: Drop OpenBLAS (3c92c36)
Fix: Use f16 MMA (141d285)
Fix: Lower PTX version for JIT (eff3854)
Fix: Working PTX kernel (514db0f)
Docs: Introduce Warp-Group-MMA on Hopper (400f294)
Make: Build CUDA for multiple platforms (3283ab0)
Fix: Avoid optimizing-out SASS code (986b8bc)
Fix: Compiling cuBLAS calls (312409a)
Make: Don't compile PTX (53202e6)
Make: Silence NVCC warnings (a6cdc74)
Fix: NVCC compilation issues (494e705)
Make: Upgrade fmt for NVCC builds (88277bf)
Fix: Ranges require constexpr on NVCC (c1d7b2f)
Make: Switch to CUDA Toolkit for GPU libs (2589a40)
Make: Options for CUDA & TBB in CMake (4d03c08)

Assets 2

26 Jan 20:40

ashvardanian

v0.5.4

3c45f6e

v0.5.4: Supporting MSVC on Windows 🪟

The less_slow.cpp project now supports Microsoft Visual C++ (MSVC), thanks to the extensive list of patches suggested by @RazielXYZ 👏

Key updates include switching to OpenBLAS via FetchContent for comparable linear algebra performance across platforms, enabling OpenMP on MSVC for parallelism in Eigen-based computations, and revising OpenMP loop indices to use int64_t, as MSVC requires signed types for parallel loops. The detection of physical cores on high-core-count Windows systems has also been improved by implementing GetActiveProcessorCount(ALL_PROCESSOR_GROUPS) and refining physical core detection logic. Furthermore, the integration addresses missing functionality, such as the lack of __builtin_popcountll on MSVC, with a manual fallback for is_power_of_two.

Additional findings include MSVC-specific behaviors, such as how linking AVX-512 code significantly slows down builds and how assembly-based benchmarks require further investigation for proper MSVC integration. Interestingly, heavily-templated libraries, like Ranges-v3 or CRTE (Compile-time RegEx), show much worse performance on MSVC than with GCC and Clang.

Contributors

RazielXYZ

Assets 2

26 Jan 11:56

ashvardanian

v0.5.3

9ab0d56

Release v0.5.3

Release: v0.5.3 [skip ci]

Patch

Improve: Sorting includes (dfbfaa2)
Make: Recommend VS Code extensions (776b7a4)
Fix: edge_t constructor for NVCC (5b5c464)

Assets 2

25 Jan 19:29

ashvardanian

v0.5.2

a98945f

Release v0.5.2

Release: v0.5.2 [skip ci]

Patch

Improve: Mark non-executable stack (65bac92)

Assets 2

22 Jan 14:18

ashvardanian

v0.5.1

ff4644d

Release v0.5.1

Release: v0.5.1 [skip ci]

Patch

Improve: MinWarmUpTime before spread ops (f3473c3)
Docs: Listing Google Benchmark APIs (a19fef3)
Improve: Use aligned_array in more places (1539624)
Improve: _k suffix for constants (53f9e71)

Assets 2

20 Jan 16:30

ashvardanian

v0.5.0

5cb72e1

Release v0.5.0

Release: v0.5.0 [skip ci]

Minor

Add: Unbundled FMA versions (5d206f1)

Patch

Improve: Reorder MA kernels (c3f3ce9)
Fix: Handling denormal values in FMA (f9b989a)
Docs: Graviton 4 GEMM benchmarks (108ae72)
Docs: Note on __bf16 on Arm (651d7e6)
Fix: Call tops_f64_neon_asm_kernel (44f8086)

Assets 2

20 Jan 13:45

ashvardanian

v0.4.2

1213f91

Release v0.4.2

Release: v0.4.2 [skip ci]

Patch

Docs: Better highlights (c662a6c)

Assets 2

20 Jan 10:27

ashvardanian

v0.4.1

b9c94b9

v0.4: Allocators and Sparse Graphs 🕸️

This release explores high-performance graph implementations optimized for different access patterns, focusing on real-world use cases like recommendation systems and social networks. To implement those, various STL and Abseil-based containers are used to implement sparse Graph structures.
It shows:

How to use STL's polymorphic allocators?
How do we construct a hybrid nested container that propagates stateful allocators to the inner structures?
Up to 300x performance difference between good implementations in different workload patterns!

Of other neat tricks, shows:

How can the three-way comparison operator be used with std::tie?
What's the difference between std::weak_ordering and the strong one ?
Where can the [[no_unique_address]] attribute be used?

Implementation

It extends the Graph API to:

upsert_edge(from, to, weight): Inserts or updates an existing edge between two vertices.
get_edge(from, to): Retrieves the std::optional weight of the edge between two vertices.
remove_edge(from, to): If present, remove the edge between two vertices.
for_edges(from, visitor): Applies a callback to all edges starting from a vertex.
size(): Returns the graph's number of vertices and edges.
reserve(capacity): Reserves memory for the given number of vertices.
compact(): Compacts the memory layout of the graph, preparing for read-intensive workloads.

Results

On Intel Sapphire Rapids CPUs in AWS c7i instances:

------------------------------------------------------------------------------------------
Benchmark                                                Time             CPU   Iterations
------------------------------------------------------------------------------------------
graph_make<std::unordered_maps>/min_time:10.000   57329425 ns     57327619 ns          245
graph_make<std::map>/min_time:10.000             109704078 ns    109697310 ns          100
graph_make<absl::flat_set>/min_time:10.000        80598043 ns     80595813 ns          174
graph_rank<std::unordered_maps>/min_time:10.000   35763632 ns     35762406 ns          392
graph_rank<std::map>/min_time:10.000              51658552 ns     51657290 ns          271
graph_rank<absl::flat_set>/min_time:10.000          236938 ns       236933 ns        59137

On AWS Graviton 4 CPUs in AWS r8g instances:

------------------------------------------------------------------------------------------
Benchmark                                                Time             CPU   Iterations
------------------------------------------------------------------------------------------
graph_make<std::unordered_maps>/min_time:10.000  163664945 ns    163660572 ns           86
graph_make<std::map>/min_time:10.000             382543113 ns    382534380 ns           45
graph_make<absl::flat_set>/min_time:10.000       213277341 ns    213272284 ns           64
graph_rank<std::unordered_maps>/min_time:10.000   59579530 ns     59578435 ns          240
graph_rank<std::map>/min_time:10.000              69177429 ns     69175965 ns          191
graph_rank<absl::flat_set>/min_time:10.000          186428 ns       186430 ns        74929

Assets 2

20 Jan 09:59

ashvardanian

v0.4.0

f5786ae

v0.4: Pointer tagging 🏷️

This release provides a prototype of a memory-tagging allocator arena, documenting the complexity of using memory tagging techniques even on Linux:

Intel's Linear Address Masking
AMD's Upper Address Ignore
ARM's Top Byte Ignore
ARM's Memory Tagging Extension

Minor

Add: Intel's Linear Address Masking (ead0cd2)
Add: Pointer tagging draft (72dfc31)

Patch

Docs: Arm & AMD pointer tagging (3a4b1d7)
Fix: Avoid tagging if Arm or LA57 (5ac64d1)
Improve: Log mean allocation size (83ba346)

Assets 2

20 Jan 09:58

ashvardanian

v0.3.0

431d5f8

v0.3: Gather 🔄 Scatter

This release introduces benchmarks for gather & scatter SIMD rarely-used instructions that can be used to accelerate lookups by ~30% on current x86 and Arm machines.

Serial
AVX-512 for x86
SVE for Arm

Minor

Add: SVE gather/scatter (107b359)
Add: Serial & AVX-512 scatter/gather (089cfa0)

Patch

Improve: Timing SVE (daa55f5)
Improve: Stabilize gather timings (3fca991)

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tensor Cores

PTX vs SASS

cuBLAS on Practice

Beyond Linear Algebra

Changelog

Contributors

Patch

Patch

Patch

Minor

Patch

Patch

Implementation

Results

Minor

Patch

Minor

Patch

Releases: ashvardanian/less_slow.cpp

Less Slow v0.6: Thrust → CUDA → PTX → SASS 🏋️‍♂️🏋️‍♀️

Tensor Cores

PTX vs SASS

cuBLAS on Practice

Beyond Linear Algebra

Changelog

v0.5.4: Supporting MSVC on Windows 🪟

Contributors

Release v0.5.3

Patch

Release v0.5.2

Patch

Release v0.5.1

Patch

Release v0.5.0

Minor

Patch

Release v0.4.2

Patch

v0.4: Allocators and Sparse Graphs 🕸️

Implementation

Results

v0.4: Pointer tagging 🏷️

Minor

Patch

v0.3: Gather 🔄 Scatter

Minor

Patch