Skip to content

Personal project to understand HIP's WMMA intrinsics

Notifications You must be signed in to change notification settings

AJcodes/hip_wmma_samples

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Understanding HIP and WMMA Intrinsics

This project is a personal exploration of HIP programming and the RDNA3 Wave Matrix Multiply-Accumulate (WMMA) intrinsic. The primary goal was to deepen my understanding of the WMMA intrinsic and extend the fixed-size example provided in the GPUOpen tutorial to support arbitrary matrix dimensions. While this project is primarily for personal learning, it may also serve as a helpful reference for others interested in exploring the WMMA intrinsic.

Note: The WMMA intrinsic is specific to RDNA3 GPUs for now, so running this project requires an RDNA3-compatible GPU. A future feature may include testing this implementation on RDNA4 hardware when it becomes available. For production-grade GPU matrix multiplication, it is highly recommended to use rocWMMA, which provides a robust and optimized abstraction over the WMMA functionality.

Objective

This project aims to:

  1. Provide a simple example of HIP programming and WMMA usage for GPU-accelerated computation.
  2. Extend beyond the fixed-size example in the GPUOpen tutorial by supporting arbitrary matrix dimensions (¨M, N, K¨).
  3. Enhance understanding of the WMMA intrinsic's mechanics, especially around data loading and storing.

Features

  • Matrix Multiplication with WMMA Intrinsic: Demonstrates how to use HIP-specific WMMA (various implementations).
  • Support for Arbitrary Sizes: Goes beyond the fixed-size (16x16) example, allowing users to experiment with any matrix dimensions.
  • Shared Memory and WMMA Comparison: Runs both a shared memory kernel and a WMMA kernel for performance comparison.
  • Verification Mode: Compares GPU results with CPU reference computations to ensure correctness.

Future Plans

If time permits, the following enhancements are planned:

  1. Further optimisations of WMMA HGEMM kernel: Explore other optimisations to further improve WMMA HGEMM kernel.
  2. Explore WMMA in Other Kernel Types: Investigate the use of WMMA intrinsics in other GPU workloads beyond matrix multiplication.
  3. Test on RDNA4: Extend the implementation to test and validate the WMMA intrinsic on future RDNA4 hardware.

How to Build and Run

Prerequisites

  • AMD ROCm installed with HIP support.
  • CMake version 3.10 or higher.
  • AMD RDNA3 GPU

Steps

  1. Clone the repository and navigate to its root directory:
    git clone https://github.com/AJcodes/hip_wmma_samples.git
    cd hip_wmma_samples
  2. Build the project:
    mkdir build
    cd build
    cmake ..
    make
  3. Run the executable:
    ./hgemm

Usage

The program outputs performance metrics for all kernels, such as:

GEMM Kernel Type: XXX
----------------------------------------------------------------------------------
Kernel execution time for sizes (128, 128, 128): X.XX ms
...
Kernel execution time for sizes (1024, 1024, 1024): X.XX ms
...
Kernel execution time for sizes (4096, 4096, 4096): X.XX ms
----------------------------------------------------------------------------------

You can modify the matrix dimensions by changing M, N, and K variables in main.cpp.

Key Insights

  • This project emphasizes understanding the mechanics of HIP programming and RDNA3 WMMA intrinsics, particularly how to handle data loading and storage effectively.
  • It is intended as a learning tool and not as an optimized implementation for production use.

Acknowledgments

This project was inspired by:

About

Personal project to understand HIP's WMMA intrinsics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published