Understanding HIP and WMMA Intrinsics

This project is a personal exploration of HIP programming and the RDNA3 Wave Matrix Multiply-Accumulate (WMMA) intrinsic. The primary goal was to deepen my understanding of the WMMA intrinsic and extend the fixed-size example provided in the GPUOpen tutorial to support arbitrary matrix dimensions. While this project is primarily for personal learning, it may also serve as a helpful reference for others interested in exploring the WMMA intrinsic.

Note: The WMMA intrinsic is specific to RDNA3 GPUs for now, so running this project requires an RDNA3-compatible GPU. A future feature may include testing this implementation on RDNA4 hardware when it becomes available. For production-grade GPU matrix multiplication, it is highly recommended to use rocWMMA, which provides a robust and optimized abstraction over the WMMA functionality.

Objective

This project aims to:

Provide a simple example of HIP programming and WMMA usage for GPU-accelerated computation.
Extend beyond the fixed-size example in the GPUOpen tutorial by supporting arbitrary matrix dimensions (¨M, N, K¨).
Enhance understanding of the WMMA intrinsic's mechanics, especially around data loading and storing.

Features

Matrix Multiplication with WMMA Intrinsic: Demonstrates how to use HIP-specific WMMA (various implementations).
Support for Arbitrary Sizes: Goes beyond the fixed-size (16x16) example, allowing users to experiment with any matrix dimensions.
Shared Memory and WMMA Comparison: Runs both a shared memory kernel and a WMMA kernel for performance comparison.
Verification Mode: Compares GPU results with CPU reference computations to ensure correctness.

Future Plans

If time permits, the following enhancements are planned:

Further optimisations of WMMA HGEMM kernel: Explore other optimisations to further improve WMMA HGEMM kernel.
Explore WMMA in Other Kernel Types: Investigate the use of WMMA intrinsics in other GPU workloads beyond matrix multiplication.
Test on RDNA4: Extend the implementation to test and validate the WMMA intrinsic on future RDNA4 hardware.

How to Build and Run

Prerequisites

AMD ROCm installed with HIP support.
CMake version 3.10 or higher.
AMD RDNA3 GPU

Steps

Clone the repository and navigate to its root directory:

git clone https://github.com/AJcodes/hip_wmma_samples.git
cd hip_wmma_samples

Build the project:
```
mkdir build
cd build
cmake ..
make
```
Run the executable:
```
./hgemm
```

Usage

The program outputs performance metrics for all kernels, such as:

GEMM Kernel Type: XXX
----------------------------------------------------------------------------------
Kernel execution time for sizes (128, 128, 128): X.XX ms
...
Kernel execution time for sizes (1024, 1024, 1024): X.XX ms
...
Kernel execution time for sizes (4096, 4096, 4096): X.XX ms
----------------------------------------------------------------------------------

You can modify the matrix dimensions by changing M, N, and K variables in main.cpp.

Key Insights

This project emphasizes understanding the mechanics of HIP programming and RDNA3 WMMA intrinsics, particularly how to handle data loading and storage effectively.
It is intended as a learning tool and not as an optimized implementation for production use.

Acknowledgments

This project was inspired by:

The GPUOpen RDNA3 WMMA Tutorial.
The excellent work on rocWMMA, which should be used for real-world applications involving WMMA.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
common		common
hgemm		hgemm
.clang-format		.clang-format
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
main.cpp		main.cpp
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Understanding HIP and WMMA Intrinsics

Objective

Features

Future Plans

How to Build and Run

Prerequisites

Steps

Usage

Key Insights

Acknowledgments

About

Releases

Packages

Languages

AJcodes/hip_wmma_samples

Folders and files

Latest commit

History

Repository files navigation

Understanding HIP and WMMA Intrinsics

Objective

Features

Future Plans

How to Build and Run

Prerequisites

Steps

Usage

Key Insights

Acknowledgments

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages