[MFMA] Refactor dot pipeline to reduce code duplication #400

binarman · 2023-11-14T13:59:57Z

This PR:

simplifies data types generated by shared->mfma dot op layout conversions. Do not pack data types in int32 or int64
reduce code duplication between fast/normal path
reduce code duplication between operand A and operand B

This PR generalizes llvm values generalted by ttg->llvm op loading: shared to mfma op generates array of repNxrepK vectors of matrix elements.

zhanglx13 · 2023-12-05T16:51:14Z

lib/Conversion/TritonGPUToLLVM/ConvertLayoutOpToLLVM/SharedToDotOperandMFMA.cpp

+  if (nonKIdx == 1)
+    waveId = udiv(waveId, i32_val(wpt[0]));
+  return urem(urem(waveId, i32_val(wpt[nonKIdx])),
+              i32_val(tensorSizeNonK / elemPerInstrNonK));


I'm confused about this part.
Say we have warpsPerCTA={2,2}, waveId=1, then we are talking about the top left wave in the workgroup, right?
If so, we have

for opA, i.e. nonKIdx=0, spatialWarpId = (waveId % wpt[0]) % (M / 32) = 1

for opB, i.e. nonKIdx=1, spatialWarpId = ((waveId/wpt[0])%wpt[1]) % (N / 8) = 0

But shouldn't wave1 has index 1 for opB and index 0 for opA?

But shouldn't wave1 has index 1 for opB and index 0 for opA?

It was originally implemented like this. I am not sure, if one or another orientation have advantages. I did not try other layout.

If you think transposed wave indexing could have advantages or preferable for style reasons, we can swap it and see what happens.

so it's assumed that the 2x2 layout is

wave0 wave2
wave1 wave3

These two assumptions are not about styles, but correctness. Maybe this is the reason why some gemm tests failed in #402, in which the mfma layout is used directly for global store.

You are probably right. I've investigated failures in 402 a little a few weeks ago, and found that current mfma layout is not compatible with global store implementation.

zhanglx13 · 2023-12-05T19:34:15Z

lib/Conversion/TritonGPUToLLVM/DotOpToLLVM/MFMA.cpp

+        auto rawElems = elems[n1 * i + j];
+        Value convertedElems;
+        if (type.isF32()) {
+          convertedElems = extract_element(type, rawElems, i32_val(0));


Why do we need to dereference rawElems here for i32 input?

This part is related to processing of A/B dot operands.
This code is an adapter from generic vec<base_type x kwidth> format to the format that rocdl intrinsic expects.

Previously this transformation was done in Type converter (see TypeConverter.cpp below), but I feel that it is better to have this transformation closer to MFMA emitting code. So I simplified type converter and moved this code here.

A and B could be one of variety of types: fp32, fp16, bf16, int8, fp8*
But rocdl mfma intrinsics takes some of these types in a packed integer format.

For example:

fp32xfp32 -> fp32 version intrinsic takes plain scalar fp32, fp32 as A/B arguments.

fp8xfp8 -> fp32 or int8xint8 -> fp32 version takes several values in for of packed int32 or int64 (depending on kwidth of operation)

zhanglx13

LGTM.

There might be a disagreement about wave layout that causes the some gemm test failures in #402. We'll fix that one later.

alefimov-amd assigned binarman Nov 14, 2023

binarman added 3 commits November 16, 2023 15:11

[MFMA] Rework type conversion for MFMA operands

65ad7ad

This PR generalizes llvm values generalted by ttg->llvm op loading: shared to mfma op generates array of repNxrepK vectors of matrix elements.

unify normal and fast path

11fe6e0

Unify Operands A and B

327d9aa

binarman force-pushed the rework_mfma branch from 5ef7bbb to 327d9aa Compare November 16, 2023 14:11

scxiao and others added 2 commits November 26, 2023 23:28

Merge branch 'triton-mlir' into rework_mfma

abc78c3

Merge branch 'triton-mlir' into rework_mfma

0cac210

zhanglx13 reviewed Dec 5, 2023

View reviewed changes

zhanglx13 added 2 commits December 5, 2023 14:18

clean up getMFMARep()

5cd9156

Merge branch 'triton-mlir' into rework_mfma

0d333b8

zhanglx13 approved these changes Dec 11, 2023

View reviewed changes

Merge remote-tracking branch 'rocm/triton-mlir' into rework_mfma

c60f283

alefimov-amd merged commit f2afd65 into ROCm:triton-mlir Dec 13, 2023
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MFMA] Refactor dot pipeline to reduce code duplication #400

[MFMA] Refactor dot pipeline to reduce code duplication #400

binarman commented Nov 14, 2023

zhanglx13 Dec 5, 2023

binarman Dec 10, 2023

zhanglx13 Dec 11, 2023

binarman Dec 11, 2023

zhanglx13 Dec 5, 2023

binarman Dec 8, 2023

zhanglx13 left a comment

[MFMA] Refactor dot pipeline to reduce code duplication #400

[MFMA] Refactor dot pipeline to reduce code duplication #400

Conversation

binarman commented Nov 14, 2023

zhanglx13 Dec 5, 2023

Choose a reason for hiding this comment

binarman Dec 10, 2023

Choose a reason for hiding this comment

zhanglx13 Dec 11, 2023

Choose a reason for hiding this comment

binarman Dec 11, 2023

Choose a reason for hiding this comment

zhanglx13 Dec 5, 2023

Choose a reason for hiding this comment

binarman Dec 8, 2023

Choose a reason for hiding this comment

zhanglx13 left a comment

Choose a reason for hiding this comment