Questions about GemmWithReduceK #723

Enter-tainer · 2022-12-06T09:15:37Z

Enter-tainer
Dec 6, 2022

Current impl of gemm+reducek will first write the D matrix to the global memory. Then load D from global memory and perform the reduction. Is it possible to use some tricks to avoid this extra read from global memory? I think this extra read can make this kernel inefficient.
Currently it only support fp16 & bf16. How to add fp32 support for it? I think I need to add something in include/cutlass/gemm/warp/default_mma_with_reduction_tensor_op.h. But I'm not sure how to do it...

Dec 6, 2022

The gemm+reduce_k kernel will first do some partial reduction while performing the threadblock scope mma. And the result of the partial reduction would be stored in gemm_k_accumulators. After finishing the epilogue, the kernel will sum gemm_k_accumulators up and get the final reduction result.

Only split k will do it. If you don't use splitk, we don't output partial sum.

And I just found that maybe I should also add code here to perform partial reduction while doing mma.

Yes, that is the right place. Suppose you want to reduce A operand, your cuda code is

            gemm_k_reduction[m * 2] += float(A[m * 4]);
            gemm_k_reduction[m * 2] += float(A[m * 4 + 2]);
  
            …

View full answer

hwu36 · 2022-12-06T16:38:43Z

hwu36
Dec 6, 2022
Maintainer

first write the D matrix to the global memory. Then load D from global memory and perform the reduction.

What do you mean? We don't do it unless we use splitK.

How to add fp32 support for it?

Do you plan to use tf32 tensor core?

1 reply

Enter-tainer Dec 6, 2022
Author

We don't do it unless we use splitK.

Thanks for your suggestion!

I'm not sure if I understand the code correctly or not. Let me first explain my understanding of the code. The gemm+reduce_k kernel will first do some partial reduction while performing the threadblock scope mma. And the result of the partial reduction would be stored in gemm_k_accumulators. After finishing the epilogue, the kernel will sum gemm_k_accumulators up and get the final reduction result.

I thought we had an extra read in EpilogueGemmKReduction when I posted this because I found load_global in its code. But now I find that it only happens when LoadForSerialSplitK is true. (or is it?) Therefore, the "extra read" won't exist if we are not using split-k.

Do you plan to use tf32 tensor core?

Yes. And I just found that maybe I should also add code here to perform partial reduction while doing mma.

hwu36 · 2022-12-06T18:46:24Z

hwu36
Dec 6, 2022
Maintainer

The gemm+reduce_k kernel will first do some partial reduction while performing the threadblock scope mma. And the result of the partial reduction would be stored in gemm_k_accumulators. After finishing the epilogue, the kernel will sum gemm_k_accumulators up and get the final reduction result.

Only split k will do it. If you don't use splitk, we don't output partial sum.

And I just found that maybe I should also add code here to perform partial reduction while doing mma.

Yes, that is the right place. Suppose you want to reduce A operand, your cuda code is

            gemm_k_reduction[m * 2] += float(A[m * 4]);
            gemm_k_reduction[m * 2] += float(A[m * 4 + 2]);
  
            gemm_k_reduction[m * 2 + 1] += float(A[m * 4 + 1]);
            gemm_k_reduction[m * 2 + 1] += float(A[m * 4 + 3]);

If you reduce for B, your cuda code is

            gemm_k_reduction[n_serpentine] += float(B[n_serpentine * 2]);
            gemm_k_reduction[n_serpentine] += float(B[n_serpentine * 2 + 1]);

You can try to write the above code in inline ptx to get better performance. The mainloop of tf32 gemm is already very busy. Fusing it may make the performance drop a lot, you need to benchmark it.

2 replies

Enter-tainer Dec 7, 2022
Author

thanks you for your kind reply! I will have a try on the tf32 version.

18321961708 Oct 11, 2024

Hello! I also have questions about this example (23), I want to fuse gemm and reduction(product but not A), and I notice that the gemmWithKReduction kernel only support reduceForA or B? And if I want to reduce the A*B product, I shoud consider using splitK kernel? I am not sure if my understand is right. And I also cannot run example23 with splitK correctly (it cannot pass reference check). Would you please help me with this example? Thanks a lot!!

rkindi · 2022-12-11T03:10:15Z

rkindi
Dec 11, 2022

@Enter-tainer I also had the need to extend example 32 to tf32 and came up with this snippet that worked for me. Was planning on upstreaming but probably won't have time to make a PR before I'm going on holiday leave. For now, I'll just leave this code snippet (feel free to make it into a PR).

#include <iostream>
#include <fstream>
#include <sstream>
#include <cstdlib>
#include "cutlass/cutlass.h"
#include "cutlass/gemm/device/gemm_universal_adapter.h"
#include "cutlass/gemm/kernel/default_gemm_with_k_reduction.h"
#include "cutlass/reduction/device/reduce_split_k.h"
#include "cutlass/reduction/kernel/reduce_split_k.h"
#include "cutlass/reduction/thread/reduction_operators.h"
#include "cutlass/matrix_coord.h"
// For default_mma_with_reduction_tensor_op.h fp32 specialization.
#include "cutlass/gemm/warp/mma_with_reduction_tensor_op.h"
// For mma_with_reduction_tensor_op.h fp32 specialization.
#include "cutlass/array.h"
#include "cutlass/platform/platform.h"
#include "cutlass/numeric_conversion.h"
#include "cutlass/numeric_types.h"
#include "cutlass/matrix_shape.h"
#include "cutlass/arch/memory_sm75.h"
#include "cutlass/arch/mma_sm75.h"
#include "cutlass/arch/mma_sm80.h"
#include "cutlass/gemm/gemm.h"
#include "cutlass/gemm/warp/mma.h"
#include "cutlass/gemm/warp/mma_tensor_op_policy.h"
#include "cutlass/gemm/warp/mma_tensor_op.h"
#include "cutlass/gemm/warp/mma_tensor_op_tile_iterator.h"
#include "cutlass/gemm/warp/mma_tensor_op_tile_iterator_sm80.h"

#define CUTLASS_CHECK(status)                                                                    \
  {                                                                                              \
    cutlass::Status error = status;                                                              \
    if (error != cutlass::Status::kSuccess) {                                                    \
      std::cerr << "Got cutlass error: " << cutlassGetStatusString(error) << " at: " << __LINE__ \
                << std::endl;                                                                    \
      exit(EXIT_FAILURE);                                                                        \
    }                                                                                            \
  }


// Add a partial specialiation for fp32/tf32 since default_mma_with_reduction_tensor_op.h does not support it.
// TODO: Upstream to cutlass.

namespace cutlass {
namespace gemm {
namespace warp {

// Partial specialization for float32.
template <
    /// Size of the Gemm problem - concept: gemm::GemmShape<>
    typename WarpShape_,
    /// Shape of one matrix production operation (concept: GemmShape)
    typename InstructionShape_,
    /// Layout of A matrix (concept: MatrixLayout)
    typename LayoutA,
    /// Layout of B matrix (concept: MatrixLayout)
    typename LayoutB,
    /// Layout of C matrix (concept: MatrixLayout)
    typename LayoutC,
    /// Reduce operand A or B along K dimension
    bool ReduceKForA_,
    /// Number of partitions along K dimension
    int PartitionsK,
    /// Store the accumulators in row major or column major.  Row major is used
    /// when output layout is interleaved.
    bool AccumulatorsInRowMajor>
struct DefaultMmaWithReductionTensorOp<
    WarpShape_,
    InstructionShape_,
    float,
    LayoutA,
    float,
    LayoutB,
    float,
    LayoutC,
    arch::OpMultiplyAdd,
    ReduceKForA_,
    PartitionsK,
    AccumulatorsInRowMajor
> {
  using Policy = cutlass::gemm::warp::MmaTensorOpPolicy<
      cutlass::arch::Mma<InstructionShape_, 32, cutlass::tfloat32_t,
                         cutlass::layout::RowMajor, cutlass::tfloat32_t,
                         cutlass::layout::ColumnMajor, float,
                         cutlass::layout::RowMajor, arch::OpMultiplyAdd>,
      cutlass::MatrixShape<1, 1> >;

  // Define the warp-level tensor op
  using Type = cutlass::gemm::warp::MmaWithReductionTensorOp<
      WarpShape_, float, LayoutA, float, LayoutB, float, LayoutC,
      Policy, ReduceKForA_, PartitionsK, AccumulatorsInRowMajor>;
};

} // namespace warp
} // namespace gemm
} // namespace cutlass

// Add a partial specialiation for fp32/tf32 since mma_with_reduction_tensor_op.h does not support it.
// TODO: Upstream to cutlass.

namespace cutlass {
namespace gemm {
namespace warp {

/// Structure to compute the matrix product targeting CUDA cores and SIMT math instructions.
template <
  /// Size of the Gemm problem - concept: gemm::GemmShape<>
  typename Shape_,
  /// Data type of A elements
  typename LayoutA_,
  /// Data type of B elements
  typename LayoutB_,
  /// Element type of C matrix
  typename LayoutC_,
  /// Policy describing warp-level MmaTensorOp (concept: MmaTensorOp policy)
  typename Policy_,
  ///
  bool ReduceKForA_,
  /// Number of partitions along K dimension
  int PartitionsK_,
  /// Store the accumulators in row major or column major.  Row major is used
  /// when output layout is interleaved.
  bool AccumulatorsInRowMajor
>
class MmaWithReductionTensorOp<
    Shape_,
    float,
    LayoutA_,
    float,
    LayoutB_,
    float,
    LayoutC_,
    Policy_,
    ReduceKForA_,
    PartitionsK_,
    AccumulatorsInRowMajor
> {
public:
  /// Shape of warp-level matrix operation (concept: GemmShape)
  using Shape = Shape_;

  /// Data type of multiplicand A
  using ElementA = float;

  /// Layout of multiplicand A
  using LayoutA = LayoutA_;

  /// Data type of multiplicand B
  using ElementB = float;

  /// Layout of multiplicand B
  using LayoutB = LayoutB_;

  /// Data type of accumulator matrix C
  using ElementC = float;

  /// Layout of accumulator matrix C
  using LayoutC = LayoutC_;

  /// Shape of the warp in units of thread (concept: MmaLanePolicySimt)
  using Policy = Policy_;

  /// Underlying matrix multiply operator (concept: arch::Mma)
  using ArchMmaOperator = typename Policy::Operator;

  /// Indicates math operator
  using MathOperator = typename ArchMmaOperator::Operator;

  /// Architecture tag from underlying instruction
  using ArchTag = typename ArchMmaOperator::ArchTag;

  /// Indicates class of matrix operator
  using OperatorClass = arch::OpClassTensorOp;

  /// Shape of underlying instruction
  using InstructionShape = typename ArchMmaOperator::Shape;

  /// Complex transform on A operand
  static ComplexTransform const kTransformA = ComplexTransform::kNone;

  /// Complex transform on B operand
  static ComplexTransform const kTransformB = ComplexTransform::kNone;

  /// Number of threads participating in warp-level matrix product
  static int const kThreadCount = 32;

  /// Number of partitions along K dimension
  static int const kPartitionsK = PartitionsK_;

  static bool const kReduceKForA = ReduceKForA_;

  static_assert(platform::is_same<ElementA, float>::value,
                "ElementA needs to be fp32.");

  static_assert(platform::is_same<ElementB, float>::value,
                "ElementB needs to be fp32.");

  static_assert(platform::is_same<InstructionShape,
                                  cutlass::gemm::GemmShape<16, 8, 8>>::value,
                "Only supports 16x8x8 tensor core instruction.");

  static_assert(!AccumulatorsInRowMajor,
                "Only calls tensor core instructions in column major.");

public:

  /// Iterates over the A operand in memory
  using IteratorA = MmaTensorOpMultiplicandTileIterator<
     MatrixShape<Shape::kM, Shape::kK>, Operand::kA, ElementA, LayoutA,
     MatrixShape<ArchMmaOperator::Shape::kM, ArchMmaOperator::Shape::kK>,
     Policy::OpDelta::kRow, kThreadCount, kPartitionsK>;

  /// Storage for A tile
  using FragmentA = typename IteratorA::Fragment;

  /// Storage for transformed A tile
  using TransformedFragmentA =
      Array<typename ArchMmaOperator::ElementA, FragmentA::kElements>;

  /// Iterates over the B operand in memory
  using IteratorB = MmaTensorOpMultiplicandTileIterator<
      MatrixShape<Shape::kK, Shape::kN>, Operand::kB, ElementB, LayoutB,
      MatrixShape<ArchMmaOperator::Shape::kK, ArchMmaOperator::Shape::kN>,
      Policy::OpDelta::kRow, kThreadCount, kPartitionsK>;

  /// Storage for B tile
  using FragmentB = typename IteratorB::Fragment;

  /// Storage for transformed B tile
  using TransformedFragmentB =
      Array<typename ArchMmaOperator::ElementB, FragmentB::kElements>;

  /// Iterates over the C operand in memory
  using IteratorC = MmaTensorOpAccumulatorTileIterator<
     MatrixShape<Shape::kM, Shape::kN>, ElementC, LayoutC,
     typename ArchMmaOperator::Shape, typename Policy::OpDelta>;

  /// Storage for C tile
  using FragmentC = typename IteratorC::Fragment;

  /// Number of mma operations performed
  using MmaIterations = MatrixShape<
    (Shape::kM + ArchMmaOperator::Shape::kM - 1) / ArchMmaOperator::Shape::kM,
    (Shape::kN + ArchMmaOperator::Shape::kN - 1) / ArchMmaOperator::Shape::kN
  >;

  using FragmentReduction = Array<ElementC, kReduceKForA ? (Shape::kM / 8) : (Shape::kN / 8)>;

public:

  /// Underlying matrix multiply operator (concept: arch::Mma)
  ArchMmaOperator mma;

public:

  //
  // Methods
  //

  /// Ctor
  CUTLASS_DEVICE
  MmaWithReductionTensorOp() {}

  /// Performs a warp-level matrix multiply-accumulate operation
  CUTLASS_DEVICE
  void operator()(
    FragmentC &D,
    TransformedFragmentA const &A,
    TransformedFragmentB const &B,
    FragmentC const &C,
    FragmentReduction &gemm_k_reduction
  ) const {

    using MmaOperandA = typename ArchMmaOperator::FragmentA;
    using MmaOperandB = typename ArchMmaOperator::FragmentB;
    using MmaOperandC = typename ArchMmaOperator::FragmentC;

    static_assert(platform::is_same<typename MmaOperandB::Element, cutlass::tfloat32_t>::value, "mmaoperand b wrong type");

    D = C;

    MmaOperandA const *ptr_A = reinterpret_cast<MmaOperandA const *>(&A);
    MmaOperandB const *ptr_B = reinterpret_cast<MmaOperandB const *>(&B);
    MmaOperandC *ptr_D = reinterpret_cast<MmaOperandC *>(&D);

    #if defined(__CUDA_ARCH__) && (__CUDA_ARCH__ < 800)
      assert(0);
    #elif defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 800)
      // Serpentine visitation order maximizing reuse of Ra
      CUTLASS_PRAGMA_UNROLL
      for (int m = 0; m < MmaIterations::kRow; ++m) {

        CUTLASS_PRAGMA_UNROLL
        for (int n = 0; n < MmaIterations::kColumn; ++n) {

          int n_serpentine = ((m % 2) ? (MmaIterations::kColumn - 1 - n) : n);

          mma(ptr_D[m + n_serpentine * MmaIterations::kRow],
              ptr_A[m],
              ptr_B[n_serpentine],
              ptr_D[m + n_serpentine * MmaIterations::kRow]);

          if (!kReduceKForA && m == 0) {
            float const B_operand_1 = float(B[n_serpentine * 2]);
            float const B_operand_2 = float(B[n_serpentine * 2 + 1]);

            asm volatile(
              "{\n\t"
              " add.f32 %0, %1, %0;\n\t"
              " add.f32 %0, %2, %0;\n\t"
              "}\n\t"
              : "+f"(gemm_k_reduction[n_serpentine])
              : "f"(B_operand_1), "f"(B_operand_2));
          }

          if (kReduceKForA && (n == 0)) {
            float const A_operand_1 = float(A[m * 4]);
            float const A_operand_2 = float(A[m * 4 + 1]);
            float const A_operand_3 = float(A[m * 4 + 2]);
            float const A_operand_4 = float(A[m * 4 + 3]);

            asm volatile(
              "{\n\t"
                " add.f32 %0, %2, %0;\n\t"
                " add.f32 %1, %3, %1;\n\t"
                " add.f32 %0, %4, %0;\n\t"
                " add.f32 %1, %5, %1;\n\t"
              "}\n\t"
              : "+f"(gemm_k_reduction[m * 2]), "+f"(gemm_k_reduction[m * 2 + 1])
              : "f"(A_operand_1), "f"(A_operand_2),"f"(A_operand_3), "f"(A_operand_4));
          }
        }
      }
    #else
      assert(0);
    #endif
  }

  /// Transform the mma operands to the required types
  CUTLASS_DEVICE
  void transform(TransformedFragmentA &dst_A, TransformedFragmentB &dst_B,
                 FragmentA const &A, FragmentB const &B) const {

    //
    // Define conversions from source type to instruction type
    //
    FloatRoundStyle const kRoundA =
        PreferredRoundingMode<typename ArchMmaOperator::ElementA,
                              ElementA>::kRound;
    FloatRoundStyle const kRoundB =
        PreferredRoundingMode<typename ArchMmaOperator::ElementB,
                              ElementB>::kRound;
    #if defined(__CUDA_ARCH__) && (__CUDA_ARCH__ < 800)
      detail::ConvertAndPack<typename ArchMmaOperator::ElementA, ElementA,
                            FragmentA::kElements, kRoundA>
          convert_A;
      NumericArrayConverter<typename ArchMmaOperator::ElementB, ElementB,
                            FragmentB::kElements / 2, kRoundB>
          convert_B;
      Array<ElementB, FragmentB::kElements / 2> const *ptr_B =
          reinterpret_cast<Array<ElementB, FragmentB::kElements / 2> const *>(&B);
      Array<typename ArchMmaOperator::ElementB, FragmentB::kElements / 2> *
          ptr_dst_B = reinterpret_cast<Array<typename ArchMmaOperator::ElementB,
                                             FragmentB::kElements / 2> *>(&dst_B);

      dst_A = convert_A(A);

      ptr_dst_B[0] = convert_B(ptr_B[0]);
      ptr_dst_B[1] = convert_B(ptr_B[1]);

    #elif defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 800)
      detail::ConvertAndPack<typename ArchMmaOperator::ElementA, ElementA,
                            FragmentA::kElements / 2, kRoundA>
          convert_A;
      NumericArrayConverter<typename ArchMmaOperator::ElementB, ElementB,
                            FragmentB::kElements, kRoundB>
          convert_B;
      Array<ElementA, FragmentA::kElements / 2> const *ptr_A =
          reinterpret_cast<Array<ElementA, FragmentA::kElements / 2> const *>(&A);
      Array<typename ArchMmaOperator::ElementA, FragmentA::kElements / 2> *
          ptr_dst_A = reinterpret_cast<Array<typename ArchMmaOperator::ElementA,
                                             FragmentA::kElements / 2> *>(&dst_A);

      dst_B = convert_B(B);

      ptr_dst_A[0] = convert_A(ptr_A[0]);
      ptr_dst_A[1] = convert_A(ptr_A[1]);
    #else
      assert(0);
    #endif
  }
};

} // namespace warp
} // namespace gemm
} // namespace cutlass

//////////////////////////////////////////////////////////////////////////////////////////////////////////////////

// The code section below describes datatype for input, output tensors and computation between
// elements
using ElementAccumulator = float;                  // Data type of accumulator
using ElementComputeEpilogue = ElementAccumulator; // Data type of epilogue computation
using ElementInputA = float;         // Data type of elements in input tensor
using ElementInputB = float;         // Data type of elements in input tensor
using ElementOutput = float;         // Data type of elements in output tensor

using LayoutInputA = cutlass::layout::ColumnMajor;
using LayoutInputB = cutlass::layout::RowMajor;
using LayoutOutput = cutlass::layout::RowMajor;

// Layout of the output vector
using LayoutGemmKReduction = cutlass::layout::PitchLinear;

// This code section describes whether you want to use tensor cores or regular SIMT cores on GPU SM
using MMAOp = cutlass::arch::OpClassTensorOp;

// This code section describes CUDA SM architecture number
using SmArch = cutlass::arch::Sm80;

// This code section describes the size of MMA op
using InstructionShape = cutlass::gemm::GemmShape<16, 8, 8>;    // TensorCore instruction shape

// This code section describes how threadblocks are scheduled on GPU
using SwizzleThreadBlock = cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<8>;

// Reduce A or B operand along the K dimension
constexpr bool ReduceKForA = true;

// This code section describes the epilogue part of the kernel, we use default value
using EpilogueOp = cutlass::epilogue::thread::LinearCombination<
    ElementOutput,                                        // Data type of output matrix.
    128 / cutlass::sizeof_bits<ElementOutput>::value,     // The number of elements per vectorized.
                                                          // memory access. This becomes the vector width of
                                                          // math instructions in the epilogue too.
    ElementAccumulator,                                   // Data type of accumulator
    ElementComputeEpilogue>;

using GemmKernel = typename cutlass::gemm::kernel::DefaultGemmWithKReduction<
  ElementInputA,
  LayoutInputA,
  cutlass::ComplexTransform::kNone,
  4, // alignment_a
  ElementInputB,
  LayoutInputB,
  cutlass::ComplexTransform::kNone,
  4, // alignment_b
  ElementOutput,
  LayoutOutput,
  ElementAccumulator,
  MMAOp,
  ReduceKForA,
  SmArch,
  cutlass::gemm::GemmShape<${block_shape}>, // ThreadblockShape,
  cutlass::gemm::GemmShape<${warp_shape}>, // WarpShape,
  InstructionShape,
  EpilogueOp,
  SwizzleThreadBlock,
  ${stages}, // NumStages,
  cutlass::arch::OpMultiplyAdd
>::GemmKernel;

using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;

void my_cu_fn(
    int M,
    int K,
    int N,
    void* Z,
    void* T,
    void* A,
    void* B,
    int split_k,
    uint8_t* workspace,
    const cudaStream_t stream
) {
    using ElementComputeEpilogue = ElementAccumulator;
    typename Gemm::Arguments arguments(
        cutlass::gemm::GemmUniversalMode::kGemm,
        {N, M, K},
        split_k,
        {ElementComputeEpilogue(1), ElementComputeEpilogue(0)},
        static_cast<void*>(B),
        static_cast<void*>(A),
        static_cast<void*>(Z),
        static_cast<void*>(Z),
        static_cast<void*>(T),
        N * K,
        M * K,
        M * N,
        M * N,
        M,
        N,
        M,
        N,
        N,
        0
    );
    Gemm gemm_op;
    auto status = gemm_op.can_implement(arguments);
    ...
}

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions about GemmWithReduceK #723

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Questions about GemmWithReduceK #723

Enter-tainer Dec 6, 2022

Replies: 3 comments · 3 replies

hwu36 Dec 6, 2022 Maintainer

Enter-tainer Dec 6, 2022 Author

hwu36 Dec 6, 2022 Maintainer

Enter-tainer Dec 7, 2022 Author

18321961708 Oct 11, 2024

rkindi Dec 11, 2022

Enter-tainer
Dec 6, 2022

Replies: 3 comments 3 replies

hwu36
Dec 6, 2022
Maintainer

Enter-tainer Dec 6, 2022
Author

hwu36
Dec 6, 2022
Maintainer

Enter-tainer Dec 7, 2022
Author

rkindi
Dec 11, 2022