Confused with the inst size shown in Profiler log #731

Adnios · 2022-12-11T02:45:15Z

Adnios
Dec 11, 2022

I'm using cutlass-2.5 profiler in V100. For a gemm case, I use the following cmdline to generate all kernels.

./tools/profiler/cutlass_profiler --kernels=*s*gemm*_f16*align8 --m=3456 --n=4096 --k=4096

It confused me in inst secetion. In the logs files, the inst_m/n/k is 16x16x4. But when I check the cuda files in cutlass/build/tools/library/generated/gemm/, the inst_m/n/k is 8x8x4.

kernel name: cutlass_tensorop_s884gemm_f16_128x256_32x2_nn_align8
# logs files
       Arguments: --gemm_kind=universal --m=3456 --n=4096 --k=4096 --A=f16:column --B=f16:column --C=f32:column --alpha=1  \
                  --beta=0 --split_k_slices=1 --batch_count=1 --op_class=tensorop --accum=f32 --cta_m=128 --cta_n=256  \
                  --cta_k=32 --stages=2 --warps_m=2 --warps_n=4 --warps_k=1 --inst_m=16 --inst_n=16 --inst_k=4 --min_cc=70  \
                  --max_cc=75

# cuda files
using cutlass_tensorop_s884gemm_f16_128x256_32x2_nn_align8_base =
  typename cutlass::gemm::kernel::DefaultGemmUniversal<
    cutlass::half_t, cutlass::layout::RowMajor, cutlass::ComplexTransform::kNone, 8,    // transposed B operand
    cutlass::half_t, cutlass::layout::RowMajor, cutlass::ComplexTransform::kNone, 8,    // transposed A operand
    float, cutlass::layout::RowMajor,
    float,
    cutlass::arch::OpClassTensorOp,
    cutlass::arch::Sm70,
    cutlass::gemm::GemmShape<128, 256, 32>,
    cutlass::gemm::GemmShape<64, 64, 32>,
    cutlass::gemm::GemmShape<8, 8, 4>,
    cutlass::epilogue::thread::LinearCombination<
      float,
      4,
      float,
      float
    >,
    cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<8>,
    2,
    cutlass::arch::OpMultiplyAdd
>::GemmKernel;

Answered by hwu36

Dec 12, 2022

it is a bit confusing. Volta tensor core instruction is different from turing/ampere ones. We use four 8x8x4 volta tensor core instruction together which looks like a 16x16x4 operation,

View full answer

hwu36 · 2022-12-12T03:53:40Z

hwu36
Dec 12, 2022
Maintainer

it is a bit confusing. Volta tensor core instruction is different from turing/ampere ones. We use four 8x8x4 volta tensor core instruction together which looks like a 16x16x4 operation,

1 reply

Adnios Dec 13, 2022
Author

Thanks. That makes sense. 😄

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Confused with the inst size shown in Profiler log #731

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Confused with the inst size shown in Profiler log #731

Adnios Dec 11, 2022

Replies: 1 comment · 1 reply

hwu36 Dec 12, 2022 Maintainer

Adnios Dec 13, 2022 Author

Adnios
Dec 11, 2022

Replies: 1 comment 1 reply

hwu36
Dec 12, 2022
Maintainer

Adnios Dec 13, 2022
Author