-
I'm using cutlass-2.5 profiler in V100. For a gemm case, I use the following cmdline to generate all kernels. ./tools/profiler/cutlass_profiler --kernels=*s*gemm*_f16*align8 --m=3456 --n=4096 --k=4096 It confused me in inst secetion. In the logs files, the inst_m/n/k is 16x16x4. But when I check the cuda files in kernel name: cutlass_tensorop_s884gemm_f16_128x256_32x2_nn_align8
# logs files
Arguments: --gemm_kind=universal --m=3456 --n=4096 --k=4096 --A=f16:column --B=f16:column --C=f32:column --alpha=1 \
--beta=0 --split_k_slices=1 --batch_count=1 --op_class=tensorop --accum=f32 --cta_m=128 --cta_n=256 \
--cta_k=32 --stages=2 --warps_m=2 --warps_n=4 --warps_k=1 --inst_m=16 --inst_n=16 --inst_k=4 --min_cc=70 \
--max_cc=75
# cuda files
using cutlass_tensorop_s884gemm_f16_128x256_32x2_nn_align8_base =
typename cutlass::gemm::kernel::DefaultGemmUniversal<
cutlass::half_t, cutlass::layout::RowMajor, cutlass::ComplexTransform::kNone, 8, // transposed B operand
cutlass::half_t, cutlass::layout::RowMajor, cutlass::ComplexTransform::kNone, 8, // transposed A operand
float, cutlass::layout::RowMajor,
float,
cutlass::arch::OpClassTensorOp,
cutlass::arch::Sm70,
cutlass::gemm::GemmShape<128, 256, 32>,
cutlass::gemm::GemmShape<64, 64, 32>,
cutlass::gemm::GemmShape<8, 8, 4>,
cutlass::epilogue::thread::LinearCombination<
float,
4,
float,
float
>,
cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<8>,
2,
cutlass::arch::OpMultiplyAdd
>::GemmKernel; |
Beta Was this translation helpful? Give feedback.
Answered by
hwu36
Dec 12, 2022
Replies: 1 comment 1 reply
-
it is a bit confusing. Volta tensor core instruction is different from turing/ampere ones. We use four 8x8x4 volta tensor core instruction together which looks like a 16x16x4 operation, |
Beta Was this translation helpful? Give feedback.
1 reply
Answer selected by
Adnios
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
it is a bit confusing. Volta tensor core instruction is different from turing/ampere ones. We use four 8x8x4 volta tensor core instruction together which looks like a 16x16x4 operation,