Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QST] Is there any INT8 GEMM with INT8 alpha and beta? #1157

Closed
jhss opened this issue Oct 25, 2023 · 8 comments
Closed

[QST] Is there any INT8 GEMM with INT8 alpha and beta? #1157

jhss opened this issue Oct 25, 2023 · 8 comments

Comments

@jhss
Copy link

jhss commented Oct 25, 2023

What is your question?

using Gemm = cutlass::gemm::device::Gemm<
      int8_t, cutlass::layout::RowMajor, int8_t, cutlass::layout::ColumnMajor,
      ElementOutput, cutlass::layout::RowMajor, ElementAccumulator,
      cutlass::arch::OpClassTensorOp, cutlass::arch::Sm75>;

cutlass::gemm::GemmCoord problem_size(M, N, K);

cutlass::TensorRef<ElementInputA, LayoutInputA> input_ref(input.data_ptr<int8_t>(), LayoutInputA::packed(input_size));
cutlass::TensorRef<ElementInputB, LayoutInputB> weight_ref(weight.data_ptr<int8_t>(), LayoutInputB::packed(weight_size));
cutlass::TensorRef<ElementOutput, LayoutOutput> out_ref(out.data_ptr<int8_t>(), LayoutOutput::packed(output_size));

typename Gemm::Arguments arguments{
      problem_size, // <- problem size of matrix multiplication
      input_ref,    // <- reference to matrix A on device
      weight_ref,   // <- reference to matrix B on device
      out_ref,      // <- reference to matrix C on device
      out_ref,      // <- reference to matrix D on device
      {alpha, beta}, 1};

In the code above, if I set alpha and beta as INT8, I got warning that narrowing conversion from int to float.

Does alpha and beta have to be float? I want to set it as INT8 to increase inference speed.

@hwu36
Copy link
Collaborator

hwu36 commented Oct 25, 2023

just using 8 bit alpha/beta is not going to make performance difference.

@mnicely
Copy link
Collaborator

mnicely commented Dec 5, 2023

@jhss is your question resolved?

@jhss
Copy link
Author

jhss commented Dec 8, 2023

I want to know why 8 bit int alpha/beta doesn't effect performance

@thakkarV
Copy link
Collaborator

thakkarV commented Dec 8, 2023

because shaving off 4 bytes to 1 byte for a single load per tile does not change the perf at all. Changing fp32 multiplication to int8 will also not move the needle too much in the grand scheme of things.. What is your problem size you are most interested in?

@jhss
Copy link
Author

jhss commented Dec 8, 2023

Thank you for answering.

I'm looking at smoothquant repository, they use matrix multiplication whose sizes are about (batch x 2048 x 768) * (768 x 768) in one layer. They perform matmul with torch_int, which use cutlass::eplilogue::thread::linear_combination at the end of matrix multiplication. By using this, they changed INT32 accumulator into float32 as follows:

weight_scale (alpha) * accumulator + bias_scale (beta) * bias

Suppose accumulator shape is (batch x 2048 x 768), then weight_scale shape is (2048, ), which is broadcasted and multiplied with accumulator. I'm just thinking that (1, 2048, 1) * (batch x 2048, 768) multiplication become faster when the type of weight_scale is changed from fp32 to int8.

@thakkarV
Copy link
Collaborator

thakkarV commented Dec 8, 2023

Although I doubt it, you can certainly try int8 alpha/beta to see if it would help in this case. What you would have to do is modify the epilogue thread functor's ElementCompute type and then use that to construct your epilogue.
https://github.com/NVIDIA/cutlass/blob/main/include/cutlass/epilogue/thread/linear_combination.h#L68

@mnicely
Copy link
Collaborator

mnicely commented Jan 2, 2024

@jhss is your question resolved?

Copy link

github-actions bot commented Feb 1, 2024

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants