-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[QST] Performance Issue of doing GEMM on A100 using CuTe #1858
Comments
Your rewrite of Beyond that, it appears that you're only using row-major/col-major smem, SIMT smem->rmem I did need to modify the
The next step would be to use LDSM for the smem->rmem load (this smem layout is already designed for the LDSM pattern... and is what is slowing it down now), which I've left notes on and we could look at CUTLASS's SM80 Collective to see how that's done. That should achieve speed-of-light, peak performance on A100. Here's my diff/patch/half_t configuration
|
I updated my project these days. I have included a
The extra tiling of The best performance I got on A100 when using FP16 datatype after autotuning is I am running out of the ideas for doing further optimization at this point. If anyone can take a look at my code and figure out what could the next optimization be, I would be greatly appreciated!! |
I recall something about Regardless, to give you an example of the same SM80 kernel using LDSM, I've attached my CuTe example for |
Thanks a lot for your quick respond!! I noticed that a class called
doing? About those LDSM copying atoms, I think their naming conventions follows as I noticed the definition of those layout in copy_traits_sm75.hpp. And I am confused about the meaning of |
|
Do you think it can be further improved? The example your provided is currently 150TFlops on A100, while CUBLAS gets 300TFlops. |
I found this high performance implementation of gemm using cute. The author wrote a series of tutorial for CuTe (in Chinese) on Zhihu. And in one of the tutorials, the author claimed this implementation have reached a CUBLAS level performance on RTX 3090. I am not sure whether it will be the same case on A100 as I currently don't have access to it. I think this series of tutorial is a very good complement to the official CuTe's documentation. |
I tried that before, looks like there are compilation issues due to interface no longer match with latest cutlass's CUTE |
@ccecka 's code is actually working in the exact same way as that in that code base. I believe all the optimizations that was in that code base except the epilog for saving the saving the computation result back into the global memory has been adopted into @ccecka 's code. |
I am currently encounter problem for understanding the To understand the code, I wrote the following example for play with the #include <cute/tensor.hpp>
#include <iostream>
int main() {
half_t* A = new half_t[256*32];
Copy_Atom<SM75_U32x2_LDSM_N, half_t> s2r_atom_a;
TiledMMA mmaC = make_tiled_mma(SM80_16x8x8_F16F16F16F16_TN{},
Layout<Shape<_2,_2>>{}, // 2x2x1 MMA Atoms
Tile<_32,_32,_16>{}); // 32x32x16 Tiled MMA for LDSM
// print_latex(mmaC);
Tensor sA = make_tensor(A, make_layout(make_shape(_64{}, _32{}, _2{})));
Tensor sB = make_tensor(A, make_layout(make_shape(_64{}, _32{}, _2{})));
ThrMMA thr_mma = mmaC.get_slice(0);
Tensor tCrA = thr_mma.partition_fragment_A(sA(_,_,0)); // (MMA,MMA_M,MMA_K)
Tensor tCrB = thr_mma.partition_fragment_B(sB(_,_,0)); // (MMA,MMA_N,MMA_K)
TiledCopy s2r_copy_a = make_tiled_copy_A(s2r_atom_a, mmaC);
ThrCopy s2r_thr_copy_a = s2r_copy_a.get_slice(0);
Tensor tXsA = s2r_thr_copy_a.partition_S(sA); // (CPY,MMA_M,MMA_K,PIPE)
Tensor tXrA = s2r_thr_copy_a.retile_D(tCrA); // (CPY,MMA_M,MMA_K)
printf("tCrA: "); print(tCrA); printf("\n");
printf("tXsA: "); print(tXsA); printf("\n");
printf("tXrA: "); print(tXrA); printf("\n");
printf("\n");
TiledCopy s2r_copy_b = make_tiled_copy_B(s2r_atom_a, mmaC);
ThrCopy s2r_thr_copy_b = s2r_copy_b.get_slice(0);
Tensor tXsB = s2r_thr_copy_b.partition_S(sB); // (CPY,MMA_N,MMA_K,PIPE)
Tensor tXrB = s2r_thr_copy_b.retile_D(tCrB); // (CPY,MMA_N,MMA_K)
printf("tCrB: "); print(tCrB); printf("\n");
printf("tXsB: "); print(tXsB); printf("\n");
printf("tXrB: "); print(tXrB); printf("\n");
} The mma atom I used there is a |
I see, you're concerned about the MMA_K mode between TiledMMA mmaC = make_tiled_mma(SM80_16x8x8_F16F16F16F16_TN{},
Layout<Shape<_2,_2>>{}, // 2x2x1 MMA Atoms
Tile<_32,_32>{}); // 32x32x8 Tiled MMA for LDSM You should be able to expand the M- and N-size of the TiledMMA to accommodate the LDSMs. This will give you a finer granularity for interleaving in the mainloop as well. Sorry for the confusion. |
Thanks! That solved the problem. I found that CuTe would not stop me from creating a int main() {
using namespace cute;
half_t* A = new half_t[256*32*3];
Copy_Atom<SM75_U32x1_LDSM_N, half_t> s2r_atom_a;
Copy_Atom<AutoVectorizingCopy, half_t> s2r_atom_b;
auto bM = _256{};
auto bN = _256{};
auto bK = _16{};
TiledMMA mmaC = make_tiled_mma(SM80_16x8x8_F16F16F16F16_TN{},
Layout<Shape<_4,_2>>{},
Tile<_32, _32>{}); // 32x32x16 Tiled MMA for LDSM
auto tile_size_m = mmaC.tile_size_mnk<0>();
auto tile_size_n = mmaC.tile_size_mnk<1>();
auto tile_size_k = mmaC.tile_size_mnk<2>();
Tensor sA = make_tensor(A, Layout<Shape <_64,_32, _2>,
Stride<_32, _1,_2048>>{});
ThrMMA thr_mma = mmaC.get_slice(0);
Tensor tCrA = thr_mma.partition_fragment_A(sA(_,_,0)); // (MMA,MMA_M,MMA_K)
print(tCrA);print("\n");
} CuTe seems to be quite happy about it, while the value of of |
Very much intentional and often used this way. We iterate over the rest modes of the partitioned tensors too. |
In this case, how mma will be performed? The size of |
This issue has been labeled |
Correct, this is a bug and there should be additional static assertions on construction to prevent incompatible parameters (or override in the case of unnecessary identity Permutations here) like you mention. |
This issue has been labeled |
Hi, I've just created a small project (link to the project) by modifying the
sgemm_sm80
example. What I was doing was trying to make use of the tensor cores for doing the computation. Unfortunately, when testing on A100, the performance seems never being able to reach the peak performance. Following is the best results I've got from the autotuning process. The peak performance reached seems to be always a bit less than half of the theoretical peak performance provided by A100. Any comments on how can I make this better?The text was updated successfully, but these errors were encountered: