Replies: 3 comments
-
Beta Was this translation helpful? Give feedback.
-
Thank you very much for the answer.
If I understand correctly, the rid/cid loops iterate the 16x8 fragments in two directions and the idx loop iterates the two 8x8 fragments within it.
I see the fragment of smem iterator group multiple 8x8 fragments for one storing. Dose it mean a granularity of smem access is the same colored 16x8 fragment?
Also, is the layout different in Volta tensor core and SIMT core?
… On Jul 1, 2022, at 6:43 PM, Jin Wang ***@***.***> wrote:
The idx loop is used to iterate the elements generated by a single thread in a warp fragment. This is determined by the fragment layout of a specific tensor core instruction. Here is an example of 32x64 accumulator computed by FP16 tensor core 16x8x16 instruction.
<https://user-images.githubusercontent.com/11183505/176981938-58be9e4a-157e-4325-8fa0-212153a7cd3c.png>
The idx loop is used to iterator a single 8x8 fragment inside a warp fragment and store the data generated by each thread to the shared memory.
—
Reply to this email directly, view it on GitHub <#551 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AEFKBMQUM3D3HD4PGGCDD5DVR6NEVANCNFSM52OFM27Q>.
You are receiving this because you authored the thread.
|
Beta Was this translation helpful? Give feedback.
-
It is a bit more complicated than that. Here are the values for rid/cid/idx iterations for the above example with FP16 tensor core 16x8x16 instructions. Thread block accumulator Shape: 32x64 Note that the Consequently when the 16x8 fragment are stored in the shared memory by each thread, the data layout will be row-major. |
Beta Was this translation helpful? Give feedback.
-
Hello,
I am looking at the fused GEMM example and I am confused about storing the accumulator fragment into shared memory.
I know it is the
FragmentIteratorTensorOp
load from the accumulator andTileIteratorTensorOp
store it the shard memory.What I am confused is the code in the
EpilogueSmemAccumulator
class.What does the
idx
for-loop mean? Is the fragment of smem storage larger than that of register?Also, It looks like the
FragmentIteratorTensorOp
andTileIteratorTensorOp
have no argument about the accumulator layout. How would they work collectively?I think all my questions are about how the epilogues deal with accumulator generally. Like handling the non-contiguous layout among threads.
I would appreciate any help and insight very much.
Beta Was this translation helpful? Give feedback.
All reactions