How to understand EpilogueSmemAccumulator? #551

cctry · 2022-07-02T00:46:38Z

cctry
Jul 2, 2022

Hello,
I am looking at the fused GEMM example and I am confused about storing the accumulator fragment into shared memory.

I know it is the FragmentIteratorTensorOp load from the accumulator and TileIteratorTensorOp store it the shard memory.

What I am confused is the code in the EpilogueSmemAccumulator class.

    CUTLASS_PRAGMA_UNROLL
    for (int rid = 0; rid < AccumulatorFragmentIterator::TileIterations::kRow; ++rid) {
    
      CUTLASS_PRAGMA_UNROLL
      for (int cid = 0; cid < AccumulatorFragmentIterator::TileIterations::kColumn; ++cid) {
  
        using AccumulatorAccessType = typename OutputOp::FragmentAccumulator;
        using FragmentSmemAccessType = typename OutputOp::FragmentOutput;
  
        FragmentSmemAccessType * smem_frag_ptr =  
          reinterpret_cast<FragmentSmemAccessType *>(&tb_frag_smem);
  
        CUTLASS_PRAGMA_UNROLL
        for (int idx = 0; idx < AccumulatorFragmentIterator::kIterationsPerTile; ++idx) {
          frag_iterator_accum.load(tb_frag_accum);
          ++frag_iterator_accum;
  
          AccumulatorAccessType const * accumulator_frag_ptr = 
            reinterpret_cast<AccumulatorAccessType const *>(&tb_frag_accum);
          const int kOutputIterations = FragmentAccumulator::kElements / OutputOp::kCount;
  
          CUTLASS_PRAGMA_UNROLL
          for (int it = 0; it < kOutputIterations; it++) {
            smem_frag_ptr[idx * kOutputIterations + it] = output_op(accumulator_frag_ptr[it]);
          }
        }
  
        smem_iterator.store(tb_frag_smem);
        ++smem_iterator;

      }

What does the idx for-loop mean? Is the fragment of smem storage larger than that of register?
Also, It looks like the FragmentIteratorTensorOp and TileIteratorTensorOp have no argument about the accumulator layout. How would they work collectively?
I think all my questions are about how the epilogues deal with accumulator generally. Like handling the non-contiguous layout among threads.
I would appreciate any help and insight very much.

jwang323 · 2022-07-02T01:43:28Z

jwang323
Jul 2, 2022

The idx loop is used to iterate the elements generated by a single thread in a warp fragment. This is determined by the fragment layout of a specific tensor core instruction. Here is an example of 32x64 accumulator computed by FP16 tensor core 16x8x16 instruction.

The idx loop is used to iterator a single 8x8 fragment inside a warp fragment and store the data generated by each thread to the shared memory.

0 replies

cctry · 2022-07-03T04:22:09Z

cctry
Jul 3, 2022
Author

Thank you very much for the answer. If I understand correctly, the rid/cid loops iterate the 16x8 fragments in two directions and the idx loop iterates the two 8x8 fragments within it. I see the fragment of smem iterator group multiple 8x8 fragments for one storing. Dose it mean a granularity of smem access is the same colored 16x8 fragment? Also, is the layout different in Volta tensor core and SIMT core?

…

On Jul 1, 2022, at 6:43 PM, Jin Wang ***@***.***> wrote: The idx loop is used to iterate the elements generated by a single thread in a warp fragment. This is determined by the fragment layout of a specific tensor core instruction. Here is an example of 32x64 accumulator computed by FP16 tensor core 16x8x16 instruction. <https://user-images.githubusercontent.com/11183505/176981938-58be9e4a-157e-4325-8fa0-212153a7cd3c.png> The idx loop is used to iterator a single 8x8 fragment inside a warp fragment and store the data generated by each thread to the shared memory. — Reply to this email directly, view it on GitHub <#551 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AEFKBMQUM3D3HD4PGGCDD5DVR6NEVANCNFSM52OFM27Q>. You are receiving this because you authored the thread.

0 replies

jwang323 · 2022-07-05T20:59:42Z

jwang323
Jul 5, 2022

It is a bit more complicated than that. Here are the values for rid/cid/idx iterations for the above example with FP16 tensor core 16x8x16 instructions.

Thread block accumulator Shape: 32x64
Warp accumulator Shape: 32x16
Smem iterator smem_iterator tile: 8x16 (2 consecutive 8x8 tiles from the same row, e.g. an orange 8x8 tile + a pink 8x8 tile)
rid iteration count: 4 (each iteration is for a 8x16 tile)
cid iteration count: 1 (only need one iteration at the column dimension)
idx iteration count: 1 (only need one iteration to load the 8x16 tile).

Note that the frag_iterator_accum loads one 8x16 fragment in one idx iteration, but it needs four elements of each thread from two non-consecutive registers (e.g. for thread T0, it needs two elements from one orange 8x8 fragment followed by two elements from one pink 8x8 fragment from the same row, while the consecutive registers store the four elements from the orange 16x8 fragment, followed by the four elements from the blue 16x8 fragment).

Consequently when the 16x8 fragment are stored in the shared memory by each thread, the data layout will be row-major.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to understand EpilogueSmemAccumulator? #551

{{title}}

Replies: 3 comments

{{title}}

{{title}}

{{title}}

Select a reply

How to understand EpilogueSmemAccumulator? #551

cctry Jul 2, 2022

Replies: 3 comments

jwang323 Jul 2, 2022

cctry Jul 3, 2022 Author

jwang323 Jul 5, 2022

cctry
Jul 2, 2022

jwang323
Jul 2, 2022

cctry
Jul 3, 2022
Author

jwang323
Jul 5, 2022