You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I see that we declare a thr_layout in the form of {32, 8}, and a vec_layout in the form of {4,1}. Meanwhile, the block_shape (line 182) is declared as {128, 64}. I think that each thread block will be responsible for moving data of size {128, 64}. However we only have 32*8=256 threads, with each thread processing 4 elements, it seems that each Copy Operation can only move 1024 elements at a time, and some looping operations must be performed to move the entire data block{128, 64}.
But why does the code only call copy(tiled_copy, thr_tile_S, fragment) once? Does this mean that copy(tiled_copy, ...) internally copies the data in a loop? If so, could you explain how this loop is implemented, and how its number of iterations is determined?
Additionally, since there are 256 threads in my thread block, could you tell me why thr_layout is defined as {32, 8} instead of other forms like {8, 32}, {256, 1}, etc.? How do these different forms affect tiled_copy?
The text was updated successfully, but these errors were encountered:
Please help me to understand how TiledCopy works.
In the example at
cutlass/examples/cute/tutorial/tiled_copy.cu
Line 217 in cc3c29a
I see that we declare a thr_layout in the form of {32, 8}, and a vec_layout in the form of {4,1}. Meanwhile, the block_shape (line 182) is declared as {128, 64}. I think that each thread block will be responsible for moving data of size {128, 64}. However we only have 32*8=256 threads, with each thread processing 4 elements, it seems that each Copy Operation can only move 1024 elements at a time, and some looping operations must be performed to move the entire data block{128, 64}.
But why does the code only call copy(tiled_copy, thr_tile_S, fragment) once? Does this mean that copy(tiled_copy, ...) internally copies the data in a loop? If so, could you explain how this loop is implemented, and how its number of iterations is determined?
Additionally, since there are 256 threads in my thread block, could you tell me why thr_layout is defined as {32, 8} instead of other forms like {8, 32}, {256, 1}, etc.? How do these different forms affect tiled_copy?
The text was updated successfully, but these errors were encountered: