You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
What is your question?
In the sample code 46, it seems that cp.async is used for copy data from device memory to shared memory
Each thread calculate the dst address in outer loop, and src address in inner loop and issue the async copy requests.
The inner loop is always fully unrolled, the outer loop is also fully unrolled for Kfixedstridedilation, while for Koptimized, the outer loop is not unrolled.
Is the above understanding correct?
For hopper architecture, making use of TMA (i.e. cp.async.bulk) will give better performance than cp.async, and TMA can completely replace cp.async for depthwise conv.
will TMA be used for depthsie conv in the future? @Ethan-Yan27
The text was updated successfully, but these errors were encountered:
TMA can generate address and handle OOB, it is a perfect solution to decrease these two overheads. I highly expect it can be enabled in depthwise conv.
What is your question?
In the sample code 46, it seems that cp.async is used for copy data from device memory to shared memory
Each thread calculate the dst address in outer loop, and src address in inner loop and issue the async copy requests.
The inner loop is always fully unrolled, the outer loop is also fully unrolled for Kfixedstridedilation, while for Koptimized, the outer loop is not unrolled.
Is the above understanding correct?
For hopper architecture, making use of TMA (i.e. cp.async.bulk) will give better performance than cp.async, and TMA can completely replace cp.async for depthwise conv.
will TMA be used for depthsie conv in the future?
@Ethan-Yan27
The text was updated successfully, but these errors were encountered: