Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QST] depthwise conv operand copy #1253

Closed
yupatrick22 opened this issue Dec 7, 2023 · 3 comments
Closed

[QST] depthwise conv operand copy #1253

yupatrick22 opened this issue Dec 7, 2023 · 3 comments

Comments

@yupatrick22
Copy link

yupatrick22 commented Dec 7, 2023

What is your question?
In the sample code 46, it seems that cp.async is used for copy data from device memory to shared memory
无标题
Each thread calculate the dst address in outer loop, and src address in inner loop and issue the async copy requests.
The inner loop is always fully unrolled, the outer loop is also fully unrolled for Kfixedstridedilation, while for Koptimized, the outer loop is not unrolled.

Is the above understanding correct?

For hopper architecture, making use of TMA (i.e. cp.async.bulk) will give better performance than cp.async, and TMA can completely replace cp.async for depthwise conv.

will TMA be used for depthsie conv in the future?
@Ethan-Yan27

@hwu36
Copy link
Collaborator

hwu36 commented Dec 7, 2023

@Ethan-Yan27

@Ethan-Yan27
Copy link
Collaborator

Yes. For Koptimized, the number of load iterations is a runtime value, so the compiler generates a runtime branch.

Regarding the TMA feature, as far as I know, there are no plans yet. If you are interested, feel free to contribute code and compare performance.

@yupatrick22
Copy link
Author

TMA can generate address and handle OOB, it is a perfect solution to decrease these two overheads. I highly expect it can be enabled in depthwise conv.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants