[QST] depthwise conv operand copy #1253

yupatrick22 · 2023-12-07T05:01:52Z

What is your question?
In the sample code 46, it seems that cp.async is used for copy data from device memory to shared memory

Each thread calculate the dst address in outer loop, and src address in inner loop and issue the async copy requests.
The inner loop is always fully unrolled, the outer loop is also fully unrolled for Kfixedstridedilation, while for Koptimized, the outer loop is not unrolled.

Is the above understanding correct?

For hopper architecture, making use of TMA (i.e. cp.async.bulk) will give better performance than cp.async, and TMA can completely replace cp.async for depthwise conv.

will TMA be used for depthsie conv in the future?
@Ethan-Yan27

hwu36 · 2023-12-07T05:04:48Z

@Ethan-Yan27

Ethan-Yan27 · 2023-12-07T10:44:53Z

Yes. For Koptimized, the number of load iterations is a runtime value, so the compiler generates a runtime branch.

Regarding the TMA feature, as far as I know, there are no plans yet. If you are interested, feel free to contribute code and compare performance.

yupatrick22 · 2023-12-07T13:08:34Z

TMA can generate address and handle OOB, it is a perfect solution to decrease these two overheads. I highly expect it can be enabled in depthwise conv.

yupatrick22 added ? - Needs Triage question Question labels Dec 7, 2023

yupatrick22 closed this as completed Dec 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QST] depthwise conv operand copy #1253

[QST] depthwise conv operand copy #1253

yupatrick22 commented Dec 7, 2023 •

edited

Loading

hwu36 commented Dec 7, 2023

Ethan-Yan27 commented Dec 7, 2023

yupatrick22 commented Dec 7, 2023

[QST] depthwise conv operand copy #1253

[QST] depthwise conv operand copy #1253

Comments

yupatrick22 commented Dec 7, 2023 • edited Loading

hwu36 commented Dec 7, 2023

Ethan-Yan27 commented Dec 7, 2023

yupatrick22 commented Dec 7, 2023

yupatrick22 commented Dec 7, 2023 •

edited

Loading