UNet Shallow host read-back performance is slow #12837

esmalTT · 2024-09-18T15:50:12Z

Summary

Based on what we are seeing in the tracy profile, UNet is spending a long time in the host read back which happens when we call synchronize device. Read-back memcpy could be improved - it seems like we are doing reads/writes on each cores which is not ideal since we’re generating 1 command per core.

tt-aho · 2024-09-18T16:00:33Z

Looking at the trace, we were able to estimate it takes dispatch 2ms to finish issuing writes to hugepage.

The reader thread is currently the bottleneck on unet E2E perf. Two potential issues are that the reader thread is looping through commands, and we generate one command per core to read back from for sharding, and that our readback logic could be suboptimal. We need to check expected BW of readback from hugepage and if we are lower, then it's something we need to optimize. Otherwise, it could be output is just too large which makes it slow and the optimization is to reduce the padding.

tt-aho · 2024-09-19T18:47:33Z

Readback perf from hugepage to host buffer is ~2GB/s.

@pgkeller fyi this is the FD readback performance issue.

UNet folks are also pursuing another change to improve readback by reducing the amount of data needing to be read #12705. We can determine how urgent improving FD readback is by estimating if the smaller amount of data at 2GB/s is enough to reduce the host bottleneck.

tt-aho · 2024-09-20T15:41:16Z

Currently output is 21,626,880B, and at 2GB/s takes ~10-11ms to read. If we were able to remove all padding, then we would only need to read 675,840B, and at 2GB/s would take ~0.3ms, which I think should be sufficient to remove the host bottleneck, so improving readback from hugepage may not be as high priority.

esmalTT · 2024-09-20T16:58:33Z

Currently output is 21,626,880B, and at 2GB/s takes ~10-11ms to read. If we were able to remove all padding, then we would only need to read 675,840B, and at 2GB/s would take ~0.3ms, which I think should be sufficient to remove the host bottleneck, so improving readback from hugepage may not be as high priority.

@tt-aho I agree - this is lower priority than #12896, #12705. Once the padding is removed, we can re-assess the overhead and maybe address this. I'll make this P1 for now.

pgkeller · 2025-01-30T22:27:09Z

what's the status on this? do we have more optimization to do here?

esmalTT · 2025-01-30T22:38:10Z

what's the status on this? do we have more optimization to do here?

@pgkeller I think we can close this. Nigel’s recent changes show good enough R/W speeds to meet our 2000 fps goal. See here: #12961 (comment)

esmalTT added P0 host performance optimization CNNs Unet-Shallow labels Sep 18, 2024

esmalTT mentioned this issue Sep 18, 2024

Improve UNet Shallow end-to-end performance #12857

Closed

4 tasks

tt-aho self-assigned this Sep 19, 2024

esmalTT added P1 and removed P0 labels Sep 20, 2024

pgkeller assigned nhuang-tt Jan 30, 2025

esmalTT closed this as completed Jan 30, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UNet Shallow host read-back performance is slow #12837

UNet Shallow host read-back performance is slow #12837

esmalTT commented Sep 18, 2024 •

edited

Loading

tt-aho commented Sep 18, 2024

tt-aho commented Sep 19, 2024

tt-aho commented Sep 20, 2024

esmalTT commented Sep 20, 2024

pgkeller commented Jan 30, 2025

esmalTT commented Jan 30, 2025

UNet Shallow host read-back performance is slow #12837

UNet Shallow host read-back performance is slow #12837

Comments

esmalTT commented Sep 18, 2024 • edited Loading

Summary

tt-aho commented Sep 18, 2024

tt-aho commented Sep 19, 2024

tt-aho commented Sep 20, 2024

esmalTT commented Sep 20, 2024

pgkeller commented Jan 30, 2025

esmalTT commented Jan 30, 2025

esmalTT commented Sep 18, 2024 •

edited

Loading