Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UNet Shallow host read-back performance is slow #12837

Closed
Tracked by #12857
esmalTT opened this issue Sep 18, 2024 · 6 comments
Closed
Tracked by #12857

UNet Shallow host read-back performance is slow #12837

esmalTT opened this issue Sep 18, 2024 · 6 comments

Comments

@esmalTT
Copy link
Contributor

esmalTT commented Sep 18, 2024

Summary

Based on what we are seeing in the tracy profile, UNet is spending a long time in the host read back which happens when we call synchronize device. Read-back memcpy could be improved - it seems like we are doing reads/writes on each cores which is not ideal since we’re generating 1 command per core.

@tt-aho
Copy link
Contributor

tt-aho commented Sep 18, 2024

Looking at the trace, we were able to estimate it takes dispatch 2ms to finish issuing writes to hugepage.

The reader thread is currently the bottleneck on unet E2E perf. Two potential issues are that the reader thread is looping through commands, and we generate one command per core to read back from for sharding, and that our readback logic could be suboptimal. We need to check expected BW of readback from hugepage and if we are lower, then it's something we need to optimize. Otherwise, it could be output is just too large which makes it slow and the optimization is to reduce the padding.

image image

@tt-aho
Copy link
Contributor

tt-aho commented Sep 19, 2024

Readback perf from hugepage to host buffer is ~2GB/s.

@pgkeller fyi this is the FD readback performance issue.

UNet folks are also pursuing another change to improve readback by reducing the amount of data needing to be read #12705. We can determine how urgent improving FD readback is by estimating if the smaller amount of data at 2GB/s is enough to reduce the host bottleneck.

@tt-aho
Copy link
Contributor

tt-aho commented Sep 20, 2024

Currently output is 21,626,880B, and at 2GB/s takes ~10-11ms to read. If we were able to remove all padding, then we would only need to read 675,840B, and at 2GB/s would take ~0.3ms, which I think should be sufficient to remove the host bottleneck, so improving readback from hugepage may not be as high priority.

@esmalTT
Copy link
Contributor Author

esmalTT commented Sep 20, 2024

Currently output is 21,626,880B, and at 2GB/s takes ~10-11ms to read. If we were able to remove all padding, then we would only need to read 675,840B, and at 2GB/s would take ~0.3ms, which I think should be sufficient to remove the host bottleneck, so improving readback from hugepage may not be as high priority.

@tt-aho I agree - this is lower priority than #12896, #12705. Once the padding is removed, we can re-assess the overhead and maybe address this. I'll make this P1 for now.

@esmalTT esmalTT added P1 and removed P0 labels Sep 20, 2024
@pgkeller
Copy link
Contributor

what's the status on this? do we have more optimization to do here?

@esmalTT
Copy link
Contributor Author

esmalTT commented Jan 30, 2025

what's the status on this? do we have more optimization to do here?

@pgkeller I think we can close this. Nigel’s recent changes show good enough R/W speeds to meet our 2000 fps goal. See here: #12961 (comment)

@esmalTT esmalTT closed this as completed Jan 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants