Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#14406: Add perf test for reduce scatter #14838

Merged
merged 3 commits into from
Nov 8, 2024

Conversation

Aswinmcw
Copy link
Contributor

@Aswinmcw Aswinmcw commented Nov 7, 2024

Ticket

#14406

What's changed

Adds perf test for reduce scatter T3k ring and line, N300 ring and line

Screenshot 2024-11-08 at 11 33 08 AM

Checklist

  • Post commit CI passes
  • Blackhole Post commit (if applicable)
  • Model regression CI testing passes (if applicable)
  • Device performance regression CI testing passes (if applicable)
  • New/Existing tests provide coverage for changes

@SeanNijjar
Copy link
Contributor

Looks good overall but there are two things that look off here (potentially). I'm not sure which dataformat being used for each test (can you please add that), and I think the Op BW equation is bugged. I'll revisit that today and get back to you with the right equation because op BW should always be >= link BW and right now it is much lower.

@SeanNijjar
Copy link
Contributor

I double checked the equation in the issue for ring reduce scatter op bandwidth, it was incorrect. I corrected it to input_tensor_volume / longest_device_fw_time for the "per chip" op bandwidth. Total op BW is a little squishy for the full cluster. That may eventually end up being a more useful measurement but this is useful to track right now (perhaps a later iteration of this work can also express cluster-level op BW

Copy link
Contributor

@SeanNijjar SeanNijjar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving now pending updates to BW calculations:

for ring reduce scatter it should match line reduce scatter: input_tensor_volume / longest_device_fw_time

Also I realized the reduce scatter eth BW is incorrect. Since we send/receive the full input tensor volume through each chip's ethernet, we should just do input_tensor_volume \ / longest_erisc_fw_time

@Aswinmcw Aswinmcw force-pushed the Aswinmcw/ccl_reduce_scatter_perf branch from a263609 to c43d160 Compare November 8, 2024 06:06
@Aswinmcw Aswinmcw force-pushed the Aswinmcw/ccl_reduce_scatter_perf branch from c43d160 to fefe768 Compare November 8, 2024 06:07
@Aswinmcw Aswinmcw marked this pull request as ready for review November 8, 2024 07:14
@Aswinmcw Aswinmcw requested a review from jvegaTT as a code owner November 8, 2024 07:14
@Aswinmcw Aswinmcw merged commit bdf1f06 into main Nov 8, 2024
134 of 137 checks passed
@Aswinmcw Aswinmcw deleted the Aswinmcw/ccl_reduce_scatter_perf branch November 8, 2024 07:14
ct-clmsn pushed a commit to ct-clmsn/tt-metal that referenced this pull request Nov 12, 2024
* tenstorrent#14406: Add perf test for reduce  scatter

* tenstorrent#14406: Add perf test for N300 reduce  scatter

* tenstorrent#14406: Fix BW computation
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants