#2065: Updated all reduce code to handle 0 or 1 mesh cluster axis #2215

tapspatel · 2025-02-19T21:04:01Z

#2065: Updated all reduce code to handle 0 or 1 cluster axis and cleaned up dialect representations of all reduce in ttir and ttnn. Update algorithms for calculating gather and scatter dimensions. Migrated all workaround code into TTNN workaround pass such that we don't clog up ttir or ttnn definitions of all_reduce.

wooseokTT · 2025-02-19T21:09:54Z

@tapspatel Can you wait for this PR to be merged? #2149 I had to fix some of the code for importing all_reduce and it's conflicting now. I believe you can apply this change on top of the pr.

kmabeeTT

Thumbs up for runtime

jnie-TT

Comment in line, otherwise runtime changes look good.

runtime/lib/ttnn/operations/ccl/reduce_scatter.cpp

sdjordjevicTT

Couple of comments inline.

lib/Dialect/TTIR/IR/TTIROps.cpp

lib/Dialect/TTNN/Transforms/Workarounds/TTNNWorkarounds.cpp

azecevicTT

Great follow-up! Few comments, mostly nits, but there are a few that might be bugs, so make sure to check them before merging.

lib/Conversion/StableHLOToTTIR/StableHLOToTTIRPatterns.cpp

lib/Dialect/TTNN/IR/TTNNOps.cpp

lib/Dialect/TTNN/Transforms/Workarounds/TTNNWorkarounds.cpp

github-actions

⚠️ Clang-Tidy found issue(s) with the introduced code (1/1)

lib/Conversion/StableHLOToTTIR/StableHLOToTTIRPatterns.cpp

nsmithtt · 2025-02-21T20:48:03Z

Perhaps the commit title should become the commit message and the title could be something a bit more concise?

nsmithtt

The TTNN interfaces I think need to change, some other comments inline, but otherwise looks good!

include/ttmlir/Dialect/TTNN/IR/TTNNOps.td

include/ttmlir/Target/TTNN/program.fbs

lib/Conversion/StableHLOToTTIR/StableHLOToTTIRPatterns.cpp

nsmithtt · 2025-02-21T21:05:58Z

lib/Conversion/StableHLOToTTIR/StableHLOToTTIRPatterns.cpp

+    auto firstElementIt = replicaGroups.begin();
+    auto secondElementIt = firstElementIt + 1;
+
+    clusterAxis = (((*firstElementIt) + 1) == *secondElementIt);


Doesn't this assume that device IDs are consecutive and in a particular topology?

From docs: https://docs.jax.dev/en/latest/_autosummary/jax.make_mesh.html

Essentially, the mesh device ordering is determined by the TPU structure internally. It will optimize the ordering based on the mesh shape. However, if you simulate CPUs, it orders by consecutive numbers. To simplify this, it is an assumption that the mesh provided by tt-xla will be monotonically increasing, because we don't have a way to propagate an efficient mesh mapping based on hardware structure yet in compiler (but this can be done).

I can add this assumption as a comment for now.

For topologies, we are only supporting 2d grid topologies in compiler from conversation with Wooseok.

I have observed the ascending device ids from 0 to N-1 so far as well, so I guess it's fine. In the future, we may leverage the device IDs to best distribute data for our hardware config, but didn't get there yet.

github-actions

⚠️ Clang-Tidy found issue(s) with the introduced code (1/1)

lib/Dialect/TTNN/Transforms/Workarounds/TTNNWorkarounds.cpp

nsmithtt

Looks good, thank you!

lib/Conversion/StableHLOToTTIR/StableHLOToTTIRPatterns.cpp

wooseokTT

Can you address some of my comments/requests?

lib/Conversion/StableHLOToTTIR/StableHLOToTTIRPatterns.cpp

wooseokTT · 2025-02-25T00:39:39Z

lib/Conversion/StableHLOToTTIR/StableHLOToTTIRPatterns.cpp

+    auto firstElementIt = replicaGroups.begin();
+    auto secondElementIt = firstElementIt + 1;
+
+    clusterAxis = (((*firstElementIt) + 1) == *secondElementIt);


I have observed the ascending device ids from 0 to N-1 so far as well, so I guess it's fine. In the future, we may leverage the device IDs to best distribute data for our hardware config, but didn't get there yet.

lib/Dialect/TTIR/IR/TTIROps.cpp

lib/Dialect/TTNN/IR/TTNNOps.cpp

lib/Conversion/StableHLOToTTIR/StableHLOToTTIRPatterns.cpp

lib/Dialect/TTNN/Transforms/Workarounds/TTNNWorkarounds.cpp

github-actions

⚠️ Clang-Tidy found issue(s) with the introduced code (1/1)

lib/Conversion/StableHLOToTTIR/StableHLOToTTIRPatterns.cpp

tapspatel · 2025-02-26T21:43:00Z

@wooseokTT let me know if the responses to all the comments are sufficient.

…ned up dialect representations of all reduce in ttir and ttnn. Update algorithms for calculating gather and scatter dimensions

) Updated all reduce code to handle 0 or 1 cluster axis and cleaned up dialect representations of all reduce in ttir and ttnn. Update algorithms for calculating gather and scatter dimensions. Migrated all workaround code into TTNN workaround pass such that we don't clog up ttir or ttnn definitions of all_reduce.

tapspatel self-assigned this Feb 19, 2025

tapspatel requested review from jnie-TT, kmabeeTT, AleksKnezevic, pilkicTT, sdjordjevicTT, svuckovicTT, mtopalovicTT, jserbedzijaTT, azecevicTT, nsmithtt and mrakitaTT as code owners February 19, 2025 21:04

tapspatel linked an issue Feb 19, 2025 that may be closed by this pull request

Update all_reduce shlo conversion to ttnn with dynamic cluster_axis #2065

Closed

tapspatel requested a review from wooseokTT February 19, 2025 21:04

kmabeeTT approved these changes Feb 19, 2025

View reviewed changes

jnie-TT approved these changes Feb 19, 2025

View reviewed changes

runtime/lib/ttnn/operations/ccl/reduce_scatter.cpp Outdated Show resolved Hide resolved

sdjordjevicTT approved these changes Feb 20, 2025

View reviewed changes

lib/Dialect/TTIR/IR/TTIROps.cpp Outdated Show resolved Hide resolved

lib/Dialect/TTNN/Transforms/Workarounds/TTNNWorkarounds.cpp Outdated Show resolved Hide resolved

lib/Dialect/TTNN/Transforms/Workarounds/TTNNWorkarounds.cpp Outdated Show resolved Hide resolved

tapspatel force-pushed the tpatel/issue-2065 branch 2 times, most recently from 3002451 to a846738 Compare February 20, 2025 18:42

This was referenced Feb 21, 2025

#2005: Added representation of reduce scatter in ttir and ttnn dialect #2127

Open

all_reduce replica_size = 1 #2235

Open

azecevicTT approved these changes Feb 21, 2025

View reviewed changes

tapspatel force-pushed the tpatel/issue-2065 branch from a846738 to 26fddac Compare February 21, 2025 16:43

github-actions bot reviewed Feb 21, 2025

View reviewed changes

lib/Conversion/StableHLOToTTIR/StableHLOToTTIRPatterns.cpp Outdated Show resolved Hide resolved

tapspatel force-pushed the tpatel/issue-2065 branch from 26fddac to ac11675 Compare February 21, 2025 20:43

nsmithtt reviewed Feb 21, 2025

View reviewed changes

github-actions bot reviewed Feb 21, 2025

View reviewed changes

lib/Dialect/TTNN/Transforms/Workarounds/TTNNWorkarounds.cpp Outdated Show resolved Hide resolved

tapspatel requested a review from nsmithtt February 21, 2025 21:49

nsmithtt approved these changes Feb 21, 2025

View reviewed changes

lib/Conversion/StableHLOToTTIR/StableHLOToTTIRPatterns.cpp Show resolved Hide resolved

tapspatel force-pushed the tpatel/issue-2065 branch from ac11675 to d5d9ebd Compare February 24, 2025 17:53

wooseokTT reviewed Feb 25, 2025

View reviewed changes

tapspatel force-pushed the tpatel/issue-2065 branch from d5d9ebd to 959ef8e Compare February 26, 2025 20:30

tapspatel requested a review from wooseokTT February 26, 2025 20:31

github-actions bot reviewed Feb 26, 2025

View reviewed changes

lib/Conversion/StableHLOToTTIR/StableHLOToTTIRPatterns.cpp Outdated Show resolved Hide resolved

lib/Conversion/StableHLOToTTIR/StableHLOToTTIRPatterns.cpp Outdated Show resolved Hide resolved

tapspatel force-pushed the tpatel/issue-2065 branch from 959ef8e to ca39af3 Compare February 26, 2025 21:17

tapspatel force-pushed the tpatel/issue-2065 branch from ca39af3 to 0cec5f6 Compare February 27, 2025 16:28

#2065: Updated all reduce code to handle 0 or 1 cluster axis and clea…

f02b2fa

…ned up dialect representations of all reduce in ttir and ttnn. Update algorithms for calculating gather and scatter dimensions

tapspatel force-pushed the tpatel/issue-2065 branch from 0cec5f6 to f02b2fa Compare February 27, 2025 17:55

tapspatel merged commit 834fc78 into main Feb 27, 2025
31 checks passed

tapspatel deleted the tpatel/issue-2065 branch February 27, 2025 19:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

#2065: Updated all reduce code to handle 0 or 1 mesh cluster axis #2215

#2065: Updated all reduce code to handle 0 or 1 mesh cluster axis #2215

tapspatel commented Feb 19, 2025 •

edited

Loading

wooseokTT commented Feb 19, 2025

kmabeeTT left a comment

jnie-TT left a comment

sdjordjevicTT left a comment

azecevicTT left a comment

github-actions bot left a comment

nsmithtt commented Feb 21, 2025

nsmithtt left a comment

nsmithtt Feb 21, 2025

tapspatel Feb 21, 2025

tapspatel Feb 21, 2025

wooseokTT Feb 25, 2025

github-actions bot left a comment

nsmithtt left a comment

wooseokTT left a comment

wooseokTT Feb 25, 2025

github-actions bot left a comment

tapspatel commented Feb 26, 2025

#2065: Updated all reduce code to handle 0 or 1 mesh cluster axis #2215

#2065: Updated all reduce code to handle 0 or 1 mesh cluster axis #2215

Conversation

tapspatel commented Feb 19, 2025 • edited Loading

wooseokTT commented Feb 19, 2025

kmabeeTT left a comment

Choose a reason for hiding this comment

jnie-TT left a comment

Choose a reason for hiding this comment

sdjordjevicTT left a comment

Choose a reason for hiding this comment

azecevicTT left a comment

Choose a reason for hiding this comment

github-actions bot left a comment

Choose a reason for hiding this comment

nsmithtt commented Feb 21, 2025

nsmithtt left a comment

Choose a reason for hiding this comment

nsmithtt Feb 21, 2025

Choose a reason for hiding this comment

tapspatel Feb 21, 2025

Choose a reason for hiding this comment

tapspatel Feb 21, 2025

Choose a reason for hiding this comment

wooseokTT Feb 25, 2025

Choose a reason for hiding this comment

github-actions bot left a comment

Choose a reason for hiding this comment

nsmithtt left a comment

Choose a reason for hiding this comment

wooseokTT left a comment

Choose a reason for hiding this comment

wooseokTT Feb 25, 2025

Choose a reason for hiding this comment

github-actions bot left a comment

Choose a reason for hiding this comment

tapspatel commented Feb 26, 2025

tapspatel commented Feb 19, 2025 •

edited

Loading