[DRAFT] [WIP] Path exploration for TT-NN x TT-Mesh Integration #18067

cfjchu · 2025-02-20T02:19:01Z

Ticket

Link to Github Issue

Problem description

Provide context for the problem.

What's changed

TT-Mesh Integration through TT-NN
- Removal of multi-threaded async implementation in TT-NN for multi-device
- Purge multi-threading metadata state in Tensor

Checklist

All post commit CI passes
Blackhole Post commit CI passes (if applicable)
Model regression CI passes (if applicable)
Device performance regression CI passes (if applicable)
(For models and ops writers) Full new models tests CI passes (if applicable)
New/Existing tests provide coverage for changes

omilyutin-tt · 2025-02-20T03:04:57Z

ttnn/cpp/ttnn/tensor/tensor_impl.cpp

@@ -614,7 +606,7 @@ Tensor to_host_mesh_tensor(const Tensor& tensor, bool blocking) {

    mesh_cq.enqueue_read_shards(shard_data_transfers, mesh_buffer, /*blocking=*/true);

-    MultiDeviceHostStorage host_storage(storage.strategy, std::move(buffers), std::move(specs));
+    MultiDeviceHostStorage host_storage(AllGatherTensor{}, std::move(buffers), std::move(specs));


Fyi, on another branch I am attempting to get rid of "strategy". I think we only need a shape to track which devices the tensor was uploaded. It can be a full mesh shape or a submesh.

that would be great!

omilyutin-tt · 2025-02-20T03:05:57Z

ttnn/cpp/ttnn/tensor/tensor_impl.cpp

        std::vector<T> host_buffer;
-        const auto& shard_tensor_spec = storage.specs.at(id);
+        const auto& shard_tensor_spec = tensor.get_tensor_spec();


We will need the per-shard information, won't we? What is your plan here?

pragmatically for now. I'll add support for it if it's really needed. For now, let's see how far I can get.

ayerofieiev-tt · 2025-02-21T23:35:16Z

ttnn/cpp/ttnn/tensor/tensor.hpp

@@ -108,6 +80,7 @@ class Tensor {
    explicit Tensor(
        uint32_t num_buffers, std::optional<DistributedTensorConfig> distributed_tensor_config = std::nullopt);
    explicit Tensor(const std::vector<IDevice*>& workers);


can be removed?

This reverts commit bae493e.

…-with-mesh

### Ticket #18360 ### Problem description Recently we disabled async mode for single device, by ignoring enable_async call for it, assuming multi-device customers make a call to MeshDevice enable_async. However in some places including our test we actually iterate over each individual device in the mesh and call enable_async on it, which is being ignored ### What's changed Make a single call to MeshDevice::enable_async instead of iterating over individual devices and calling Device::enable_async for each one of them ### Checklist - [ ] [All post commit CI passes](https://github.com/tenstorrent/tt-metal/actions/runs/13553947437) - [x] [T3K demo tests CI passes](https://github.com/tenstorrent/tt-metal/actions/runs/13553950838) - [x] New/Existing tests provide coverage for changes (cherry picked from commit 69a36b8)

lots of fun TMP

…-with-mesh-rebase

…-with-mesh

omilyutin-tt reviewed Feb 20, 2025

View reviewed changes

cfjchu force-pushed the jchu/ttnn-integration-with-mesh branch 4 times, most recently from 7441d1b to 560ad32 Compare February 21, 2025 05:40

ayerofieiev-tt reviewed Feb 21, 2025

View reviewed changes

cfjchu and others added 9 commits February 22, 2025 00:30

#0: remove unnecessary fcn param

cf8acdf

#0: [DRAFT] Falcon MLP module working with TT-Mesh Integration

f64909a

Build fix

b470d38

#0: cleanup

c6fc6d6

Make tensor glorious again

f6014bb

Re-enable couple mesh tensors tests

9b6890f

wip

3fd9efd

wip

cfc88ef

Merge disable async on SD from Stas

bae493e

cfjchu force-pushed the jchu/ttnn-integration-with-mesh branch from 71a3899 to bae493e Compare February 22, 2025 00:35

cfjchu and others added 13 commits February 22, 2025 09:37

fix multi-device de-serialization; disable async on MeshDevice

3e23d1d

Data-Parallel Full Falcon Model Passing; disable program-cache for now

1a92c2a

Revert "Merge disable async on SD from Stas"

48c07a8

This reverts commit bae493e.

Merge remote-tracking branch 'origin/main' into jchu/ttnn-integration…

03bbd74

…-with-mesh

Tensor destructor crash fix

f0c3fdb

Fix MeshCoordinateRange

8d43ca4

Merge remote-tracking branch 'origin/main' into jchu/ttnn-integration…

49481f6

…-with-mesh

Mesh trace integration

a1d6a34

Merge remote-tracking branch 'origin/main' into jchu/ttnn-integration…

3f92c44

…-with-mesh

re-enable program cache for single-device flow

589e559

Fixes

cbf6f03

Variety of fixes

fc4dcf0

cfjchu and others added 11 commits February 28, 2025 11:11

Enable Caching Mechanism for MeshWorkload

a627881

lots of fun TMP

#0: Remove TTNN synchronization/blocking when executing workloads

ad8be43

Merge remote-tracking branch 'origin/main' into jchu/ttnn-integration…

249d5bc

…-with-mesh-rebase

Fixup logger message

9da1252

fix

04c09db

more fixes to single-device

4e0a0ab

fixups

372c521

cleanup

36ce556

Merge remote-tracking branch 'origin/main' into jchu/ttnn-integration…

7b0e40b

…-with-mesh

done

3149ffc

Merge remote-tracking branch 'origin/main' into jchu/ttnn-integration…

0aa9853

…-with-mesh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DRAFT] [WIP] Path exploration for TT-NN x TT-Mesh Integration #18067

[DRAFT] [WIP] Path exploration for TT-NN x TT-Mesh Integration #18067

cfjchu commented Feb 20, 2025 •

edited

Loading

omilyutin-tt Feb 20, 2025

cfjchu Feb 20, 2025

omilyutin-tt Feb 20, 2025

cfjchu Feb 20, 2025

ayerofieiev-tt Feb 21, 2025

[DRAFT] [WIP] Path exploration for TT-NN x TT-Mesh Integration #18067

Are you sure you want to change the base?

[DRAFT] [WIP] Path exploration for TT-NN x TT-Mesh Integration #18067

Conversation

cfjchu commented Feb 20, 2025 • edited Loading

Ticket

Problem description

What's changed

Checklist

omilyutin-tt Feb 20, 2025

Choose a reason for hiding this comment

cfjchu Feb 20, 2025

Choose a reason for hiding this comment

omilyutin-tt Feb 20, 2025

Choose a reason for hiding this comment

cfjchu Feb 20, 2025

Choose a reason for hiding this comment

ayerofieiev-tt Feb 21, 2025

Choose a reason for hiding this comment

cfjchu commented Feb 20, 2025 •

edited

Loading