Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DRAFT] [WIP] Path exploration for TT-NN x TT-Mesh Integration #18067

Draft
wants to merge 33 commits into
base: main
Choose a base branch
from

Conversation

cfjchu
Copy link
Contributor

@cfjchu cfjchu commented Feb 20, 2025

Ticket

Link to Github Issue

Problem description

Provide context for the problem.

What's changed

  • TT-Mesh Integration through TT-NN
    • Removal of multi-threaded async implementation in TT-NN for multi-device
    • Purge multi-threading metadata state in Tensor

Checklist

@@ -614,7 +606,7 @@ Tensor to_host_mesh_tensor(const Tensor& tensor, bool blocking) {

mesh_cq.enqueue_read_shards(shard_data_transfers, mesh_buffer, /*blocking=*/true);

MultiDeviceHostStorage host_storage(storage.strategy, std::move(buffers), std::move(specs));
MultiDeviceHostStorage host_storage(AllGatherTensor{}, std::move(buffers), std::move(specs));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fyi, on another branch I am attempting to get rid of "strategy". I think we only need a shape to track which devices the tensor was uploaded. It can be a full mesh shape or a submesh.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that would be great!

std::vector<T> host_buffer;
const auto& shard_tensor_spec = storage.specs.at(id);
const auto& shard_tensor_spec = tensor.get_tensor_spec();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will need the per-shard information, won't we? What is your plan here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pragmatically for now. I'll add support for it if it's really needed. For now, let's see how far I can get.

@cfjchu cfjchu force-pushed the jchu/ttnn-integration-with-mesh branch 4 times, most recently from 7441d1b to 560ad32 Compare February 21, 2025 05:40
@@ -108,6 +80,7 @@ class Tensor {
explicit Tensor(
uint32_t num_buffers, std::optional<DistributedTensorConfig> distributed_tensor_config = std::nullopt);
explicit Tensor(const std::vector<IDevice*>& workers);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can be removed?

@cfjchu cfjchu force-pushed the jchu/ttnn-integration-with-mesh branch from 71a3899 to bae493e Compare February 22, 2025 00:35
cfjchu and others added 13 commits February 22, 2025 09:37
### Ticket
#18360

### Problem description
Recently we disabled async mode for single device, by ignoring
enable_async call for it, assuming multi-device customers make a call to
MeshDevice enable_async. However in some places including our test we
actually iterate over each individual device in the mesh and call
enable_async on it, which is being ignored

### What's changed
Make a single call to MeshDevice::enable_async instead of iterating over
individual devices and calling Device::enable_async for each one of them

### Checklist
- [ ] [All post commit CI
passes](https://github.com/tenstorrent/tt-metal/actions/runs/13553947437)
- [x] [T3K demo tests CI
passes](https://github.com/tenstorrent/tt-metal/actions/runs/13553950838)
- [x] New/Existing tests provide coverage for changes

(cherry picked from commit 69a36b8)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants