Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ON HOLD] #14375: Optimize usage of unordered_map in tt::Cluster #14446

Closed
wants to merge 5 commits into from

Conversation

sminakov-tt
Copy link
Contributor

@sminakov-tt sminakov-tt commented Oct 29, 2024

Ticket

#14375

Problem description

Cluster::get_driver consumes a lot of resources according to the profiler. Primarily there are two reasons: it's being called too often, and it uses an std::unordered_map where an std::vector would work

What's changed

Describe the approach used to solve the problem.
Summarize the changes made and its impact.

Checklist

@davorchap
Copy link
Collaborator

@sminakov-tt is the IndexMap here similar to the SlotMap introduced by @eyonland and @patrickroberts ?
Just want to make sure we don't duplicate if possible

fyi @pgkeller

Copy link
Member

@ayerofieiev-tt ayerofieiev-tt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sminakov-tt do you have a rough idea on how many elements are usually there in these containers?

github-actions[bot]

This comment was marked as outdated.

github-actions[bot]

This comment was marked as outdated.

Copy link
Contributor

@pgkeller pgkeller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great to move away from map
what does indexmap buy us vs just using a vector?
do we really need optionals and has_value? when will correct code not have a value when accessed?

@blozano-tt
Copy link
Contributor

Sorry for the clang-tidy noise on this PR.
This PR has a stale .clang-tidy file which will be fixed once this branch is rebased.
I enabled clang-tidy today to run one simple check on using namespace;

@sminakov-tt
Copy link
Contributor Author

@blozano-tt I actually quite liked the clang-tidy comments, I think they are reasonable, so I fixed all of them.

@sminakov-tt
Copy link
Contributor Author

@davorchap From what I can see, SlotMap is very different in both purpose and implementation compared to IndexMap. IndexMap is just an api wrapper over std::vector<std::optional> to mimic std::map api.

@sminakov-tt
Copy link
Contributor Author

@ayerofieiev-tt I currently believe that each index is roughly in range [0; numberOfChipsInTheCluster)

@sminakov-tt
Copy link
Contributor Author

@pgkeller
IndexMap is just a simple vector with optionals. The reason for optionals is that currently there are a lot of checks that the requested element is actually present either using .at() method or even more explicit ones like:

    if (this->sdesc_per_chip_.find(chip) == this->sdesc_per_chip_.end()) {
        TT_THROW(
            "Cannot access soc descriptor for {} before device driver is initialized! Call "
            "initialize_device_driver({}) first",
            chip,
            chip);
    }

And there are also some iterations, which are supposed to only iterate over existing elements, for example:

for (const auto &[chip, mmio_device_id] : this->device_to_mmio_device_)

This is why I think we need to have the optionals there, and IndexMap is just a simple wrapper to preserve map-like api.

@blozano-tt
Copy link
Contributor

@davorchap From what I can see, SlotMap is very different in both purpose and implementation compared to IndexMap. IndexMap is just an api wrapper over std::vectorstd::optional to mimic std::map api.

@sminakov-tt awesome! I'm glad you liked them. I'm happy to enable more checks in the main line incrementally.

The mistake I made last time, was enable everything at once.

@sminakov-tt sminakov-tt marked this pull request as ready for review October 31, 2024 00:02
@@ -0,0 +1,121 @@
// SPDX-FileCopyrightText: © 2024 Tenstorrent AI ULC
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this file go in tt_metal/tt_stl?

Copy link
Contributor

@abhullar-tt abhullar-tt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aliuTT could you provide feedback on the ethernet metadata structures

@@ -264,28 +265,28 @@ class Cluster {

// There is one device driver per PCIe card. This map points id of the MMIO device points to the associated device
// driver
std::unordered_map<chip_id_t, std::unique_ptr<tt_device>> mmio_device_id_to_driver_;
tt::IndexMap<std::unique_ptr<tt_device>> mmio_device_id_to_driver_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR #13949 will be merged soon and updates this so that there is one device driver for all devices detectd in a cluster (this will not need to be an IndexMap at that point)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds amazing!


// Need to hold reference to cluster descriptor to detect total number of devices available in cluster
// UMD static APIs `detect_available_device_ids` and `detect_number_of_chips` only returns number of MMIO mapped
// devices
std::string cluster_desc_path_;
std::unique_ptr<tt_ClusterDescriptor> cluster_desc_;
// There is an entry for every device that can be targeted (MMIO and remote)
std::unordered_map<chip_id_t, metal_SocDescriptor> sdesc_per_chip_;
tt::IndexMap<metal_SocDescriptor> sdesc_per_chip_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there will always be one soc desc per chip so this can just be a vector sized to total number of chips in a cluster

// Save mapping of device id to associated MMIO device id for fast lookup
std::unordered_map<chip_id_t, chip_id_t> device_to_mmio_device_;
tt::IndexMap<chip_id_t> device_to_mmio_device_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here, there is an entry for each device so we don't necessarily need the IndexMap here

@@ -296,10 +297,10 @@ class Cluster {
// 2 -> 1
// 1 -> 0
// 3 -> 1
std::unordered_map<chip_id_t, uint16_t> device_to_host_mem_channel_;
tt::IndexMap<uint16_t> device_to_host_mem_channel_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is also an entry for each device id in here

@sminakov-tt
Copy link
Contributor Author

Based on the review from @abhullar-tt, I will wait on #13949, and then replace IndexMap with a simple std::vector in the mentioned places.

I would really appreciate if we can get similar feedback on devices_grouped_by_assoc_mmio_device_, tunnels_from_mmio_device, device_eth_routing_info_, and ethernet_sockets_

It would be really nice to just have std::vector everywhere, so we can kill IndexMap. The less code the better 🎉


// Collections of devices that are grouped based on the associated MMIO device. MMIO device is included in the
// grouping
std::unordered_map<chip_id_t, std::set<chip_id_t>> devices_grouped_by_assoc_mmio_device_;
tt::IndexMap<std::set<chip_id_t>> devices_grouped_by_assoc_mmio_device_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there will be one entry per mmio device so it can be a vector of this size

@pgkeller
Copy link
Contributor

@pgkeller IndexMap is just a simple vector with optionals. The reason for optionals is that currently there are a lot of checks that the requested element is actually present either using .at() method or even more explicit ones like:

    if (this->sdesc_per_chip_.find(chip) == this->sdesc_per_chip_.end()) {
        TT_THROW(
            "Cannot access soc descriptor for {} before device driver is initialized! Call "
            "initialize_device_driver({}) first",
            chip,
            chip);
    }

And there are also some iterations, which are supposed to only iterate over existing elements, for example:

for (const auto &[chip, mmio_device_id] : this->device_to_mmio_device_)

This is why I think we need to have the optionals there, and IndexMap is just a simple wrapper to preserve map-like api.

I guess I'm ok w/ this. From 10K feet it feels like we're using IndexMap to preserve the map-like API when using a map was bad design to begin with. So, is this a bandaid to easy transition or is there value to how these are being used?

@sminakov-tt sminakov-tt changed the title #14375: Optimize usage of unordered_map in tt::Cluster [ON HOLD] #14375: Optimize usage of unordered_map in tt::Cluster Nov 1, 2024
@sminakov-tt sminakov-tt closed this Feb 7, 2025
@sminakov-tt sminakov-tt deleted the sminakov/cluster-map branch February 7, 2025 01:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants