#0: DRAFT Add MeshProgram class #13701

tt-asaigal · 2024-10-10T18:42:45Z

This is a draft to get additional feedback

Includes APIs to set MeshProgram configuration across entire MeshDevice or per device in the Mesh. APIs are analogous to Program config APIs
Basic getter APIs to return individual programs and state across MeshProgram
Relies on distribute_impl_ and distribute_to_mesh_device_ functions
- distribute_impl_ serves as the MeshProgram entry point on the Controller or Executor to process attributes of this data structure on host(s)
- This function is currently implemented as a simple loop, but it can be swapped out for a set of RPC calls on the Controller and asynchronous calls on the executor
- distribute_to_mesh_device_impl_ serves as the interface between the MeshProgram on host and the MeshDevice. Curently used in EnqueueMeshProgram and implemented using a simple loop. Can be used to interface with TT-Fabric, once the infra is available
Design aspects to consider as we go along:
- Does a MeshProgram span Controllers, or is it limited to a Controller connected to a single cluster. APIs and heirachy may be easier with MeshProgram <--> Controller <--> MeshDevice mapping
- For programs we want to broadcast, we don't need the host to perform repeated work across the entire device mesh (it currently does). Likely makes sense to have a bcast setting in the MeshProgram class, to ensure that host sets configuration data once with fast dispatch/fabric performing the bcast
- Potential Hierarchy:
  - Controller populates MeshProgram with kernels, sems, CBs and RTAs (individual programs, single program bcast or multi-program bcast). Population can be done with a reimplementation of distributed_impl_ on the controller
  - MeshProgram is sent to Executors through a virtual CQ (RPC call + cq_id). This is through a specialized distribute_to_mesh_device_impl_ on the Controller. - Executors get a MeshProgram, which can be scattered or broadcasted through the specified CQ using Fast Dispatch and Fabric. Assembling FD/Fabric commands can be host intensive, and it would thus make sense to distribute these steps across Exectuors. Executors get the MeshProgram to the MeshDevice through a specialized distribute_to_mesh_device_impl_
For generic entry points to mutate Mesh data structures and send them to the Mesh device, we need generic distribute* functions that can accept any data type and perform generic processing (assemble FD commands, make RPC calls, send programs or data to MeshDevice, etc.)

- Includes APIs to set MeshProgram configuration across entire MeshDevice or per device in the Mesh. APIs are analogous to Program config APIs - Basic getter APIs to return individual programs and state across MeshProgram - Relies on distribute_impl_ and distribute_to_mesh_device_ functions - distribute_impl_ serves as the MeshProgram entry point on the Controller or Executor to process attributes of this data structure on host(s) - This function is currently implemented as a simple loop, but it can be swapped out for a set of RPC calls on the Controller and asynchronous calls on the executor - distribute_to_mesh_device_impl_ serves as the interface between the MeshProgram on host and the MeshDevice. Curently used in EnqueueMeshProgram and implemented using a simple loop. Can be used to interface with TT-Fabric, once the infra is available - Design aspects to consider as we go along: - Does a MeshProgram span Controllers, or is it limited to a Controller connected to a single cluster. APIs and heirachy may be easier with MeshProgram <--> Controller <--> MeshDevice mapping - For programs we want to broadcast, we don't need the host to perform repeated work across the entire device mesh (it currently does). Likely makes sense to have a bcast setting in the MeshProgram class, to ensure that host sets configuration data once with fast dispatch/fabric performing the bcast - Potential Hierarchy: - Controller populates MeshProgram with kernels, sems, CBs and RTAs (individual programs, single program bcast or multi-program bcast). Population can be done with a reimplementation of distributed_impl_ on the controller - MeshProgram is sent to Executors through a virtual CQ (RPC call + cq_id). This is through a specialized distribute_to_mesh_device_impl_ on the Controller. - Executors get a MeshProgram, which can be scattered or broadcasted through the specified CQ using Fast Dispatch and Fabric. Assembling FD/Fabric commands can be host intensive, and it would thus make sense to distribute these steps across Exectuors. Executors get the MeshProgram to the MeshDevice through a specialized distribute_to_mesh_device_impl_ - For generic entry points to mutate Mesh data structures and send them to the Mesh device, we need generic distribute* functions that can accept any data type and perform generic processing (assemble FD commands, make RPC calls, send programs or data to MeshDevice, etc.)

cfjchu · 2024-10-10T18:57:16Z

tt_metal/distributed/mesh_program.hpp

+        template<typename T>
+        T distributed_impl_(const std::function<T(Program&)>& callable) {
+            if constexpr (std::is_same<T, void>::value) {
+                for (std::size_t program_idx = 0; program_idx < this->programs.size(); program_idx++) {
+                    callable(*this->programs.at(program_idx));
+                }
+            } else {
+                for (std::size_t program_idx = 0; program_idx < this->programs.size() - 1; program_idx++) {
+                    callable(*this->programs.at(program_idx));
+                }
+                return callable(*this->programs.at(this->programs.size() -1));
+            }
+        }
+
+        template<typename T>
+        std::vector<T> distributed_impl_(const std::variant<std::function<T(Program&)>, std::function<T(Program&, Device*)>>& callable, std::shared_ptr<MeshDevice> mesh_device = nullptr) const {
+            std::vector<T> rval = {};
+            std::vector<Device*> devices = {};
+            if (mesh_device != nullptr) {
+                devices = mesh_device->get_devices();
+                TT_ASSERT(devices.size() == this->programs.size(),
+                    "MeshProgram created for {} devices cannot be mapped to a MeshDevice with {} devices",
+                    this->programs.size(), devices.size());
+                TT_ASSERT(std::holds_alternative<std::function<T(Program&, Device*)>>(callable));
+                auto f = std::get<std::function<T(Program&, Device*)>>(callable);
+                for (std::size_t program_idx = 0; program_idx < devices.size(); program_idx++) {
+                    rval.push_back(f(*this->programs.at(program_idx), devices.at(program_idx)));
+                }
+            } else {
+                TT_ASSERT(std::holds_alternative<std::function<T(Program&)>>(callable));
+                auto f = std::get<std::function<T(Program&)>>(callable);
+                for (std::size_t program_idx = 0; program_idx < this->programs.size() - 1; program_idx++) {
+                    rval.push_back(f(*this->programs.at(program_idx)));
+                }
+            }
+            return rval;
+        }
+
+        template<typename T>
+        T distribute_to_mesh_device_impl_(const std::function<T(Program&, Device*)>& callable, std::shared_ptr<MeshDevice>& mesh_device) {
+            auto devices = mesh_device->get_devices();
+            TT_ASSERT(devices.size() == this->programs.size(),
+                    "MeshProgram created for {} devices cannot be mapped to a MeshDevice with {} devices",
+                    this->programs.size(), devices.size());
+            if constexpr (std::is_same<T, void>::value) {
+                for (std::size_t program_idx = 0; program_idx < devices.size(); program_idx++) {
+                    callable(*this->programs.at(program_idx), devices.at(program_idx));
+                }
+            } else {
+                for (std::size_t program_idx = 0; program_idx < devices.size() - 1; program_idx++) {
+                    callable(*this->programs.at(program_idx), devices.at(program_idx));
+                }
+                return callable(*this->programs.at(devices.size() -1), devices.at(devices.size() -1));
+            }
+        }


this should definitely not be part of any interface

cfjchu · 2024-10-10T18:58:25Z

tt_metal/distributed/mesh_program.hpp

+uint32_t CreateSemaphore(
+    MeshProgram& mesh_program,
+    const std::variant<CoreRange, CoreRangeSet> &core_spec,
+    uint32_t initial_value,
+    CoreType core_type = CoreType::WORKER);
+
+uint32_t CreateSemaphore(
+    MeshProgram& mesh_program,
+    const std::variant<CoreRange, CoreRangeSet> &core_spec,
+    uint32_t initial_value,
+    CoreType core_type,
+    chip_id_t device_id);
+
+CBHandle CreateCircularBuffer(
+    MeshProgram& mesh_program,
+    const std::variant<CoreCoord, CoreRange, CoreRangeSet> &core_spec,
+    const CircularBufferConfig &config);
+
+CBHandle CreateCircularBuffer(
+    MeshProgram& mesh_program,
+    const std::variant<CoreCoord, CoreRange, CoreRangeSet> &core_spec,
+    const CircularBufferConfig &config,
+    chip_id_t device_id);
+
+void SetRuntimeArgs(
+    MeshProgram& mesh_program,
+    KernelHandle kernel,
+    const std::variant<CoreCoord, CoreRange, CoreRangeSet> &core_spec,
+    const std::vector<uint32_t> &runtime_args);
+
+void SetRuntimeArgs(
+    MeshProgram& mesh_program,
+    KernelHandle kernel,
+    const std::variant<CoreCoord, CoreRange, CoreRangeSet> &core_spec,
+    const std::vector<uint32_t> &runtime_args,
+    chip_id_t device_id);
+
+KernelHandle CreateKernel(
+    MeshProgram& mesh_program,
+    const std::string &file_name,
+    const std::variant<CoreCoord, CoreRange, CoreRangeSet> &core_spec,
+    const std::variant<DataMovementConfig, ComputeConfig, EthernetConfig> &config);
+
+KernelHandle CreateKernel(
+    MeshProgram& mesh_program,
+    const std::string &file_name,
+    const std::variant<CoreCoord, CoreRange, CoreRangeSet> &core_spec,
+    const std::variant<DataMovementConfig, ComputeConfig, EthernetConfig> &config,
+    chip_id_t device_id);
+
+void EnqueueMeshProgram(
+    uint8_t cq_id, MeshProgram& mesh_program, std::shared_ptr<MeshDevice> mesh_device, bool blocking);
+
+void Finish(std::shared_ptr<MeshDevice> mesh_device, uint8_t cq_id);
+


We should try to stick as true to the APIs as possible for now since there will be some refactor work.

cfjchu · 2024-10-10T19:00:06Z

tt_metal/distributed/mesh_program.cpp

+KernelHandle CreateKernel(
+    MeshProgram& mesh_program,
+    const std::string &file_name,
+    const std::variant<CoreCoord, CoreRange, CoreRangeSet> &core_spec,


It feels like we shouldn't be trying to program to the MeshProgram and then to also the Program level or we'll end up feeding two sets of inputs at different abstraction levels

cfjchu · 2024-10-10T19:02:06Z

tt_metal/distributed/mesh_program.hpp

+        }
+
+        template<typename T>
+        std::vector<T> distributed_impl_(const std::variant<std::function<T(Program&)>, std::function<T(Program&, Device*)>>& callable, std::shared_ptr<MeshDevice> mesh_device = nullptr) const {


Includes APIs to set MeshProgram configuration across entire MeshDevice or per device in the Mesh. APIs are analogous to Program config APIs
Basic getter APIs to return individual programs and state across MeshProgram
Relies on distribute_impl_ and distribute_to_mesh_device_ functions
distribute_impl_ serves as the MeshProgram entry point on the Controller or Executor to process attributes of this data structure on host(s)
This function is currently implemented as a simple loop, but it can be swapped out for a set of RPC calls on the Controller and asynchronous calls on the executor
distribute_to_mesh_device_impl_ serves as the interface between the MeshProgram on host and the MeshDevice.

The header interface should remain the same as we swap out the changes in the impl

cfjchu reviewed Oct 10, 2024

View reviewed changes

#0: Add Mesh workload class

84ae0d6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

#0: DRAFT Add MeshProgram class #13701

#0: DRAFT Add MeshProgram class #13701

tt-asaigal commented Oct 10, 2024

cfjchu Oct 10, 2024

cfjchu Oct 10, 2024

cfjchu Oct 10, 2024

cfjchu Oct 10, 2024

#0: DRAFT Add MeshProgram class #13701

Are you sure you want to change the base?

#0: DRAFT Add MeshProgram class #13701

Conversation

tt-asaigal commented Oct 10, 2024

cfjchu Oct 10, 2024

Choose a reason for hiding this comment

cfjchu Oct 10, 2024

Choose a reason for hiding this comment

cfjchu Oct 10, 2024

Choose a reason for hiding this comment

cfjchu Oct 10, 2024

Choose a reason for hiding this comment