Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revert "Multidimensional device mesh" #3857

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

wujingyue
Copy link
Collaborator

Reverts #3821 due to #3856

@wujingyue wujingyue requested review from cowanmeg and naoyam February 8, 2025 19:18
Copy link

github-actions bot commented Feb 8, 2025

Description

  • Reverted multidimensional device mesh support

  • Removed multidimensional mesh checks and logic

  • Simplified device mesh handling to 1D

  • Updated tests and documentation accordingly


Changes walkthrough 📝

Relevant files
Bug fix
9 files
executor.cpp
Removed device index handling                                                       
+1/-3     
lower.cpp
Removed multidimensional mesh checks                                         
+19/-51 
device_mesh.cpp
Simplified device mesh to 1D                                                         
+2/-118 
utils.cpp
Updated max device ID calculation                                               
+3/-1     
type.cpp
Removed multidimensional parallel types                                   
+1/-6     
test_sharding.cpp
Removed multidimensional mesh tests                                           
+0/-27   
lower.h
Removed device index parameter                                                     
+2/-2     
device_mesh.h
Simplified device mesh to 1D                                                         
+9/-50   
type.h
Removed multidimensional parallel types                                   
+2/-6     

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

🧪 No relevant tests
⚡ Recommended focus areas for review

Possible Issue

The removal of my_device_idx from the lower function calls might lead to incorrect behavior if the device index is needed for proper communication lowering.

std::vector<Expr*> HostIrLower::lower(Expr* c) {
  FusionGuard fg(c->fusion());

  if (c->isOneOf<MatmulOp, LinearOp>()) {
    return lowerToCollectiveBasedPipelinedGemmComm(c);
Possible Issue

The removal of multi-dimensional support in DeviceMesh might break existing code that relies on this functionality.

#include <numeric>
#include <unordered_set>

// for operator<<(std::ostream&, const std::vector<T>&)
#include <c10/util/Logging.h>

#include <type.h>

namespace nvfuser {

DeviceMesh::DeviceMesh(std::vector<DeviceIdxType> devices) {
  setDevices(std::move(devices));
}

DeviceMesh::DeviceMesh(std::initializer_list<DeviceIdxType> devices) {
  setDevices(std::vector<DeviceIdxType>(devices));
}

void DeviceMesh::setDevices(std::vector<DeviceIdxType> devices) {
  vector_ = std::move(devices);

  std::unordered_set<DeviceIdxType> unique_devices(
      vector_.begin(), vector_.end());
  NVF_ERROR(
      unique_devices.size() == vector_.size(),
      "Device mesh has duplicates: ",
      vector_);
}

/*static*/ DeviceMesh DeviceMesh::createForNumDevices(
    const int64_t num_devices) {
Missing Tests

The tests for multi-dimensional device mesh have been removed. It is important to add tests for the new single-dimensional DeviceMesh implementation to ensure correctness.

  KernelExecutor ke;
  ke.compile(fusion.get(), {a_tensor});
  auto outputs = ke.run({a_tensor});
  testValidate(fusion.get(), outputs, {a_tensor}, __LINE__, __FILE__);
}

INSTANTIATE_TEST_SUITE_P(
    ,
    ShardingTest,
    testing::Bool(),

@wujingyue
Copy link
Collaborator Author

!test

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant