Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix RemoteMixtureOfExperts and RemoteSwitchMixtureOfExperts backward() on GPU #626

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

Vectorrent
Copy link
Contributor

If you try to use RemoteMixtureOfExperts or RemoteSwitchMixtureOfExperts during training on GPU, you will get errors like this:

  File "/home/crow/repos/praxis/venv/lib/python3.12/site-packages/torch/autograd/graph.py", line 768, in _engine_run_backward
    return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/crow/repos/praxis/venv/lib/python3.12/site-packages/torch/autograd/function.py", line 306, in apply
    return user_fn(self, *args)
           ^^^^^^^^^^^^^^^^^^^^
  File "/home/crow/repos/praxis/venv/lib/python3.12/site-packages/torch/autograd/function.py", line 599, in wrapper
    outputs = fn(ctx, *args)
              ^^^^^^^^^^^^^^
  File "/home/crow/repos/praxis/venv/lib/python3.12/site-packages/hivemind/moe/client/moe.py", line 312, in backward
    inputs_per_expert = zip(*(tensor[alive_ii].split(1, dim=0) for tensor in flat_inputs_cpu))
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/crow/repos/praxis/venv/lib/python3.12/site-packages/hivemind/moe/client/moe.py", line 312, in <genexpr>
    inputs_per_expert = zip(*(tensor[alive_ii].split(1, dim=0) for tensor in flat_inputs_cpu))
                              ~~~~~~^^^^^^^^^^
RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cpu)

This does not happen when training on CPU. The error only occurs when training on GPU, and so far as I can tell - there is no way to fix it, without fixing the underlying Hivemind code. I tried torch.cuda.set_device(), I tried moving input tensors to the CPU - none of that works. These hard-coded CPU operations conflict with tensors on GPU.

This PR should fix the problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant