Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RemoteMixtureOfExperts and RemoteSwitchMixtureOfExperts cannot be used on GPU #3

Closed
Vectorrent opened this issue Sep 6, 2024 · 2 comments
Labels
bug Something isn't working help wanted Extra attention is needed

Comments

@Vectorrent
Copy link
Contributor

If you try to use these classes in a differentiable manner (i.e. backward passes during training), and on GPU - they will both fail with errors like the following:

  File "/home/crow/repos/praxis/venv/lib/python3.12/site-packages/torch/autograd/graph.py", line 768, in _engine_run_backward
    return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/crow/repos/praxis/venv/lib/python3.12/site-packages/torch/autograd/function.py", line 306, in apply
    return user_fn(self, *args)
           ^^^^^^^^^^^^^^^^^^^^
  File "/home/crow/repos/praxis/venv/lib/python3.12/site-packages/torch/autograd/function.py", line 599, in wrapper
    outputs = fn(ctx, *args)
              ^^^^^^^^^^^^^^
  File "/home/crow/repos/praxis/venv/lib/python3.12/site-packages/hivemind/moe/client/moe.py", line 312, in backward
    inputs_per_expert = zip(*(tensor[alive_ii].split(1, dim=0) for tensor in flat_inputs_cpu))
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/crow/repos/praxis/venv/lib/python3.12/site-packages/hivemind/moe/client/moe.py", line 312, in <genexpr>
    inputs_per_expert = zip(*(tensor[alive_ii].split(1, dim=0) for tensor in flat_inputs_cpu))
                              ~~~~~~^^^^^^^^^^
RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cpu)

If you trace the code to flat_inputs_cpu within Hivemind, you will clearly see that there are many places where CPU-only operations are being used. This explains why these layer types will work on CPU, but fail on GPU.

I don't think there's a great solution to this, short of refactoring the Hivemind code, or replacing these layers entirely.

@Vectorrent Vectorrent added bug Something isn't working help wanted Extra attention is needed labels Sep 6, 2024
@Vectorrent
Copy link
Contributor Author

I submit a PR to Hivemind, though I may not use this code, in the long run. These CPU operations are very slow, and using GPU does not apparently speed them up in any way. This may not matter, though - since remote peers are going to be slow, no matter what we do.

@Vectorrent
Copy link
Contributor Author

This is probably not going to be an issue for us. We're not going to use Hivemind's MoE implementation; we are implementing a custom one of our own.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

1 participant