RemoteMixtureOfExperts and RemoteSwitchMixtureOfExperts cannot be used on GPU #3

Vectorrent · 2024-09-06T01:18:21Z

If you try to use these classes in a differentiable manner (i.e. backward passes during training), and on GPU - they will both fail with errors like the following:

  File "/home/crow/repos/praxis/venv/lib/python3.12/site-packages/torch/autograd/graph.py", line 768, in _engine_run_backward
    return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/crow/repos/praxis/venv/lib/python3.12/site-packages/torch/autograd/function.py", line 306, in apply
    return user_fn(self, *args)
           ^^^^^^^^^^^^^^^^^^^^
  File "/home/crow/repos/praxis/venv/lib/python3.12/site-packages/torch/autograd/function.py", line 599, in wrapper
    outputs = fn(ctx, *args)
              ^^^^^^^^^^^^^^
  File "/home/crow/repos/praxis/venv/lib/python3.12/site-packages/hivemind/moe/client/moe.py", line 312, in backward
    inputs_per_expert = zip(*(tensor[alive_ii].split(1, dim=0) for tensor in flat_inputs_cpu))
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/crow/repos/praxis/venv/lib/python3.12/site-packages/hivemind/moe/client/moe.py", line 312, in <genexpr>
    inputs_per_expert = zip(*(tensor[alive_ii].split(1, dim=0) for tensor in flat_inputs_cpu))
                              ~~~~~~^^^^^^^^^^
RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cpu)

If you trace the code to flat_inputs_cpu within Hivemind, you will clearly see that there are many places where CPU-only operations are being used. This explains why these layer types will work on CPU, but fail on GPU.

I don't think there's a great solution to this, short of refactoring the Hivemind code, or replacing these layers entirely.

The text was updated successfully, but these errors were encountered:

Vectorrent · 2024-09-06T03:09:50Z

I submit a PR to Hivemind, though I may not use this code, in the long run. These CPU operations are very slow, and using GPU does not apparently speed them up in any way. This may not matter, though - since remote peers are going to be slow, no matter what we do.

Vectorrent · 2024-10-23T07:32:39Z

This is probably not going to be an issue for us. We're not going to use Hivemind's MoE implementation; we are implementing a custom one of our own.

Vectorrent added bug Something isn't working help wanted Extra attention is needed labels Sep 6, 2024

Vectorrent closed this as completed Oct 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RemoteMixtureOfExperts and RemoteSwitchMixtureOfExperts cannot be used on GPU #3

RemoteMixtureOfExperts and RemoteSwitchMixtureOfExperts cannot be used on GPU #3

Vectorrent commented Sep 6, 2024

Vectorrent commented Sep 6, 2024

Vectorrent commented Oct 23, 2024

RemoteMixtureOfExperts and RemoteSwitchMixtureOfExperts cannot be used on GPU #3

RemoteMixtureOfExperts and RemoteSwitchMixtureOfExperts cannot be used on GPU #3

Comments

Vectorrent commented Sep 6, 2024

Vectorrent commented Sep 6, 2024

Vectorrent commented Oct 23, 2024