You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
If you try to use these classes in a differentiable manner (i.e. backward passes during training), and on GPU - they will both fail with errors like the following:
File "/home/crow/repos/praxis/venv/lib/python3.12/site-packages/torch/autograd/graph.py", line 768, in _engine_run_backward
return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/crow/repos/praxis/venv/lib/python3.12/site-packages/torch/autograd/function.py", line 306, in apply
return user_fn(self, *args)
^^^^^^^^^^^^^^^^^^^^
File "/home/crow/repos/praxis/venv/lib/python3.12/site-packages/torch/autograd/function.py", line 599, in wrapper
outputs = fn(ctx, *args)
^^^^^^^^^^^^^^
File "/home/crow/repos/praxis/venv/lib/python3.12/site-packages/hivemind/moe/client/moe.py", line 312, in backward
inputs_per_expert = zip(*(tensor[alive_ii].split(1, dim=0) for tensor in flat_inputs_cpu))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/crow/repos/praxis/venv/lib/python3.12/site-packages/hivemind/moe/client/moe.py", line 312, in <genexpr>
inputs_per_expert = zip(*(tensor[alive_ii].split(1, dim=0) for tensor in flat_inputs_cpu))
~~~~~~^^^^^^^^^^
RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cpu)
If you trace the code to flat_inputs_cpu within Hivemind, you will clearly see that there are many places where CPU-only operations are being used. This explains why these layer types will work on CPU, but fail on GPU.
I don't think there's a great solution to this, short of refactoring the Hivemind code, or replacing these layers entirely.
The text was updated successfully, but these errors were encountered:
I submit a PR to Hivemind, though I may not use this code, in the long run. These CPU operations are very slow, and using GPU does not apparently speed them up in any way. This may not matter, though - since remote peers are going to be slow, no matter what we do.
If you try to use these classes in a differentiable manner (i.e. backward passes during training), and on GPU - they will both fail with errors like the following:
If you trace the code to
flat_inputs_cpu
within Hivemind, you will clearly see that there are many places where CPU-only operations are being used. This explains why these layer types will work on CPU, but fail on GPU.I don't think there's a great solution to this, short of refactoring the Hivemind code, or replacing these layers entirely.
The text was updated successfully, but these errors were encountered: