Naive Parallelism Multi-GPU example script fails #993

johncookds · 2023-11-14T16:58:02Z

Hi, I tried to adapt this example script to use device='auto', to support a larger model.
https://github.com/huggingface/trl/blob/main/examples/scripts/ppo_multi_adapter.py
Unfortunately it fails on the compute_reward_score step. (the generate step does work).
The error using llama-13b on 4 A100s is below.
I'm running it using naive parallelism.
I believe it's because set_adapter isn't correctly loading the weights onto the GPU?
Any help/comments would be greatly appreciated.

The error:

    raw_rewards = ppo_trainer.model.compute_reward_score(**inputs)
  File "/usr/local/lib/python3.8/dist-packages/trl/models/modeling_base.py", line 498, in compute_reward_score
    base_model_output = self.pretrained_model(
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1533, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/peft/peft_model.py", line 977, in forward
    return self.base_model(
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1533, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/peft/tuners/tuners_utils.py", line 106, in forward
    return self.model.forward(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/accelerate/hooks.py", line 164, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/transformers/models/llama/modeling_llama.py", line 820, in forward
    outputs = self.model(
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1533, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/transformers/models/llama/modeling_llama.py", line 708, in forward
    layer_outputs = decoder_layer(
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1533, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/accelerate/hooks.py", line 164, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/transformers/models/llama/modeling_llama.py", line 424, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1533, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/accelerate/hooks.py", line 164, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/transformers/models/llama/modeling_llama.py", line 321, in forward
    query_states = self.q_proj(hidden_states)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1533, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/peft/tuners/lora/bnb.py", line 290, in forward
    output = lora_B(lora_A(dropout(x)))
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1533, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/linear.py", line 114, in forward
    return F.linear(input, self.weight, self.bias)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument mat2 in method wrapper_CUDA_mm)

The text was updated successfully, but these errors were encountered:

johncookds · 2023-11-14T20:05:29Z

I was able to solve this by looping through the named parameters and putting the reward model adapter on the correct gpu instance.
If you look at the original devices of all the parameters, you see the default adapter and the base model layers on the right machines, but the reward model adapter on the CPU, and set_adapter doesn't correctly place the model adapters.
The code I run before the train loop is:

    ppo_trainer.model.pretrained_model.set_adapter(ppo_trainer.model.rm_adapter_name)
    ppo_trainer.model.pretrained_model.eval()
    previous_device = None
    for name, module in ppo_trainer.model.pretrained_model.named_parameters():
        if str(module.device) == "cpu":
            module.data = module.data.to(previous_device)
        previous_device = module.device
    ppo_trainer.model.pretrained_model.set_adapter("default")
    ppo_trainer.model.pretrained_model.train()

johncookds closed this as completed Nov 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Naive Parallelism Multi-GPU example script fails #993

Naive Parallelism Multi-GPU example script fails #993

johncookds commented Nov 14, 2023 •

edited

Loading

johncookds commented Nov 14, 2023

Naive Parallelism Multi-GPU example script fails #993

Naive Parallelism Multi-GPU example script fails #993

Comments

johncookds commented Nov 14, 2023 • edited Loading

johncookds commented Nov 14, 2023

johncookds commented Nov 14, 2023 •

edited

Loading