-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PPO on multi-GPU but get Error: Expected all tensors to be on the same device #809
Comments
@younesbelkada |
Hi @Ricardokevins |
Hi,thank you for your reply! @younesbelkada is it an issue about transformers version? i use transformers==4.33.1 |
Hmm this is strange, loading a model with |
Hi thank you for your help I check the setting and set the
But still receive error:
I check the error is caused by the LlamaRMSNorm module |
I dive into the code, and find that the model's 7th layer is on gpu0 although in device_map it should be placed on gpu1. @younesbelkada
|
I have some similar issues about "Expected all tensors to be on the same device" when I run the example in https://github.com/huggingface/trl/blob/main/examples/research_projects/stack_llama/scripts/rl_training.py Later I found out using the example command is fine But I enable deepspeed zero-2, I have that error |
Anything Update here ? |
Remove |
Wow, amazing! I noticed that in the original code, the Accelerator was mainly used to provide the current_device. Have you replaced all the settings for current_device with 'auto'? |
Thank you for your previous suggestion in resolving the issue. I have implemented the changes based on your advice, but unfortunately, the problem still persists. If it's convenient for you, could you please share the code that you successfully ran, so that I can reference it for further troubleshooting? Thank you for your help. |
@Ricardokevins have you solved the issue? I have the same problem |
Hi @Ricardokevins @imgremlin , I think you guys can paste the reproducing code snippet for me to check? |
No, I can't run the model correctly |
sure
|
Here is my script
|
The code looks fine to me. Do you have a minimal runnable snippet? |
This code works well with:
But doesn't work with:
|
Hello! I have the same problem. Have you solved this issue? |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. |
Guys, I am suffering from the same issue abovementioned. Do anybody have solution to this? |
1 similar comment
Guys, I am suffering from the same issue abovementioned. Do anybody have solution to this? |
I am training alpaca-7B on 4 * A100 80G
I am using the provided Deepspeed-zero2 yaml file as the configuration file in the repository and running it. Even when I set the device-map of the model to None and batch size to 2, there is still a memory overflow issue with minibatch size set to 1.
Therefore, I tried setting the device-map to 'auto' so that accelerate can shard the model across different GPUs. However, when the code reaches ppo-trainer.generate, the aforementioned error occurs. Could you please advise on how to resolve this?
the command i use:
the model loading code:
the PPo config
The text was updated successfully, but these errors were encountered: