Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inference very slow since some of the params are going to CPU after fine tuning Nemotron-70B #73

Open
pulkitmehtaworkmetacube opened this issue Nov 18, 2024 · 1 comment

Comments

@pulkitmehtaworkmetacube

We did following :

  1. Took nvidia/Llama-3.1-Nemotron-70B-Instruct-HF base model and performed fine tuning using our custom data set for classification task . Training completed in 6 hrs or so and we got adapter weights.

  2. Trying to do inferencing on our test set by first loading base model then adapter weights using PEFT .

We have 2 A 100 80 GB GPUs . After step 1 , we have around 67 GB GPU util on each GPU while after loading adapter one of the GPU gets to 80 GB mark and we get message Some parameters are on the meta device because they were offloaded to the cpu.

We also tried loading base model in 8 bits but then we are getting error TypeError: Input tensors need to be on the same GPU, but found the following tensor and device combinations:
[(torch.Size([170, 8192]), device(type='cuda', index=0)), (torch.Size([8192, 8192]), device(type='cuda', index=1)), (torch.Size([170, 8192]), device(type='cuda', index=0))]

Any suggestion , leads will be highly appreciated .

@pulkitmehtaworkmetacube
Copy link
Author

Please review this and provide help

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant