Inference very slow since some of the params are going to CPU after fine tuning Nemotron-70B #73

pulkitmehtaworkmetacube · 2024-11-18T12:30:43Z

We did following :

Took nvidia/Llama-3.1-Nemotron-70B-Instruct-HF base model and performed fine tuning using our custom data set for classification task . Training completed in 6 hrs or so and we got adapter weights.
Trying to do inferencing on our test set by first loading base model then adapter weights using PEFT .

We have 2 A 100 80 GB GPUs . After step 1 , we have around 67 GB GPU util on each GPU while after loading adapter one of the GPU gets to 80 GB mark and we get message Some parameters are on the meta device because they were offloaded to the cpu.

We also tried loading base model in 8 bits but then we are getting error TypeError: Input tensors need to be on the same GPU, but found the following tensor and device combinations:
[(torch.Size([170, 8192]), device(type='cuda', index=0)), (torch.Size([8192, 8192]), device(type='cuda', index=1)), (torch.Size([170, 8192]), device(type='cuda', index=0))]

Any suggestion , leads will be highly appreciated .

pulkitmehtaworkmetacube · 2024-11-20T03:23:47Z

Please review this and provide help

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inference very slow since some of the params are going to CPU after fine tuning Nemotron-70B #73

Inference very slow since some of the params are going to CPU after fine tuning Nemotron-70B #73

pulkitmehtaworkmetacube commented Nov 18, 2024

pulkitmehtaworkmetacube commented Nov 20, 2024

Inference very slow since some of the params are going to CPU after fine tuning Nemotron-70B #73

Inference very slow since some of the params are going to CPU after fine tuning Nemotron-70B #73

Comments

pulkitmehtaworkmetacube commented Nov 18, 2024

pulkitmehtaworkmetacube commented Nov 20, 2024