When using dual Nvidia GPU's do both GPU's participate in inference/generation? #427

wh33t · 2023-09-09T15:07:30Z

wh33t
Sep 9, 2023

I see in the wiki it says this:

How do I use multiple GPUs?
Multi-GPU is only available when using CuBLAS. When not selecting a specific GPU ID after --usecublas (or selecting "All" in the GUI), weights will be distributed across all detected Nvidia GPUs automatically. You can change the ratio with the parameter --tensor_split, e.g. --tensor_split 3 1 for a 75%/25% ratio.

To me that seems to say that only one (presumably the GPU at position 0) will be doing the inference/generation and the second GPU will just be towed along as extra VRAM.

Is that correct? Any dual nvidia users here able to comment?

Answered by LostRuins

Sep 9, 2023

Yes, using a slower GPU may actually result in a lower average speed. This can be partially mitigated by setting the "main gpu" which is the GPU number that is passed in with --usecublas which will be used to store KV, then manually setting --tensor_split to allocate layers onto the secondary GPU. Best approach is still trial and error.

View full answer

LostRuins · 2023-09-09T16:38:17Z

LostRuins
Sep 9, 2023
Maintainer

GPUs will work with whatever weights have been offloaded to it. So GPU1 will do the matmul with weights offloaded to GPU1 and same for GPU2.

2 replies

wh33t Sep 9, 2023
Author

Fantastic.

Would you be able to tell me, if the GPU's are not the same (IE. a 3090 paired with a 2060), is it possible that the 2060 would ultimately result in lower generation performance due it's lower amount of CUDA? Or generally speaking, is it always better to have multiple GPU's?

I guess what I'm curious about is how well optimized tensor_split is and whether or not a slower GPU can end up bottlenecking the system.

LostRuins Sep 9, 2023
Maintainer

Yes, using a slower GPU may actually result in a lower average speed. This can be partially mitigated by setting the "main gpu" which is the GPU number that is passed in with --usecublas which will be used to store KV, then manually setting --tensor_split to allocate layers onto the secondary GPU. Best approach is still trial and error.

Answer selected by wh33t

wh33t · 2023-09-09T17:07:21Z

wh33t
Sep 9, 2023
Author

Yes, using a slower GPU may actually result in a lower average speed. This can be partially mitigated by setting the "main gpu" which is the GPU number that is passed in with --usecublas which will be used to store KV, then manually setting --tensor_split to allocate layers onto the secondary GPU. Best approach is still trial and error.

Really appreciate the clarity! Thanks so much for responding so quickly and thanks so much for providing us with this incredible piece of technology.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When using dual Nvidia GPU's do both GPU's participate in inference/generation? #427

{{title}}

Replies: 2 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

When using dual Nvidia GPU's do both GPU's participate in inference/generation? #427

wh33t Sep 9, 2023

Replies: 2 comments · 2 replies

LostRuins Sep 9, 2023 Maintainer

wh33t Sep 9, 2023 Author

LostRuins Sep 9, 2023 Maintainer

wh33t Sep 9, 2023 Author

wh33t
Sep 9, 2023

Replies: 2 comments 2 replies

LostRuins
Sep 9, 2023
Maintainer

wh33t Sep 9, 2023
Author

LostRuins Sep 9, 2023
Maintainer

wh33t
Sep 9, 2023
Author