-
I see in the wiki it says this:
To me that seems to say that only one (presumably the GPU at position 0) will be doing the inference/generation and the second GPU will just be towed along as extra VRAM. Is that correct? Any dual nvidia users here able to comment? |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 2 replies
-
GPUs will work with whatever weights have been offloaded to it. So GPU1 will do the matmul with weights offloaded to GPU1 and same for GPU2. |
Beta Was this translation helpful? Give feedback.
-
Really appreciate the clarity! Thanks so much for responding so quickly and thanks so much for providing us with this incredible piece of technology. |
Beta Was this translation helpful? Give feedback.
Yes, using a slower GPU may actually result in a lower average speed. This can be partially mitigated by setting the "main gpu" which is the GPU number that is passed in with
--usecublas
which will be used to store KV, then manually setting--tensor_split
to allocate layers onto the secondary GPU. Best approach is still trial and error.