Skip to content

When using dual Nvidia GPU's do both GPU's participate in inference/generation? #427

Closed Answered by LostRuins
wh33t asked this question in Q&A
Discussion options

You must be logged in to vote

Yes, using a slower GPU may actually result in a lower average speed. This can be partially mitigated by setting the "main gpu" which is the GPU number that is passed in with --usecublas which will be used to store KV, then manually setting --tensor_split to allocate layers onto the secondary GPU. Best approach is still trial and error.

Replies: 2 comments 2 replies

Comment options

You must be logged in to vote
2 replies
@wh33t
Comment options

@LostRuins
Comment options

Answer selected by wh33t
Comment options

You must be logged in to vote
0 replies
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants