Why llamafile is slower with GPU than CPU (on Windows) #629
BradHutchings
started this conversation in
General
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
From my search on the subject, someone is going to need this answer. 🤣
Background
When you run a llamafile.exe that does not have GPU support linked in statically for Windows, the executable will try to find a dynamic library in the user's
$env:USERPROFILE\.llamafile
directory. If it finds one, it will use it. If it doesn't find one, it can build it if you have Visual Studio (cl.exe) and CUDA installed. Your.args
need to contain a-ngl
parameter, or you need to specify on the command line of the tool, for the executable to go looking for a GPU.Why llamafile is slower with GPU than CPU on Windows
Your GPU card has RAM dedicated to it. Windows has thing called "shared GPU memory". This shared memory is in your main RAM. Windows (I'm assuming) shuffles data to the GPU as it's needed. This picture from Task Manager shows my situation running a 9GB model with 4GB RAM on my GPU:
So the model doesn't fit on the GPU. And that is why inference runs slower with the GPU enabled than just running on the CPU.
I look forward to everyone's clarifications and corrections. 🤣
Beta Was this translation helpful? Give feedback.
All reactions