CAUTION: Support of GPU acceleration is preliminary. There are known issues.
Generally, all backends supported by GGML are available, with a focus on below backends.
Backend | Target devices |
---|---|
CUDA | Nvidia GPU |
RPC | Any |
Vulkan | GPU |
To build with Vulkan:
cmake -B build -DGGML_VULKAN=1
cmake --build build --config Release
To build with CUDA:
cmake -B build -DGGML_CUDA=1
cmake --build build --config Release
For more information, please checkout Build llama.cpp locally.
Use -ngl
(--n_gpu_layers
) to specify number of layers to be deployed to GPU. We name the staffs before the first layer as "Prolog", and
staffs after the last layer as "Epilog". "Prolog" and "Epilog" are treated as special layers, and they can also be configured from -ngl
by including prolog
and epilog
respectively.
Suppose there is a model with 10
hidden layers:
-ngl 5
: put the first 5 layers to GPU;-ngl 100
: put all layers to GPU;-ngl 5,prolog
: put the first 5 layers, and "Prolog" layer to GPU;-ngl 100,prolog,epilog
: put all layers, "Prolog" layer and "Epilog" layer to GPU.-ngl all
: equivalent to-ngl 99999,prolog,epilog
.
The full format of -ngl
is -ngl [id:]layer_specs[;id:layer_specs]..
. id
is GPU device ID. If id
is omitted, 0
is assumed.
layer_spec
can be a positive integer, prolog
, epilog
, a combination of these; or just all
.
Use --show_devices
to check all available devices.
-
Custom operators (
ggml::map_custom...
);If hidden layers of a model use custom operators, then GPU acceleration is unavailable.
-
Models with
tie_word_embeddings = true
;Ensure
Prolog
andEpilog
layers are on the same device. -
Other issues;
If a model has
10
hidden layers and-ngl 10
not work, then try-ngl all
,-ngl 10,epilog
, or-ngl 9
.
-
Having trouble with Python binding on Windows with CUDA?
Copy these DLL to the
bindings
folder:cublas64_12.dll
cudart64_12.dll
cublasLt64_12.dll