In mistral.rs, device mapping is automatically managed to be as performant and easy as possible. Automatic device mapping is enabled by default in the CLI/server and Python API and does not make any changes when the model fits entirely on the GPU.
Automatic device mapping works by prioritizing loading models into GPU memory, and any remaining parts are loaded into CPU memory. Models architectures such as vision models which greatly benefit from GPU acceleration also automatically prioritize keeping those components on the GPU.
If you want to manually device map the model (not recommended), please continue reading.
Note
Manual device mapping is deprecated in favor of automatic device mapping due to the possibility for user error in manual.
There are 2 ways to do device mapping:
- Specify the number of layers to put on the GPU - this uses the GPU with ordinal 0.
- Specify the ordinals and number of layers - this allows for cross-GPU device mapping.
The format for the ordinals and number of layers is ORD:NUM;...
where ORD is the unique ordinal and NUM is the number of layers for that GPU. This may be repeated as many times as necessary.
Note: We refer to GPU layers as "device layers" throughout mistral.rs.
cargo run --release --features cuda -- -n "0:16;1:16" -i plain -m gradientai/Llama-3-8B-Instruct-262k -a llama
Note: In the Python API, the "0:16;1:16" string is passed as the list
["0:16", "1:16"]
.
cargo run --release --features cuda -- -n 16 -i plain -m gradientai/Llama-3-8B-Instruct-262k -a llama