Support BF16 kvcache, rope and attentions for inference of GGUF/GGML models #1009

guoqingbao · 2024-12-27T09:05:04Z

The current approach for inference of GGUF/GGML models requires the input data type to be F32 due to limitations in kernel implementations for quantized matmul, which necessitate an F32 input and quantized weights. This limitation causes side effects, making attention mechanisms (including SDPA, paged attention, and flash attention) and the KV cache also to be F32.

This PR addresses the issue by allowing the data type to be specified for GGUF/GGML models and casting the F32 results of quantized matmuls to BF16. This casting applies to subsequent attention operations, rotary embeddings, and KV cache storage. As a result, KV cache usage is reduced by half, and inference performance is improved due to the use of BF16 for matmuls and attentions.

Example

On Apple Silicon (Metal), run the following example with throughput logging enabled, paged attention settings (limiting KV cache to a maximum of 4GB), and data type set to BF16 (for KV cache, rotary embeddings, and attentions):

cargo build --release --features metal
./target/release/mistralrs-server -i --throughput --paged-attn --pa-gpu-mem 4096 gguf --dtype bf16 -m /Users/Downloads/ -f Phi-3.5-mini-instruct-Q4_K_M.gguf

You may also run without paged attention:

./target/release/mistralrs-server -i --throughput gguf --dtype bf16 -m /Users/Downloads/ -f Phi-3.5-mini-instruct-Q4_K_M.gguf

Notes

For quantized GGUF/GGML models, you may specify the data type as either f32 or bf16.
The use of f16 is not recommended due to its lower precision, which can adversely impact the precision of quantized inference.

github-actions · 2024-12-27T09:06:03Z

Code Metrics Report

  ===============================================================================
 Language            Files        Lines         Code     Comments       Blanks
===============================================================================
 C Header                2           35           28            0            7
 Dockerfile              1           41           22           10            9
 JSON                   12          105          104            0            1
 Python                 63         2706         2338           71          297
 Shell                   1           57           22           18           17
 Plain Text              3         3723            0         2413         1310
 TOML                   18          605          539            2           64
 YAML                    2           21           19            2            0
-------------------------------------------------------------------------------
 Jupyter Notebooks       4            0            0            0            0
 |- Markdown             2           77           32           31           14
 |- Python               2          205          178            1           26
 (Total)                            282          210           32           40
-------------------------------------------------------------------------------
 Markdown               43         3328            0         2523          805
 |- BASH                 6          101           98            0            3
 |- JSON                 1           12           12            0            0
 |- Python               7          121          109            0           12
 |- Rust                12          406          344            0           62
 |- TOML                 2           75           63            0           12
 (Total)                           4043          626         2523          894
-------------------------------------------------------------------------------
 Rust                  296        89473        80286         1861         7326
 |- Markdown           143         1581           25         1436          120
 (Total)                          91054        80311         3297         7446
===============================================================================
 Total                 445       100094        83358         6900         9836
===============================================================================

EricLBuehler

Thanks for the PR! I made a few comments.

mistralrs-core/src/model_loader.rs

mistralrs-core/src/toml_selector.rs

mistralrs-core/src/pipeline/paths.rs

mistralrs-server/src/main.rs

…the model cannot terminate itself for running GGUF file)

EricLBuehler

Thank you!

Support BF16 kvcache & attention for GGUF/GGML quantization

acf0d7e

Fix clippy

2701f38

EricLBuehler requested changes Dec 27, 2024

View reviewed changes

mistralrs-core/src/model_loader.rs Outdated Show resolved Hide resolved

mistralrs-core/src/toml_selector.rs Outdated Show resolved Hide resolved

mistralrs-core/src/pipeline/paths.rs Outdated Show resolved Hide resolved

mistralrs-server/src/main.rs Show resolved Hide resolved

guoqingbao added 2 commits December 28, 2024 19:39

Pass dtype to xlora gguf/ggml model

49ef757

Remove the hardcoded fix for the literal chat template (side effect: …

bdfe69b

…the model cannot terminate itself for running GGUF file)

EricLBuehler mentioned this pull request Dec 29, 2024

Support device mapping for Paged Attention #1011

Merged

Pass dtype to Lora GGUF/GGML models

8a627e1

EricLBuehler approved these changes Dec 30, 2024

View reviewed changes

EricLBuehler merged commit 1880c0b into EricLBuehler:master Dec 30, 2024
12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support BF16 kvcache, rope and attentions for inference of GGUF/GGML models #1009

Support BF16 kvcache, rope and attentions for inference of GGUF/GGML models #1009

guoqingbao commented Dec 27, 2024

github-actions bot commented Dec 27, 2024 •

edited

Loading

EricLBuehler left a comment

EricLBuehler left a comment

Support BF16 kvcache, rope and attentions for inference of GGUF/GGML models #1009

Support BF16 kvcache, rope and attentions for inference of GGUF/GGML models #1009

Conversation

guoqingbao commented Dec 27, 2024

Example

Notes

github-actions bot commented Dec 27, 2024 • edited Loading

EricLBuehler left a comment

Choose a reason for hiding this comment

EricLBuehler left a comment

Choose a reason for hiding this comment

github-actions bot commented Dec 27, 2024 •

edited

Loading