Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support BF16 kvcache, rope and attentions for inference of GGUF/GGML models #1009

Merged
merged 5 commits into from
Dec 30, 2024

Conversation

guoqingbao
Copy link
Contributor

The current approach for inference of GGUF/GGML models requires the input data type to be F32 due to limitations in kernel implementations for quantized matmul, which necessitate an F32 input and quantized weights. This limitation causes side effects, making attention mechanisms (including SDPA, paged attention, and flash attention) and the KV cache also to be F32.

This PR addresses the issue by allowing the data type to be specified for GGUF/GGML models and casting the F32 results of quantized matmuls to BF16. This casting applies to subsequent attention operations, rotary embeddings, and KV cache storage. As a result, KV cache usage is reduced by half, and inference performance is improved due to the use of BF16 for matmuls and attentions.

Example

On Apple Silicon (Metal), run the following example with throughput logging enabled, paged attention settings (limiting KV cache to a maximum of 4GB), and data type set to BF16 (for KV cache, rotary embeddings, and attentions):

cargo build --release --features metal
./target/release/mistralrs-server -i --throughput --paged-attn --pa-gpu-mem 4096 gguf --dtype bf16 -m /Users/Downloads/ -f Phi-3.5-mini-instruct-Q4_K_M.gguf

You may also run without paged attention:

./target/release/mistralrs-server -i --throughput gguf --dtype bf16 -m /Users/Downloads/ -f Phi-3.5-mini-instruct-Q4_K_M.gguf

Notes

  • For quantized GGUF/GGML models, you may specify the data type as either f32 or bf16.
  • The use of f16 is not recommended due to its lower precision, which can adversely impact the precision of quantized inference.

Copy link

github-actions bot commented Dec 27, 2024

Code Metrics Report
  ===============================================================================
 Language            Files        Lines         Code     Comments       Blanks
===============================================================================
 C Header                2           35           28            0            7
 Dockerfile              1           41           22           10            9
 JSON                   12          105          104            0            1
 Python                 63         2706         2338           71          297
 Shell                   1           57           22           18           17
 Plain Text              3         3723            0         2413         1310
 TOML                   18          605          539            2           64
 YAML                    2           21           19            2            0
-------------------------------------------------------------------------------
 Jupyter Notebooks       4            0            0            0            0
 |- Markdown             2           77           32           31           14
 |- Python               2          205          178            1           26
 (Total)                            282          210           32           40
-------------------------------------------------------------------------------
 Markdown               43         3328            0         2523          805
 |- BASH                 6          101           98            0            3
 |- JSON                 1           12           12            0            0
 |- Python               7          121          109            0           12
 |- Rust                12          406          344            0           62
 |- TOML                 2           75           63            0           12
 (Total)                           4043          626         2523          894
-------------------------------------------------------------------------------
 Rust                  296        89473        80286         1861         7326
 |- Markdown           143         1581           25         1436          120
 (Total)                          91054        80311         3297         7446
===============================================================================
 Total                 445       100094        83358         6900         9836
===============================================================================
  

Copy link
Owner

@EricLBuehler EricLBuehler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! I made a few comments.

mistralrs-core/src/model_loader.rs Outdated Show resolved Hide resolved
mistralrs-core/src/toml_selector.rs Outdated Show resolved Hide resolved
mistralrs-core/src/pipeline/paths.rs Outdated Show resolved Hide resolved
mistralrs-server/src/main.rs Show resolved Hide resolved
Copy link
Owner

@EricLBuehler EricLBuehler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@EricLBuehler EricLBuehler merged commit 1880c0b into EricLBuehler:master Dec 30, 2024
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants