-
Notifications
You must be signed in to change notification settings - Fork 333
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support BF16 kvcache, rope and attentions for inference of GGUF/GGML models #1009
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Code Metrics Report=============================================================================== Language Files Lines Code Comments Blanks =============================================================================== C Header 2 35 28 0 7 Dockerfile 1 41 22 10 9 JSON 12 105 104 0 1 Python 63 2706 2338 71 297 Shell 1 57 22 18 17 Plain Text 3 3723 0 2413 1310 TOML 18 605 539 2 64 YAML 2 21 19 2 0 ------------------------------------------------------------------------------- Jupyter Notebooks 4 0 0 0 0 |- Markdown 2 77 32 31 14 |- Python 2 205 178 1 26 (Total) 282 210 32 40 ------------------------------------------------------------------------------- Markdown 43 3328 0 2523 805 |- BASH 6 101 98 0 3 |- JSON 1 12 12 0 0 |- Python 7 121 109 0 12 |- Rust 12 406 344 0 62 |- TOML 2 75 63 0 12 (Total) 4043 626 2523 894 ------------------------------------------------------------------------------- Rust 296 89473 80286 1861 7326 |- Markdown 143 1581 25 1436 120 (Total) 91054 80311 3297 7446 =============================================================================== Total 445 100094 83358 6900 9836 =============================================================================== |
EricLBuehler
requested changes
Dec 27, 2024
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! I made a few comments.
…the model cannot terminate itself for running GGUF file)
EricLBuehler
approved these changes
Dec 30, 2024
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The current approach for inference of GGUF/GGML models requires the input data type to be
F32
due to limitations in kernel implementations for quantized matmul, which necessitate anF32
input and quantized weights. This limitation causes side effects, making attention mechanisms (including SDPA, paged attention, and flash attention) and the KV cache also to beF32
.This PR addresses the issue by allowing the data type to be specified for GGUF/GGML models and casting the
F32
results of quantized matmuls toBF16
. This casting applies to subsequent attention operations, rotary embeddings, and KV cache storage. As a result, KV cache usage is reduced by half, and inference performance is improved due to the use ofBF16
for matmuls and attentions.Example
On Apple Silicon (
Metal
), run the following example with throughput logging enabled, paged attention settings (limiting KV cache to a maximum of 4GB), and data type set toBF16
(for KV cache, rotary embeddings, and attentions):You may also run without paged attention:
Notes
f32
orbf16
.f16
is not recommended due to its lower precision, which can adversely impact the precision of quantized inference.