-
Notifications
You must be signed in to change notification settings - Fork 335
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KV Cache Quantization #971
Comments
Hi @dinerburger ! After some recent work in KV cache, I think we have the infrastructure now for this! I'll take a look again and will probably merge some initial support. |
Obviously there a number of ways to implement KV cache quant, but I'd be interested in knowing which implementation you're considering. |
I'm considering two options. The 8 bit cache using FP8 might be easier to implement.
@sammcj I saw your recent PR merge to ollama supporting KV cache quantization - congrats! What method did you take (did you do anything special to quantize the K/V blocks)? |
Perfect, yeah I was gonna recommend the Hadamard transform approach. It's easy and effective. I followed that PR pretty closely; @sammcj piggy-backed on llama.cpp's implementation, utilizing either |
Thanks @EricLBuehler! Was simple compared to the efforts you'll be putting in I'm sure as llama.cpp does the heavy lifting of performing the quantisation. The changes to Ollama were mainly around the parameterisation of the Ollama components to make use of it, some memory management for their layer estimations/placement and a lot of shall we say 'soft skills' to get it across the line 😅 You can see the initial changes (bundled with FA support) in llama.cpp here: ggerganov/llama.cpp#7527 While 4bit works well for Exllamav2's KV, the quantisation that works well with llama.cpp/gguf is Q8_0, which is approximately 8.5bpw. I've published a F16 vs Q8_0 KV perplexity measurement here (I might add q4_0 and another dataset variant as well in the next day or two). Forgive my ignorance here - when you say int4/int8 - are you talking about quantising down to 4/8bit integers, or simply rounding to them? The reason I ask this is I know that int4/8 models tend to be quite a bit lower quality than their quantised counterparts such as Q4_K_M/Q8_0. |
@sammcj @dinerburger sorry for the late reply! I've begun work in #988.
I'll be using Q4_K_M and Q8_0 + a hadamard transform in the kernel for better distributions, not int4/int8.
Sounds like an interesting idea. I'm curious if we can do something similar after the initial KV cache quantization support is merged. |
Yeah, you can see the supported quant types here: https://github.com/ggerganov/llama.cpp/blob/26a8406ba9198eb6fdd8329fa717555b4f77f05f/common/common.cpp#L1018. A note however, if you want to experiment: compile llama.cpp with |
Both exllamav2 and llama.cpp support quantized KV cache to allow pretty large context lengths on consumer hardware. It would be a great addition to mistral.rs; I've been very interested in trying it, but I'm limited to 24GB of VRAM, necessitating, for example, sending KV cache to system RAM instead of keeping it on the card (only possible on llama.cpp to my knowledge).
The text was updated successfully, but these errors were encountered: