Replies: 1 comment 1 reply
-
There has been increased granularity added to the context sizes with options for 3072 and 6144 added as well. Currently, you can set the For context size, the problem is that not all buffers scale linearly - right now I have to manually test against each combination of buffer sizes for each context for multiple model sizes to try to ensure that it doesn't run out of memory even at max context - having this value be fully user customizable will make it much harder to test completely. Even llama.cpp upstream uses fixed buffer sizes for their caches for known models, only recently are they exploring varying them for long context. |
Beta Was this translation helpful? Give feedback.
-
Now that we have more llama_v1 models with 4k context and beyond thanks to linear/NTK interpolations and with the 2x bigger possibilities offered by llama_v2 with its base context of 4k, it could be great, if it is possible, to be able to precise manually or by increments the length of the context, so the scratch buffer and KV buffer allocation sizes match one's hardware specs to use the longest context possible.
For exemple, I can only run a context of 4096 with my Geforce 3090 on a 33b model in Q_K_3M with the 63 layers in VRAM. I could probably run a context of 4352 on the same model, and 4864-5120 in Q_K_3S.
Without trying to compare technically 2 different systems of quantization, Oobabooga offer such easy context size customization by steps of 256 for GPTQ models on Exllama for example, directly influencing the size taken in VRAM by the loaded model and its buffers. Would such a context customization stepping of 256, or even 128, be possible on llamacpp/koboldcpp?
On the side, a daydream, and maybe it's more about llamacpp than koboldcpp, and more about my ignorance of the tech part than anything else. Is it remotely possible for the KV buffer's size be scaled on the type of quantization of the model, and not only its base number of parameters (13b, 33b, etc) and context size, in order to have more reasonable KV buffer sizes, and thus use higher context sizes for the same amount of VRAM?
Beta Was this translation helpful? Give feedback.
All reactions