Manually selectable context size outside of the usual presets, by steps of 256. #330

Nexesenex · 2023-07-20T20:22:17Z

Nexesenex
Jul 20, 2023

Now that we have more llama_v1 models with 4k context and beyond thanks to linear/NTK interpolations and with the 2x bigger possibilities offered by llama_v2 with its base context of 4k, it could be great, if it is possible, to be able to precise manually or by increments the length of the context, so the scratch buffer and KV buffer allocation sizes match one's hardware specs to use the longest context possible.

For exemple, I can only run a context of 4096 with my Geforce 3090 on a 33b model in Q_K_3M with the 63 layers in VRAM. I could probably run a context of 4352 on the same model, and 4864-5120 in Q_K_3S.

Without trying to compare technically 2 different systems of quantization, Oobabooga offer such easy context size customization by steps of 256 for GPTQ models on Exllama for example, directly influencing the size taken in VRAM by the loaded model and its buffers. Would such a context customization stepping of 256, or even 128, be possible on llamacpp/koboldcpp?

On the side, a daydream, and maybe it's more about llamacpp than koboldcpp, and more about my ignorance of the tech part than anything else. Is it remotely possible for the KV buffer's size be scaled on the type of quantization of the model, and not only its base number of parameters (13b, 33b, etc) and context size, in order to have more reasonable KV buffer sizes, and thus use higher context sizes for the same amount of VRAM?

LostRuins · 2023-07-21T06:11:31Z

LostRuins
Jul 21, 2023
Maintainer

There has been increased granularity added to the context sizes with options for 3072 and 6144 added as well. Currently, you can set the --contextsize to any of these values [512,1024,2048,3072,4096,6144,8192] while you have full control of setting the RoPE Scale to whatever you want with --ropeconfig

For context size, the problem is that not all buffers scale linearly - right now I have to manually test against each combination of buffer sizes for each context for multiple model sizes to try to ensure that it doesn't run out of memory even at max context - having this value be fully user customizable will make it much harder to test completely. Even llama.cpp upstream uses fixed buffer sizes for their caches for known models, only recently are they exploring varying them for long context.

1 reply

Nexesenex Jul 21, 2023
Author

Thanks for these explanations.

About testing, just sharing my thoughts : maybe it could be interesting to include a new "buffer test" panel in the new Kobold GUI (and a basic how-to-test) overriding your combos so the users of KoboldCPP can crowd-test the granular contexts and non-linearly scaled buffers with their favorite models. Many of us noobs are not capable to code anything, but can understand the numbers and would gladly search find the right ones to milk more of their models, then report their findings in a dedicated thread until it becomes irrelevant.

I didn't see this very recent PR on LlamaCPP, and it might be related : ggerganov#2295

As you said, exploration is ongoing.

Thanks again for your hard work, I really enjoy using KoboldCPP.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Manually selectable context size outside of the usual presets, by steps of 256. #330

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Manually selectable context size outside of the usual presets, by steps of 256. #330

Nexesenex Jul 20, 2023

Replies: 1 comment · 1 reply

LostRuins Jul 21, 2023 Maintainer

Nexesenex Jul 21, 2023 Author

Nexesenex
Jul 20, 2023

Replies: 1 comment 1 reply

LostRuins
Jul 21, 2023
Maintainer

Nexesenex Jul 21, 2023
Author