Advanced inference engine features #245

kir-gadjello · 2024-04-29T01:58:26Z

kir-gadjello
Apr 29, 2024

Advanced inference engine features I think many will find desirable:

Speculative decoding with fast ngram-like or small transformer-like models (without requirement to have exact same tokenizers): [WIP] Speculative decoding using a draft model vllm-project/vllm#2188
Content-aware grammar decoding: when inside a markdown code block use the corresponding programming language grammar. I think all major LLM providers might use variants of this. Also have json mode (+ json schema mode like in tabbyapi and llama.cpp) if you don't have any yet, these are useful.
KV-cache compression with Hadamard transform as in latest exllama v2: turboderp-org/exllamav2@324404e
Smart KV-cache offload into RAM (could be probably combined with https://www.semanticscholar.org/paper/Scissorhands%3A-Exploiting-the-Persistence-of-for-LLM-Liu-Desai/d6eeb2898bd9bd34744194ef543062dda6c4531a but this is research territory).
Support for at least one strong vision-language model: https://huggingface.co/openbmb/MiniCPM-V-2 https://huggingface.co/OpenGVLab/InternVL-Chat-V1-5 with an option to compute visual frontend model on CPU. You might find it easier to ship visual transformer part via onnx.
Dynamic per-request yarn rope-scaling, if it is possible.
Advanced multi-slot inference (ability to extract batch parallelism from a few concurrent completion requests, I know it might be hard but would be also interesting if there were RadixAttention impl like in https://github.com/sgl-project/sglang - this all is useful when you have an agent system that produces many concurrent api calls with a shared prefix/kvcache)
Multi-model inference with minimal VRAM<->RAM swapping: there is a usecase where a person needs to run a strong instruction-tuned model (takes almost all VRAM, used intermittently) concurrently with a smaller but sill considerable code completion model. It would be very convenient if the engine were able to run both at the same time assuming bursty exclusive usage pattern and minimizing latency with preference given to one or the other model.
Support for Medusa or EAGLE with heads like these: https://huggingface.co/text-generation-inference
High-quality calibrated quantization of long-context models (32, 64, 128k).
Per-token embedding output (choose one of the last layers) for classifiers.

Research Territory:

Smart KV-cache compression (leave salient tokens alone)
Support for the "best" current quantization format for running 70B-tier models on 24GB GPUs - might be AQLM, iQuant or another scheme.

Thanks for hearing me out!

lucasavila00 · 2024-04-29T14:35:28Z

lucasavila00
Apr 29, 2024

@kir-gadjello thanks for this. I really appreciate it.

Content-aware grammar decoding: when inside a markdown code block use the corresponding programming language grammar. I think all major LLM providers might use variants of this. Also have json mode (+ json schema mode like in tabbyapi and llama.cpp) if you don't have any yet, these are useful.

mistral.rs supports regex and yacc based grammars so a big part of it can be implemented "in userland". I was hoping to build something that translates json schema to a yacc grammar automatically. It's not too different from how outlines do it.

You mention both different quants and RadixAttention. These are usually hard to combine efficiently. Mistral.rs currently has no paging for KV cache, meaning we can use quant kernels without a lot of changes.

Should mistral.rs add paging, we'd need to change all attention kernels to be aware of the memory paging.

An option would be to copy the paging implementation from sglang or vllm. But then we would be limited to using kernels from either project, because the kernels must be aware of the KV paging

0 replies

EricLBuehler · 2024-04-29T17:08:09Z

EricLBuehler
Apr 29, 2024
Maintainer

@kir-gadjello, thank you for listing these out! I have gone through each point and responded to each.

You mention both different quants and RadixAttention. These are usually hard to combine efficiently. Mistral.rs currently has no paging for KV cache, meaning we can use quant kernels without a lot of changes.

Should mistral.rs add paging, we'd need to change all attention kernels to be aware of the memory paging.

An option would be to copy the paging implementation from sglang or vllm. But then we would be limited to using kernels from either project, because the kernels must be aware of the KV paging

@lucasavila00 can you please elaborate on how different quants and RadixAttention are hard to combine?

Speculative decoding with fast ngram-like or small transformer-like models (without requirement to have exact same tokenizers): [WIP] Speculative decoding using a draft model vllm-project/vllm#2188
This is being implemented in Implement Speculative Decoding #242.
Content-aware grammar decoding: when inside a markdown code block use the corresponding programming language grammar. I think all major LLM providers might use variants of this. Also have json mode (+ json schema mode like in tabbyapi and llama.cpp) if you don't have any yet, these are useful.
I think we can provide specialized grammars to implement this: where it only activates if in a code block perhaps.
Specialized grammars for context aware decoding
Default JSON schema
KV-cache compression with Hadamard transform as in latest exllama v2: turboderp-org/exllamav2@324404e
Can you please elaborate what this does?
Implement KV cache quantization or compression
Smart KV-cache offload into RAM (could be probably combined with https://www.semanticscholar.org/paper/Scissorhands%3A-Exploiting-the-Persistence-of-for-LLM-Liu-Desai/d6eeb2898bd9bd34744194ef543062dda6c4531a but this is research territory).
Implement KV cache offloading
Support for at least one strong vision-language model: https://huggingface.co/openbmb/MiniCPM-V-2 https://huggingface.co/OpenGVLab/InternVL-Chat-V1-5 with an option to compute visual frontend model on CPU. You might find it easier to ship visual transformer part via onnx.
We plan to implement vision models this week.
Dynamic per-request yarn rope-scaling, if it is possible.
How would this work? We support the Yarn rope scaling method already:

mistral.rs/mistralrs-core/src/layers.rs

Lines 70 to 73 in d442e02

enum ScaledRopeType {

Su,

Yarn,

}

Advanced multi-slot inference (ability to extract batch parallelism from a few concurrent completion requests, I know it might be hard but would be also interesting if there were RadixAttention impl like in https://github.com/sgl-project/sglang - this all is useful when you have an agent system that produces many concurrent api calls with a shared prefix/kvcache)
We currently implement prefix caching. Do you mean cross-gpu inference support by multi-slot inference? Please let me know if I am misunderstanding.
Multi-model inference with minimal VRAM<->RAM swapping: there is a usecase where a person needs to run a strong instruction-tuned model (takes almost all VRAM, used intermittently) concurrently with a smaller but sill considerable code completion model. It would be very convenient if the engine were able to run both at the same time assuming bursty exclusive usage pattern and minimizing latency with preference given to one or the other model.
I'm designing the speculative sampling to support dual-pipelines generally, so this can be implemented with that.
Support for Medusa or EAGLE with heads like these: https://huggingface.co/text-generation-inference
I think this is a very interesting topic and again can be implemented from something like the speculative decoding multiple-pipeline architecture.
High-quality calibrated quantization of long-context models (32, 64, 128k).
Can you please link a paper or an issue for this? I think this sounds good, though.
Per-token embedding output (choose one of the last layers) for classifiers.
Can you please link a paper or an issue for this? Which classifier would benefit from this?

4 replies

oldgithubman Jul 27, 2024

KV-cache compression with Hadamard transform as in latest exllama v2: turboderp/exllamav2@324404e
Can you please elaborate what this does?

Quantizes the KV cache with the goal of using less space. llama.cpp allows you to specify quantization types for K and V individually. Ideally, we'd be able to use all the same quantization types as gguf. If you're asking about "Hadamard transform," I don't know either.

The ability to have multiple models loaded simultaneously would be useful (for example, for continue.dev, a coding model and an embedding model).

High-quality calibrated quantization of long-context models (32, 64, 128k).
Can you please link a paper or an issue for this? I think this sounds good, though.

I'm not sure what he meant by this either, but I too like the sound of it

EricLBuehler Jul 27, 2024
Maintainer

Yeah, these are all super interesting. I think KV cache compression is the most realistic to implement right now, I'm not really sure what the calibrated quantization would mean - perhaps something like the IQ... quants?

oldgithubman Jul 27, 2024

Yeah, these are all super interesting. I think KV cache compression is the most realistic to implement right now, I'm not really sure what the calibrated quantization would mean - perhaps something like the IQ... quants?

Maybe. IQ quants are nice. I almost always use them

EricLBuehler Jul 29, 2024
Maintainer

Sounds good. I will add those soon!

lucasavila00 · 2024-04-29T17:25:32Z

lucasavila00
Apr 29, 2024

@lucasavila00 can you please elaborate on how different quants and RadixAttention are hard to combine?

Paged attention or radix attention work by storing a bunch of slots for the KV cache of each token. Let's call it the KV array.

At inference time, the model receives a list of indexes as the KV cache, not tensors with the data itself.

Then, at inference time, the attention kernels dynamically use the index to access the item of the KV array and do the attention calculation. It never creates a contiguous tensor with the KV cache.

So if we get the AWQ kernels from https://github.com/casper-hansen/AutoAWQ_kernels which work on contiguous tensors, we can't use them with paged attention without rewriting the attention part to work with the index indirection.

10 replies

lucasavila00 Apr 29, 2024

🤔

The scheduler also needs to be cache aware, so I guess it should be a KVManagerScheduler trait?

EricLBuehler Apr 29, 2024
Maintainer

Hmm, that could work. #242 is going to bring some major changes internally to abstract the sampling and generation process overall. Perhaps we could include these traits in a PR later.

EricLBuehler Apr 29, 2024
Maintainer

The scheduler also needs to be cache aware, so I guess it should be a KVManagerScheduler trait?

Does it need to be cache aware? That would be an optimization, to reduce the clone+cat steps which we could implement right now.

lucasavila00 Apr 29, 2024

It does IMO. Let's say we got 2 requests, but can only execute 1 at a time because of configurations.

One request is 90% prefix-cached, the other 0% prefix-cached.

If we run the 0% prefix-cached we'll delete the cache and then we'll need to start the 90% req from scratch.

If we run the 90% prefix-cached first we can increase cache re-use. Sglang does this, vLLM does not. vLLm is adding this ability though.

EricLBuehler May 1, 2024
Maintainer

Yes, we should prioritize requests that way.

EricLBuehler · 2024-07-27T01:50:23Z

EricLBuehler
Jul 27, 2024
Maintainer

@kir-gadjello @lucasavila00 we now have PagedAttention!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Advanced inference engine features #245

{{title}}

Replies: 4 comments 14 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Advanced inference engine features #245

kir-gadjello Apr 29, 2024

Replies: 4 comments · 14 replies

lucasavila00 Apr 29, 2024

EricLBuehler Apr 29, 2024 Maintainer

oldgithubman Jul 27, 2024

EricLBuehler Jul 27, 2024 Maintainer

oldgithubman Jul 27, 2024

EricLBuehler Jul 29, 2024 Maintainer

lucasavila00 Apr 29, 2024

lucasavila00 Apr 29, 2024

EricLBuehler Apr 29, 2024 Maintainer

EricLBuehler Apr 29, 2024 Maintainer

lucasavila00 Apr 29, 2024

EricLBuehler May 1, 2024 Maintainer

EricLBuehler Jul 27, 2024 Maintainer

kir-gadjello
Apr 29, 2024

Replies: 4 comments 14 replies

lucasavila00
Apr 29, 2024

EricLBuehler
Apr 29, 2024
Maintainer

EricLBuehler Jul 27, 2024
Maintainer

EricLBuehler Jul 29, 2024
Maintainer

lucasavila00
Apr 29, 2024

EricLBuehler Apr 29, 2024
Maintainer

EricLBuehler Apr 29, 2024
Maintainer

EricLBuehler May 1, 2024
Maintainer

EricLBuehler
Jul 27, 2024
Maintainer