Advanced inference engine features #245
Replies: 4 comments 14 replies
-
@kir-gadjello thanks for this. I really appreciate it.
mistral.rs supports regex and yacc based grammars so a big part of it can be implemented "in userland". I was hoping to build something that translates json schema to a yacc grammar automatically. It's not too different from how outlines do it. You mention both different quants and Should mistral.rs add paging, we'd need to change all attention kernels to be aware of the memory paging. An option would be to copy the paging implementation from sglang or vllm. But then we would be limited to using kernels from either project, because the kernels must be aware of the KV paging |
Beta Was this translation helpful? Give feedback.
-
@kir-gadjello, thank you for listing these out! I have gone through each point and responded to each.
@lucasavila00 can you please elaborate on how different quants and
|
Beta Was this translation helpful? Give feedback.
-
Paged attention or radix attention work by storing a bunch of slots for the KV cache of each token. Let's call it the KV array. At inference time, the model receives a list of indexes as the KV cache, not tensors with the data itself. Then, at inference time, the attention kernels dynamically use the index to access the item of the KV array and do the attention calculation. It never creates a contiguous tensor with the KV cache. So if we get the AWQ kernels from https://github.com/casper-hansen/AutoAWQ_kernels which work on contiguous tensors, we can't use them with paged attention without rewriting the attention part to work with the index indirection. |
Beta Was this translation helpful? Give feedback.
-
@kir-gadjello @lucasavila00 we now have PagedAttention! |
Beta Was this translation helpful? Give feedback.
-
Advanced inference engine features I think many will find desirable:
Research Territory:
Thanks for hearing me out!
Beta Was this translation helpful? Give feedback.
All reactions