Implement K/V caching in the attention module #25

Vectorrent · 2024-11-27T12:49:55Z

Inference becomes unbearably slow, as sequence length grows. This is because the attention module must essentially re-compute key/value pairs for all past tokens, for every new token. This is a ton of extra work, which can be addressed by caching key/value pairs, such that they will only be computed once.

Vectorrent · 2024-11-29T03:09:20Z

We likely want to implement something natively-supported in the HF Transformers library, such as DynamicCache, but the actual process of integration will not be terribly easy. It's going to require an extensive redesign of the current attention module, as well as consideration towards Hivemind - which completely breaks our ability to actually use the HF API in the intended way. In HF, you would typically manage key/value states centrally, and pass them to an attention module via a past_key_values argument in the forward pass; however, Hivemind does not allow for dynamic arguments in the forward pass - and we almost certainly don't want to be passing caches around to peers on the Internet.

I don't know what the solution is going to be here.

Vectorrent added the enhancement New feature or request label Nov 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement K/V caching in the attention module #25

Implement K/V caching in the attention module #25

Vectorrent commented Nov 27, 2024

Vectorrent commented Nov 29, 2024

Implement K/V caching in the attention module #25

Implement K/V caching in the attention module #25

Comments

Vectorrent commented Nov 27, 2024

Vectorrent commented Nov 29, 2024