Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement K/V caching in the attention module #25

Open
Vectorrent opened this issue Nov 27, 2024 · 1 comment
Open

Implement K/V caching in the attention module #25

Vectorrent opened this issue Nov 27, 2024 · 1 comment
Labels
enhancement New feature or request

Comments

@Vectorrent
Copy link
Contributor

Inference becomes unbearably slow, as sequence length grows. This is because the attention module must essentially re-compute key/value pairs for all past tokens, for every new token. This is a ton of extra work, which can be addressed by caching key/value pairs, such that they will only be computed once.

@Vectorrent Vectorrent added the enhancement New feature or request label Nov 27, 2024
@Vectorrent
Copy link
Contributor Author

We likely want to implement something natively-supported in the HF Transformers library, such as DynamicCache, but the actual process of integration will not be terribly easy. It's going to require an extensive redesign of the current attention module, as well as consideration towards Hivemind - which completely breaks our ability to actually use the HF API in the intended way. In HF, you would typically manage key/value states centrally, and pass them to an attention module via a past_key_values argument in the forward pass; however, Hivemind does not allow for dynamic arguments in the forward pass - and we almost certainly don't want to be passing caches around to peers on the Internet.

I don't know what the solution is going to be here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant