You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Inference becomes unbearably slow, as sequence length grows. This is because the attention module must essentially re-compute key/value pairs for all past tokens, for every new token. This is a ton of extra work, which can be addressed by caching key/value pairs, such that they will only be computed once.
The text was updated successfully, but these errors were encountered:
We likely want to implement something natively-supported in the HF Transformers library, such as DynamicCache, but the actual process of integration will not be terribly easy. It's going to require an extensive redesign of the current attention module, as well as consideration towards Hivemind - which completely breaks our ability to actually use the HF API in the intended way. In HF, you would typically manage key/value states centrally, and pass them to an attention module via a past_key_values argument in the forward pass; however, Hivemind does not allow for dynamic arguments in the forward pass - and we almost certainly don't want to be passing caches around to peers on the Internet.
I don't know what the solution is going to be here.
Inference becomes unbearably slow, as sequence length grows. This is because the attention module must essentially re-compute key/value pairs for all past tokens, for every new token. This is a ton of extra work, which can be addressed by caching key/value pairs, such that they will only be computed once.
The text was updated successfully, but these errors were encountered: