-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The first generation token output sees the whole cache key and value #27
Comments
Because this is during the prefill stage, it is unrelated to kv compression, so full kv is used for computation. |
To my understanding, the first generation token is the last output logit of prefilling stage. SnapKV/snapkv/monkeypatch/mistral_hijack_4_37.py Lines 168 to 176 in 82135ce
If so, then the first generation(predict) token sees the whole KV from input prompt. |
Hello, I think the simple fix is to do this here: Thanks! |
SnapKV/snapkv/monkeypatch/mistral_hijack_4_37.py
Line 130 in 82135ce
Hi there~
Thanks for your great work!
The past_key_value in L130 does update the new compressed key and value.
However, the first generation tokens(L168) are still generated with full cache key and value after the prompt compression.
SnapKV/snapkv/monkeypatch/mistral_hijack_4_37.py
Lines 168 to 176 in 82135ce
Is this a bug?
The text was updated successfully, but these errors were encountered: