The first generation token output sees the whole cache key and value #27

PengWenChen · 2025-01-06T08:29:27Z

SnapKV/snapkv/monkeypatch/mistral_hijack_4_37.py

Line 130 in 82135ce

    
           past_key_value.update(key_states_compress, value_states_compress, self.layer_idx, cache_kwargs)

Hi there~
Thanks for your great work!
The past_key_value in L130 does update the new compressed key and value.
However, the first generation tokens(L168) are still generated with full cache key and value after the prompt compression.

SnapKV/snapkv/monkeypatch/mistral_hijack_4_37.py

Lines 168 to 176 in 82135ce

    
           attn_output = self._flash_attention_forward( 
        
               query_states, 
        
               key_states, 
        
               value_states, 
        
               attention_mask, 
        
               q_len, 
        
               dropout=dropout_rate, 
        
               use_sliding_windows=use_sliding_windows, 
        
           )

Is this a bug?

XiongxiaoL · 2025-01-07T03:18:57Z

Because this is during the prefill stage, it is unrelated to kv compression, so full kv is used for computation.

PengWenChen · 2025-01-07T06:21:37Z

To my understanding, the first generation token is the last output logit of prefilling stage.
So the first token of the model response comes from the attn_output here right?

SnapKV/snapkv/monkeypatch/mistral_hijack_4_37.py

Lines 168 to 176 in 82135ce

    
           attn_output = self._flash_attention_forward( 
        
               query_states, 
        
               key_states, 
        
               value_states, 
        
               attention_mask, 
        
               q_len, 
        
               dropout=dropout_rate, 
        
               use_sliding_windows=use_sliding_windows, 
        
           )

If so, then the first generation(predict) token sees the whole KV from input prompt.
If not, what's the input token of the first generation token after KV compressing? There must exist a input token to become hidden states and predict the first response token right?

akhauriyash · 2025-01-27T13:15:51Z

Hello,
Has there been a resolution / more discussion on this?

I think the simple fix is to do this here:
key_states, value_states = past_key_value.update(key_states_compress, value_states_compress, self.layer_idx, cache_kwargs)

Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The first generation token output sees the whole cache key and value #27

The first generation token output sees the whole cache key and value #27

PengWenChen commented Jan 6, 2025

XiongxiaoL commented Jan 7, 2025

PengWenChen commented Jan 7, 2025

akhauriyash commented Jan 27, 2025 •

edited

Loading

The first generation token output sees the whole cache key and value #27

The first generation token output sees the whole cache key and value #27

Comments

PengWenChen commented Jan 6, 2025

XiongxiaoL commented Jan 7, 2025

PengWenChen commented Jan 7, 2025

akhauriyash commented Jan 27, 2025 • edited Loading

akhauriyash commented Jan 27, 2025 •

edited

Loading