FA2 broken for Cohere2 if Optional `Mask` is not passed in `forward` #35547

Qubitium · 2025-01-07T14:30:25Z

System Info

transformers==4.48.0.dev0 (from git+https://github.com/huggingface/transformers.git@5615a393691c81e00251e420c73e4d04c6fe22e5)

Who can help?

@ArthurZucker @Cyrilvallez @SunMarc

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

Check our CI test failures:

Gemma

https://github.com/ModelCloud/GPTQModel/actions/runs/12651906072/job/35253942521#step:12:1164

Cohere2

https://github.com/ModelCloud/GPTQModel/actions/runs/12651906072/job/35253938235#step:12:922

We enabled FA2 by default on GPTQModel for inference of gptq quantized models and our CI tests are failing for multiple models. This looks like a regression in the fa2 attention code where seq_len is never set if mask is None. FA2 forward requires seq_len:

transformers/src/transformers/models/cohere2/modeling_cohere2.py

Lines 235 to 270 in fc74e39

    
           def flash_attention_forward( 
        
               config: Cohere2Config, 
        
               query: torch.Tensor, 
        
               key: torch.Tensor, 
        
               value: torch.Tensor, 
        
               mask: Optional[torch.Tensor], 
        
               target_dtype: torch.dtype = torch.float16, 
        
               **_kwargs, 
        
           ) -> Tuple[torch.Tensor, None]: 
        
               if mask is not None: 
        
                   seq_len = mask.shape[1] 
        
                   query = query[:, :, :seq_len] 
        
                   value = value[:, :, :seq_len] 
        
               # TODO: These transpose are quite inefficient but Flash Attention requires the layout 
        
               # [batch_size, sequence_length, num_heads, head_dim]. We would need to refactor rotary embedding 
        
               query_states = query.transpose(1, 2) 
        
               key_states = key.transpose(1, 2) 
        
               value_states = value.transpose(1, 2) 
        
               dropout_rate = config.attention_dropout if config.training else 0.0 
        
               input_dtype = query_states.dtype 
        
               if input_dtype == torch.float32: 
        
                   query_states = query_states.to(target_dtype) 
        
                   key_states = key_states.to(target_dtype) 
        
                   value_states = value_states.to(target_dtype) 
        
               attn_output = _flash_attention_forward( 
        
                   query_states, 
        
                   key_states, 
        
                   value_states, 
        
                   mask, 
        
                   seq_len, 
        
                   dropout=dropout_rate, 
        
                   is_causal=config.is_causal,

@SunMarc I don't think this is related to quantization and @ArthurZucker The FA2 code above is broken if mask is not passed or None as seq_len will never be set. The mask param is explicitly declared as Optional.

Expected behavior

Work and not crash.

The text was updated successfully, but these errors were encountered:

SunMarc · 2025-01-07T15:03:59Z

cohere2 flash attention 2 code is the original one from the author as you can see here. cohere2 model is one of the few models that code its own flash_attention_forward function. Maybe @alexrs-cohere can help you fix this issue.

Also, we are refactoring the attention #35235, please let us know if you face any issues with other models !

Cyrilvallez · 2025-01-07T15:12:14Z

Not entirely sure why you chose this particular commit as a version, but this does not seem to be an issue on main

Qubitium · 2025-01-07T16:36:26Z

Not entirely sure why you chose this particular commit as a version, but this does not seem to be an issue on main

@Cyrilvallez My mistake. Our CI was force checking out a commit post 4.47.1 but not the latest main since Cohere2 code was merged right after 4.47.1 release.

So it looks like Cohere2 is the only model that still has the broken implementation code for Fa2.

@alexrs-cohere Please check.

transformers/src/transformers/models/cohere2/modeling_cohere2.py

Line 268 in 7f76773

seq_len,

Cyrilvallez · 2025-01-07T16:41:54Z

Ha indeed the issue persists for Cohere2! Thanks, I'll open a PR!

alexrs-cohere · 2025-01-07T17:02:58Z

Thanks for reporting this @Qubitium!

@Cyrilvallez let me know when the PR is ready and if you need any support from me!

Qubitium added the bug label Jan 7, 2025

Qubitium changed the title ~~FA2 broken if Optional Mask is not passed in forward~~ FA2 broken for Cohere2 if Optional Mask is not passed in forward Jan 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FA2 broken for Cohere2 if Optional `Mask` is not passed in `forward` #35547

FA2 broken for Cohere2 if Optional `Mask` is not passed in `forward` #35547

Qubitium commented Jan 7, 2025 •

edited

Loading

SunMarc commented Jan 7, 2025

Cyrilvallez commented Jan 7, 2025 •

edited

Loading

Qubitium commented Jan 7, 2025 •

edited

Loading

Cyrilvallez commented Jan 7, 2025

alexrs-cohere commented Jan 7, 2025

FA2 broken for Cohere2 if Optional Mask is not passed in forward #35547

FA2 broken for Cohere2 if Optional Mask is not passed in forward #35547

Comments

Qubitium commented Jan 7, 2025 • edited Loading

System Info

Who can help?

Information

Tasks

Reproduction

Gemma

Cohere2

Expected behavior

SunMarc commented Jan 7, 2025

Cyrilvallez commented Jan 7, 2025 • edited Loading

Qubitium commented Jan 7, 2025 • edited Loading

Cyrilvallez commented Jan 7, 2025

alexrs-cohere commented Jan 7, 2025

FA2 broken for Cohere2 if Optional `Mask` is not passed in `forward` #35547

FA2 broken for Cohere2 if Optional `Mask` is not passed in `forward` #35547

Qubitium commented Jan 7, 2025 •

edited

Loading

Cyrilvallez commented Jan 7, 2025 •

edited

Loading

Qubitium commented Jan 7, 2025 •

edited

Loading