What are the differences between the llama block in this repo and the implementation in candle-transformer? #481

chenwanqq · 2024-06-23T11:34:45Z

chenwanqq
Jun 23, 2024

Hello, as we discussed before, I am integrating LLaVA into this repo, but I encountered an extremely difficult problem that is hard to debug.

In LLaVA, the image features (obtained through CLIP) and text features need to be concatenated and fed into the LLaMA model. However, during the processing of the llama block, all activation values became NaN. I used the following code for debugging:

fn forward(
        &self,
        x: &Tensor,
        attention_mask: &Option<Tensor>,
        seqlen_offsets: &[usize],
        start_offsets_kernel: Tensor,
        block_idx: usize,
        kv_cache: &mut crate::pipeline::LayerCaches,
    ) -> Result<Tensor> {
        let residual = x;
        let max_value = x.max(0)?.max(0)?.max(0)?;
        let min_value = x.min(0)?.min(0)?.min(0)?;
        println!("before rms_1 max_value: {} min_value: {}",max_value,min_value);
        let x = self.rms_1.forward(x)?;
        println!("after rms_1 x: {}",x);
        let x = (self.attn.forward(
       //....

I found that after passing through two blocks, the max_value and min_value became infinity and negative infinity, respectively, which caused subsequent issues.

Here are some key points:

(1) If I only input text, this problem does not occur.

(2) This problem does not occur in my implementation based on candle-transformers. I measured the maximum and minimum values of the features before entering the block, and both implementations are around plus or minus 30, which seems to make no difference.

After further reading the code, I found that the llama used in this repo adopts "fused_rope". I'm not sure what this is. Could you please explain the difference between this and the original version of rope?

chenwanqq · 2024-06-23T11:36:06Z

chenwanqq
Jun 23, 2024
Author

By the way my working branch is chenwanqq/mistral.rs. However it's a mess and there are tons of local path and println! for debugging.

0 replies

EricLBuehler · 2024-06-23T14:25:39Z

EricLBuehler
Jun 23, 2024
Maintainer

Hello @chenwanqq! Our llama block in this repo is not different than the candle-transformers one in design. The only difference is that we, as you pointed out, use the fused RoPE kernel instead of the Candle RoPE. Our fused RoPE is faster and does all the operations for multiple batches in one kernel.

Does this behavior occur after the first block runs (so in the second block) or at some other block? That could be a clue to which part of the code is not behaving correctly.

0 replies

chenwanqq · 2024-06-25T09:49:38Z

chenwanqq
Jun 25, 2024
Author

Hello @chenwanqq! Our llama block in this repo is not different than the candle-transformers one in design. The only difference is that we, as you pointed out, use the fused RoPE kernel instead of the Candle RoPE. Our fused RoPE is faster and does all the operations for multiple batches in one kernel.

Does this behavior occur after the first block runs (so in the second block) or at some other block? That could be a clue to which part of the code is not behaving correctly.

Un fortunately it occurs in last few rounds...😭

before rms_1 max_value: 32.875 min_value: -31.109375
after rms_1 max_value: 3.7285156 min_value: -4.1601563
after attn max_value: 32.875 min_value: -31.125
after rms_2 max_value: 1.7539063 min_value: -3.7929688
after mlp max_value: 3.7011719 min_value: -0.9394531
after residual max_value: 32.875 min_value: -31.125
before rms_1 max_value: 32.875 min_value: -31.125
after rms_1 max_value: 9.0078125 min_value: -3.3671875
after attn max_value: 32.875 min_value: -31.140625
after rms_2 max_value: 2.328125 min_value: -1.0947266
after mlp max_value: 2784 min_value: -1654
after residual max_value: 2784 min_value: -1654
before rms_1 max_value: 2784 min_value: -1654
after rms_1 max_value: 7.7109375 min_value: -4.390625
after attn max_value: 2784 min_value: -1654
after rms_2 max_value: 5.671875 min_value: -4.0234375
after mlp max_value: 4.4648438 min_value: -2.828125
after residual max_value: 2788 min_value: -1657
before rms_1 max_value: 2788 min_value: -1657
after rms_1 max_value: 7.71875 min_value: -5.46875
after attn max_value: 2788 min_value: -1657
after rms_2 max_value: 2.8886719 min_value: -3.5292969
after mlp max_value: 0.78125 min_value: -3.0527344
after residual max_value: 2788 min_value: -1657
before rms_1 max_value: 2788 min_value: -1657
after rms_1 max_value: 7.6445313 min_value: -6.4609375
after attn max_value: 2788 min_value: -1657
after rms_2 max_value: 7.5976563 min_value: -6.0625
after mlp max_value: 10.3359375 min_value: -6.0234375
after residual max_value: 2798 min_value: -1662
before rms_1 max_value: 2798 min_value: -1662
after rms_1 max_value: 8.234375 min_value: -6.3554688
after attn max_value: 2798 min_value: -1662
after rms_2 max_value: 3.2519531 min_value: -5.4492188
after mlp max_value: 1.453125 min_value: -5.0976563
after residual max_value: 2798 min_value: -1662
before rms_1 max_value: 2798 min_value: -1662
after rms_1 max_value: 10.0703125 min_value: -8.1875
after attn max_value: 2798 min_value: -1662
after rms_2 max_value: 3.1855469 min_value: -5.5117188
after mlp max_value: 1.3837891 min_value: -7.546875
after residual max_value: 2798 min_value: -1662
before rms_1 max_value: 2798 min_value: -1662
after rms_1 max_value: 9.9921875 min_value: -12.46875
after attn max_value: 2798 min_value: -1662
after rms_2 max_value: 3.640625 min_value: -6.5507813
after mlp max_value: 1.2861328 min_value: -6.1875
after residual max_value: 2798 min_value: -1662
before rms_1 max_value: 2798 min_value: -1662
after rms_1 max_value: 12.203125 min_value: -18.734375
after attn max_value: 2798 min_value: -1662
after rms_2 max_value: 3.7207031 min_value: -8.4140625
after mlp max_value: 1.4990234 min_value: -7.5507813
after residual max_value: 2798 min_value: -1662
before rms_1 max_value: 2798 min_value: -1662
after rms_1 max_value: 13.2734375 min_value: -23.5625
after attn max_value: 2798 min_value: -1662
after rms_2 max_value: 4.0703125 min_value: -11.1015625
after mlp max_value: 2.4707031 min_value: -5.6367188
after residual max_value: 2798 min_value: -1662
before rms_1 max_value: 2798 min_value: -1662
after rms_1 max_value: 13.84375 min_value: -30.546875
after attn max_value: 2798 min_value: -1662
after rms_2 max_value: 4.5 min_value: -6.0273438
after mlp max_value: 3.9140625 min_value: -2.4257813
after residual max_value: 2798 min_value: -1662
before rms_1 max_value: 2798 min_value: -1662
after rms_1 max_value: 14.5859375 min_value: -24.3125
after attn max_value: 2798 min_value: -1662
after rms_2 max_value: 4.5546875 min_value: -6.0078125
after mlp max_value: 1.0380859 min_value: -4.6054688
after residual max_value: 2798 min_value: -1662
before rms_1 max_value: 2798 min_value: -1662
after rms_1 max_value: 15.0234375 min_value: -24.796875
after attn max_value: 2798 min_value: -1662
after rms_2 max_value: 4.046875 min_value: -7.328125
after mlp max_value: 1.0029297 min_value: -3.2246094
after residual max_value: 2798 min_value: -1662
before rms_1 max_value: 2798 min_value: -1662
after rms_1 max_value: 15.390625 min_value: -21.140625
after attn max_value: 2800 min_value: -1663
after rms_2 max_value: 4.3164063 min_value: -8.765625
after mlp max_value: 4.4570313 min_value: -6.1757813
after residual max_value: 2800 min_value: -1663
before rms_1 max_value: 2800 min_value: -1663
after rms_1 max_value: 16.0625 min_value: -21.828125
after attn max_value: 2800 min_value: -1663
after rms_2 max_value: 4.5859375 min_value: -7.5
after mlp max_value: 1.9951172 min_value: -3.4375
after residual max_value: 2800 min_value: -1663
before rms_1 max_value: 2800 min_value: -1663
after rms_1 max_value: 18.296875 min_value: -20.21875
after attn max_value: 2800 min_value: -1663
after rms_2 max_value: 5.203125 min_value: -8.734375
after mlp max_value: 12.3828125 min_value: -2.390625
after residual max_value: 2800 min_value: -1663
before rms_1 max_value: 2800 min_value: -1663
after rms_1 max_value: 18.265625 min_value: -19.375
after attn max_value: 2800 min_value: -1663
after rms_2 max_value: 9.3359375 min_value: -8.53125
after mlp max_value: 6.4648438 min_value: -5.2734375
after residual max_value: 2804 min_value: -1663
before rms_1 max_value: 2804 min_value: -1663
after rms_1 max_value: 19.84375 min_value: -14.5703125
after attn max_value: 2804 min_value: -1663
after rms_2 max_value: 5.0507813 min_value: -7.59375
after mlp max_value: 4.4414063 min_value: -2.1738281
after residual max_value: 2804 min_value: -1663
before rms_1 max_value: 2804 min_value: -1663
after rms_1 max_value: 22.796875 min_value: -16.5625
after attn max_value: 2804 min_value: -1663
after rms_2 max_value: 5.1601563 min_value: -7.3632813
after mlp max_value: 5.4648438 min_value: -10.734375
after residual max_value: 2804 min_value: -1663
before rms_1 max_value: 2804 min_value: -1663
after rms_1 max_value: 21.375 min_value: -13.453125
after attn max_value: 2804 min_value: -1663
after rms_2 max_value: 9.03125 min_value: -7.5625
after mlp max_value: 5.6914063 min_value: -6.9765625
after residual max_value: 2804 min_value: -1662
before rms_1 max_value: 2804 min_value: -1662
after rms_1 max_value: 22.671875 min_value: -12.484375
after attn max_value: 2804 min_value: -1662
after rms_2 max_value: 5.296875 min_value: -6.6367188
after mlp max_value: 8.453125 min_value: -6.125
after residual max_value: 2804 min_value: -1662
before rms_1 max_value: 2804 min_value: -1662
after rms_1 max_value: 23.109375 min_value: -12.4296875
after attn max_value: 2804 min_value: -1662
after rms_2 max_value: 5.0507813 min_value: -6.9101563
after mlp max_value: 2.4355469 min_value: -6.8945313
after residual max_value: 2804 min_value: -1662
before rms_1 max_value: 2804 min_value: -1662
after rms_1 max_value: 25.578125 min_value: -12.7734375
after attn max_value: 2804 min_value: -1662
after rms_2 max_value: 4.96875 min_value: -6.8320313
after mlp max_value: 2.1503906 min_value: -3.2460938
after residual max_value: 2804 min_value: -1662
before rms_1 max_value: 2804 min_value: -1662
after rms_1 max_value: 23.96875 min_value: -12.1640625
after attn max_value: 2804 min_value: -1662
after rms_2 max_value: 4.859375 min_value: -6.4609375
after mlp max_value: 3.1152344 min_value: -8.4765625
after residual max_value: 2804 min_value: -1662
before rms_1 max_value: 2804 min_value: -1662
after rms_1 max_value: 23.59375 min_value: -13.15625
after attn max_value: 2804 min_value: -1662
after rms_2 max_value: 4.6679688 min_value: -6.9414063
after mlp max_value: 8.21875 min_value: -2.6054688
after residual max_value: 2804 min_value: -1662
before rms_1 max_value: 2804 min_value: -1662
after rms_1 max_value: 23 min_value: -11.8046875
after attn max_value: 2804 min_value: -1662
after rms_2 max_value: 5.5234375 min_value: -7.109375
after mlp max_value: 6.2734375 min_value: -22.375
after residual max_value: 2804 min_value: -1662
before rms_1 max_value: 2804 min_value: -1662
after rms_1 max_value: 24.28125 min_value: -13.5234375
after attn max_value: 2804 min_value: -1662
after rms_2 max_value: 6.03125 min_value: -7.859375
after mlp max_value: 7.3242188 min_value: -4.421875
after residual max_value: 2804 min_value: -1662
before rms_1 max_value: 2804 min_value: -1662
after rms_1 max_value: 22.796875 min_value: -10.78125
after attn max_value: 2804 min_value: -1662
after rms_2 max_value: 7.3476563 min_value: -8.546875
after mlp max_value: 4.859375 min_value: -9.203125
after residual max_value: 2804 min_value: -1662
before rms_1 max_value: 2804 min_value: -1662
after rms_1 max_value: 20.421875 min_value: -10.65625
after attn max_value: 2804 min_value: -1662
after rms_2 max_value: 8.1484375 min_value: -9.2578125
after mlp max_value: 8.2734375 min_value: -14.5078125
after residual max_value: 2804 min_value: -1662
before rms_1 max_value: 2804 min_value: -1662
after rms_1 max_value: 19.359375 min_value: -12.15625
after attn max_value: 2804 min_value: -1662
after rms_2 max_value: 8.8984375 min_value: -10
after mlp max_value: 36.71875 min_value: -72.0625
after residual max_value: 2760 min_value: -1639
before rms_1 max_value: 2760 min_value: -1639
after rms_1 max_value: 17.4375 min_value: -10.34375
after attn max_value: 2750 min_value: -1637
after rms_2 max_value: 16.515625 min_value: -11.53125
after mlp max_value: inf min_value: -inf
after residual max_value: inf min_value: -inf
before rms_1 max_value: inf min_value: -inf
after rms_1 max_value: NaN min_value: NaN
after attn max_value: NaN min_value: NaN
after rms_2 max_value: NaN min_value: NaN
after mlp max_value: NaN min_value: NaN
after residual max_value: NaN min_value: NaN
thread '<unnamed>' panicked at mistralrs-core/src/sampler.rs:288:72:
No ordering.

What makes things more complicated is that if I switch from the 7b to the 13b model, this problem doesn't happen.
There are some clues that I found in some other discussions about the instability of fp16 in ROPE:
huggingface/transformers#27179
ggerganov/llama.cpp#5817
However, I don't think the implementation in candle-transformer has considered this detail, yet it does not produce an error in a real run.
I'm sorry I have to leave this work for some time. It consumes me too much.😭😭😭

0 replies

chenwanqq · 2024-06-25T11:59:46Z

chenwanqq
Jun 25, 2024
Author

Hey, I think I can confirm that it's the fused RoPE causing the problem. I modified the llama block to use the original ROPE implementation from candle, as follows:

//self.rotary_emb
        //    .forward(seqlen_offsets, &start_offsets_kernel, &mut q, &mut k, b_sz)?;
        q = q.transpose(0,1)?.reshape((b_sz,self.num_attention_heads,seq_len,self.head_dim))?;
        k = k.transpose(0,1)?.reshape((b_sz,self.num_key_value_heads,seq_len,self.head_dim))?;
        q = self.apply_rotary_emb(&q, seqlen_offsets[0])?;//[1,32,2975,128]
        k = self.apply_rotary_emb(&k, seqlen_offsets[0])?;
        q = q.transpose(1,2)?.reshape((b_sz*seq_len,self.num_attention_heads,self.head_dim))?;
        k = k.transpose(1,2)?.reshape((b_sz*seq_len,self.num_key_value_heads,self.head_dim))?;

And it works, outputting the correct result! (Tested on a 7b model)

Text: The image appears to show some sort of computer screen or a bitmap with a black background and white elements that could be interpreted in various ways. The specific details are not clear, and the image provided is quite small. Without more context or higher resolution, it's difficult to provide a clear description of what the image represents. It could be a graphic design, a painted image, a photograph, or a range of other possibilities. If you have additional details or can provide a larger image, I might be able to give a more detailed answer.

Perhaps a temporary solution would be for me to implement my own version of LLaMA without fused RoPE? Or should I wait for a fix to this issue?

0 replies

EricLBuehler · 2024-06-25T12:02:11Z

EricLBuehler
Jun 25, 2024
Maintainer

That's great. If you could prepare the PR without fused rope in a copied llama impl, that would be great. I'll put priority on fixing the RoPE instability, but that can be done after this PR is merged.

I haven't looked too much at the llava model, but does it increase the sequence length proportionally to the image dimensions after some sort of vision embedding stack? This may have to do with #339.

0 replies

chenwanqq · 2024-06-25T12:22:35Z

chenwanqq
Jun 25, 2024
Author

That's great. If you could prepare the PR without fused rope in a copied llama impl, that would be great. I'll put priority on fixing the RoPE instability, but that can be done after this PR is merged.

I haven't looked too much at the llava model, but does it increase the sequence length proportionally to the image dimensions after some sort of vision embedding stack? This may have to do with #339.

In LLaVA, the length of the image feature (num_per_image_token) is fixed for one image.

let patch_per_side = (image_size/patch_size);
let num_img_token = patch_per_side*patch_per_side+(patch_per_side*2)*(patch_per_side*2+1);

For LLaVANext (LLaVA 1.6),

image_size=336 (after resize at the very initial stage)
patch_size=14
hence num_image_token=2928

and the image feature is concatenate right in the place of image placeholder, with text feature before and after it.
The context size of the overall model is 4096, and the total token count for one image with a very short prompt (and chat template) is approximately 3000.
I hope it may help!

0 replies

EricLBuehler · 2024-06-25T14:32:17Z

EricLBuehler
Jun 25, 2024
Maintainer

Thanks for clarifying! It seems like it scales up with the image size then, I'll see if the instability you found here applies to #339.

0 replies

chenwanqq · 2024-06-25T14:38:53Z

chenwanqq
Jun 25, 2024
Author

Thanks for clarifying! It seems like it scales up with the image size then, I'll see if the instability you found here applies to #339.

Emm... I don't know what's you meaning about "scales up". It does not scale up if you input different size of images to a specific model ... However it does generate a long features(2928 for LLaVANext) for each image.

0 replies

EricLBuehler · 2024-06-25T15:01:49Z

EricLBuehler
Jun 25, 2024
Maintainer

Ok, that makes sense. Looking forward to a PR 😄!

0 replies

EricLBuehler · 2024-06-25T18:48:51Z

EricLBuehler
Jun 25, 2024
Maintainer

@chenwanqq, I'm converting this to a discussion as I think any problem has been resolved., but please feel free to reopen another issue!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What are the differences between the llama block in this repo and the implementation in candle-transformer? #481

{{title}}

Replies: 10 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

What are the differences between the llama block in this repo and the implementation in candle-transformer? #481

chenwanqq Jun 23, 2024

Replies: 10 comments

chenwanqq Jun 23, 2024 Author

EricLBuehler Jun 23, 2024 Maintainer

chenwanqq Jun 25, 2024 Author

chenwanqq Jun 25, 2024 Author

EricLBuehler Jun 25, 2024 Maintainer

chenwanqq Jun 25, 2024 Author

EricLBuehler Jun 25, 2024 Maintainer

chenwanqq Jun 25, 2024 Author

EricLBuehler Jun 25, 2024 Maintainer

EricLBuehler Jun 25, 2024 Maintainer

chenwanqq
Jun 23, 2024

chenwanqq
Jun 23, 2024
Author

EricLBuehler
Jun 23, 2024
Maintainer

chenwanqq
Jun 25, 2024
Author

chenwanqq
Jun 25, 2024
Author

EricLBuehler
Jun 25, 2024
Maintainer

chenwanqq
Jun 25, 2024
Author

EricLBuehler
Jun 25, 2024
Maintainer

chenwanqq
Jun 25, 2024
Author

EricLBuehler
Jun 25, 2024
Maintainer

EricLBuehler
Jun 25, 2024
Maintainer