Replies: 10 comments
-
By the way my working branch is chenwanqq/mistral.rs. However it's a mess and there are tons of local path and println! for debugging. |
Beta Was this translation helpful? Give feedback.
-
Hello @chenwanqq! Our llama block in this repo is not different than the candle-transformers one in design. The only difference is that we, as you pointed out, use the fused RoPE kernel instead of the Candle RoPE. Our fused RoPE is faster and does all the operations for multiple batches in one kernel. Does this behavior occur after the first block runs (so in the second block) or at some other block? That could be a clue to which part of the code is not behaving correctly. |
Beta Was this translation helpful? Give feedback.
-
Un fortunately it occurs in last few rounds...😭
What makes things more complicated is that if I switch from the 7b to the 13b model, this problem doesn't happen. |
Beta Was this translation helpful? Give feedback.
-
Hey, I think I can confirm that it's the fused RoPE causing the problem. I modified the llama block to use the original ROPE implementation from candle, as follows: //self.rotary_emb
// .forward(seqlen_offsets, &start_offsets_kernel, &mut q, &mut k, b_sz)?;
q = q.transpose(0,1)?.reshape((b_sz,self.num_attention_heads,seq_len,self.head_dim))?;
k = k.transpose(0,1)?.reshape((b_sz,self.num_key_value_heads,seq_len,self.head_dim))?;
q = self.apply_rotary_emb(&q, seqlen_offsets[0])?;//[1,32,2975,128]
k = self.apply_rotary_emb(&k, seqlen_offsets[0])?;
q = q.transpose(1,2)?.reshape((b_sz*seq_len,self.num_attention_heads,self.head_dim))?;
k = k.transpose(1,2)?.reshape((b_sz*seq_len,self.num_key_value_heads,self.head_dim))?; And it works, outputting the correct result! (Tested on a 7b model)
Perhaps a temporary solution would be for me to implement my own version of LLaMA without fused RoPE? Or should I wait for a fix to this issue? |
Beta Was this translation helpful? Give feedback.
-
That's great. If you could prepare the PR without fused rope in a copied llama impl, that would be great. I'll put priority on fixing the RoPE instability, but that can be done after this PR is merged. I haven't looked too much at the llava model, but does it increase the sequence length proportionally to the image dimensions after some sort of vision embedding stack? This may have to do with #339. |
Beta Was this translation helpful? Give feedback.
-
In LLaVA, the length of the image feature (num_per_image_token) is fixed for one image. let patch_per_side = (image_size/patch_size);
let num_img_token = patch_per_side*patch_per_side+(patch_per_side*2)*(patch_per_side*2+1); For LLaVANext (LLaVA 1.6),
and the image feature is concatenate right in the place of image placeholder, with text feature before and after it. |
Beta Was this translation helpful? Give feedback.
-
Thanks for clarifying! It seems like it scales up with the image size then, I'll see if the instability you found here applies to #339. |
Beta Was this translation helpful? Give feedback.
-
Emm... I don't know what's you meaning about "scales up". It does not scale up if you input different size of images to a specific model ... However it does generate a long features(2928 for LLaVANext) for each image. |
Beta Was this translation helpful? Give feedback.
-
Ok, that makes sense. Looking forward to a PR 😄! |
Beta Was this translation helpful? Give feedback.
-
@chenwanqq, I'm converting this to a discussion as I think any problem has been resolved., but please feel free to reopen another issue! |
Beta Was this translation helpful? Give feedback.
-
Hello, as we discussed before, I am integrating LLaVA into this repo, but I encountered an extremely difficult problem that is hard to debug.
In LLaVA, the image features (obtained through CLIP) and text features need to be concatenated and fed into the LLaMA model. However, during the processing of the llama block, all activation values became NaN. I used the following code for debugging:
I found that after passing through two blocks, the max_value and min_value became infinity and negative infinity, respectively, which caused subsequent issues.
Here are some key points:
(1) If I only input text, this problem does not occur.
(2) This problem does not occur in my implementation based on candle-transformers. I measured the maximum and minimum values of the features before entering the block, and both implementations are around plus or minus 30, which seems to make no difference.
After further reading the code, I found that the llama used in this repo adopts "fused_rope". I'm not sure what this is. Could you please explain the difference between this and the original version of rope?
Beta Was this translation helpful? Give feedback.
All reactions