-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RoPE for Long-CLIP - perfect match for Flux? #67
Comments
Adding the relevant / changed parts of the Long-CLIP code for RoPE.
Note, in case you think it's interesting and want to try it (with a more suitable batch size, perhaps): As SPRIGHT-T2I is GPT-4V etc. / AI labeled, I noticed the dataset contains a few "glitch labels", where the label is just a repetition of "a small airplane and a large airplane. a small airplane and a large airplane. [etc] [etc]", exceeding 248 tokens. I recommend running the entire dataset against the tokenizer and just deleting any examples that cause an error. I find about 10 out of 100,000 so far - and they are often low-quality images (e.g. images full of text), so I think it's best to just auto-delete them. |
Indeed, this is a very promising direction, to be honest. We have published a new ArXiv paper discussing the extension of content length for the CLIP model using RoPE and CoPE techniques. Our findings show that RoPE outperforms CoPE. This was achieved through a two-stage process involving model distillation and context length expansion. In the area of long-caption image retrieval, our approach surpasses that of Long-CLIP. Would love to hear your comments. |
First of all, in case anybody sees this thread and thinks "Oh, I want to use Long-CLIP with Flux!": I made a ComfyUI custom node for it, and you can find it here: https://github.com/zer0int/ComfyUI-Long-CLIP
Now, about the actual topic. The developers of Flux1 have stated in their announcement that (besides being a gigantic 12B parameters diffusion transformer), their model's superior performance is also due to
rotary positional embeddings
.And indeed, the red cube on top of a blue cube problem doesn't exist for FLux. It can even accuraty generate this:
"a red cube on top of a blue cube, with a green kitten sitting on top of the red cube, the cat is holding a sign that says 'rotary positional embeddings', and in the background there are many tiny pink spheres and yellow triangles"
In terms of general spatial prompt following, Flux1 can do this (albeit details can be problematic).
Flux1 uses
T5 + CLIP ViT-L/14
as Text Encoders. Besides Long-CLIP nicely complementing the maximum sequence length of T5, I naturally also wondered: What if CLIP had RoPE?My previous MLP modification was Geometric Parametrization (GmP), which "splits" the .weight into .theta and .r, and thus preserves the learned information. Not a big deal.
However, RoPE changes how the attention mechanism works. So it needs to "learn how to see" again after this change.
Nevertheless, I tried it! 🤓
First, I fine-tuned with the
COCO-SPRIGHT-40k
spatial labels dataset I used previously, with labels <77 tokens, to compare CLIP vs. Long-CLIP. The models show similar patterns in learning, but their validation acccuracy and F1 remains very poor after this - BUT it is improving, albeit slowly.So I continued fine-tuning with CC12M SPRIGHT, and using the original long captions >77 tokens for Long-CLIP.
Now, after ~150,000 text-image pairs (split into two separate runs), latest:
Start: Validation Acc: 0.0880, Validation F1: 0.0616
End: Validation Acc: 0.1537, Validation F1: 0.1492
Fine-tuned Model Accuracy on MVT ImageNet/ObjectNet: 0.79064
Down from (your model): 0.81134
So I am hoping it may be enough with "just" 1-5 million text-image pairs of re-training for CLIP to learn how to "see with its new attention".
I am using GmP and label smoothing and RoPE. GmP is necessary because I am "GPU poor", I am training on 1 RTX 4090. RoPE further increases model VRAM requirement, so I have to train using a batch size of 26 (!) - definitely NOT ideal for CLIP!
I would love to get your feedback on this idea (RoPE for CLIP in general). While GmP has shown to "just work well" empirically, I am uncertain about RoPE. So, even if you have negative feedback and criticism / if you think RoPE is a very bad idea, I would very much appreciate this feedback, too!
Any feedback is welcome! Thank you!
Supplementary images.
The text was updated successfully, but these errors were encountered: