-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use RADIOV2 as VLM's vision encoder. #60
Comments
Hello, in the experiments we published in our paper, we used an image pre-processor that resizes the longest edge to 432, adjusting the shortest edge to keep the original image aspect ratio, followed by a crop along the shortest edge to the nearest multiple of the patch size. This should be mostly equivalent to Are you using |
Thank you for your response! Yes, during finetuning, we used image_aspect_ratio == 'pad'. I'm now trying the experiment according to your instructions. Thank you very much! |
Hello, RADIOV2 is still lower than SigClip. I would like to know if I have missed any operations in the feature extraction code below. Do I need to extract features from the second-to-last layer from vision tower like LLAVA? Or if I have overlooked the normalization operation? Or do I need to add the summary token?
|
Hello, I have not worked with the HuggingFace model in LLaVA however equivalently you should be able to use the TorchHub model. In my LLaVA integration I used standard normalization instead of the built-in input conditioner (i.e. I make a call to This is my code (pardon the lack of untidiness):
|
Thank you! I‘m also curious about the setting of "extra_config" and "config_items ". Is the setting for the following parameters is true or false?
|
Hi, in my standard configuration the adaptor is |
Thank you for your prompt response and your great work! |
Hello, have you been able to get RADIO to perform well in your VLM setup? |
I'm sorry, to be honest, I can't achieve better results than Sigclip under the same settings. Sigclip has a resolution of 384, while Radio's resolution is dynamic (with a maximum size set to 1280). To save time, we use qwen2 0.5b as LLM. And we also add some OCR data such as docvqa and textvqa. However, the experiments are at the same setting.
|
Hello, our RADIOv2.5 very much improves VLM metrics, see the release notes at the root of this repo. Would you like to try it? |
Hello, the results of RADIOv2.5 are indeed quite impressive. I'm curious if RADIOv2.5 supports dynamic resolution. Given that different tasks and images may require different resolution settings, recent VLMs, such as the one detailed in this paper, have adopted the strategy of splitting the original image to ensure the input image resolution is close to the original, achieving very good results. Since RADIOv2.5 naturally possesses the ability to support arbitrary resolutions, I'm wondering if it's possible to use only a single instance of RADIOv2.5 to support dynamic resolution. Additionally, does RADIOv2.5 have a maximum resolution limit of 768? If it can support larger resolutions, using RADIOv2.5 as a visual encoder might yield better performance on document images (DocVQA). Thank you for your great work! |
Yes, the RADIOv2.5 family of models supports dynamic resolution. Indeed, using RADIO it wouldn't be necessary to tile. It also isn't necessary that the image is square. The only requirement is that each dimension is a multiple of 16. If you check out the tech report, you'll see that the model does well all the way up to 2048px. It can go even higher, although we haven't spent much time assessing it. |
Yes, fixing "mode switching" is a major thing we fixed in the latest release. That was probably the primary reason that you were seeing weird results with dynamic resolution and RADIOv2.1. Definitely give it a try if you have the chance. The gist is that RADIOv2.1 and below were behaving differently at resolutions below ~720px versus above. The representations would dramatically change around that threshold. This no longer happens with the new models, and we demonstrate how increasing the resolution from 432 up to 768 dramatically improves our LLaVA metrics. Depending on the language model, you could go even higher for even better results (particularly for OCR tasks). |
Thank you for your response, I am very willing to give it a try on radio2.5. |
Is there any good news? |
Hello, thank you for your great work!
We are currently exploring the utilization of radio as a vision encoder for vision language models. In our specific setup, we employ SigClip and RADIOV2 as the vision encoder, while Phi2 serves as the language model. The obtained results are as follows:
They use the same data and configuration, the only difference is the vision encoder. Is it normal to observe worse performance when using a RADIOv2 compared to using SigClip?
Could you give me some suggestions?
The text was updated successfully, but these errors were encountered: