Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrations into popular evaluation frameworks like lmms_eval or vlmevalkit #2

Open
wizyoung opened this issue Nov 1, 2024 · 6 comments

Comments

@wizyoung
Copy link

wizyoung commented Nov 1, 2024

Thank you for your great work!
I wonder if it can be integrated into popular evaluation frameworks like lmms_eval or vlmevalkit for easier use by everyone?

@woodfrog
Copy link
Collaborator

woodfrog commented Nov 1, 2024

Thank you for the suggestion! Yes, we will integrate our benchmark into lmms_eval and VLMEvalKit. We will work on this soon after adding the results of some recent VLMs.

@wizyoung
Copy link
Author

wizyoung commented Nov 1, 2024

Look forward to that!

@Violettttee
Copy link

Violettttee commented Nov 7, 2024

Hi again~
Recently our team want to integrate your code into our lmms like repo,but i found one problem is that your messages input seems to have a order.for example,
Clipboard_Screenshot_1730983068
the multiple image seems to have a order,which leads to a question is that in lmms(the text input is separeate from visual input),we must fully judge the number of images(especially video), since for video task,we must read it and count the frame to put every frame in a order.
So the question is that:
1.Does the sequence of image matters?
2.If so,is there a good plan to solve the above problem?

@woodfrog
Copy link
Collaborator

woodfrog commented Nov 7, 2024

@Violettttee Thanks for your questions.

  1. Yes, the sequence/order of the images matter, as some multi-image tasks (e.g., image retrieval, temporal/spatial understanding) require the model to answer the image index or sort the image indices.
  2. I don't fully understand your problem here. Is your question "how do we fully judge the number of images"?

@Violettttee
Copy link

@woodfrog Hi~
the question "how do we fully judge the number of images" actually means in lmms repo,the text inputs and visuals inputs is not done at one time, but for text inputs, our model(based on llava) needs to add token in text if the sequence of the images matter,for example:

text = "This is the first example.<image>This is the second example <image>"

and for video tasks, the examples seem to be:

text = "This is the first example.<image><image><image>.......<image>This is the second example<image>......

so, under this circumstance and for video tasks, it means we must load the video, and get all the frames to add token in text.It seems to take a lot of memory and times.

@woodfrog
Copy link
Collaborator

woodfrog commented Nov 8, 2024

@Violettttee Thanks for the clarification. So the question is specifically for video tasks, am I right?

In our evaluation pipeline, we indeed read the video and do frame sub-sampling based on per-model hyper-parameters (models with a larger context window size will have a larger sampling rate). The <video> placeholder will be replaced by a sequence of placeholders based on how many frames are sub-sampled from the video (see our pipeline for InternVL2 for an example).

We didn't pre-convert a video into a list of images mainly for two reasons: 1) If the pre-defined sampling rate is too large, the size of the image sequence will be super large; 2) If the pre-defined sampling rate is too small, we might lose critical temporal information, which is also unfair for those models with long context window.

Maybe you can follow our sub-sampling pipeline to prepare the data? Feel free to ask if you have any further questions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants