Integrations into popular evaluation frameworks like lmms_eval or vlmevalkit #2

wizyoung · 2024-11-01T07:47:08Z

Thank you for your great work!
I wonder if it can be integrated into popular evaluation frameworks like lmms_eval or vlmevalkit for easier use by everyone?

woodfrog · 2024-11-01T08:08:57Z

Thank you for the suggestion! Yes, we will integrate our benchmark into lmms_eval and VLMEvalKit. We will work on this soon after adding the results of some recent VLMs.

wizyoung · 2024-11-01T08:11:32Z

Look forward to that!

Violettttee · 2024-11-07T12:40:19Z

Hi again~
Recently our team want to integrate your code into our lmms like repo,but i found one problem is that your messages input seems to have a order.for example,

the multiple image seems to have a order,which leads to a question is that in lmms(the text input is separeate from visual input),we must fully judge the number of images(especially video), since for video task,we must read it and count the frame to put every frame in a order.
So the question is that:
1.Does the sequence of image matters?
2.If so,is there a good plan to solve the above problem?

woodfrog · 2024-11-07T21:33:28Z

@Violettttee Thanks for your questions.

Yes, the sequence/order of the images matter, as some multi-image tasks (e.g., image retrieval, temporal/spatial understanding) require the model to answer the image index or sort the image indices.
I don't fully understand your problem here. Is your question "how do we fully judge the number of images"?

Violettttee · 2024-11-08T02:41:22Z

@woodfrog Hi~
the question "how do we fully judge the number of images" actually means in lmms repo,the text inputs and visuals inputs is not done at one time, but for text inputs, our model(based on llava) needs to add token in text if the sequence of the images matter,for example:

text = "This is the first example.<image>This is the second example <image>"

and for video tasks, the examples seem to be:

text = "This is the first example.<image><image><image>.......<image>This is the second example<image>......

so, under this circumstance and for video tasks, it means we must load the video, and get all the frames to add token in text.It seems to take a lot of memory and times.

woodfrog · 2024-11-08T04:47:20Z

@Violettttee Thanks for the clarification. So the question is specifically for video tasks, am I right?

In our evaluation pipeline, we indeed read the video and do frame sub-sampling based on per-model hyper-parameters (models with a larger context window size will have a larger sampling rate). The <video> placeholder will be replaced by a sequence of placeholders based on how many frames are sub-sampled from the video (see our pipeline for InternVL2 for an example).

We didn't pre-convert a video into a list of images mainly for two reasons: 1) If the pre-defined sampling rate is too large, the size of the image sequence will be super large; 2) If the pre-defined sampling rate is too small, we might lose critical temporal information, which is also unfair for those models with long context window.

Maybe you can follow our sub-sampling pipeline to prepare the data? Feel free to ask if you have any further questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrations into popular evaluation frameworks like lmms_eval or vlmevalkit #2

Integrations into popular evaluation frameworks like lmms_eval or vlmevalkit #2

wizyoung commented Nov 1, 2024

woodfrog commented Nov 1, 2024

wizyoung commented Nov 1, 2024

Violettttee commented Nov 7, 2024 •

edited

Loading

woodfrog commented Nov 7, 2024

Violettttee commented Nov 8, 2024

woodfrog commented Nov 8, 2024

Integrations into popular evaluation frameworks like lmms_eval or vlmevalkit #2

Integrations into popular evaluation frameworks like lmms_eval or vlmevalkit #2

Comments

wizyoung commented Nov 1, 2024

woodfrog commented Nov 1, 2024

wizyoung commented Nov 1, 2024

Violettttee commented Nov 7, 2024 • edited Loading

woodfrog commented Nov 7, 2024

Violettttee commented Nov 8, 2024

woodfrog commented Nov 8, 2024

Violettttee commented Nov 7, 2024 •

edited

Loading