🏠 LMMs-Lab Homepage | discord/lmms-eval | 🎓 Project Page |📝 Arxiv Paper
- [2025-1] 🎉🎉 We introduce VideoMMMU, a massive, multi-modal, multi-disciplinary video benchmark that evaluates the knowledge acquisition capability from educational videos.
The evaluation of VideoMMMU is integrated into LMMs-Eval. Below is a detailed instruction of the evaluation.
For formal usage, you can install the package from PyPI by running the following command:
pip install lmms-eval
For development, you can install the package by cloning the repository and running the following command:
git clone https://github.com/EvolvingLMMs-Lab/lmms-eval
cd lmms-eval
pip install -e .
If you want to test LLaVA, you will have to clone their repo from LLaVA and
git clone https://github.com/LLaVA-VL/LLaVA-NeXT
cd LLaVA-NeXT
pip install -e .
Evaluation of LLaVA-OneVision on VideoMMMU
accelerate launch --num_processes=1 --main_process_port 12345 -m lmms_eval \
--model llava_onevision \
--model_args pretrained=lmms-lab/llava-onevision-qwen2-7b-ov,conv_template=qwen_1_5,model_name=llava_qwen,max_frames_num=32,torch_dype=bfloat16 \
--tasks video_mmmu \
--batch_size 1 \
--log_samples \
--log_samples_suffix debug \
--output_path ./logs/
Evaluate a single track of VideoMMMU
accelerate launch --num_processes=1 --main_process_port 12345 -m lmms_eval \
--model llava_onevision \
--model_args pretrained=lmms-lab/llava-onevision-qwen2-7b-ov,conv_template=qwen_1_5,model_name=llava_qwen,max_frames_num=32,torch_dype=bfloat16 \
--tasks video_mmmu_perception \
--batch_size 1 \
--log_samples \
--log_samples_suffix debug \
--output_path ./logs/
Evaluate the question_only track of VideoMMMU (Knowledge Acquisition Experiment)
accelerate launch --num_processes=1 --main_process_port 12345 -m lmms_eval \
--model llava_onevision \
--model_args pretrained=lmms-lab/llava-onevision-qwen2-7b-ov,conv_template=qwen_1_5,model_name=llava_qwen,max_frames_num=1,torch_dype=bfloat16 \
--tasks video_mmmu_adaptation_question_only \
--batch_size 1 \
--log_samples \
--log_samples_suffix debug \
--output_path ./logs/
We evaluate various open-source and proprietary LMMs. The table below provides a detailed comparison. For submitting your model results, please contact Kairui Hu at [email protected] or Bo Li at [email protected].
Model | Overall | Perception | Comprehension | Adaptation | Δknowledge |
---|---|---|---|---|---|
Human Expert | 74.44 | 84.33 | 78.67 | 60.33 | +33.1 |
Claude-3.5-Sonnet | 65.78 | 72.00 | 69.67 | 55.67 | +11.4 |
GPT-4o | 61.22 | 66.00 | 62.00 | 55.67 | +15.6 |
Qwen-2.5-VL-72B | 60.22 | 69.33 | 61.00 | 50.33 | +9.7 |
Gemini 1.5 Pro | 53.89 | 59.00 | 53.33 | 49.33 | +8.7 |
Aria | 50.78 | 65.67 | 46.67 | 40.00 | +3.2 |
Gemini 1.5 Flash | 49.78 | 57.33 | 49.00 | 43.00 | -3.3 |
LLaVA-Video-72B | 49.67 | 59.67 | 46.00 | 43.33 | +7.1 |
LLaVA-OneVision-72B | 48.33 | 59.67 | 42.33 | 43.00 | +6.6 |
Qwen-2.5-VL-7B | 47.44 | 58.33 | 44.33 | 39.67 | +2.2 |
mPLUG-Owl3-7B | 42.00 | 49.33 | 38.67 | 38.00 | +7.5 |
MAmmoTH-VL-8B | 41.78 | 51.67 | 40.00 | 33.67 | +1.5 |
InternVL2-8B | 37.44 | 47.33 | 33.33 | 31.67 | -8.5 |
LLaVA-Video-7B | 36.11 | 41.67 | 33.33 | 33.33 | -5.3 |
VILA1.5-40B | 34.00 | 38.67 | 30.67 | 32.67 | +9.4 |
Llama-3.2-11B | 30.00 | 35.67 | 32.33 | 22.00 | - |
LongVA-7B | 23.98 | 24.00 | 24.33 | 23.67 | -7.0 |
VILA1.5-8B | 20.89 | 20.33 | 17.33 | 25.00 | +5.9 |
@article{hu2025videommmu,
title={Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos},
author={Kairui Hu and Penghao Wu and Fanyi Pu and Wang Xiao and Yuanhan Zhang and Xiang Yue and Bo Li and Ziwei Liu},
booktitle={arXiv preprint arXiv:2501.13826},
year={2025},
url={https://arxiv.org/abs/2501.13826}
}