Skip to content

Latest commit

 

History

History
116 lines (90 loc) · 5.57 KB

README.md

File metadata and controls

116 lines (90 loc) · 5.57 KB

Video-MMMU Icon Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos

🏠 LMMs-Lab Homepage | Discord_Thread discord/lmms-eval | 🎓 Project Page |📝 Arxiv Paper


Annoucement

  • [2025-1] 🎉🎉 We introduce VideoMMMU, a massive, multi-modal, multi-disciplinary video benchmark that evaluates the knowledge acquisition capability from educational videos.

Evaluation

The evaluation of VideoMMMU is integrated into LMMs-Eval. Below is a detailed instruction of the evaluation.

Installation

For formal usage, you can install the package from PyPI by running the following command:

pip install lmms-eval

For development, you can install the package by cloning the repository and running the following command:

git clone https://github.com/EvolvingLMMs-Lab/lmms-eval
cd lmms-eval
pip install -e .

If you want to test LLaVA, you will have to clone their repo from LLaVA and

git clone https://github.com/LLaVA-VL/LLaVA-NeXT
cd LLaVA-NeXT
pip install -e .

Command

Evaluation of LLaVA-OneVision on VideoMMMU

accelerate launch --num_processes=1 --main_process_port 12345 -m lmms_eval \
--model llava_onevision \
--model_args pretrained=lmms-lab/llava-onevision-qwen2-7b-ov,conv_template=qwen_1_5,model_name=llava_qwen,max_frames_num=32,torch_dype=bfloat16 \
    --tasks video_mmmu \
    --batch_size 1 \
    --log_samples \
    --log_samples_suffix debug \
    --output_path ./logs/

Evaluate a single track of VideoMMMU

accelerate launch --num_processes=1 --main_process_port 12345 -m lmms_eval \
--model llava_onevision \
--model_args pretrained=lmms-lab/llava-onevision-qwen2-7b-ov,conv_template=qwen_1_5,model_name=llava_qwen,max_frames_num=32,torch_dype=bfloat16 \
    --tasks video_mmmu_perception \
    --batch_size 1 \
    --log_samples \
    --log_samples_suffix debug \
    --output_path ./logs/

Evaluate the question_only track of VideoMMMU (Knowledge Acquisition Experiment)

accelerate launch --num_processes=1 --main_process_port 12345 -m lmms_eval \
--model llava_onevision \
--model_args pretrained=lmms-lab/llava-onevision-qwen2-7b-ov,conv_template=qwen_1_5,model_name=llava_qwen,max_frames_num=1,torch_dype=bfloat16 \
    --tasks video_mmmu_adaptation_question_only \
    --batch_size 1 \
    --log_samples \
    --log_samples_suffix debug \
    --output_path ./logs/

Video-MMMU Leaderboard


We evaluate various open-source and proprietary LMMs. The table below provides a detailed comparison. For submitting your model results, please contact Kairui Hu at [email protected] or Bo Li at [email protected].

Model Overall Perception Comprehension Adaptation Δknowledge
Human Expert 74.44 84.33 78.67 60.33 +33.1
Claude-3.5-Sonnet 65.78 72.00 69.67 55.67 +11.4
GPT-4o 61.22 66.00 62.00 55.67 +15.6
Qwen-2.5-VL-72B 60.22 69.33 61.00 50.33 +9.7
Gemini 1.5 Pro 53.89 59.00 53.33 49.33 +8.7
Aria 50.78 65.67 46.67 40.00 +3.2
Gemini 1.5 Flash 49.78 57.33 49.00 43.00 -3.3
LLaVA-Video-72B 49.67 59.67 46.00 43.33 +7.1
LLaVA-OneVision-72B 48.33 59.67 42.33 43.00 +6.6
Qwen-2.5-VL-7B 47.44 58.33 44.33 39.67 +2.2
mPLUG-Owl3-7B 42.00 49.33 38.67 38.00 +7.5
MAmmoTH-VL-8B 41.78 51.67 40.00 33.67 +1.5
InternVL2-8B 37.44 47.33 33.33 31.67 -8.5
LLaVA-Video-7B 36.11 41.67 33.33 33.33 -5.3
VILA1.5-40B 34.00 38.67 30.67 32.67 +9.4
Llama-3.2-11B 30.00 35.67 32.33 22.00 -
LongVA-7B 23.98 24.00 24.33 23.67 -7.0
VILA1.5-8B 20.89 20.33 17.33 25.00 +5.9

Citation

@article{hu2025videommmu,
    title={Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos},
    author={Kairui Hu and Penghao Wu and Fanyi Pu and Wang Xiao and Yuanhan Zhang and Xiang Yue and Bo Li and Ziwei Liu},
    booktitle={arXiv preprint arXiv:2501.13826},
    year={2025},
    url={https://arxiv.org/abs/2501.13826}
}