Skip to content

Latest commit

 

History

History
1107 lines (837 loc) · 36.3 KB

README.md

File metadata and controls

1107 lines (837 loc) · 36.3 KB

InternVL-Chat

This folder contains the implementation of the InternVL-Chat.

🛠️ Installation

See INSTALLATION.md

In addition, using this codebase requires executing the following steps:

  • Install other requirements:

    pip install --upgrade pip  # enable PEP 660 support
    pip install -e .

📦 Model Preparation

model name type download size
InternViT-6B-448px-V1-2 ViT 🤗 HF link 11.1 GB
Nous-Hermes-2-Yi-34B LLM 🤗 HF link 65.0 GB

Please download the above model weights and place them in the pretrained/ folder.

cd pretrained/
# pip install -U huggingface_hub
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/InternViT-6B-448px-V1-2 --local-dir intern_vit_6b_448px_v1_2
huggingface-cli download --resume-download --local-dir-use-symlinks False NousResearch/Nous-Hermes-2-Yi-34B --local-dir Nous-Hermes-2-Yi-34B

The directory structure is:

pretrained
│── intern_vit_6b_448px_v1_2/
└── Nous-Hermes-2-Yi-34B/

🔥 Supervised Fine-tuning

Prepare Training Datasets

Inspired by LLaVA-NeXT, we adopted a data-efficient SFT strategy to train InternVL-Chat-V1.2, utilizing approximately 1.2M of visual instruction tuning samples in total, all of which are fully open-source. In a macro sense, we build upon ShareGPT-4V and additionally integrate LLaVA-ZH, DVQA, ChartQA, AI2D, DocVQA, GeoQA+, and SynthDoG-EN. Most of the data remains consistent with LLaVA-NeXT.

First, download the annotation files and place them in the playground/ folder.

Second, download all the images we used.

Then, organize the data as follows in playground/data:

playground/
├── sharegpt4v_instruct_gpt4-vision_cap100k.jsonl
├── llava_instruct_150k_zh.jsonl
├── sharegpt4v_mix665k_cap23k_coco-ap9k_lcs3k_sam9k_div2k.jsonl
├── dvqa_train_200k.jsonl
├── chartqa_train_18k.jsonl
├── ai2d_train_12k.jsonl
├── docvqa_train_10k.jsonl
├── geoqa+.jsonl
├── synthdog_en.jsonl
├── data
│   ├── ai2d
│   │   └── images
│   ├── chartqa
│   │   ├── test
│   │   ├── train
│   │   └── val
│   ├── coco
│   │   └── train2017
│   ├── docvqa
│   │   ├── test
│   │   ├── train
│   │   └── val
│   ├── dvqa
│   │   └── images
│   ├── gqa
│   │   └── images
│   ├── llava
│   │   └── llava_pretrain
│   │       └── images
│   ├── ocr_vqa
│   │   └── images
│   ├── sam
│   │   └── images
│   ├── share_textvqa
│   │   └── images
│   ├── synthdog-en
│   │   └── images
│   ├── textvqa
│   │   └── train_images
│   ├── vg
│   │   ├── VG_100K
│   │   └── VG_100K_2
│   ├── web-celebrity
│   │   └── images
│   ├── web-landmark
│   │   └── images
│   ├── wikiart
│   │   └── images
│   ├── geoqa+
│   │   └── images

Start Training

We provide slurm scripts for multi-node multi-GPU training. You can use either 32 or 64 GPUs to train this model. If you use 64 GPUs, training will take approximately 18 hours.

  • If you encounter an OOM error, you can decrease the PER_DEVICE_BATCH_SIZE, for example, set PER_DEVICE_BATCH_SIZE=4.
# using 32 GPUs
PARTITION='your partition' GPUS=32 PER_DEVICE_BATCH_SIZE=8 sh shell/hermes2_yi34b/internvl_chat_v1_2_hermes2_yi34b_448_finetune.sh
# using 64 GPUs
PARTITION='your partition' GPUS=64 PER_DEVICE_BATCH_SIZE=8 sh shell/hermes2_yi34b/internvl_chat_v1_2_hermes2_yi34b_448_finetune.sh

The hyperparameters used for finetuning are listed in the following table.

Hyperparameter Trainable Param Global Batch Size Learning rate Epochs Max length Weight decay
InternVL-Chat-V1.2 40B 512 1e-5 1 2048 0.05

📊 Evaluation

MultiModal Benchmark

model MME MMBdev/test MMB-CNdev/test POPE MMVP MathVista
InternVL-Chat-V1.1 1672.3 / 341.1 76.6 / 75.4 71.5 / 70.1 87.2 44.7 34.5
InternVL-Chat-V1.2 1672.1 / 509.3 81.4 / 82.2 79.5 / 81.2 88.0 56.7 47.7
model MMMUval/test CMMMUval/test TinyLVLM LLaVAbench MM-Vet
InternVL-Chat-V1.1 39.1 / 35.3 34.8 / 34.0 344.5 76.3 45.0
InternVL-Chat-V1.2 51.6 / 46.2 TODO 350.3 - 48.9

Visual Question Answering

model VQAv2test OKVQAval TextVQAval VizWizval/test AI2Dtest GQAtest SQAtest
InternVL-Chat-V1.1 80.9 64.2 65.8 58.3 / 57.3 70.2 62.4 91.2
InternVL-Chat-V1.2 - 62.5 69.7 61.9 / 60.0 71.6 64.0 83.3

Image Captioning

model COCOtest Flickr30Ktest NoCapsval
InternVL-Chat-V1.1 141.8* 84.3 120.4
InternVL-Chat-V1.2 113.9 92.4 112.5

📊 Evaluation (Legacy Models)

model QLLaMA LLM res COCO Flickr NoCaps VQAv2 GQA VizWiz TextVQA MME POPE Download
InternVL-Chat ✔️ frozen V-7B 224 141.4 89.7 120.5 72.3 57.7 44.5 42.1 1298.5 85.2 TODO
InternVL-Chat ✔️ frozen V-13B 224 142.4 89.9 123.1 71.7 59.5 54.0 49.1 1317.2 85.4 TODO
InternVL-Chat ✔️ V-13B 336 146.2 92.2 126.2 81.2 66.6 58.5 61.5 1586.4 87.6 TODO

❓ How to Evaluate

Please prepare the data according to the following directory structure.

Directory Structure
data
├── flickr30k
│   ├── flickr30k_test_karpathy.json
│   └── Images/
├── coco
│   ├── annotations
│   │   ├── coco_karpathy_test_gt.json
│   │   ├── coco_karpathy_test.json
│   │   └── ...
│   ├── train2014/
│   ├── val2014/
│   └── test2015/
├── nocaps
│   ├── nocaps_val_4500_captions.json
│   └── images/
├── vqav2
│   ├── v2_mscoco_train2014_annotations.json
│   ├── v2_mscoco_train2014_complementary_pairs.json
│   ├── v2_mscoco_val2014_annotations.json
│   ├── v2_OpenEnded_mscoco_test2015_questions.json
│   ├── v2_OpenEnded_mscoco_test-dev2015_questions.json
│   ├── v2_OpenEnded_mscoco_train2014_questions.json
│   ├── v2_OpenEnded_mscoco_val2014_questions.json
│   ├── vqav2_testdev.jsonl
│   ├── vqav2_train.jsonl
│   ├── vqav2_val.jsonl
│   ├── train2014/ -> ../coco/train2014/
│   ├── val2014/ -> ../coco/val2014/
│   └── test2015/ -> ../coco/test2015/
├── okvqa
│   ├── mscoco_train2014_annotations.json
│   ├── mscoco_val2014_annotations.json
│   ├── OpenEnded_mscoco_train2014_questions.json
│   ├── OpenEnded_mscoco_val2014_questions.json
│   ├── okvqa_train.jsonl
│   ├── okvqa_val.jsonl
│   ├── train2014/ -> ../coco/train2014/
│   └── val2014/ -> ../coco/val2014/
├── textvqa
│   ├── textvqa_train_annotations.json
│   ├── textvqa_train.jsonl
│   ├── textvqa_train_questions.json
│   ├── textvqa_val_annotations.json
│   ├── textvqa_val.jsonl
│   ├── textvqa_val_questions.json
│   ├── textvqa_val_llava.jsonl
│   └── train_images/
├── vizwiz
│   ├── vizwiz_test.jsonl
│   ├── vizwiz_train_annotations.json
│   ├── vizwiz_train.jsonl
│   ├── vizwiz_train_questions.json
│   ├── vizwiz_val_annotations.json
│   ├── vizwiz_val.jsonl
│   ├── vizwiz_val_questions.json
│   ├── test/
│   ├── train/
│   ├── val/
│   └── annotations/
├── docvqa
│   ├── test.jsonl
│   ├── train.jsonl
│   ├── val.jsonl
│   ├── test/
│   ├── train/
│   └── val/
├── chartqa
│   ├── ChartQA Dataset/
│   │   ├── train/
│   │   ├── test/
│   │   ├── val/
│   ├── test_augmented.jsonl
│   ├── test_human.jsonl
│   ├── train_augmented.jsonl
│   └── train_human.jsonl
├── gqa
│   ├── images/
│   ├── eval.py
│   ├── challenge_all_questions.json
│   ├── challenge_balanced_questions.json
│   ├── llava_gqa_testdev_balanced_qwen_format.jsonl
│   ├── submission_all_questions.json
│   ├── test_all_questions.json
│   ├── test_balanced.jsonl
│   ├── test_balanced_questions.json
│   ├── testdev_all_questions.json
│   ├── testdev_balanced_all_questions.json
│   ├── testdev_balanced_questions.json
│   ├── train_all_questions/
│   ├── train_balanced.jsonl
│   ├── train_balanced_questions.json
│   ├── val_all_questions.json
│   └── val_balanced_questions.json
├── ocrvqa
│   ├── images/
│   ├── ocrvqa_test.jsonl
│   ├── ocrvqa_train.jsonl
│   └── ocrvqa_val.jsonl
├── ai2diagram
│   ├── ai2d/
│   ├── test.jsonl
│   └── train.jsonl
├── scienceqa
│   ├── images/
│   ├── problems.json
│   └── scienceqa_test_img.jsonl
├── refcoco
│   ├── refcocog_test.jsonl
│   ├── refcocog_val.jsonl
│   ├── refcoco_testA.jsonl
│   ├── refcoco+_testA.jsonl
│   ├── refcoco_testB.jsonl
│   ├── refcoco+_testB.jsonl
│   ├── refcoco_val.jsonl
│   └── refcoco+_val.jsonl
├── mme
│   ├── MME_Benchmark_release/
│   └── images/
├── pope
│   ├── coco/
│   │    ├── coco_pope_adversarial.json
│   │    ├── coco_pope_popular.json
│   │    └── coco_pope_random.json
│   ├── val2014/ -> ../coco/val2014/
│   └── llava_pope_test.jsonl
├── tiny_lvlm
│   └── updated_datasets
│       ├── Object_Hallucination
│       ├── ...
│       └── Visual_Reasoning
├── mmbench
│   ├── mmbench_dev_20230712.tsv
│   ├── mmbench_dev_cn_20231003.tsv
│   ├── mmbench_dev_en_20231003.tsv
│   ├── mmbench_test_cn_20231003.tsv
│   └── mmbench_test_en_20231003.tsv
├── llava-bench-in-the-wild
│   ├── answers_gpt4.jsonl
│   ├── ...
│   └── images/
├── mmmu
│   ├── Accounting/
│   ├── ...
│   └── Sociology
├── mm-vet
│   └── images/
├── MMVP
│   ├── MMVP Images/
│   ├── Questions.csv
│   └── Questions.xlsx
├── MMVP_VLM
│   ├── MLLM_VLM Images/
│   └── Questions.csv
├── MathVista
│   ├── annot_testmini.json
│   └── AI4Math___math_vista/
├── SEED
│   ├── SEED-Bench.json
│   ├── SEED-Bench-image/
│   └── SEED-Bench-video-image-1/

Image Caption

COCO images are used in VQAv2/OK-VQA/RefCOCO/RefCOCO+/RefCOCOg. Make sure you have already downloaded COCO images before evaluating on these benchmarks.

Data Preparation
mkdir -p data/coco && cd data/coco

# download coco images
wget http://images.cocodataset.org/zips/train2014.zip && unzip train2014.zip
wget http://images.cocodataset.org/zips/val2014.zip && unzip val2014.zip
wget http://images.cocodataset.org/zips/test2015.zip && unzip test2015.zip

mkdir -p annotations && cd annotations/
# download converted annotation files
wget https://github.com/OpenGVLab/InternVL/releases/download/data/coco_karpathy_test.json
wget https://github.com/OpenGVLab/InternVL/releases/download/data/coco_karpathy_test_gt.json

cd ../../../
Evaluation
GPUS=8 sh evaluate.sh <checkpoint> caption-coco
Data Preparation
mkdir -p data/flickr30k && cd data/flickr30k

# download images from https://bryanplummer.com/Flickr30kEntities/
# karpathy split annotations can be downloaded from https://cs.stanford.edu/people/karpathy/deepimagesent/
# download converted files
wget https://github.com/OpenGVLab/InternVL/releases/download/data/flickr30k_test_karpathy.json

cd ../..
Evaluation
GPUS=8 sh evaluate.sh <checkpoint> caption-flickr30k
Data Preparation
mkdir -p data/nocaps && cd data/nocaps

# download images from https://nocaps.org/download
# original annotations can be downloaded from https://nocaps.s3.amazonaws.com/nocaps_val_4500_captions.json
wget https://nocaps.s3.amazonaws.com/nocaps_val_4500_captions.json

cd ../..
Evaluation
GPUS=8 sh evaluate.sh <checkpoint> caption-nocaps

General VQA

Data Preparation
mkdir -p data/vqav2 && cd data/vqav2

# make sure you have downloaded COCO images
ln -s ../coco/train2014 ./
ln -s ../coco/val2014 ./
ln -s ../coco/test2015 ./

# download questions and annotations
wget https://s3.amazonaws.com/cvmlp/vqa/mscoco/vqa/v2_Annotations_Train_mscoco.zip && unzip v2_Annotations_Train_mscoco.zip
wget https://s3.amazonaws.com/cvmlp/vqa/mscoco/vqa/v2_Questions_Train_mscoco.zip && unzip v2_Questions_Train_mscoco.zip
wget https://s3.amazonaws.com/cvmlp/vqa/mscoco/vqa/v2_Annotations_Val_mscoco.zip && unzip v2_Annotations_Val_mscoco.zip
wget https://s3.amazonaws.com/cvmlp/vqa/mscoco/vqa/v2_Questions_Val_mscoco.zip && unzip v2_Questions_Val_mscoco.zip
wget https://s3.amazonaws.com/cvmlp/vqa/mscoco/vqa/v2_Questions_Test_mscoco.zip && unzip v2_Questions_Test_mscoco.zip

# download converted files
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/vqav2/vqav2_train.jsonl
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/vqav2/vqav2_val.jsonl
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/vqav2/vqav2_testdev.jsonl

cd ../..
Evaluation
# VQAv2-val
GPUS=8 sh evaluate.sh <checkpoint> vqa-vqav2-val
# VQAv2-testdev
GPUS=8 sh evaluate.sh <checkpoint> vqa-vqav2-testdev

For the testdev set, submit the results to the evaluation server.

Data Preparation
mkdir -p data/okvqa && cd data/okvqa

# make sure you have downloaded COCO images
ln -s ../coco/train2014 ./
ln -s ../coco/val2014 ./

# download annotations and questions
wget https://okvqa.allenai.org/static/data/mscoco_train2014_annotations.json.zip && unzip mscoco_train2014_annotations.json.zip
wget https://okvqa.allenai.org/static/data/OpenEnded_mscoco_train2014_questions.json.zip && unzip OpenEnded_mscoco_train2014_questions.json.zip
wget https://okvqa.allenai.org/static/data/mscoco_val2014_annotations.json.zip && unzip mscoco_val2014_annotations.json.zip
wget https://okvqa.allenai.org/static/data/OpenEnded_mscoco_val2014_questions.json.zip && unzip OpenEnded_mscoco_val2014_questions.json.zip

# download converted files
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/okvqa/okvqa_train.jsonl
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/okvqa/okvqa_val.jsonl

cd ../..
Evaluation
GPUS=8 sh evaluate.sh <checkpoint> vqa-okvqa-val
Data Preparation
mkdir -p data/textvqa && cd data/textvqa

# download images
wget https://dl.fbaipublicfiles.com/textvqa/images/train_val_images.zip && unzip train_val_images.zip

# download annotations and questions
wget https://dl.fbaipublicfiles.com/textvqa/data/TextVQA_0.5.1_train.json
wget https://dl.fbaipublicfiles.com/textvqa/data/TextVQA_0.5.1_val.json

# download converted files
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/textvqa/textvqa_train_annotations.json
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/textvqa/textvqa_train_questions.json
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/textvqa/textvqa_train.jsonl
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/textvqa/textvqa_val_annotations.json
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/textvqa/textvqa_val_questions.json
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/textvqa/textvqa_val.jsonl
wget https://github.com/OpenGVLab/InternVL/releases/download/data/textvqa_val_llava.jsonl

cd ../..
Evaluation
GPUS=8 sh evaluate.sh <checkpoint> vqa-textvqa-val
Data Preparation
mkdir -p data/vizwiz && cd data/vizwiz

# download images
wget https://vizwiz.cs.colorado.edu/VizWiz_final/images/train.zip && unzip train.zip
wget https://vizwiz.cs.colorado.edu/VizWiz_final/images/val.zip && unzip val.zip
wget https://vizwiz.cs.colorado.edu/VizWiz_final/images/test.zip && unzip test.zip

# download annotations
wget https://vizwiz.cs.colorado.edu/VizWiz_final/vqa_data/Annotations.zip && unzip Annotations.zip

# download converted files
# train
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/vizwiz/vizwiz_train_annotations.json
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/vizwiz/vizwiz_train_questions.json
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/vizwiz/vizwiz_train.jsonl
# val
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/vizwiz/vizwiz_val_annotations.json
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/vizwiz/vizwiz_val_questions.json
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/vizwiz/vizwiz_val.jsonl
# test
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/vizwiz/vizwiz_test.jsonl
cd ../..
Evaluation
# VizWiz val
GPUS=8 sh evaluate.sh <checkpoint> vqa-vizwiz-val
# VizWiz test
GPUS=8 sh evaluate.sh <checkpoint> vqa-vizwiz-test

For the test set, submit the results to the evaluation server.

Data Preparation
mkdir -p data/docvqa && cd data/docvqa

# download images and annotations from https://www.docvqa.org/datasets

# download converted files
# train
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/docvqa/train.jsonl
# val
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/docvqa/val.jsonl
# test
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/docvqa/test.jsonl
cd ../..
Evaluation
# DocVQA-val
GPUS=8 sh evaluate.sh <checkpoint> vqa-docvqa-val
# DocVQA-test
GPUS=8 sh evaluate.sh <checkpoint> vqa-docvqa-test

For the test set, submit the results to the evaluation server.

Data Preparation
mkdir -p data/chartqa && cd data/chartqa

# download images from https://drive.google.com/file/d/1Lm_w6zeET1Hyl_9ks6w5nEsgpoyPHalV/view

# download converted files
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/chartqa/train_human.jsonl
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/chartqa/train_augmented.jsonl
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/chartqa/test_human.jsonl
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/chartqa/test_augmented.jsonl

cd ../..
Evaluation
# ChartQA-test-human
GPUS=8 sh evaluate.sh <checkpoint> vqa-chartqa-test-human
# ChartQA-test-augmented
GPUS=8 sh evaluate.sh <checkpoint> vqa-chartqa-test-augmented
Data Preparation
mkdir -p data/gqa && cd data/gqa

# download images
wget https://downloads.cs.stanford.edu/nlp/data/gqa/images.zip
unzip images.zip

# download converted files
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/gqa/testdev_balanced.jsonl
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/gqa/train_balanced.jsonl
wget https://github.com/OpenGVLab/InternVL/releases/download/data/llava_gqa_testdev_balanced_qwen_format.jsonl

cd ../..
Evaluation
GPUS=8 sh evaluate.sh <checkpoint> vqa-gqa-testdev
Data Preparation
mkdir -p data/ocrvqa && cd data/ocrvqa

# download images by following instructions at https://ocr-vqa.github.io/kvqa_ProjectFiles/README.txt

# download converted files
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/ocrvqa/ocrvqa_train.jsonl
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/ocrvqa/ocrvqa_val.jsonl
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/ocrvqa/ocrvqa_test.jsonl

cd ../..
Evaluation
# OCRVQA-val
GPUS=8 sh evaluate.sh <checkpoint> vqa-ocrvqa-val
# OCRVQA-test
GPUS=8 sh evaluate.sh <checkpoint> vqa-ocrvqa-test
Data Preparation
mkdir -p data/ai2diagram && cd data/ai2diagram

# download images
wget https://ai2-public-datasets.s3.amazonaws.com/diagrams/ai2d-all.zip

# download converted files
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/ai2diagram/train.jsonl
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/ai2diagram/test.jsonl

cd ../..
Evaluation
GPUS=8 sh evaluate.sh <checkpoint> vqa-ai2d-test
Data Preparation
mkdir -p data/scienceqa/images && cd data/scienceqa/images

# download images
wget https://scienceqa.s3.us-west-1.amazonaws.com/images/test.zip && unzip test.zip

cd ..

# download original questions
wget https://github.com/lupantech/ScienceQA/blob/main/data/scienceqa/problems.json

# download converted files
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/scienceqa/scienceqa_test_img.jsonl

cd ../..
Evaluation
GPUS=8 sh evaluate.sh <checkpoint> scienceqa

Refer Expression Comprehension

RefCOCO/RefCOCO+/RefCOCO-g

Data Preparation
mkdir -p data/refcoco && cd data/refcoco

# download converted files
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/refcoco/refcoco_val.jsonl
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/refcoco/refcoco_testA.jsonl
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/refcoco/refcoco_testB.jsonl
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/refcoco%2B/refcoco%2B_val.jsonl
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/refcoco%2B/refcoco%2B_testA.jsonl
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/refcoco%2B/refcoco%2B_testB.jsonl
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/refcocog/refcocog_val.jsonl
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/refcocog/refcocog_test.jsonl

cd ../..
Evaluation
GPUS=8 sh evaluate.sh <checkpoint> refcoco

MultiModal Dialogue

Data Preparation
mkdir -p data/mme && cd data/mme

# 1. Download MME images and eval_tool from the [MME repo](https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models/blob/Evaluation/README.md)
# 2. Rearrange images by executing `python get_images.py`
python get_images.py
cd ../..
Evaluation
# single GPU testing
CUDA_VISIBLE_DEVICES=0 sh evaluate.sh <checkpoint> mme
Data Preparation
mkdir -p data/mmbench && cd data/mmbench

# download csv files of mmbench
wget https://download.openmmlab.com/mmclassification/datasets/mmbench/mmbench_dev_20230712.tsv
wget https://download.openmmlab.com/mmclassification/datasets/mmbench/mmbench_dev_cn_20231003.tsv
wget https://download.openmmlab.com/mmclassification/datasets/mmbench/mmbench_dev_en_20231003.tsv
wget https://download.openmmlab.com/mmclassification/datasets/mmbench/mmbench_test_cn_20231003.tsv
wget https://download.openmmlab.com/mmclassification/datasets/mmbench/mmbench_test_en_20231003.tsv

cd ../..
Evaluation
# mmbench_dev_20230712
GPUS=8 sh evaluate.sh <checkpoint> mmbench-dev-en
# mmbench_dev_cn_20231003
GPUS=8 sh evaluate.sh <checkpoint> mmbench-dev-cn
# mmbench_test_en_20231003
GPUS=8 sh evaluate.sh <checkpoint> mmbench-test-en
# mmbench_test_cn_20231003
GPUS=8 sh evaluate.sh <checkpoint> mmbench-test-cn

Then, submit the results to the evaluation server.

Data Preparation
mkdir -p data/pope && cd data/pope

# make sure you have downloaded COCO images
ln -s ../coco/val2014 ./
wget https://github.com/OpenGVLab/InternVL/releases/download/data/llava_pope_test.jsonl

# download `coco` from POPE
mkdir -p coco && cd coco
wget https://github.com/AoiDragon/POPE/raw/e3e39262c85a6a83f26cf5094022a782cb0df58d/output/coco/coco_pope_adversarial.json
wget https://github.com/AoiDragon/POPE/raw/e3e39262c85a6a83f26cf5094022a782cb0df58d/output/coco/coco_pope_popular.json
wget https://github.com/AoiDragon/POPE/raw/e3e39262c85a6a83f26cf5094022a782cb0df58d/output/coco/coco_pope_random.json
cd ../../..
Evaluation
GPUS=8 sh evaluate.sh <checkpoint> pope
Data Preparation

The evaluation code will automatically download the dataset from hugging face.

Evaluation
# dev set
GPUS=8 sh evaluate.sh <checkpoint> mmmu-dev
# val set
GPUS=8 sh evaluate.sh <checkpoint> mmmu-val
# test set
GPUS=8 sh evaluate.sh <checkpoint> mmmu-test

For the test set, submit the results to the evaluation server.

CMMMU

Data Preparation
mkdir -p data/tiny_lvlm && cd data/tiny_lvlm

# download dataset from https://github.com/OpenGVLab/Multi-Modality-Arena/tree/main/tiny_lvlm_evaluation
# i.e., download `updated_datasets.tar.gz` from https://drive.google.com/file/d/1PuFC612XzOmKwzRldtBb1CFZnIjiR7we/view
tar -xzvf updated_datasets.tar.gz

cd ../..
Evaluation
GPUS=8 sh evaluate.sh <checkpoint> tiny_lvlm
Data Preparation
cd data/
# download dataset from https://huggingface.co/datasets/liuhaotian/llava-bench-in-the-wild
git clone https://huggingface.co/datasets/liuhaotian/llava-bench-in-the-wild
cd llava-bench-in-the-wild/
rm -rf images && mkdir -p images && cd images
# download all 24 images
wget https://huggingface.co/datasets/liuhaotian/llava-bench-in-the-wild/resolve/main/images/001.jpg
# ...
wget https://huggingface.co/datasets/liuhaotian/llava-bench-in-the-wild/resolve/main/images/024.jpg
cd ../../../
Evaluation
# single GPU testing
export OPENAI_API_KEY='your_gpt4_key'
CUDA_VISIBLE_DEVICES=0 sh evaluate.sh <checkpoint> llava-bench
Data Preparation
mkdir -p data/mm-vet && cd data/mm-vet
wget https://github.com/yuweihao/MM-Vet/releases/download/v1/mm-vet.zip
unzip mm-vet.zip
cd ../..
Evaluation
GPUS=8 sh evaluate.sh <checkpoint> mmvet
Data Preparation
cd data
git lfs install
git clone https://huggingface.co/datasets/MMVP/MMVP
git clone https://huggingface.co/datasets/MMVP/MMVP_VLM
cd ..
Evaluation
GPUS=8 sh evaluate.sh <checkpoint> mmvp
Data Preparation
mkdir -p data/MathVista && cd data/MathVista
# Execute the following python code
# from datasets import load_dataset
# dataset = load_dataset("AI4Math/MathVista")
# dataset.save_to_disk('./MathVista')
wget https://huggingface.co/datasets/AI4Math/MathVista/raw/main/annot_testmini.json
cd ../..
Evaluation
# testmini set
GPUS=8 sh evaluate.sh <checkpoint> mathvista-testmini
# test set
GPUS=8 sh evaluate.sh <checkpoint> mathvista-test
Data Preparation
  1. Follow the official instructions Data Preparation for SEED-Bench-1 to download the images and the videos. Put images under ./playground/data/eval/seed_bench/SEED-Bench-image.
  2. Extract the video frame in the middle from the downloaded videos, and put them under ./playground/data/eval/seed_bench/SEED-Bench-video-image. We provide our script extract_video_frames.py modified from the official one.
Evaluation
GPUS=8 sh evaluate.sh <checkpoint> seed