InternVL-Chat

This folder contains the implementation of the InternVL-Chat.

🛠️ Installation

In addition, using this codebase requires executing the following steps:

Install other requirements:

pip install --upgrade pip  # enable PEP 660 support
pip install -e .

📦 Model Preparation

model name	type	download	size
InternViT-6B-448px-V1-2	ViT	🤗 HF link	11.1 GB
Nous-Hermes-2-Yi-34B	LLM	🤗 HF link	65.0 GB

Please download the above model weights and place them in the pretrained/ folder.

cd pretrained/
# pip install -U huggingface_hub
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/InternViT-6B-448px-V1-2 --local-dir intern_vit_6b_448px_v1_2
huggingface-cli download --resume-download --local-dir-use-symlinks False NousResearch/Nous-Hermes-2-Yi-34B --local-dir Nous-Hermes-2-Yi-34B

The directory structure is:

pretrained
│── intern_vit_6b_448px_v1_2/
└── Nous-Hermes-2-Yi-34B/

🔥 Supervised Fine-tuning

Prepare Training Datasets

Inspired by LLaVA-NeXT, we adopted a data-efficient SFT strategy to train InternVL-Chat-V1.2, utilizing approximately 1.2M of visual instruction tuning samples in total, all of which are fully open-source. In a macro sense, we build upon ShareGPT-4V and additionally integrate LLaVA-ZH, DVQA, ChartQA, AI2D, DocVQA, GeoQA+, and SynthDoG-EN. Most of the data remains consistent with LLaVA-NeXT.

First, download the annotation files and place them in the playground/ folder.

Second, download all the images we used.

AI2D: ai2d-all
ChartQA: ChartQA Dataset
COCO: train2017
DocVQA: train, val, test
DVQA: images
GQA: images
LLaVA-Pretrain: images
OCR-VQA: download script. We save all files as .jpg
SAM: We only use 000000~000050.tar for now. You can quickly download 9K images from here.
TextVQA: trainvalimages
SynthDoG-EN: We only use 00000~00004 parquet files for now, with a total of 30K images. We provide the converted images.
VisualGenome: part1, part2
WebData: images. Only for academic usage.
GeoQA+: GeoQA+

Then, organize the data as follows in playground/data:

playground/
├── sharegpt4v_instruct_gpt4-vision_cap100k.jsonl
├── llava_instruct_150k_zh.jsonl
├── sharegpt4v_mix665k_cap23k_coco-ap9k_lcs3k_sam9k_div2k.jsonl
├── dvqa_train_200k.jsonl
├── chartqa_train_18k.jsonl
├── ai2d_train_12k.jsonl
├── docvqa_train_10k.jsonl
├── geoqa+.jsonl
├── synthdog_en.jsonl
├── data
│   ├── ai2d
│   │   └── images
│   ├── chartqa
│   │   ├── test
│   │   ├── train
│   │   └── val
│   ├── coco
│   │   └── train2017
│   ├── docvqa
│   │   ├── test
│   │   ├── train
│   │   └── val
│   ├── dvqa
│   │   └── images
│   ├── gqa
│   │   └── images
│   ├── llava
│   │   └── llava_pretrain
│   │       └── images
│   ├── ocr_vqa
│   │   └── images
│   ├── sam
│   │   └── images
│   ├── share_textvqa
│   │   └── images
│   ├── synthdog-en
│   │   └── images
│   ├── textvqa
│   │   └── train_images
│   ├── vg
│   │   ├── VG_100K
│   │   └── VG_100K_2
│   ├── web-celebrity
│   │   └── images
│   ├── web-landmark
│   │   └── images
│   ├── wikiart
│   │   └── images
│   ├── geoqa+
│   │   └── images

Start Training

We provide slurm scripts for multi-node multi-GPU training. You can use either 32 or 64 GPUs to train this model. If you use 64 GPUs, training will take approximately 18 hours.

If you encounter an OOM error, you can decrease the PER_DEVICE_BATCH_SIZE, for example, set PER_DEVICE_BATCH_SIZE=4.

# using 32 GPUs
PARTITION='your partition' GPUS=32 PER_DEVICE_BATCH_SIZE=8 sh shell/hermes2_yi34b/internvl_chat_v1_2_hermes2_yi34b_448_finetune.sh
# using 64 GPUs
PARTITION='your partition' GPUS=64 PER_DEVICE_BATCH_SIZE=8 sh shell/hermes2_yi34b/internvl_chat_v1_2_hermes2_yi34b_448_finetune.sh

The hyperparameters used for finetuning are listed in the following table.

Hyperparameter	Trainable Param	Global Batch Size	Learning rate	Epochs	Max length	Weight decay
InternVL-Chat-V1.2	40B	512	1e-5	1	2048	0.05

📊 Evaluation

MultiModal Benchmark

model	MME	MMB_dev/test	MMB-CN_dev/test	POPE	MMVP	MathVista
InternVL-Chat-V1.1	1672.3 / 341.1	76.6 / 75.4	71.5 / 70.1	87.2	44.7	34.5
InternVL-Chat-V1.2	1672.1 / 509.3	81.4 / 82.2	79.5 / 81.2	88.0	56.7	47.7

model	MMMU_val/test	CMMMU_val/test	Tiny_LVLM	LLaVA_bench	MM-Vet
InternVL-Chat-V1.1	39.1 / 35.3	34.8 / 34.0	344.5	76.3	45.0
InternVL-Chat-V1.2	51.6 / 46.2	TODO	350.3	-	48.9

Visual Question Answering

model	VQAv2_test	OKVQA_val	TextVQA_val	VizWiz_val/test	AI2D_test	GQA_test	SQA_test
InternVL-Chat-V1.1	80.9	64.2	65.8	58.3 / 57.3	70.2	62.4	91.2
InternVL-Chat-V1.2	-	62.5	69.7	61.9 / 60.0	71.6	64.0	83.3

Image Captioning

model	COCO_test	Flickr30K_test	NoCaps_val
InternVL-Chat-V1.1	141.8*	84.3	120.4
InternVL-Chat-V1.2	113.9	92.4	112.5

📊 Evaluation (Legacy Models)

model	QLLaMA	LLM	res	COCO	Flickr	NoCaps	VQAv2	GQA	VizWiz	TextVQA	MME	POPE	Download
InternVL-Chat	✔️	frozen V-7B	224	141.4	89.7	120.5	72.3	57.7	44.5	42.1	1298.5	85.2	TODO
InternVL-Chat	✔️	frozen V-13B	224	142.4	89.9	123.1	71.7	59.5	54.0	49.1	1317.2	85.4	TODO
InternVL-Chat	✔️	V-13B	336	146.2	92.2	126.2	81.2	66.6	58.5	61.5	1586.4	87.6	TODO

❓ How to Evaluate

Please prepare the data according to the following directory structure.

Directory Structure

data
├── flickr30k
│   ├── flickr30k_test_karpathy.json
│   └── Images/
├── coco
│   ├── annotations
│   │   ├── coco_karpathy_test_gt.json
│   │   ├── coco_karpathy_test.json
│   │   └── ...
│   ├── train2014/
│   ├── val2014/
│   └── test2015/
├── nocaps
│   ├── nocaps_val_4500_captions.json
│   └── images/
├── vqav2
│   ├── v2_mscoco_train2014_annotations.json
│   ├── v2_mscoco_train2014_complementary_pairs.json
│   ├── v2_mscoco_val2014_annotations.json
│   ├── v2_OpenEnded_mscoco_test2015_questions.json
│   ├── v2_OpenEnded_mscoco_test-dev2015_questions.json
│   ├── v2_OpenEnded_mscoco_train2014_questions.json
│   ├── v2_OpenEnded_mscoco_val2014_questions.json
│   ├── vqav2_testdev.jsonl
│   ├── vqav2_train.jsonl
│   ├── vqav2_val.jsonl
│   ├── train2014/ -> ../coco/train2014/
│   ├── val2014/ -> ../coco/val2014/
│   └── test2015/ -> ../coco/test2015/
├── okvqa
│   ├── mscoco_train2014_annotations.json
│   ├── mscoco_val2014_annotations.json
│   ├── OpenEnded_mscoco_train2014_questions.json
│   ├── OpenEnded_mscoco_val2014_questions.json
│   ├── okvqa_train.jsonl
│   ├── okvqa_val.jsonl
│   ├── train2014/ -> ../coco/train2014/
│   └── val2014/ -> ../coco/val2014/
├── textvqa
│   ├── textvqa_train_annotations.json
│   ├── textvqa_train.jsonl
│   ├── textvqa_train_questions.json
│   ├── textvqa_val_annotations.json
│   ├── textvqa_val.jsonl
│   ├── textvqa_val_questions.json
│   ├── textvqa_val_llava.jsonl
│   └── train_images/
├── vizwiz
│   ├── vizwiz_test.jsonl
│   ├── vizwiz_train_annotations.json
│   ├── vizwiz_train.jsonl
│   ├── vizwiz_train_questions.json
│   ├── vizwiz_val_annotations.json
│   ├── vizwiz_val.jsonl
│   ├── vizwiz_val_questions.json
│   ├── test/
│   ├── train/
│   ├── val/
│   └── annotations/
├── docvqa
│   ├── test.jsonl
│   ├── train.jsonl
│   ├── val.jsonl
│   ├── test/
│   ├── train/
│   └── val/
├── chartqa
│   ├── ChartQA Dataset/
│   │   ├── train/
│   │   ├── test/
│   │   ├── val/
│   ├── test_augmented.jsonl
│   ├── test_human.jsonl
│   ├── train_augmented.jsonl
│   └── train_human.jsonl
├── gqa
│   ├── images/
│   ├── eval.py
│   ├── challenge_all_questions.json
│   ├── challenge_balanced_questions.json
│   ├── llava_gqa_testdev_balanced_qwen_format.jsonl
│   ├── submission_all_questions.json
│   ├── test_all_questions.json
│   ├── test_balanced.jsonl
│   ├── test_balanced_questions.json
│   ├── testdev_all_questions.json
│   ├── testdev_balanced_all_questions.json
│   ├── testdev_balanced_questions.json
│   ├── train_all_questions/
│   ├── train_balanced.jsonl
│   ├── train_balanced_questions.json
│   ├── val_all_questions.json
│   └── val_balanced_questions.json
├── ocrvqa
│   ├── images/
│   ├── ocrvqa_test.jsonl
│   ├── ocrvqa_train.jsonl
│   └── ocrvqa_val.jsonl
├── ai2diagram
│   ├── ai2d/
│   ├── test.jsonl
│   └── train.jsonl
├── scienceqa
│   ├── images/
│   ├── problems.json
│   └── scienceqa_test_img.jsonl
├── refcoco
│   ├── refcocog_test.jsonl
│   ├── refcocog_val.jsonl
│   ├── refcoco_testA.jsonl
│   ├── refcoco+_testA.jsonl
│   ├── refcoco_testB.jsonl
│   ├── refcoco+_testB.jsonl
│   ├── refcoco_val.jsonl
│   └── refcoco+_val.jsonl
├── mme
│   ├── MME_Benchmark_release/
│   └── images/
├── pope
│   ├── coco/
│   │    ├── coco_pope_adversarial.json
│   │    ├── coco_pope_popular.json
│   │    └── coco_pope_random.json
│   ├── val2014/ -> ../coco/val2014/
│   └── llava_pope_test.jsonl
├── tiny_lvlm
│   └── updated_datasets
│       ├── Object_Hallucination
│       ├── ...
│       └── Visual_Reasoning
├── mmbench
│   ├── mmbench_dev_20230712.tsv
│   ├── mmbench_dev_cn_20231003.tsv
│   ├── mmbench_dev_en_20231003.tsv
│   ├── mmbench_test_cn_20231003.tsv
│   └── mmbench_test_en_20231003.tsv
├── llava-bench-in-the-wild
│   ├── answers_gpt4.jsonl
│   ├── ...
│   └── images/
├── mmmu
│   ├── Accounting/
│   ├── ...
│   └── Sociology
├── mm-vet
│   └── images/
├── MMVP
│   ├── MMVP Images/
│   ├── Questions.csv
│   └── Questions.xlsx
├── MMVP_VLM
│   ├── MLLM_VLM Images/
│   └── Questions.csv
├── MathVista
│   ├── annot_testmini.json
│   └── AI4Math___math_vista/
├── SEED
│   ├── SEED-Bench.json
│   ├── SEED-Bench-image/
│   └── SEED-Bench-video-image-1/

Image Caption

COCO karpathy test

COCO images are used in VQAv2/OK-VQA/RefCOCO/RefCOCO+/RefCOCOg. Make sure you have already downloaded COCO images before evaluating on these benchmarks.

Data Preparation

mkdir -p data/coco && cd data/coco

# download coco images
wget http://images.cocodataset.org/zips/train2014.zip && unzip train2014.zip
wget http://images.cocodataset.org/zips/val2014.zip && unzip val2014.zip
wget http://images.cocodataset.org/zips/test2015.zip && unzip test2015.zip

mkdir -p annotations && cd annotations/
# download converted annotation files
wget https://github.com/OpenGVLab/InternVL/releases/download/data/coco_karpathy_test.json
wget https://github.com/OpenGVLab/InternVL/releases/download/data/coco_karpathy_test_gt.json

cd ../../../

Evaluation

GPUS=8 sh evaluate.sh <checkpoint> caption-coco

Flickr30K karpathy test

Data Preparation

mkdir -p data/flickr30k && cd data/flickr30k

# download images from https://bryanplummer.com/Flickr30kEntities/
# karpathy split annotations can be downloaded from https://cs.stanford.edu/people/karpathy/deepimagesent/
# download converted files
wget https://github.com/OpenGVLab/InternVL/releases/download/data/flickr30k_test_karpathy.json

cd ../..

Evaluation

GPUS=8 sh evaluate.sh <checkpoint> caption-flickr30k

NoCaps val

Data Preparation

mkdir -p data/nocaps && cd data/nocaps

# download images from https://nocaps.org/download
# original annotations can be downloaded from https://nocaps.s3.amazonaws.com/nocaps_val_4500_captions.json
wget https://nocaps.s3.amazonaws.com/nocaps_val_4500_captions.json

cd ../..

Evaluation

GPUS=8 sh evaluate.sh <checkpoint> caption-nocaps

General VQA

VQAv2 val & test-dev

Data Preparation

mkdir -p data/vqav2 && cd data/vqav2

# make sure you have downloaded COCO images
ln -s ../coco/train2014 ./
ln -s ../coco/val2014 ./
ln -s ../coco/test2015 ./

# download questions and annotations
wget https://s3.amazonaws.com/cvmlp/vqa/mscoco/vqa/v2_Annotations_Train_mscoco.zip && unzip v2_Annotations_Train_mscoco.zip
wget https://s3.amazonaws.com/cvmlp/vqa/mscoco/vqa/v2_Questions_Train_mscoco.zip && unzip v2_Questions_Train_mscoco.zip
wget https://s3.amazonaws.com/cvmlp/vqa/mscoco/vqa/v2_Annotations_Val_mscoco.zip && unzip v2_Annotations_Val_mscoco.zip
wget https://s3.amazonaws.com/cvmlp/vqa/mscoco/vqa/v2_Questions_Val_mscoco.zip && unzip v2_Questions_Val_mscoco.zip
wget https://s3.amazonaws.com/cvmlp/vqa/mscoco/vqa/v2_Questions_Test_mscoco.zip && unzip v2_Questions_Test_mscoco.zip

# download converted files
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/vqav2/vqav2_train.jsonl
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/vqav2/vqav2_val.jsonl
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/vqav2/vqav2_testdev.jsonl

cd ../..

Evaluation

# VQAv2-val
GPUS=8 sh evaluate.sh <checkpoint> vqa-vqav2-val
# VQAv2-testdev
GPUS=8 sh evaluate.sh <checkpoint> vqa-vqav2-testdev

For the testdev set, submit the results to the evaluation server.

OKVQA val

Data Preparation

mkdir -p data/okvqa && cd data/okvqa

# make sure you have downloaded COCO images
ln -s ../coco/train2014 ./
ln -s ../coco/val2014 ./

# download annotations and questions
wget https://okvqa.allenai.org/static/data/mscoco_train2014_annotations.json.zip && unzip mscoco_train2014_annotations.json.zip
wget https://okvqa.allenai.org/static/data/OpenEnded_mscoco_train2014_questions.json.zip && unzip OpenEnded_mscoco_train2014_questions.json.zip
wget https://okvqa.allenai.org/static/data/mscoco_val2014_annotations.json.zip && unzip mscoco_val2014_annotations.json.zip
wget https://okvqa.allenai.org/static/data/OpenEnded_mscoco_val2014_questions.json.zip && unzip OpenEnded_mscoco_val2014_questions.json.zip

# download converted files
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/okvqa/okvqa_train.jsonl
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/okvqa/okvqa_val.jsonl

cd ../..

Evaluation

GPUS=8 sh evaluate.sh <checkpoint> vqa-okvqa-val

TextVQA val

Data Preparation

mkdir -p data/textvqa && cd data/textvqa

# download images
wget https://dl.fbaipublicfiles.com/textvqa/images/train_val_images.zip && unzip train_val_images.zip

# download annotations and questions
wget https://dl.fbaipublicfiles.com/textvqa/data/TextVQA_0.5.1_train.json
wget https://dl.fbaipublicfiles.com/textvqa/data/TextVQA_0.5.1_val.json

# download converted files
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/textvqa/textvqa_train_annotations.json
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/textvqa/textvqa_train_questions.json
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/textvqa/textvqa_train.jsonl
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/textvqa/textvqa_val_annotations.json
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/textvqa/textvqa_val_questions.json
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/textvqa/textvqa_val.jsonl
wget https://github.com/OpenGVLab/InternVL/releases/download/data/textvqa_val_llava.jsonl

cd ../..

Evaluation

GPUS=8 sh evaluate.sh <checkpoint> vqa-textvqa-val

VizWiz val & test

Data Preparation

mkdir -p data/vizwiz && cd data/vizwiz

# download images
wget https://vizwiz.cs.colorado.edu/VizWiz_final/images/train.zip && unzip train.zip
wget https://vizwiz.cs.colorado.edu/VizWiz_final/images/val.zip && unzip val.zip
wget https://vizwiz.cs.colorado.edu/VizWiz_final/images/test.zip && unzip test.zip

# download annotations
wget https://vizwiz.cs.colorado.edu/VizWiz_final/vqa_data/Annotations.zip && unzip Annotations.zip

# download converted files
# train
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/vizwiz/vizwiz_train_annotations.json
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/vizwiz/vizwiz_train_questions.json
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/vizwiz/vizwiz_train.jsonl
# val
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/vizwiz/vizwiz_val_annotations.json
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/vizwiz/vizwiz_val_questions.json
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/vizwiz/vizwiz_val.jsonl
# test
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/vizwiz/vizwiz_test.jsonl
cd ../..

Evaluation

# VizWiz val
GPUS=8 sh evaluate.sh <checkpoint> vqa-vizwiz-val
# VizWiz test
GPUS=8 sh evaluate.sh <checkpoint> vqa-vizwiz-test

For the test set, submit the results to the evaluation server.

DocVQA val & test

Data Preparation

mkdir -p data/docvqa && cd data/docvqa

# download images and annotations from https://www.docvqa.org/datasets

# download converted files
# train
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/docvqa/train.jsonl
# val
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/docvqa/val.jsonl
# test
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/docvqa/test.jsonl
cd ../..

Evaluation

# DocVQA-val
GPUS=8 sh evaluate.sh <checkpoint> vqa-docvqa-val
# DocVQA-test
GPUS=8 sh evaluate.sh <checkpoint> vqa-docvqa-test

For the test set, submit the results to the evaluation server.

ChartQA test-human & test-augmented

Data Preparation

mkdir -p data/chartqa && cd data/chartqa

# download images from https://drive.google.com/file/d/1Lm_w6zeET1Hyl_9ks6w5nEsgpoyPHalV/view

# download converted files
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/chartqa/train_human.jsonl
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/chartqa/train_augmented.jsonl
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/chartqa/test_human.jsonl
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/chartqa/test_augmented.jsonl

cd ../..

Evaluation

# ChartQA-test-human
GPUS=8 sh evaluate.sh <checkpoint> vqa-chartqa-test-human
# ChartQA-test-augmented
GPUS=8 sh evaluate.sh <checkpoint> vqa-chartqa-test-augmented

GQA testdev

Data Preparation

mkdir -p data/gqa && cd data/gqa

# download images
wget https://downloads.cs.stanford.edu/nlp/data/gqa/images.zip
unzip images.zip

# download converted files
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/gqa/testdev_balanced.jsonl
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/gqa/train_balanced.jsonl
wget https://github.com/OpenGVLab/InternVL/releases/download/data/llava_gqa_testdev_balanced_qwen_format.jsonl

cd ../..

Evaluation

GPUS=8 sh evaluate.sh <checkpoint> vqa-gqa-testdev

OCRVQA val & test

Data Preparation

mkdir -p data/ocrvqa && cd data/ocrvqa

# download images by following instructions at https://ocr-vqa.github.io/kvqa_ProjectFiles/README.txt

# download converted files
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/ocrvqa/ocrvqa_train.jsonl
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/ocrvqa/ocrvqa_val.jsonl
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/ocrvqa/ocrvqa_test.jsonl

cd ../..

Evaluation

# OCRVQA-val
GPUS=8 sh evaluate.sh <checkpoint> vqa-ocrvqa-val
# OCRVQA-test
GPUS=8 sh evaluate.sh <checkpoint> vqa-ocrvqa-test

AI2Diagram test

Data Preparation

mkdir -p data/ai2diagram && cd data/ai2diagram

# download images
wget https://ai2-public-datasets.s3.amazonaws.com/diagrams/ai2d-all.zip

# download converted files
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/ai2diagram/train.jsonl
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/ai2diagram/test.jsonl

cd ../..

Evaluation

GPUS=8 sh evaluate.sh <checkpoint> vqa-ai2d-test

ScienceQA test

Data Preparation

mkdir -p data/scienceqa/images && cd data/scienceqa/images

# download images
wget https://scienceqa.s3.us-west-1.amazonaws.com/images/test.zip && unzip test.zip

cd ..

# download original questions
wget https://github.com/lupantech/ScienceQA/blob/main/data/scienceqa/problems.json

# download converted files
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/scienceqa/scienceqa_test_img.jsonl

cd ../..

Evaluation

GPUS=8 sh evaluate.sh <checkpoint> scienceqa

Refer Expression Comprehension

RefCOCO/RefCOCO+/RefCOCO-g

Data Preparation

mkdir -p data/refcoco && cd data/refcoco

# download converted files
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/refcoco/refcoco_val.jsonl
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/refcoco/refcoco_testA.jsonl
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/refcoco/refcoco_testB.jsonl
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/refcoco%2B/refcoco%2B_val.jsonl
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/refcoco%2B/refcoco%2B_testA.jsonl
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/refcoco%2B/refcoco%2B_testB.jsonl
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/refcocog/refcocog_val.jsonl
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/refcocog/refcocog_test.jsonl

cd ../..

Evaluation

GPUS=8 sh evaluate.sh <checkpoint> refcoco

MultiModal Dialogue

MME

Data Preparation

mkdir -p data/mme && cd data/mme

# 1. Download MME images and eval_tool from the [MME repo](https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models/blob/Evaluation/README.md)
# 2. Rearrange images by executing `python get_images.py`
python get_images.py
cd ../..

Evaluation

# single GPU testing
CUDA_VISIBLE_DEVICES=0 sh evaluate.sh <checkpoint> mme

MMBench dev & test

Data Preparation

mkdir -p data/mmbench && cd data/mmbench

# download csv files of mmbench
wget https://download.openmmlab.com/mmclassification/datasets/mmbench/mmbench_dev_20230712.tsv
wget https://download.openmmlab.com/mmclassification/datasets/mmbench/mmbench_dev_cn_20231003.tsv
wget https://download.openmmlab.com/mmclassification/datasets/mmbench/mmbench_dev_en_20231003.tsv
wget https://download.openmmlab.com/mmclassification/datasets/mmbench/mmbench_test_cn_20231003.tsv
wget https://download.openmmlab.com/mmclassification/datasets/mmbench/mmbench_test_en_20231003.tsv

cd ../..

Evaluation

# mmbench_dev_20230712
GPUS=8 sh evaluate.sh <checkpoint> mmbench-dev-en
# mmbench_dev_cn_20231003
GPUS=8 sh evaluate.sh <checkpoint> mmbench-dev-cn
# mmbench_test_en_20231003
GPUS=8 sh evaluate.sh <checkpoint> mmbench-test-en
# mmbench_test_cn_20231003
GPUS=8 sh evaluate.sh <checkpoint> mmbench-test-cn

Then, submit the results to the evaluation server.

POPE

Data Preparation

mkdir -p data/pope && cd data/pope

# make sure you have downloaded COCO images
ln -s ../coco/val2014 ./
wget https://github.com/OpenGVLab/InternVL/releases/download/data/llava_pope_test.jsonl

# download `coco` from POPE
mkdir -p coco && cd coco
wget https://github.com/AoiDragon/POPE/raw/e3e39262c85a6a83f26cf5094022a782cb0df58d/output/coco/coco_pope_adversarial.json
wget https://github.com/AoiDragon/POPE/raw/e3e39262c85a6a83f26cf5094022a782cb0df58d/output/coco/coco_pope_popular.json
wget https://github.com/AoiDragon/POPE/raw/e3e39262c85a6a83f26cf5094022a782cb0df58d/output/coco/coco_pope_random.json
cd ../../..

Evaluation

GPUS=8 sh evaluate.sh <checkpoint> pope

MMMU

Data Preparation

The evaluation code will automatically download the dataset from hugging face.

Evaluation

# dev set
GPUS=8 sh evaluate.sh <checkpoint> mmmu-dev
# val set
GPUS=8 sh evaluate.sh <checkpoint> mmmu-val
# test set
GPUS=8 sh evaluate.sh <checkpoint> mmmu-test

For the test set, submit the results to the evaluation server.

CMMMU

Tiny LVLM

Data Preparation

mkdir -p data/tiny_lvlm && cd data/tiny_lvlm

# download dataset from https://github.com/OpenGVLab/Multi-Modality-Arena/tree/main/tiny_lvlm_evaluation
# i.e., download `updated_datasets.tar.gz` from https://drive.google.com/file/d/1PuFC612XzOmKwzRldtBb1CFZnIjiR7we/view
tar -xzvf updated_datasets.tar.gz

cd ../..

Evaluation

GPUS=8 sh evaluate.sh <checkpoint> tiny_lvlm

LLaVA Bench

Data Preparation

cd data/
# download dataset from https://huggingface.co/datasets/liuhaotian/llava-bench-in-the-wild
git clone https://huggingface.co/datasets/liuhaotian/llava-bench-in-the-wild
cd llava-bench-in-the-wild/
rm -rf images && mkdir -p images && cd images
# download all 24 images
wget https://huggingface.co/datasets/liuhaotian/llava-bench-in-the-wild/resolve/main/images/001.jpg
# ...
wget https://huggingface.co/datasets/liuhaotian/llava-bench-in-the-wild/resolve/main/images/024.jpg
cd ../../../

Evaluation

# single GPU testing
export OPENAI_API_KEY='your_gpt4_key'
CUDA_VISIBLE_DEVICES=0 sh evaluate.sh <checkpoint> llava-bench

MM-Vet

Data Preparation

mkdir -p data/mm-vet && cd data/mm-vet
wget https://github.com/yuweihao/MM-Vet/releases/download/v1/mm-vet.zip
unzip mm-vet.zip
cd ../..

Evaluation

GPUS=8 sh evaluate.sh <checkpoint> mmvet

MMVP

Data Preparation

cd data
git lfs install
git clone https://huggingface.co/datasets/MMVP/MMVP
git clone https://huggingface.co/datasets/MMVP/MMVP_VLM
cd ..

Evaluation

GPUS=8 sh evaluate.sh <checkpoint> mmvp

MathVista

Data Preparation

mkdir -p data/MathVista && cd data/MathVista
# Execute the following python code
# from datasets import load_dataset
# dataset = load_dataset("AI4Math/MathVista")
# dataset.save_to_disk('./MathVista')
wget https://huggingface.co/datasets/AI4Math/MathVista/raw/main/annot_testmini.json
cd ../..

Evaluation

# testmini set
GPUS=8 sh evaluate.sh <checkpoint> mathvista-testmini
# test set
GPUS=8 sh evaluate.sh <checkpoint> mathvista-test

SEED

Data Preparation

Follow the official instructions Data Preparation for SEED-Bench-1 to download the images and the videos. Put images under ./playground/data/eval/seed_bench/SEED-Bench-image.
Extract the video frame in the middle from the downloaded videos, and put them under ./playground/data/eval/seed_bench/SEED-Bench-video-image. We provide our script extract_video_frames.py modified from the official one.

Evaluation

GPUS=8 sh evaluate.sh <checkpoint> seed

Files

README.md

Latest commit

History

README.md

File metadata and controls

InternVL-Chat

🛠️ Installation

📦 Model Preparation

🔥 Supervised Fine-tuning

Prepare Training Datasets

Start Training

📊 Evaluation

📊 Evaluation (Legacy Models)

❓ How to Evaluate

Image Caption

General VQA

Refer Expression Comprehension

RefCOCO/RefCOCO+/RefCOCO-g

MultiModal Dialogue

CMMMU