Skip to content

Commit

Permalink
Release InternVL-Chat-V1.2 (#45)
Browse files Browse the repository at this point in the history
* Add chat templates

* Update to llama2 flash attention

* Add zero3 deepspeed config

* Support DeepSpeed zero3

* Fix template bug

* Support internlm2

* Rename V1.1 to V1-1

* Update README.md

* Support device_map='auto'

* Compatible with transformers 4.36.2

* Add Hermes-2 template

* Support MMVP

* Support MathVista

* Clean code

* Update MMMU

* Update select_layer to save GPU memory

* Don't use beam search when model is too large

* Fix wrong calculation of total params

* Update trainer

* Add json2jsonl tool

* Update README.md

* Update

* Add shell scripts

* Rename

* Add shell

* Update

* Update BLOG.md

* Fix bug and support loading pretrained mlp

* Update README.md

* Update README.md

* Update

* Update BLOG.md

* Update README.md

* Update SEED

---------

Co-authored-by: Wenhai Wang <[email protected]>
  • Loading branch information
czczup and whai362 authored Feb 13, 2024
1 parent ac6e5c9 commit 88a146c
Show file tree
Hide file tree
Showing 53 changed files with 1,762 additions and 576 deletions.
51 changes: 51 additions & 0 deletions BLOG.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
# Blog

## InternVL-Chat-V1.2

> Date: 2024/02/12<br>
> Developed by: Zhe Chen, Weiyun Wang, Wenhai Wang
In January 2024, we released [InternVL-Chat-V1.1](https://huggingface.co/OpenGVLab/InternVL-Chat-Chinese-V1-1), featuring a structure similar to LLaVA, including a ViT, an MLP projector, and an LLM. In that version, we explored increasing the resolution to 448x448, enhancing OCR capabilities, and improving support for Chinese conversations. However, it still lagged behind existing SOTA in some benchmarks.

<img width="600" alt="image" src="https://github.com/czczup/InternVL-MoE/assets/23737120/9b68aa35-40fd-4e81-9595-d404cbbfc6bd">

Today, we are excited to introduce InternVL-Chat-V1.2. Inspired by [LLaVA-NeXT-34B](https://llava-vl.github.io/blog/2024-01-30-llava-next/), we have also adopted [Nous-Hermes-2-Yi-34B](https://huggingface.co/NousResearch/Nous-Hermes-2-Yi-34B) as the language model.
From the experimental results, **we've observed that a stronger language model (34B) can better leverage the powerful capabilities of our vision foundation model ([InternViT-6B](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-2)).**

For better training reproducibility, we follow the minimalist design and data efficiency similar to LLaVA-NeXT. To reduce training costs, we provide a pre-trained MLP projector and only employ around 1 million visual instruction tuning samples for SFT. Our model has a total of 40 billion parameters and can be trained within 1.5 days using 32 A100 GPUs. The code, data, and model will be made publicly available.

### Data Preparation

Inspired by LLaVA-NeXT, we adopted a data-efficient SFT strategy to train InternVL-Chat-V1.2, utilizing approximately 1.2M of visual instruction tuning samples in total, all of which are fully open-source. In a macro sense, we build upon [ShareGPT-4V](https://github.com/InternLM/InternLM-XComposer/blob/main/projects/ShareGPT4V/docs/Data.md#prepare-images) and additionally integrate [LLaVA-ZH](https://huggingface.co/datasets/openbmb/llava_zh), [DVQA](https://github.com/kushalkafle/DVQA_dataset), [ChartQA](https://github.com/vis-nlp/ChartQA), [AI2D](https://allenai.org/data/diagrams), [DocVQA](https://www.docvqa.org/datasets), [GeoQA+](https://github.com/SCNU203/GeoQA-Plus), and [SynthDoG-EN](https://huggingface.co/datasets/naver-clova-ix/synthdog-en). Most of the data remains consistent with LLaVA-NeXT.

For more details about data preparation, please see [here](./internvl_chat#prepare-training-datasets).

### Performance

\* Proprietary Model

| name | image size | MMMU<br>(val) | MMMU<br>(test) | MathVista<br>(testmini) | MMB<br>(test) | MMB−CN<br>(test) | MMVP | MME | ScienceQA<br>(image) | POPE | TextVQA | SEEDv1<br>(image) | VizWiz<br>(test) | GQA<br>(test) |
| ------------------ | ---------- | ------------- | -------------- | ----------------------- | ------------- | ---------------- | ---- | -------- | -------------------- | ---- | ------- | ----------------- | ---------------- | ------------- |
| GPT-4V\* | unknown | 56.8 | 55.7 | 49.9 | 77.0 | 74.4 | 38.7 | 1409/517 | - | - | 78.0 | 71.6 | - | - |
| Gemini Ultra\* | unknown | 59.4 | - | 53.0 | - | - | - | - | - | - | 82.3 | - | - | - |
| Gemini Pro\* | unknown | 47.9 | - | 45.2 | 73.6 | 74.3 | 40.7 | 1497/437 | - | - | 74.6 | 70.7 | - | - |
| Qwen-VL-Plus\* | unknown | 45.2 | 40.8 | 43.3 | 67.0 | 70.7 | - | 1681/502 | - | - | 78.9 | 65.7 | - | - |
| Qwen-VL-Max\* | unknown | 51.4 | 46.8 | 51.0 | 77.6 | 75.7 | - | - | - | - | 79.5 | - | - | - |
| | | | | | | | | | | | | | | |
| LLaVA-NEXT-34B | 672x672 | 51.1 | 44.7 | 46.5 | 79.3 | 79.0 | - | 1631/397 | 81.8 | 87.7 | 69.5 | 75.9 | 63.8 | 67.1 |
| InternVL-Chat-V1.2 | 448x448 | 51.6 | 46.2 | 47.7 | 82.2 | 81.2 | 56.7 | 1672/509 | 83.3 | 88.0 | 69.7 | 75.6 | 60.0 | 64.0 |

- MMBench results are collected from the [leaderboard](https://mmbench.opencompass.org.cn/leaderboard).
- In most benchmarks, InternVL-Chat-V1.2 achieves better performance than LLaVA-NeXT-34B.

### Training (SFT)

We provide [slurm scripts](./internvl_chat/shell/hermes2_yi34b/internvl_chat_v1_2_hermes2_yi34b_448_finetune.sh) for multi-node multi-GPU training. You can use either 32 or 64 GPUs to train this model. If you use 64 GPUs, training will take approximately 18 hours.

For more details about training, please see [here](./internvl_chat#start-training).

The hyperparameters used for finetuning are listed in the following table.

| Hyperparameter | Trainable Param | Global Batch Size | Learning rate | Epochs | Max length | Weight decay |
| ------------------ | ---------------- | ----------------- | ------------- | ------ | ---------- | ------------ |
| InternVL-Chat-V1.2 | 40B (full model) | 512 | 1e-5 | 1 | 2048 | 0.05 |
10 changes: 6 additions & 4 deletions INSTALLATION.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,9 +23,9 @@
pip install torch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 --index-url https://download.pytorch.org/whl/cu118
```

- Install `flash-attn==0.2.8` :
- Install `flash-attn==0.2.8` or `flash-attn==2.3.6`:

If you want to fully replicate my results, please install `v0.2.8`, otherwise install the latest version.
If you want to fully replicate my results in the paper, please install `v0.2.8`, otherwise install the `v2.3.6`.

This is because different versions of flash attention yield slight differences in results.

Expand All @@ -44,10 +44,10 @@
mim install mmcv-full==1.6.2
```

- Install `transformers==4.32.0`:
- Install `transformers==4.36.2`:

```bash
pip install transformers==4.32.0
pip install transformers==4.36.2
```

- Install `apex` (optional):
Expand All @@ -66,4 +66,6 @@

```bash
pip install opencv-python termcolor yacs pyyaml scipy
pip install deepspeed==0.10.0
pip install pycocoevalcap tqdm
```
67 changes: 37 additions & 30 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,47 +1,39 @@
# <img width="60" alt="image" src="https://github.com/OpenGVLab/InternVL/assets/8529570/5aa4cda8-b453-40a0-9336-17012b430ae8"> InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks —— An Open-Source Alternative to ViT-22B

\[[InternVL-Chat-V1.2 Blog](./BLOG.md)\] \[[Paper](https://arxiv.org/abs/2312.14238)\] \[[Chat Demo](https://internvl.opengvlab.com/)\] \[[Quick Start](#quick-start-with-huggingface)\] \[[中文解读](https://mp.weixin.qq.com/s/bdfAJRqOF9tUk8Vy9KC_XQ)\]

## News🚀🚀🚀

- `2024/02/12`: InternVL-Chat-V1.2 has been released, utilizing [Nous-Hermes-2-Yi-34B](https://huggingface.co/NousResearch/Nous-Hermes-2-Yi-34B) as the LLM. It achieves 51.6 on MMMU val and 82.3 on MMBench test. For more details, please refer to our [blog](BLOG.md) or try our [demo](https://internvl.opengvlab.com/). The model is now available on [HuggingFace](https://huggingface.co/OpenGVLab/InternVL-Chat-Chinese-V1-2), and both training/evaluation data and scripts are open-sourced.
- `2024/02/04`: [InternVL-Chat-V1.1](https://huggingface.co/OpenGVLab/InternVL-Chat-Chinese-V1-1) achieves 44.67% on [MMVP](https://github.com/tsb0601/MMVP), higher than GPT-4V!
- `2024/01/27`: We release 448 resolution model, achieving 76.6 on MMBench dev, see [here](https://github.com/OpenGVLab/InternVL/tree/main/internvl_chat#-evaluation-chinese-models).
- `2024/01/24`: InternVL-Chat-V1.1 is released, it supports Chinese and has stronger OCR capability, see [here](https://huggingface.co/OpenGVLab/InternVL-Chat-Chinese-V1-1) or try our [demo](https://internvl.opengvlab.com/).
- `2024/01/16`: We release our [customized mmcv/mmsegmentation/mmdetection code](https://github.com/OpenGVLab/InternVL-MMDetSeg), integrated with DeepSpeed, which can be used for training large-scale object detection and semantic segmentation models.

## What is InternVL?

\[[Paper](https://arxiv.org/abs/2312.14238)\] \[[Chat Demo](https://internvl.opengvlab.com/)\] \[[Quick Start](#quick-start-with-huggingface)\] \[[中文解读](https://mp.weixin.qq.com/s/bdfAJRqOF9tUk8Vy9KC_XQ)\]

InternVL scales up the ViT to _**6B parameters**_ and aligns it with LLM.

It is _**the largest open-source vision/vision-language foundation model (14B)**_ to date, achieving _**32 state-of-the-art**_ performances on a wide range of tasks such as visual perception, cross-modal retrieval, multimodal dialogue, etc.

<img width="1204" alt="image" src="https://github.com/OpenGVLab/InternVL/assets/23737120/47878df8-2aec-446e-8a58-00640a2e1327">

[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvl-scaling-up-vision-foundation-models/zero-shot-cross-modal-retrieval-on-coco-2014)](https://paperswithcode.com/sota/zero-shot-cross-modal-retrieval-on-coco-2014?p=internvl-scaling-up-vision-foundation-models)
[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvl-scaling-up-vision-foundation-models/zero-shot-image-retrieval-on-coco-cn)](https://paperswithcode.com/sota/zero-shot-image-retrieval-on-coco-cn?p=internvl-scaling-up-vision-foundation-models)
[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvl-scaling-up-vision-foundation-models/zero-shot-cross-modal-retrieval-on-flickr30k)](https://paperswithcode.com/sota/zero-shot-cross-modal-retrieval-on-flickr30k?p=internvl-scaling-up-vision-foundation-models)
[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvl-scaling-up-vision-foundation-models/image-to-text-retrieval-on-flickr30k)](https://paperswithcode.com/sota/image-to-text-retrieval-on-flickr30k?p=internvl-scaling-up-vision-foundation-models)
[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvl-scaling-up-vision-foundation-models/zero-shot-image-retrieval-on-flickr30k-cn)](https://paperswithcode.com/sota/zero-shot-image-retrieval-on-flickr30k-cn?p=internvl-scaling-up-vision-foundation-models)
[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvl-scaling-up-vision-foundation-models/image-retrieval-on-flickr30k-cn)](https://paperswithcode.com/sota/image-retrieval-on-flickr30k-cn?p=internvl-scaling-up-vision-foundation-models)
[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvl-scaling-up-vision-foundation-models/zero-shot-image-retrieval-on-xtd10)](https://paperswithcode.com/sota/zero-shot-image-retrieval-on-xtd10?p=internvl-scaling-up-vision-foundation-models)
[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvl-scaling-up-vision-foundation-models/zero-shot-transfer-image-classification-on-cn)](https://paperswithcode.com/sota/zero-shot-transfer-image-classification-on-cn?p=internvl-scaling-up-vision-foundation-models)
[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvl-scaling-up-vision-foundation-models/zero-shot-transfer-image-classification-on-8)](https://paperswithcode.com/sota/zero-shot-transfer-image-classification-on-8?p=internvl-scaling-up-vision-foundation-models)
[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvl-scaling-up-vision-foundation-models/zero-shot-video-retrieval-on-msr-vtt)](https://paperswithcode.com/sota/zero-shot-video-retrieval-on-msr-vtt?p=internvl-scaling-up-vision-foundation-models)
[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvl-scaling-up-vision-foundation-models/zero-shot-transfer-image-classification-on-6)](https://paperswithcode.com/sota/zero-shot-transfer-image-classification-on-6?p=internvl-scaling-up-vision-foundation-models)
[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvl-scaling-up-vision-foundation-models/zero-shot-transfer-image-classification-on-5)](https://paperswithcode.com/sota/zero-shot-transfer-image-classification-on-5?p=internvl-scaling-up-vision-foundation-models)
[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvl-scaling-up-vision-foundation-models/zero-shot-transfer-image-classification-on-3)](https://paperswithcode.com/sota/zero-shot-transfer-image-classification-on-3?p=internvl-scaling-up-vision-foundation-models)
[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvl-scaling-up-vision-foundation-models/zero-shot-transfer-image-classification-on-1)](https://paperswithcode.com/sota/zero-shot-transfer-image-classification-on-1?p=internvl-scaling-up-vision-foundation-models)

## Model Zoo

| Model | Date | Download | Note |
| ------------------ | ---------- | ------------------------------------------------------------------------------ | -------------------------------- |
| InternViT-6B-224px | 2023.12.22 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-224px) | vision foundation model |
| InternVL-14B-224px | 2023.12.22 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-14B-224px) | vision-language foundation model |
| InternVL-Chat-13B | 2023.12.25 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-ViT-6B-Vicuna-7B) | English multimodal dialogue |
| InternVL-Chat-19B | 2023.12.25 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-ViT-6B-Vicuna-13B) | English multimodal dialogue |
| InternVL-Chat-V1.1 | 2024.01.24 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-Chinese-V1-1) | support Chinese and stronger OCR |
| InternViT-6B-448px | 2024.01.30 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px) | 448 resolution |
**Vision-Language Foundation Model**

| Model | Date | Download | Note |
| ----------------------- | ---------- | ---------------------------------------------------------------------- | -------------------------------- |
| InternViT-6B-224px | 2023.12.22 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-224px) | vision foundation model |
| InternVL-14B-224px | 2023.12.22 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-14B-224px) | vision-language foundation model |
| InternViT-6B-448px | 2024.01.30 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px) | 448 resolution |
| InternViT-6B-448px-V1.2 | 2024.02.11 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-2) | 448 resolution (🔥new) |

**Vision Large Language Model**

| Model | Date | Download | Note |
| ----------------------- | ---------- | ------------------------------------------------------------------------------------ | -------------------------------- |
| InternVL-Chat-13B | 2023.12.25 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-ViT-6B-Vicuna-7B) | English multimodal dialogue |
| InternVL-Chat-19B | 2023.12.25 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-ViT-6B-Vicuna-13B) | English multimodal dialogue |
| InternVL-Chat-19B-448px | 2024.02.03 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-ViT-6B-Vicuna-13B-448px) | 448 resolution |
| InternVL-Chat-V1.1 | 2024.01.24 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-Chinese-V1-1) | support Chinese and stronger OCR |
| InternVL-Chat-V1.2 | 2024.02.11 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-Chinese-V1-2) | scaling up LLM to 34B (🔥new) |

## What can InternVL do?

Expand Down Expand Up @@ -174,6 +166,22 @@ It is _**the largest open-source vision/vision-language foundation model (14B)**
<td>82.7</td>
<td>85.1</td>
</tr>
<tr align=center>
<td align=left>EVA-CLIP-8B</td>
<td>95.6</td>
<td>99.6</td>
<td>99.9</td>
<td>80.8</td>
<td>95.5</td>
<td>97.6</td>
<td>70.3</td>
<td>89.3</td>
<td>93.9</td>
<td>53.0</td>
<td>76.0</td>
<td>83.4</td>
<td>86.2</td>
</tr>
<tr align=center>
<td align=left>InternVL-C (ours)</td>
<td>94.7</td>
Expand Down Expand Up @@ -396,7 +404,6 @@ pixel_values = image_processor(images=image, return_tensors='pt').pixel_values
pixel_values = pixel_values.to(torch.bfloat16).cuda()

outputs = model(pixel_values)

```

</details>
Expand Down
63 changes: 2 additions & 61 deletions classification/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,66 +10,7 @@ InternViT-6B follows the structure of vanilla ViT, and its hyperparameters are l

## 🛠️ Installation

> If you have already installed the environment as per the instructions in other folders, you can skip this section.
- Clone this repository:

```bash
git clone https://github.com/OpenGVLab/InternVL.git
cd InternVL/classification
```

- Create a conda virtual environment and activate it:

```bash
conda create -n internvl python=3.9 -y
conda activate internvl
```

- Install `PyTorch>=2.0` and `torchvision>=0.15.2` with `CUDA>=11.6`:

For examples, to install `torch==2.0.1` with `CUDA==11.8`:

```bash
conda install pytorch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 pytorch-cuda=11.8 -c pytorch -c nvidia
# or
pip install torch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 --index-url https://download.pytorch.org/whl/cu118
```

- Install `flash-attn==0.2.8` :

If you want to fully replicate my results, please install `v0.2.8`, otherwise install the latest version.

This is because different versions of flash attention yield slight differences in results.

```bash
git clone https://github.com/Dao-AILab/flash-attention.git
cd flash-attention
git checkout v0.2.8
python setup.py install
```

- Install `timm==0.9.12` and `mmcv-full==1.6.2`:

```bash
pip install -U openmim
pip install timm==0.9.12
mim install mmcv-full==1.6.2
```

- Install `apex`:

```bash
git clone https://github.com/NVIDIA/apex.git
git checkout 2386a912164b0c5cfcd8be7a2b890fbac5607c82 # https://github.com/NVIDIA/apex/issues/1735
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./
```

- Install other requirements:

```bash
pip install opencv-python termcolor yacs pyyaml scipy
```
See [INSTALLATION.md](../INSTALLATION.md)

## 📦 Data Preparation

Expand Down Expand Up @@ -150,7 +91,7 @@ pretrained

> Note, please install apex before training (see installation guide above for details).
To train a linear classifier for `InternViT-6b` on ImageNet with 8 GPUs, run:
To train a linear classifier for `InternViT-6B` on ImageNet with 8 GPUs, run:

```bash
python -m torch.distributed.launch --nproc_per_node 8 --master_port 12345 main.py --cfg configs/intern_vit_6b_1k_224.yaml
Expand Down
Loading

0 comments on commit 88a146c

Please sign in to comment.