Release InternVL-Chat-V1.2 (#45)

* Add chat templates * Update to llama2 flash attention * Add zero3 deepspeed config * Support DeepSpeed zero3 * Fix template bug * Support internlm2 * Rename V1.1 to V1-1 * Update README.md * Support device_map='auto' * Compatible with transformers 4.36.2 * Add Hermes-2 template * Support MMVP * Support MathVista * Clean code * Update MMMU * Update select_layer to save GPU memory * Don't use beam search when model is too large * Fix wrong calculation of total params * Update trainer * Add json2jsonl tool * Update README.md * Update * Add shell scripts * Rename * Add shell * Update * Update BLOG.md * Fix bug and support loading pretrained mlp * Update README.md * Update README.md * Update * Update BLOG.md * Update README.md * Update SEED --------- Co-authored-by: Wenhai Wang <[email protected]>
OpenGVLab · Feb 13, 2024 · 88a146c · 88a146c
1 parent ac6e5c9
commit 88a146c
Show file tree

Hide file tree

Showing 53 changed files with 1,762 additions and 576 deletions.
diff --git a/BLOG.md b/BLOG.md
@@ -0,0 +1,51 @@
+# Blog
+
+## InternVL-Chat-V1.2
+
+> Date: 2024/02/12<br>
+> Developed by: Zhe Chen, Weiyun Wang, Wenhai Wang
+
+In January 2024, we released [InternVL-Chat-V1.1](https://huggingface.co/OpenGVLab/InternVL-Chat-Chinese-V1-1), featuring a structure similar to LLaVA, including a ViT, an MLP projector, and an LLM. In that version, we explored increasing the resolution to 448x448, enhancing OCR capabilities, and improving support for Chinese conversations. However, it still lagged behind existing SOTA in some benchmarks.
+
+<img width="600" alt="image" src="https://github.com/czczup/InternVL-MoE/assets/23737120/9b68aa35-40fd-4e81-9595-d404cbbfc6bd">
+
+Today, we are excited to introduce InternVL-Chat-V1.2. Inspired by [LLaVA-NeXT-34B](https://llava-vl.github.io/blog/2024-01-30-llava-next/), we have also adopted [Nous-Hermes-2-Yi-34B](https://huggingface.co/NousResearch/Nous-Hermes-2-Yi-34B) as the language model.
+From the experimental results, **we've observed that a stronger language model (34B) can better leverage the powerful capabilities of our vision foundation model ([InternViT-6B](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-2)).**
+
+For better training reproducibility, we follow the minimalist design and data efficiency similar to LLaVA-NeXT. To reduce training costs, we provide a pre-trained MLP projector and only employ around 1 million visual instruction tuning samples for SFT. Our model has a total of 40 billion parameters and can be trained within 1.5 days using 32 A100 GPUs. The code, data, and model will be made publicly available.
+
+### Data Preparation
+
+Inspired by LLaVA-NeXT, we adopted a data-efficient SFT strategy to train InternVL-Chat-V1.2, utilizing approximately 1.2M of visual instruction tuning samples in total, all of which are fully open-source. In a macro sense, we build upon [ShareGPT-4V](https://github.com/InternLM/InternLM-XComposer/blob/main/projects/ShareGPT4V/docs/Data.md#prepare-images) and additionally integrate [LLaVA-ZH](https://huggingface.co/datasets/openbmb/llava_zh), [DVQA](https://github.com/kushalkafle/DVQA_dataset), [ChartQA](https://github.com/vis-nlp/ChartQA), [AI2D](https://allenai.org/data/diagrams), [DocVQA](https://www.docvqa.org/datasets), [GeoQA+](https://github.com/SCNU203/GeoQA-Plus), and [SynthDoG-EN](https://huggingface.co/datasets/naver-clova-ix/synthdog-en). Most of the data remains consistent with LLaVA-NeXT.
+
+For more details about data preparation, please see [here](./internvl_chat#prepare-training-datasets).
+
+### Performance
+
+\* Proprietary Model
+
+| name               | image size | MMMU<br>(val) | MMMU<br>(test) | MathVista<br>(testmini) | MMB<br>(test) | MMB−CN<br>(test) | MMVP | MME      | ScienceQA<br>(image) | POPE | TextVQA | SEEDv1<br>(image) | VizWiz<br>(test) | GQA<br>(test) |
+| ------------------ | ---------- | ------------- | -------------- | ----------------------- | ------------- | ---------------- | ---- | -------- | -------------------- | ---- | ------- | ----------------- | ---------------- | ------------- |
+| GPT-4V\*           | unknown    | 56.8          | 55.7           | 49.9                    | 77.0          | 74.4             | 38.7 | 1409/517 | -                    | -    | 78.0    | 71.6              | -                | -             |
+| Gemini Ultra\*     | unknown    | 59.4          | -              | 53.0                    | -             | -                | -    | -        | -                    | -    | 82.3    | -                 | -                | -             |
+| Gemini Pro\*       | unknown    | 47.9          | -              | 45.2                    | 73.6          | 74.3             | 40.7 | 1497/437 | -                    | -    | 74.6    | 70.7              | -                | -             |
+| Qwen-VL-Plus\*     | unknown    | 45.2          | 40.8           | 43.3                    | 67.0          | 70.7             | -    | 1681/502 | -                    | -    | 78.9    | 65.7              | -                | -             |
+| Qwen-VL-Max\*      | unknown    | 51.4          | 46.8           | 51.0                    | 77.6          | 75.7             | -    | -        | -                    | -    | 79.5    | -                 | -                | -             |
+|                    |            |               |                |                         |               |                  |      |          |                      |      |         |                   |                  |               |
+| LLaVA-NEXT-34B     | 672x672    | 51.1          | 44.7           | 46.5                    | 79.3          | 79.0             | -    | 1631/397 | 81.8                 | 87.7 | 69.5    | 75.9              | 63.8             | 67.1          |
+| InternVL-Chat-V1.2 | 448x448    | 51.6          | 46.2           | 47.7                    | 82.2          | 81.2             | 56.7 | 1672/509 | 83.3                 | 88.0 | 69.7    | 75.6              | 60.0             | 64.0          |
+
+- MMBench results are collected from the [leaderboard](https://mmbench.opencompass.org.cn/leaderboard).
+- In most benchmarks, InternVL-Chat-V1.2 achieves better performance than LLaVA-NeXT-34B.
+
+### Training (SFT)
+
+We provide [slurm scripts](./internvl_chat/shell/hermes2_yi34b/internvl_chat_v1_2_hermes2_yi34b_448_finetune.sh) for multi-node multi-GPU training. You can use either 32 or 64 GPUs to train this model. If you use 64 GPUs, training will take approximately 18 hours.
+
+For more details about training, please see [here](./internvl_chat#start-training).
+
+The hyperparameters used for finetuning are listed in the following table.
+
+| Hyperparameter     | Trainable Param  | Global Batch Size | Learning rate | Epochs | Max length | Weight decay |
+| ------------------ | ---------------- | ----------------- | ------------- | ------ | ---------- | ------------ |
+| InternVL-Chat-V1.2 | 40B (full model) | 512               | 1e-5          | 1      | 2048       | 0.05         |
diff --git a/INSTALLATION.md b/INSTALLATION.md
@@ -23,9 +23,9 @@
   pip install torch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 --index-url https://download.pytorch.org/whl/cu118
   ```
 
-- Install `flash-attn==0.2.8` :
+- Install `flash-attn==0.2.8` or `flash-attn==2.3.6`:
 
-  If you want to fully replicate my results, please install `v0.2.8`, otherwise install the latest version.
+  If you want to fully replicate my results in the paper, please install `v0.2.8`, otherwise install the `v2.3.6`.
 
   This is because different versions of flash attention yield slight differences in results.
 
@@ -44,10 +44,10 @@
   mim install mmcv-full==1.6.2
   ```
 
-- Install `transformers==4.32.0`:
+- Install `transformers==4.36.2`:
 
   ```bash
-  pip install transformers==4.32.0
+  pip install transformers==4.36.2
   ```
 
 - Install `apex` (optional):
@@ -66,4 +66,6 @@
 
   ```bash
   pip install opencv-python termcolor yacs pyyaml scipy
+  pip install deepspeed==0.10.0
+  pip install pycocoevalcap tqdm
   ```
diff --git a/README.md b/README.md
@@ -1,47 +1,39 @@
 # <img width="60" alt="image" src="https://github.com/OpenGVLab/InternVL/assets/8529570/5aa4cda8-b453-40a0-9336-17012b430ae8"> InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks —— An Open-Source Alternative to ViT-22B
 
+\[[InternVL-Chat-V1.2 Blog](./BLOG.md)\]   \[[Paper](https://arxiv.org/abs/2312.14238)\]  \[[Chat Demo](https://internvl.opengvlab.com/)\]  \[[Quick Start](#quick-start-with-huggingface)\]  \[[中文解读](https://mp.weixin.qq.com/s/bdfAJRqOF9tUk8Vy9KC_XQ)\]
+
 ## News🚀🚀🚀
 
+- `2024/02/12`: InternVL-Chat-V1.2 has been released, utilizing [Nous-Hermes-2-Yi-34B](https://huggingface.co/NousResearch/Nous-Hermes-2-Yi-34B) as the LLM. It achieves 51.6 on MMMU val and 82.3 on MMBench test. For more details, please refer to our [blog](BLOG.md) or try our [demo](https://internvl.opengvlab.com/). The model is now available on [HuggingFace](https://huggingface.co/OpenGVLab/InternVL-Chat-Chinese-V1-2), and both training/evaluation data and scripts are open-sourced.
 - `2024/02/04`: [InternVL-Chat-V1.1](https://huggingface.co/OpenGVLab/InternVL-Chat-Chinese-V1-1) achieves 44.67% on [MMVP](https://github.com/tsb0601/MMVP), higher than GPT-4V!
 - `2024/01/27`: We release 448 resolution model, achieving 76.6 on MMBench dev, see [here](https://github.com/OpenGVLab/InternVL/tree/main/internvl_chat#-evaluation-chinese-models).
 - `2024/01/24`: InternVL-Chat-V1.1 is released, it supports Chinese and has stronger OCR capability, see [here](https://huggingface.co/OpenGVLab/InternVL-Chat-Chinese-V1-1) or try our [demo](https://internvl.opengvlab.com/).
 - `2024/01/16`: We release our [customized mmcv/mmsegmentation/mmdetection code](https://github.com/OpenGVLab/InternVL-MMDetSeg), integrated with DeepSpeed, which can be used for training large-scale object detection and semantic segmentation models.
 
 ## What is InternVL?
 
-\[[Paper](https://arxiv.org/abs/2312.14238)\]  \[[Chat Demo](https://internvl.opengvlab.com/)\]  \[[Quick Start](#quick-start-with-huggingface)\]  \[[中文解读](https://mp.weixin.qq.com/s/bdfAJRqOF9tUk8Vy9KC_XQ)\]
-
 InternVL scales up the ViT to _**6B parameters**_ and aligns it with LLM.
 
-It is _**the largest open-source vision/vision-language foundation model (14B)**_ to date, achieving _**32 state-of-the-art**_ performances on a wide range of tasks such as visual perception, cross-modal retrieval, multimodal dialogue, etc.
-
-<img width="1204" alt="image" src="https://github.com/OpenGVLab/InternVL/assets/23737120/47878df8-2aec-446e-8a58-00640a2e1327">
-
-[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvl-scaling-up-vision-foundation-models/zero-shot-cross-modal-retrieval-on-coco-2014)](https://paperswithcode.com/sota/zero-shot-cross-modal-retrieval-on-coco-2014?p=internvl-scaling-up-vision-foundation-models)
-[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvl-scaling-up-vision-foundation-models/zero-shot-image-retrieval-on-coco-cn)](https://paperswithcode.com/sota/zero-shot-image-retrieval-on-coco-cn?p=internvl-scaling-up-vision-foundation-models)
-[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvl-scaling-up-vision-foundation-models/zero-shot-cross-modal-retrieval-on-flickr30k)](https://paperswithcode.com/sota/zero-shot-cross-modal-retrieval-on-flickr30k?p=internvl-scaling-up-vision-foundation-models)
-[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvl-scaling-up-vision-foundation-models/image-to-text-retrieval-on-flickr30k)](https://paperswithcode.com/sota/image-to-text-retrieval-on-flickr30k?p=internvl-scaling-up-vision-foundation-models)
-[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvl-scaling-up-vision-foundation-models/zero-shot-image-retrieval-on-flickr30k-cn)](https://paperswithcode.com/sota/zero-shot-image-retrieval-on-flickr30k-cn?p=internvl-scaling-up-vision-foundation-models)
-[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvl-scaling-up-vision-foundation-models/image-retrieval-on-flickr30k-cn)](https://paperswithcode.com/sota/image-retrieval-on-flickr30k-cn?p=internvl-scaling-up-vision-foundation-models)
-[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvl-scaling-up-vision-foundation-models/zero-shot-image-retrieval-on-xtd10)](https://paperswithcode.com/sota/zero-shot-image-retrieval-on-xtd10?p=internvl-scaling-up-vision-foundation-models)
-[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvl-scaling-up-vision-foundation-models/zero-shot-transfer-image-classification-on-cn)](https://paperswithcode.com/sota/zero-shot-transfer-image-classification-on-cn?p=internvl-scaling-up-vision-foundation-models)
-[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvl-scaling-up-vision-foundation-models/zero-shot-transfer-image-classification-on-8)](https://paperswithcode.com/sota/zero-shot-transfer-image-classification-on-8?p=internvl-scaling-up-vision-foundation-models)
-[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvl-scaling-up-vision-foundation-models/zero-shot-video-retrieval-on-msr-vtt)](https://paperswithcode.com/sota/zero-shot-video-retrieval-on-msr-vtt?p=internvl-scaling-up-vision-foundation-models)
-[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvl-scaling-up-vision-foundation-models/zero-shot-transfer-image-classification-on-6)](https://paperswithcode.com/sota/zero-shot-transfer-image-classification-on-6?p=internvl-scaling-up-vision-foundation-models)
-[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvl-scaling-up-vision-foundation-models/zero-shot-transfer-image-classification-on-5)](https://paperswithcode.com/sota/zero-shot-transfer-image-classification-on-5?p=internvl-scaling-up-vision-foundation-models)
-[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvl-scaling-up-vision-foundation-models/zero-shot-transfer-image-classification-on-3)](https://paperswithcode.com/sota/zero-shot-transfer-image-classification-on-3?p=internvl-scaling-up-vision-foundation-models)
-[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvl-scaling-up-vision-foundation-models/zero-shot-transfer-image-classification-on-1)](https://paperswithcode.com/sota/zero-shot-transfer-image-classification-on-1?p=internvl-scaling-up-vision-foundation-models)
-
 ## Model Zoo
 
-| Model              | Date       | Download                                                                       | Note                             |
-| ------------------ | ---------- | ------------------------------------------------------------------------------ | -------------------------------- |
-| InternViT-6B-224px | 2023.12.22 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-224px)              | vision foundation model          |
-| InternVL-14B-224px | 2023.12.22 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-14B-224px)              | vision-language foundation model |
-| InternVL-Chat-13B  | 2023.12.25 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-ViT-6B-Vicuna-7B)  | English multimodal dialogue      |
-| InternVL-Chat-19B  | 2023.12.25 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-ViT-6B-Vicuna-13B) | English multimodal dialogue      |
-| InternVL-Chat-V1.1 | 2024.01.24 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-Chinese-V1-1)      | support Chinese and stronger OCR |
-| InternViT-6B-448px | 2024.01.30 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px)              | 448 resolution                   |
+**Vision-Language Foundation Model**
+
+| Model                   | Date       | Download                                                               | Note                             |
+| ----------------------- | ---------- | ---------------------------------------------------------------------- | -------------------------------- |
+| InternViT-6B-224px      | 2023.12.22 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-224px)      | vision foundation model          |
+| InternVL-14B-224px      | 2023.12.22 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-14B-224px)      | vision-language foundation model |
+| InternViT-6B-448px      | 2024.01.30 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px)      | 448 resolution                   |
+| InternViT-6B-448px-V1.2 | 2024.02.11 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-2) | 448 resolution (🔥new)           |
+
+**Vision Large Language Model**
+
+| Model                   | Date       | Download                                                                             | Note                             |
+| ----------------------- | ---------- | ------------------------------------------------------------------------------------ | -------------------------------- |
+| InternVL-Chat-13B       | 2023.12.25 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-ViT-6B-Vicuna-7B)        | English multimodal dialogue      |
+| InternVL-Chat-19B       | 2023.12.25 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-ViT-6B-Vicuna-13B)       | English multimodal dialogue      |
+| InternVL-Chat-19B-448px | 2024.02.03 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-ViT-6B-Vicuna-13B-448px) | 448 resolution                   |
+| InternVL-Chat-V1.1      | 2024.01.24 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-Chinese-V1-1)            | support Chinese and stronger OCR |
+| InternVL-Chat-V1.2      | 2024.02.11 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-Chinese-V1-2)            | scaling up LLM to 34B (🔥new)    |
 
 ## What can InternVL do?
 
@@ -174,6 +166,22 @@ It is _**the largest open-source vision/vision-language foundation model (14B)**
         <td>82.7</td>
         <td>85.1</td>
      </tr>
+    <tr align=center>
+        <td align=left>EVA-CLIP-8B</td>
+        <td>95.6</td>
+        <td>99.6</td>
+        <td>99.9</td>
+        <td>80.8</td>
+        <td>95.5</td>
+        <td>97.6</td>
+        <td>70.3</td>
+        <td>89.3</td>
+        <td>93.9</td>
+        <td>53.0</td>
+        <td>76.0</td>
+        <td>83.4</td>
+        <td>86.2</td>
+     </tr>
   <tr align=center>
         <td align=left>InternVL-C (ours)</td>
         <td>94.7</td>
@@ -396,7 +404,6 @@ pixel_values = image_processor(images=image, return_tensors='pt').pixel_values
 pixel_values = pixel_values.to(torch.bfloat16).cuda()
 
 outputs = model(pixel_values)
-
 ```
 
 </details>

diff --git a/classification/README.md b/classification/README.md
@@ -10,66 +10,7 @@ InternViT-6B follows the structure of vanilla ViT, and its hyperparameters are l
 
 ## 🛠️ Installation
 
-> If you have already installed the environment as per the instructions in other folders, you can skip this section.
-
-- Clone this repository:
-
-  ```bash
-  git clone https://github.com/OpenGVLab/InternVL.git
-  cd InternVL/classification
-  ```
-
-- Create a conda virtual environment and activate it:
-
-  ```bash
-  conda create -n internvl python=3.9 -y
-  conda activate internvl
-  ```
-
-- Install `PyTorch>=2.0` and `torchvision>=0.15.2` with `CUDA>=11.6`:
-
-  For examples, to install `torch==2.0.1` with `CUDA==11.8`:
-
-  ```bash
-  conda install pytorch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 pytorch-cuda=11.8 -c pytorch -c nvidia
-  # or
-  pip install torch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 --index-url https://download.pytorch.org/whl/cu118
-  ```
-
-- Install `flash-attn==0.2.8` :
-
-  If you want to fully replicate my results, please install `v0.2.8`, otherwise install the latest version.
-
-  This is because different versions of flash attention yield slight differences in results.
-
-  ```bash
-  git clone https://github.com/Dao-AILab/flash-attention.git
-  cd flash-attention
-  git checkout v0.2.8
-  python setup.py install
-  ```
-
-- Install `timm==0.9.12` and `mmcv-full==1.6.2`:
-
-  ```bash
-  pip install -U openmim
-  pip install timm==0.9.12
-  mim install mmcv-full==1.6.2
-  ```
-
-- Install `apex`:
-
-  ```bash
-  git clone https://github.com/NVIDIA/apex.git
-  git checkout 2386a912164b0c5cfcd8be7a2b890fbac5607c82  # https://github.com/NVIDIA/apex/issues/1735
-  pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./
-  ```
-
-- Install other requirements:
-
-  ```bash
-  pip install opencv-python termcolor yacs pyyaml scipy
-  ```
+See [INSTALLATION.md](../INSTALLATION.md)
 
 ## 📦 Data Preparation
 
@@ -150,7 +91,7 @@ pretrained
 
 > Note, please install apex before training (see installation guide above for details).
 
-To train a linear classifier for `InternViT-6b` on ImageNet with 8 GPUs, run:
+To train a linear classifier for `InternViT-6B` on ImageNet with 8 GPUs, run:
 
 ```bash
 python -m torch.distributed.launch --nproc_per_node 8 --master_port 12345 main.py --cfg configs/intern_vit_6b_1k_224.yaml