Skip to content

Latest commit

 

History

History
157 lines (105 loc) · 22.8 KB

README.md

File metadata and controls

157 lines (105 loc) · 22.8 KB

pyramid Parameter-Inverted Image Pyramid Networks (PIIP)

[Paper] [中文解读] [Slides] [Video]

The official implementation of the paper "Parameter-Inverted Image Pyramid Networks"

NeurIPS 2024 Spotlight (Top 2.08%)

📰 News

[2025/1/15] We introduce PIIP-LLaVA, an MLLM that uses PIIP design to improve performance and save computational costs. We also extend PIIP to CNN-based structures and ViT-CNN hybrid structures. Code and models will be released soon. Check out our new paper for details.

⭐️ Highlights

TL;DR: We introduce the Parameter-Inverted Image Pyramid Networks (PIIP), employing a parameter-inverted paradigm that uses models with different parameter sizes to process different resolution levels of the image pyramid, thereby saving computation cost while improving the performance.

  • Support tasks of object detection, instance segmentation, semantic segmentation , image classification, and multimodal understanding.
  • Surpasses single-branch and other multi-resolution methods with higher performance and lower computation costs.
  • Achieve 60.0 $\rm AP^b$) on COCO object detection with InternViT-6B, and 73.0% accuracy on TextVQA and 74.5% on MMBench with only 2.8M training data.

scatter

🖼 Qualitative Results

detection visualization multimodal understanding visualization

📌 Abstract

Image pyramids are widely adopted in top-performing methods to obtain multi-scale features for precise visual perception and understanding. However, current image pyramids use the same large-scale model to process multiple resolutions of images, leading to significant computational cost. To address this challenge, we propose a novel network architecture, called Parameter-Inverted Image Pyramid Networks (PIIP). Specifically, PIIP uses pretrained models (ViTs or CNNs) as branches to process multi-scale images, where images of higher resolutions are processed by smaller network branches to balance computational cost and performance. To integrate information from different spatial scales, we further propose a novel cross-branch feature interaction mechanism. To validate PIIP, we apply it to various perception models and a representative multimodal large language model called LLaVA, and conduct extensive experiments on various tasks such as object detection, segmentation, image classification and multimodal understanding. PIIP achieves superior performance compared to single-branch and existing multi-resolution approaches with lower computational cost. When applied to InternViT-6B, a large-scale vision foundation model, PIIP can improve its performance by 1%-2% on detection and segmentation with only 40%-60% of the original computation, finally achieving 60.0 box AP on MS COCO and 59.7 mIoU on ADE20K. For multimodal understanding, our PIIP-LLaVA achieves 73.0% accuracy on TextVQA and 74.5% on MMBench with only 2.8M training data.

🔍 Method

Architecture

PIIP-LLaVA

🛠️ Usage

For instructions on installation, pretrained models, training and evaluation, please refer to the readme files under each subfolder:

🚀 Released Models

COCO Object Detection and Instance Segmentation

Note:

  1. We report the number of parameters and FLOPs of the backbone.
  2. Results in the paper were obtained with an internal codebase, which may exhibit slightly different performance than this repo ($\leq\pm0.2$).
  3. Experiments involving InternViT-6B do not use window attention, different from those in the paper.
Backbone Detector Resolution Schd Box mAP Mask mAP #Param #FLOPs Download
ViT-B Mask R-CNN 1024 1x 43.7 39.7 90M 463G log | ckpt | cfg
PIIP-TSB Mask R-CNN 1120/896/448 1x 43.6 38.7 146M 243G log | ckpt | cfg
PIIP-TSB Mask R-CNN 1568/896/448 1x 45.0 40.3 147M 287G log | ckpt | cfg
PIIP-TSB Mask R-CNN 1568/1120/672 1x 46.5 41.3 149M 453G log | ckpt | cfg
ViT-L Mask R-CNN 1024 1x 46.7 42.5 308M 1542G log | ckpt | cfg
PIIP-SBL Mask R-CNN 1120/672/448 1x 46.5 40.8 493M 727G log | ckpt | cfg
PIIP-SBL Mask R-CNN 1344/896/448 1x 48.3 42.7 495M 1002G log | ckpt | cfg
PIIP-SBL Mask R-CNN 1568/896/672 1x 49.3 43.7 497M 1464G log | ckpt | cfg
PIIP-TSBL Mask R-CNN 1344/896/672/448 1x 47.1 41.9 506M 755G log | ckpt | cfg
PIIP-TSBL Mask R-CNN 1568/1120/672/448 1x 48.2 42.9 507M 861G log | ckpt | cfg
PIIP-TSBL Mask R-CNN 1792/1568/1120/448 1x 49.4 44.1 512M 1535G log | ckpt | cfg
InternViT-6B Mask R-CNN 1024 1x 53.8 48.1 5919M 29323G log | ckpt | cfg
PIIP-H6B Mask R-CNN 1024/512 1x 55.8 49.0 6872M 11080G log | ckpt | cfg
Backbone Detector Pretrain Resolution Schd Box mAP Mask mAP Download
PIIP-SBL Mask R-CNN AugReg (384) 1568/1120/672 1x 48.3 42.6 log | ckpt | cfg
PIIP-SBL Mask R-CNN DeiT III (S) + Uni-Perceiver (BL) 1568/1120/672 1x 48.8 42.9 log | ckpt | cfg
PIIP-SBL Mask R-CNN DeiT III (S) + MAE (BL) 1568/1120/672 1x 49.1 43.0 log | ckpt | cfg
PIIP-SBL Mask R-CNN DeiT III 1568/1120/672 1x 50.0 44.4 log | ckpt | cfg
PIIP-SBL Mask R-CNN DeiT III (S) + DINOv2 (BL) 1568/1120/672 1x 51.0 44.7 log | ckpt | cfg
PIIP-SBL Mask R-CNN DeiT III (S) + BEiTv2 (BL) 1568/1120/672 1x 51.8 45.4 log | ckpt | cfg
PIIP-SBL DINO DeiT III (384) 1792/1120/672 3x 57.8 - log | ckpt | cfg
PIIP-H6B DINO MAE (H) + InternVL (6B) 1024/768 1x 60.0 - log | ckpt | cfg

ADE20K Semantic Segmentation

Backbone Detector Resolution Schd mIoU #Param #FLOPs Download
InternViT-6B UperNet 512 80k 58.42 5910M 6364G log | ckpt | cfg
PIIP-H6B UperNet 512/192 80k 57.81 6745M 1663G log | ckpt | cfg
PIIP-H6B UperNet 512/256 80k 58.35 6745M 2354G log | ckpt | cfg
PIIP-H6B UperNet 512/384 80k 59.32 6746M 4374G log | ckpt | cfg
PIIP-H6B UperNet 512/512 80k 59.85 6747M 7308G log | ckpt | cfg

ImageNet-1K Image Classification

Model Resolution #Param #FLOPs Top-1 Acc Config Download
PIIP-TSB 368/192/128 144M 17.4G 82.1 config log | ckpt
PIIP-SBL 320/160/96 489M 39.0G 85.2 config log | ckpt
PIIP-SBL 384/192/128 489M 61.2G 85.9 config log | ckpt

Multimodal Understanding

Will be released soon

📅 Schedule

  • detection code
  • classification code
  • segmentation code
  • multimodal understanding code

🖊️ Citation

If you find this work helpful for your research, please consider giving this repo a star ⭐ and citing our paper:

@article{piip,
  title={Parameter-Inverted Image Pyramid Networks},
  author={Zhu, Xizhou and Yang, Xue and Wang, Zhaokai and Li, Hao and Dou, Wenhan and Ge, Junqi and Lu, Lewei and Qiao, Yu and Dai, Jifeng},
  journal={arXiv preprint arXiv:2406.04330},
  year={2024}
}

@article{piip_v2,
  title={Parameter-Inverted Image Pyramid Networks for Visual Perception and Multimodal Understanding},
  author={Wang, Zhaokai and Zhu, Xizhou and Yang, Xue and Luo, Gen and Li, Hao and Tian, Changyao and Dou, Wenhan and Ge, Junqi and Lu, Lewei and Qiao, Yu and Dai, Jifeng},
  journal={arXiv preprint arXiv:2501.07783},
  year={2025}
}

📃 License

This project is released under the MIT license. Parts of this project contain code and models from other sources, which are subject to their respective licenses.

🙏 Acknowledgements

Our code is built with reference to the code of the following projects: InternVL-MMDetSeg, ViT-Adapter, DeiT, MMDetection, MMSegmentation, and timm. Thanks for their awesome work!