Parameter-Inverted Image Pyramid Networks (PIIP)

The official implementation of the paper "Parameter-Inverted Image Pyramid Networks"

NeurIPS 2024 Spotlight (Top 2.08%)

📰 News

[2025/1/15] We introduce PIIP-LLaVA, an MLLM that uses PIIP design to improve performance and save computational costs. We also extend PIIP to CNN-based structures and ViT-CNN hybrid structures. Code and models will be released soon. Check out our new paper for details.

⭐️ Highlights

TL;DR: We introduce the Parameter-Inverted Image Pyramid Networks (PIIP), employing a parameter-inverted paradigm that uses models with different parameter sizes to process different resolution levels of the image pyramid, thereby saving computation cost while improving the performance.

Support tasks of object detection, instance segmentation, semantic segmentation , image classification, and multimodal understanding.
Surpasses single-branch and other multi-resolution methods with higher performance and lower computation costs.
Achieve 60.0 $\rm AP^b$) on COCO object detection with InternViT-6B, and 73.0% accuracy on TextVQA and 74.5% on MMBench with only 2.8M training data.

🖼 Qualitative Results

📌 Abstract

Image pyramids are widely adopted in top-performing methods to obtain multi-scale features for precise visual perception and understanding. However, current image pyramids use the same large-scale model to process multiple resolutions of images, leading to significant computational cost. To address this challenge, we propose a novel network architecture, called Parameter-Inverted Image Pyramid Networks (PIIP). Specifically, PIIP uses pretrained models (ViTs or CNNs) as branches to process multi-scale images, where images of higher resolutions are processed by smaller network branches to balance computational cost and performance. To integrate information from different spatial scales, we further propose a novel cross-branch feature interaction mechanism. To validate PIIP, we apply it to various perception models and a representative multimodal large language model called LLaVA, and conduct extensive experiments on various tasks such as object detection, segmentation, image classification and multimodal understanding. PIIP achieves superior performance compared to single-branch and existing multi-resolution approaches with lower computational cost. When applied to InternViT-6B, a large-scale vision foundation model, PIIP can improve its performance by 1%-2% on detection and segmentation with only 40%-60% of the original computation, finally achieving 60.0 box AP on MS COCO and 59.7 mIoU on ADE20K. For multimodal understanding, our PIIP-LLaVA achieves 73.0% accuracy on TextVQA and 74.5% on MMBench with only 2.8M training data.

🔍 Method

🛠️ Usage

For instructions on installation, pretrained models, training and evaluation, please refer to the readme files under each subfolder:

🚀 Released Models

COCO Object Detection and Instance Segmentation

Note:

We report the number of parameters and FLOPs of the backbone.
Results in the paper were obtained with an internal codebase, which may exhibit slightly different performance than this repo ($\leq\pm0.2$).
Experiments involving InternViT-6B do not use window attention, different from those in the paper.

Backbone	Detector	Resolution	Schd	Box mAP	Mask mAP	#Param	#FLOPs	Download
ViT-B	Mask R-CNN	1024	1x	43.7	39.7	90M	463G	log \| ckpt \| cfg
PIIP-TSB	Mask R-CNN	1120/896/448	1x	43.6	38.7	146M	243G	log \| ckpt \| cfg
PIIP-TSB	Mask R-CNN	1568/896/448	1x	45.0	40.3	147M	287G	log \| ckpt \| cfg
PIIP-TSB	Mask R-CNN	1568/1120/672	1x	46.5	41.3	149M	453G	log \| ckpt \| cfg

ViT-L	Mask R-CNN	1024	1x	46.7	42.5	308M	1542G	log \| ckpt \| cfg
PIIP-SBL	Mask R-CNN	1120/672/448	1x	46.5	40.8	493M	727G	log \| ckpt \| cfg
PIIP-SBL	Mask R-CNN	1344/896/448	1x	48.3	42.7	495M	1002G	log \| ckpt \| cfg
PIIP-SBL	Mask R-CNN	1568/896/672	1x	49.3	43.7	497M	1464G	log \| ckpt \| cfg
PIIP-TSBL	Mask R-CNN	1344/896/672/448	1x	47.1	41.9	506M	755G	log \| ckpt \| cfg
PIIP-TSBL	Mask R-CNN	1568/1120/672/448	1x	48.2	42.9	507M	861G	log \| ckpt \| cfg
PIIP-TSBL	Mask R-CNN	1792/1568/1120/448	1x	49.4	44.1	512M	1535G	log \| ckpt \| cfg

InternViT-6B	Mask R-CNN	1024	1x	53.8	48.1	5919M	29323G	log \| ckpt \| cfg
PIIP-H6B	Mask R-CNN	1024/512	1x	55.8	49.0	6872M	11080G	log \| ckpt \| cfg

Backbone	Detector	Pretrain	Resolution	Schd	Box mAP	Mask mAP	Download
PIIP-SBL	Mask R-CNN	AugReg (384)	1568/1120/672	1x	48.3	42.6	log \| ckpt \| cfg
PIIP-SBL	Mask R-CNN	DeiT III (S) + Uni-Perceiver (BL)	1568/1120/672	1x	48.8	42.9	log \| ckpt \| cfg
PIIP-SBL	Mask R-CNN	DeiT III (S) + MAE (BL)	1568/1120/672	1x	49.1	43.0	log \| ckpt \| cfg
PIIP-SBL	Mask R-CNN	DeiT III	1568/1120/672	1x	50.0	44.4	log \| ckpt \| cfg
PIIP-SBL	Mask R-CNN	DeiT III (S) + DINOv2 (BL)	1568/1120/672	1x	51.0	44.7	log \| ckpt \| cfg
PIIP-SBL	Mask R-CNN	DeiT III (S) + BEiTv2 (BL)	1568/1120/672	1x	51.8	45.4	log \| ckpt \| cfg
PIIP-SBL	DINO	DeiT III (384)	1792/1120/672	3x	57.8	-	log \| ckpt \| cfg
PIIP-H6B	DINO	MAE (H) + InternVL (6B)	1024/768	1x	60.0	-	log \| ckpt \| cfg

ADE20K Semantic Segmentation

Backbone	Detector	Resolution	Schd	mIoU	#Param	#FLOPs	Download
InternViT-6B	UperNet	512	80k	58.42	5910M	6364G	log \| ckpt \| cfg
PIIP-H6B	UperNet	512/192	80k	57.81	6745M	1663G	log \| ckpt \| cfg
PIIP-H6B	UperNet	512/256	80k	58.35	6745M	2354G	log \| ckpt \| cfg
PIIP-H6B	UperNet	512/384	80k	59.32	6746M	4374G	log \| ckpt \| cfg
PIIP-H6B	UperNet	512/512	80k	59.85	6747M	7308G	log \| ckpt \| cfg

ImageNet-1K Image Classification

Model	Resolution	#Param	#FLOPs	Top-1 Acc	Config	Download
PIIP-TSB	368/192/128	144M	17.4G	82.1	config	log \| ckpt
PIIP-SBL	320/160/96	489M	39.0G	85.2	config	log \| ckpt
PIIP-SBL	384/192/128	489M	61.2G	85.9	config	log \| ckpt

Multimodal Understanding

Will be released soon

📅 Schedule

detection code
classification code
segmentation code
multimodal understanding code

🖊️ Citation

If you find this work helpful for your research, please consider giving this repo a star ⭐ and citing our paper:

@article{piip,
  title={Parameter-Inverted Image Pyramid Networks},
  author={Zhu, Xizhou and Yang, Xue and Wang, Zhaokai and Li, Hao and Dou, Wenhan and Ge, Junqi and Lu, Lewei and Qiao, Yu and Dai, Jifeng},
  journal={arXiv preprint arXiv:2406.04330},
  year={2024}
}

@article{piip_v2,
  title={Parameter-Inverted Image Pyramid Networks for Visual Perception and Multimodal Understanding},
  author={Wang, Zhaokai and Zhu, Xizhou and Yang, Xue and Luo, Gen and Li, Hao and Tian, Changyao and Dou, Wenhan and Ge, Junqi and Lu, Lewei and Qiao, Yu and Dai, Jifeng},
  journal={arXiv preprint arXiv:2501.07783},
  year={2025}
}

📃 License

This project is released under the MIT license. Parts of this project contain code and models from other sources, which are subject to their respective licenses.

🙏 Acknowledgements

Our code is built with reference to the code of the following projects: InternVL-MMDetSeg, ViT-Adapter, DeiT, MMDetection, MMSegmentation, and timm. Thanks for their awesome work!

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
classification		classification
imgs		imgs
mmcv		mmcv
mmdetection		mmdetection
mmsegmentation		mmsegmentation
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Parameter-Inverted Image Pyramid Networks (PIIP)

📰 News

⭐️ Highlights

🖼 Qualitative Results

📌 Abstract

🔍 Method

🛠️ Usage

🚀 Released Models

COCO Object Detection and Instance Segmentation

ADE20K Semantic Segmentation

ImageNet-1K Image Classification

Multimodal Understanding

📅 Schedule

🖊️ Citation

📃 License

🙏 Acknowledgements

About

Contributors 2

Languages

License

OpenGVLab/PIIP

Folders and files

Latest commit

History

Repository files navigation

Parameter-Inverted Image Pyramid Networks (PIIP)

📰 News

⭐️ Highlights

🖼 Qualitative Results

📌 Abstract

🔍 Method

🛠️ Usage

🚀 Released Models

COCO Object Detection and Instance Segmentation

ADE20K Semantic Segmentation

ImageNet-1K Image Classification

Multimodal Understanding

📅 Schedule

🖊️ Citation

📃 License

🙏 Acknowledgements

About

Topics

Resources

License

Stars

Watchers

Forks

Contributors 2

Languages