[FEATURE] ViT with average pooling like ResNet #1555

Phuoc-Hoan-Le · 2022-11-19T05:45:34Z

Phuoc-Hoan-Le
Nov 19, 2022

Is there any attempt on training ViT models with global average pooling like resnet models? If yes, what are the exact hyperparameters used to get the best performance when training on imagenet1k from scratch

rwightman · 2022-11-19T16:17:43Z

rwightman
Nov 19, 2022
Maintainer

@CharlesLeeeee there are already some there, but discussions for these sorts of q please.

1 reply

Phuoc-Hoan-Le Nov 19, 2022
Author

@rwightman I see "vit_base_patch16_rpn_224" which uses average pooling and res-post norm residuals, but I have no idea what learning rate; warmup-epochs; batch size; and data augmentation techniques you used for that model.

I just used the exact DeiT training recipe and I got NaN.

I cannot find any discussions related to training "vit_base_patch16_rpn_224". If you know one could you point me the link? Thanks!

rwightman · 2022-11-20T05:58:24Z

rwightman
Nov 20, 2022
Maintainer

Most of these use global pooling, but they're relative position https://github.com/rwightman/pytorch-image-models/blob/main/timm/models/vision_transformer_relpos.py

They don't really train much differently than token models though. Most of the newer weights for these are trained with a swin type recipe, which isn't far from deit. Deit should not blow up, it's possible you didn't adapt the LR to your batch size or didn't enable grad clipping... I have hparams somewhere in a cloud bucket but have to find them...

1 reply

Phuoc-Hoan-Le Nov 22, 2022
Author

I never had to use gradient clipping to prevent it from going to NaN, especially as in DeiT 1 paper (https://arxiv.org/pdf/2012.12877.pdf). I find that using warmup-epochs=20 solves the problem most of the time.

Phuoc-Hoan-Le · 2022-11-24T19:58:43Z

Phuoc-Hoan-Le
Nov 24, 2022
Author

So far I am able to train a DeiT model with average pooling successfully by increasing the warmup-epochs from 5 to 20.

However, when using a res-post-norm structure with avg pooling if it keeps on getting NaN.

Are you using repeated-aug to get these results? I know that the swin model doesn't use repeated-aug

1 reply

Doraemonzzz Apr 18, 2024

Hello, by changing the warmup to 20 and using gap, is it possible to replicate the results of the deit paper?

rwightman · 2022-11-25T19:18:46Z

rwightman
Nov 25, 2022
Maintainer

vit-b16_rpn-args.yaml.txt

hparams from the rpn b16 attached, it was trained on TPU v4-8 so it's 4x256 global batch size, but it's using --lr-base so it autoscales the LR ...

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] ViT with average pooling like ResNet #1555

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 3 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

[FEATURE] ViT with average pooling like ResNet #1555

Phuoc-Hoan-Le Nov 19, 2022

Replies: 4 comments · 3 replies

rwightman Nov 19, 2022 Maintainer

Phuoc-Hoan-Le Nov 19, 2022 Author

rwightman Nov 20, 2022 Maintainer

Phuoc-Hoan-Le Nov 22, 2022 Author

Phuoc-Hoan-Le Nov 24, 2022 Author

Doraemonzzz Apr 18, 2024

rwightman Nov 25, 2022 Maintainer

Phuoc-Hoan-Le
Nov 19, 2022

Replies: 4 comments 3 replies

rwightman
Nov 19, 2022
Maintainer

Phuoc-Hoan-Le Nov 19, 2022
Author

rwightman
Nov 20, 2022
Maintainer

Phuoc-Hoan-Le Nov 22, 2022
Author

Phuoc-Hoan-Le
Nov 24, 2022
Author

rwightman
Nov 25, 2022
Maintainer