-
Notifications
You must be signed in to change notification settings - Fork 129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training advise with swin_transformer - initialization with GELU, etc. #3
Comments
Cool! Would be great to keep us posted on whether you could replicate the results! |
I tested with both gradient clipping as in the paper and with adaptive gradient clipping. |
Hi, I want to use swin transformer to replace feature pyramid net, what should I do to modify the code? |
so madgrad blew away all my previous results, nearly an 18% improvement for the same limited run time (22 epochs). |
Exciting! Thanks for keeping us posted ... |
Running madgrad with AdamW style decay and using similar decay as for AdamW so far has been the best results (slightly better accuracy and loss vs no weight decay or minor weight decay). |
Thanks for sharing the experiments. I also tried a simple classification but couldn't work. I would appreciate if you can have some advices: I noticed the Avg Pooling comment, it seems "x = x.mean(dim=[2, 3])" in the forward path already doing that. so all I did was a simple training: model_ft = swin_l(hidden_dim=192, layers=(2, 2, 18, 2), heads=(6, 12, 24, 48), nn.CrossEntropyLoss() |
Hi @yueming-zhang, |
appreciate your response @lessw2020 , I tested AdamW with my custom scheduler and yield following result. (note the very small LR). Seems there is a negative correlation between LR and Accu. I briefly tested the madgrad, but the result is not ideal. |
@lessw2020 Could you share how do you implement the above discussion in the code? |
@yueming-zhang I met the same situation as above, could you tell me how do you solve that and do you use any pre-trained weight? |
Hi jm-R512, |
Thank you @lessw2020 AdamW optimizer has very good affect on training, do you've any further suggestions for improvements. |
Hi,
I've been setting up with swin_transformer but having a hard time getting it to actually train.
I figured one immediate issue is the lack of init, so I'm using the truncated init setup from rwightman/pytorch he used in ViT impl since that also uses GELU.
But regardless, I'm not able to get it to learn atm even after testing out a range of lr.
Thus wondering if anyone has found some starting hyperparams and/or init method, to get it up and training?
The text was updated successfully, but these errors were encountered: