-
Notifications
You must be signed in to change notification settings - Fork 145
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
loss becomes nan #49
Comments
I didn't face this issue. You can clip your gradient to avoid this issue. |
Thanks very much, I will try! However, I didn't change the code (the latest) and only change the batchsize and thread and use 8 nvidia V100 to train, what batchsize and thread did you set as you train? |
The change of batchsize will not cause the loss nan. I ever faced the "loss nan" problem due to the crop operation. If the depth image becomes invalid(0) for the whole image after cropping, the loss will be nan. I will try to debug and avoid it but it may be time-consuming due to the need for 8 nvidia V100 GPUs. How many iterations have you trained before the loss nan? You can try to clip the gradient to avoid it, or wait for my debugging. Thank you! |
Thanks for your reply!The loss became nan after I trained about 12000 iterations (the 3rd epoch), and I see the code you released contains gradient clip, it seems not work. |
lib.utils.logging INFO: [Step 10470/182650] [Epoch 2/50] [multi]
loss: nan, time: 5.862533, eta: 11 days, 16:23:31
meanstd-tanh_auxiloss: nan, meanstd-tanh_loss: nan, msg_normal_loss: nan, pairwise-normal-regress-edge_loss: nan, pairwise-normal-regress-plane_loss: nan, ranking-edge_auxiloss: nan, ranking-edge_loss: nan, abs_rel: 0.211080, whdr: 0.087764,
group0_lr: 0.001000, group1_lr: 0.001000,
您好,当我在用taskonomy DiverseDepth HRWSI Holopix50k这四个数据集训练的时候,loss变成了nan,请问您在训练的时候有遇到这样的问题吗?如果有应该怎么解决呢?谢谢!下面是我输入的参数
--backbone resnext101
--dataset_list taskonomy DiverseDepth HRWSI Holopix50k
--batchsize 16
--base_lr 0.001
--use_tfboard
--thread 8
--loss_mode ranking-edge_pairwise-normal-regress-edge_msgil-normal_meanstd-tanh_pairwise-normal-regress-plane_ranking-edge-auxi_meanstd-tanh-auxi
--epoch 50
--lr_scheduler_multiepochs 10 25 40
--val_step 5000
--snapshot_iters 5000
--log_interval 10 \
The text was updated successfully, but these errors were encountered: