Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nan loss on the half way to train the model. #5

Open
LT1st opened this issue Mar 7, 2024 · 2 comments
Open

Nan loss on the half way to train the model. #5

LT1st opened this issue Mar 7, 2024 · 2 comments

Comments

@LT1st
Copy link

LT1st commented Mar 7, 2024

I am using your framework on image translation task.

The loss was fine at the very begining, but was changed into Nan in Epoch 006. Have you ever solved this problem?

The log infomation:

Epoch: 005 - 025Epoch: [5][0/1250]	Loss: 0.5269870162010193, LR: 0.00020489077162409578
Epoch: [5][800/1250]	Loss: 0.44976134161080017, LR: 0.00022987952270145035
Epoch: [5][900/1250]	Loss: 0.44960858015453115, LR: 0.00023297791072928454
Epoch: [5][1000/1250]	Loss: 0.4489031859657743, LR: 0.00023607105232487043
Epoch: [5][1100/1250]	Loss: 0.45127786015186605, LR: 0.00023915904358565203
Epoch: [5][1200/1250]	Loss: 0.45215647128549447, LR: 0.00024224197730360856

Epoch: 005 - 025
====================================================================================================
        d1         d2         d3    abs_rel     sq_rel       rmse   rmse_log      log10      silog 
    0.2776     0.6083     0.7993     0.8979    77.8871    60.1445     0.6755     0.2110     0.6634 
====================================================================================================

Epoch: 006 - 025Epoch: [6][0/1250]	Loss: 0.4362526535987854, LR: 0.00024378157571854745
Epoch: [6][100/1250]	Loss: nan, LR: 0.0002468570902802666
Epoch: [6][200/1250]	Loss: nan, LR: 0.0002499277658752283
Epoch: [6][900/1250]	Loss: nan, LR: 0.00027129361225600624
Epoch: [6][1000/1250]	Loss: nan, LR: 0.0002743283372183186
Epoch: [6][1100/1250]	Loss: nan, LR: 0.00027735887967590467
Epoch: [6][1200/1250]	Loss: nan, LR: 0.0002803853021768897

The output goes:

NaN or Inf found in input tensor.

====================================================================================================
        d1         d2         d3    abs_rel     sq_rel       rmse   rmse_log      log10      silog 
    0.0000     0.0000     0.0000     0.9861   131.8554   146.8084    11.4800     4.9767     8.1322 
====================================================================================================


Epoch: 009 - 025
Epoch: [9][0/1250]      Loss: nan, LR: 0.0003563283532129068
Epoch: [9][1200/1250]   Loss: nan, LR: 0.00039136562872899835
NaN or Inf found in input tensor.
nan
nan
nan
nan
nan

It seems to be something wrong, but the inputs are right at the beging.

@LT1st
Copy link
Author

LT1st commented Mar 7, 2024

I've just dopuble checked the dataset, it's right.

    from torch.utils.data import DataLoader
    import tqdm
    dataloader = DataLoader(dataset, batch_size=4, shuffle=True, num_workers=2)
    for i, x in enumerate(dataloader):
        print(f'Batch {i}:')
        print(x['image'].shape, x['depth'].shape)
        # print('Data:', data.shape)
        # print('Label:', label)

@wwqq
Copy link
Collaborator

wwqq commented Mar 11, 2024

Reducing the learning rate or increasing the weight decay might solve this problem, but it could also impact performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants