training on own data, and RMSE is nan #25

mlcoop · 2018-12-05T07:58:34Z

I have been trying to train on my own data.
Dataset consists of 539278 user_ids and 1551731 items. Data is super sparse.
While training my RMSE: nan. Should I take absolute value of mseloss?

I have PyTorch 0.4, Cuda 9.0. Training on gtx 1080ti.

Using GPUs: [0] Doing epoch 0 of 12 run.py:198: UserWarning: invalid index of a 0-dim tensor. This will be an error in PyTorch 0.5. Use tensor.item() to convert a 0-dim tensor to a Python number t_loss += loss.data[0] [0, 0] RMSE: 8.0848995 run.py:212: UserWarning: invalid index of a 0-dim tensor. This will be an error in PyTorch 0.5. Use tensor.item() to convert a 0-dim tensor to a Python number total_epoch_loss += loss.data[0] [0, 1000] RMSE: nan [0, 2000] RMSE: nan [0, 3000] RMSE: nan [0, 4000] RMSE: nan [0, 5000] RMSE: nan [0, 6000] RMSE: nan [0, 7000] RMSE: nan [0, 8000] RMSE: nan Total epoch 0 finished in 1966.838391304016 seconds with TRAINING RMSE loss: nan run.py:74: UserWarning: invalid index of a 0-dim tensor. This will be an error in PyTorch 0.5. Use tensor.item() to convert a 0-dim tensor to a Python number total_epoch_loss += loss.data[0] run.py:75: UserWarning: invalid index of a 0-dim tensor. This will be an error in PyTorch 0.5. Use tensor.item() to convert a 0-dim tensor to a Python number denom += num_ratings.data[0] Epoch 0 EVALUATION LOSS: nan Saving model to model_save/model.epoch_0 Doing epoch 1 of 12 [1, 0] RMSE: nan [1, 1000] RMSE: nan [1, 2000] RMSE: nan [1, 3000] RMSE: nan [1, 4000] RMSE: nan [1, 5000] RMSE: nan [1, 6000] RMSE: nan [1, 7000] RMSE: nan [1, 8000] RMSE: nan

Could you please help me out?

The text was updated successfully, but these errors were encountered:

okuchaiev · 2018-12-06T00:42:12Z

can you try lowering learning rate?
1/10 or 1/100 of whatever you are using. Also, what is the range of your labels, e.g. 1-5 or some other range?

mlcoop · 2018-12-06T03:43:29Z

@okuchaiev
I have lowered my lr to 0.001 and optimizer to Adam. nan's disappeared but loss wont converge.
label's range between 1-5. Also my batch:64 and hidden_layers:128,196,256,320.

I have tried different sets of hidden layers, and these ones show some kind of loss decrease.
For example 128,128,256 increased loss after each epoch.

I know it's art of tuning, but could you please toward me to right direction?
maybe my data is too sparse and batch,hidden layers are too small?

okuchaiev added the question label Dec 6, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

training on own data, and RMSE is nan #25

training on own data, and RMSE is nan #25

mlcoop commented Dec 5, 2018 •

edited

Loading

okuchaiev commented Dec 6, 2018 •

edited

Loading

mlcoop commented Dec 6, 2018

training on own data, and RMSE is nan #25

training on own data, and RMSE is nan #25

Comments

mlcoop commented Dec 5, 2018 • edited Loading

okuchaiev commented Dec 6, 2018 • edited Loading

mlcoop commented Dec 6, 2018

mlcoop commented Dec 5, 2018 •

edited

Loading

okuchaiev commented Dec 6, 2018 •

edited

Loading