Skip to content
This repository has been archived by the owner on Aug 3, 2021. It is now read-only.

training on own data, and RMSE is nan #25

Open
mlcoop opened this issue Dec 5, 2018 · 2 comments
Open

training on own data, and RMSE is nan #25

mlcoop opened this issue Dec 5, 2018 · 2 comments
Labels

Comments

@mlcoop
Copy link

mlcoop commented Dec 5, 2018

hey @okuchaiev

I have been trying to train on my own data.
Dataset consists of 539278 user_ids and 1551731 items. Data is super sparse.
While training my RMSE: nan. Should I take absolute value of mseloss?

I have PyTorch 0.4, Cuda 9.0. Training on gtx 1080ti.

Using GPUs: [0] Doing epoch 0 of 12 run.py:198: UserWarning: invalid index of a 0-dim tensor. This will be an error in PyTorch 0.5. Use tensor.item() to convert a 0-dim tensor to a Python number t_loss += loss.data[0] [0, 0] RMSE: 8.0848995 run.py:212: UserWarning: invalid index of a 0-dim tensor. This will be an error in PyTorch 0.5. Use tensor.item() to convert a 0-dim tensor to a Python number total_epoch_loss += loss.data[0] [0, 1000] RMSE: nan [0, 2000] RMSE: nan [0, 3000] RMSE: nan [0, 4000] RMSE: nan [0, 5000] RMSE: nan [0, 6000] RMSE: nan [0, 7000] RMSE: nan [0, 8000] RMSE: nan Total epoch 0 finished in 1966.838391304016 seconds with TRAINING RMSE loss: nan run.py:74: UserWarning: invalid index of a 0-dim tensor. This will be an error in PyTorch 0.5. Use tensor.item() to convert a 0-dim tensor to a Python number total_epoch_loss += loss.data[0] run.py:75: UserWarning: invalid index of a 0-dim tensor. This will be an error in PyTorch 0.5. Use tensor.item() to convert a 0-dim tensor to a Python number denom += num_ratings.data[0] Epoch 0 EVALUATION LOSS: nan Saving model to model_save/model.epoch_0 Doing epoch 1 of 12 [1, 0] RMSE: nan [1, 1000] RMSE: nan [1, 2000] RMSE: nan [1, 3000] RMSE: nan [1, 4000] RMSE: nan [1, 5000] RMSE: nan [1, 6000] RMSE: nan [1, 7000] RMSE: nan [1, 8000] RMSE: nan

Could you please help me out?

@okuchaiev
Copy link
Member

okuchaiev commented Dec 6, 2018

can you try lowering learning rate?
1/10 or 1/100 of whatever you are using. Also, what is the range of your labels, e.g. 1-5 or some other range?

@mlcoop
Copy link
Author

mlcoop commented Dec 6, 2018

@okuchaiev
I have lowered my lr to 0.001 and optimizer to Adam. nan's disappeared but loss wont converge.
label's range between 1-5. Also my batch:64 and hidden_layers:128,196,256,320.

I have tried different sets of hidden layers, and these ones show some kind of loss decrease.
For example 128,128,256 increased loss after each epoch.

I know it's art of tuning, but could you please toward me to right direction?
maybe my data is too sparse and batch,hidden layers are too small?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

2 participants