Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loss is inf or nan, bug?, example file attached! #678

Open
Spiegel-Leser opened this issue Nov 19, 2021 · 5 comments
Open

Loss is inf or nan, bug?, example file attached! #678

Spiegel-Leser opened this issue Nov 19, 2021 · 5 comments

Comments

@Spiegel-Leser
Copy link

Spiegel-Leser commented Nov 19, 2021

Training the agent often fails with message "Loss is inf or nan". I found another thread where missing normalization was the culprit. I don't know what that is about, I could find nothing about normalization in the documentation (but maybe I just searched wrong). Can you clear this up for me?

Debugging shows
dqn_agent.py, l. 451: transition = self._as_transition(experience)
calls
data_converter.py, l. 423: value = trajectory.to_n_step_transition(value, gamma=self._gamma)
calls
trajectory.py, l. 780 - 787
which create nans in time_steps
After that in
dqn_agent.py, l. 456, l. 469 and 471 (q_values,td_error,td_loss)
gives back those nans into the loss.

So trajectory.py l. 780 - 787 are the responsible lines. In a working case they don't produce nan. Their code exactly is

  time_steps = ts.TimeStep(
      first_frame.step_type,
      # unknown
      reward=tf.nest.map_structure(
          lambda r: np.nan * tf.ones_like(r), first_frame.reward),
      # unknown
      discount=np.nan * tf.ones_like(first_frame.discount),
      observation=first_frame.observation)
@Spiegel-Leser
Copy link
Author

The problem returned, so reopening this

@Spiegel-Leser Spiegel-Leser reopened this Dec 17, 2021
@Spiegel-Leser
Copy link
Author

Spiegel-Leser commented Dec 23, 2021

Perhaps noone answers this thread because you cannot reproduce the behaviour. I wrote a file that produces the nan values, attached. Please help me!
produce_nan_bug.zip

@Spiegel-Leser Spiegel-Leser changed the title Loss is inf or nan Loss is inf or nan, bug?, example file attached! Dec 23, 2021
@sguada
Copy link
Member

sguada commented Jan 4, 2022

Although that code adds NaNs when converted to Trajectory those are discarded and it shouldn't be there.

@Spiegel-Leser
Copy link
Author

I don't understand. As stated above the NaNs definititely produce the error during training. How can I make that error description better for you?

@sguada
Copy link
Member

sguada commented Jan 4, 2022

What I meant is that code shouldn't be the cause of the NaNs you are seeing in training, unless there is some problem with the data. It's hard to know how you got NaNs in your training loop, but yeah debugging the inputs to DQNAgent.train(), which shouldn't contain any NaNs is a good first step.

When you run any of the existing examples do you see a similar error? or it just in your specific example?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants