Loss is inf or nan, bug?, example file attached! #678

Spiegel-Leser · 2021-11-19T11:10:32Z

Training the agent often fails with message "Loss is inf or nan". I found another thread where missing normalization was the culprit. I don't know what that is about, I could find nothing about normalization in the documentation (but maybe I just searched wrong). Can you clear this up for me?

Debugging shows
dqn_agent.py, l. 451: transition = self._as_transition(experience)
calls
data_converter.py, l. 423: value = trajectory.to_n_step_transition(value, gamma=self._gamma)
calls
trajectory.py, l. 780 - 787
which create nans in time_steps
After that in
dqn_agent.py, l. 456, l. 469 and 471 (q_values,td_error,td_loss)
gives back those nans into the loss.

So trajectory.py l. 780 - 787 are the responsible lines. In a working case they don't produce nan. Their code exactly is

  time_steps = ts.TimeStep(
      first_frame.step_type,
      # unknown
      reward=tf.nest.map_structure(
          lambda r: np.nan * tf.ones_like(r), first_frame.reward),
      # unknown
      discount=np.nan * tf.ones_like(first_frame.discount),
      observation=first_frame.observation)

The text was updated successfully, but these errors were encountered:

Spiegel-Leser · 2021-12-17T10:16:31Z

The problem returned, so reopening this

Spiegel-Leser · 2021-12-23T12:43:04Z

Perhaps noone answers this thread because you cannot reproduce the behaviour. I wrote a file that produces the nan values, attached. Please help me!
produce_nan_bug.zip

sguada · 2022-01-04T18:01:13Z

Although that code adds NaNs when converted to Trajectory those are discarded and it shouldn't be there.

Spiegel-Leser · 2022-01-04T20:07:52Z

I don't understand. As stated above the NaNs definititely produce the error during training. How can I make that error description better for you?

sguada · 2022-01-04T21:59:51Z

What I meant is that code shouldn't be the cause of the NaNs you are seeing in training, unless there is some problem with the data. It's hard to know how you got NaNs in your training loop, but yeah debugging the inputs to DQNAgent.train(), which shouldn't contain any NaNs is a good first step.

When you run any of the existing examples do you see a similar error? or it just in your specific example?

Spiegel-Leser mentioned this issue Nov 22, 2021

Debugging in dqn_agent.py #679

Closed

Spiegel-Leser closed this as completed Nov 24, 2021

Spiegel-Leser reopened this Dec 17, 2021

Spiegel-Leser changed the title ~~Loss is inf or nan~~ Loss is inf or nan, bug?, example file attached! Dec 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loss is inf or nan, bug?, example file attached! #678

Loss is inf or nan, bug?, example file attached! #678

Spiegel-Leser commented Nov 19, 2021 •

edited

Loading

Spiegel-Leser commented Dec 17, 2021

Spiegel-Leser commented Dec 23, 2021 •

edited

Loading

sguada commented Jan 4, 2022

Spiegel-Leser commented Jan 4, 2022

sguada commented Jan 4, 2022

Loss is inf or nan, bug?, example file attached! #678

Loss is inf or nan, bug?, example file attached! #678

Comments

Spiegel-Leser commented Nov 19, 2021 • edited Loading

Spiegel-Leser commented Dec 17, 2021

Spiegel-Leser commented Dec 23, 2021 • edited Loading

sguada commented Jan 4, 2022

Spiegel-Leser commented Jan 4, 2022

sguada commented Jan 4, 2022

Spiegel-Leser commented Nov 19, 2021 •

edited

Loading

Spiegel-Leser commented Dec 23, 2021 •

edited

Loading