Categorical Q Network Performing Worse than Q Network #331

lukepolson · 2020-03-19T19:26:18Z

I am currently training an agent that receives sparse rewards (a reward of +1 approximately every 5-10 moves). I have the discount factor set to 0.99.

I trained both a Q Network and Categorical Q Network on the environment- for both I used fc_layer_params=[100,100,100] and for the Categorical Q Network I used n_atoms=51, min_q_value=-20, max_q_value=20, and n_step_update=2. The categorical Q network performs significantly worse on the environment. In particular, in the categorical Q Network, the average reward initially increases up to +10 (after 1 mil steps) and then gradually decreases to +6 (from 1 mil to 10 mil steps).

Might there be a reason for this? Are the min and max q values inappropriate? I'm quite confused- the C51 paper linked in your tutorial says that the C51 Network is particularly suited for complex environments with sparse rewards (precisely my environment- I'm training a snake game where agent gets +1 reward for eating food and -1 reward for dying).

Any help/suggestions is appreciated.

sguada · 2020-03-19T19:35:49Z

You may need to adjust the values of categorical Q-Network, for instance if you never get negative rewards or only -1 when dies you may want to change the min_q_value=-1, also you may need to adjust n_atoms=11 or n_atoms=21.

The default values are those used for Atari.

lukepolson · 2020-03-19T19:40:21Z

What's the reasoning for adjusting to n_atoms to 11 or 21? Wouldn't that just give a distribution of Q-values that has less resolution? Has this ever been shown to help?

sguada · 2020-03-19T19:44:33Z

Take a look at https://arxiv.org/pdf/1707.06887.pdf Figure 3, where they run different n_atoms to find best solution for Atari. 5, 11, 21, 51 were the solutions they tried.

lukepolson · 2020-03-19T19:48:25Z

I see... so 21 returns seemed to perform better for Asterix. I am making the following changes

trial 1: min_q_value=-1, max_q_value=20, n_atoms=21, n_step_update=2
trial 2: min_q_value=-1, max_q_value=20, n_atoms=11, n_step_update=2

I will come back in 8 hours once my code has finished running and comment on the results.

lukepolson · 2020-03-19T20:19:43Z

I should also note that I use using an epsilon greedy policy (decayed to eps=0.01 over 500000 steps) for the Q Network but not for the Categorical Q Network (my reasoning for not using one was because there was none in tutorial 9). Could this possibly be the reason why the agent is performing poorly?- not enough initial exploration while learning? @sguada

sguada · 2020-03-19T22:35:53Z

Yeah that could explain the bad performance, you need some extra exploration at the beginning of training, even for Categorical Q-Networks.
The tutorials are intended to show different use cases, don't cover all the aspects in each one.
Try to keep the same exploration policy for both agents.

lukepolson · 2020-03-20T17:47:00Z

@sguada Here are the results for n_atoms=21 vs n_atoms=51:

Recall that this was trained with a uniform replay buffer with an epsilon greedy policy of 500000 steps. The initial negative reward is simply because I give the snake -0.5 for running into itself (and then take a different random action where the snake does not run into itself instead - initially it runs into itself a lot)

The most interesting feature I see here is that the reward gets worse for n_atoms=51 over time but stays static for n_atoms=21. What could be the reason for this? It seems to suggest that n_atoms=21 has somehow found a more accurate distribution of the Q-Values? Is it possibly hinting at a design flaw somewhere else?

This seems to be the exact opposite of Figure 3 in the paper you linked (see Breakout graph for example) where small number of n_atoms cause the reward to decrease over time but larger number of n_atoms cause the reward to increase.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Categorical Q Network Performing Worse than Q Network #331

Categorical Q Network Performing Worse than Q Network #331

lukepolson commented Mar 19, 2020 •

edited

Loading

sguada commented Mar 19, 2020

lukepolson commented Mar 19, 2020

sguada commented Mar 19, 2020

lukepolson commented Mar 19, 2020 •

edited

Loading

lukepolson commented Mar 19, 2020 •

edited

Loading

sguada commented Mar 19, 2020 •

edited

Loading

lukepolson commented Mar 20, 2020 •

edited

Loading

Categorical Q Network Performing Worse than Q Network #331

Categorical Q Network Performing Worse than Q Network #331

Comments

lukepolson commented Mar 19, 2020 • edited Loading

sguada commented Mar 19, 2020

lukepolson commented Mar 19, 2020

sguada commented Mar 19, 2020

lukepolson commented Mar 19, 2020 • edited Loading

lukepolson commented Mar 19, 2020 • edited Loading

sguada commented Mar 19, 2020 • edited Loading

lukepolson commented Mar 20, 2020 • edited Loading

lukepolson commented Mar 19, 2020 •

edited

Loading

lukepolson commented Mar 19, 2020 •

edited

Loading

lukepolson commented Mar 19, 2020 •

edited

Loading

sguada commented Mar 19, 2020 •

edited

Loading

lukepolson commented Mar 20, 2020 •

edited

Loading