-
Notifications
You must be signed in to change notification settings - Fork 728
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Categorical Q Network Performing Worse than Q Network #331
Comments
You may need to adjust the values of categorical Q-Network, for instance if you never get negative rewards or only -1 when dies you may want to change the min_q_value=-1, also you may need to adjust n_atoms=11 or n_atoms=21. The default values are those used for Atari. |
What's the reasoning for adjusting to n_atoms to 11 or 21? Wouldn't that just give a distribution of Q-values that has less resolution? Has this ever been shown to help? |
Take a look at https://arxiv.org/pdf/1707.06887.pdf Figure 3, where they run different n_atoms to find best solution for Atari. 5, 11, 21, 51 were the solutions they tried. |
I see... so 21 returns seemed to perform better for Asterix. I am making the following changes trial 1: min_q_value=-1, max_q_value=20, n_atoms=21, n_step_update=2 I will come back in 8 hours once my code has finished running and comment on the results. |
I should also note that I use using an epsilon greedy policy (decayed to eps=0.01 over 500000 steps) for the Q Network but not for the Categorical Q Network (my reasoning for not using one was because there was none in tutorial 9). Could this possibly be the reason why the agent is performing poorly?- not enough initial exploration while learning? @sguada |
Yeah that could explain the bad performance, you need some extra exploration at the beginning of training, even for Categorical Q-Networks. |
@sguada Here are the results for n_atoms=21 vs n_atoms=51: Recall that this was trained with a uniform replay buffer with an epsilon greedy policy of 500000 steps. The initial negative reward is simply because I give the snake -0.5 for running into itself (and then take a different random action where the snake does not run into itself instead - initially it runs into itself a lot) The most interesting feature I see here is that the reward gets worse for n_atoms=51 over time but stays static for n_atoms=21. What could be the reason for this? It seems to suggest that n_atoms=21 has somehow found a more accurate distribution of the Q-Values? Is it possibly hinting at a design flaw somewhere else? This seems to be the exact opposite of Figure 3 in the paper you linked (see Breakout graph for example) where small number of n_atoms cause the reward to decrease over time but larger number of n_atoms cause the reward to increase. |
I am currently training an agent that receives sparse rewards (a reward of +1 approximately every 5-10 moves). I have the discount factor set to 0.99.
I trained both a Q Network and Categorical Q Network on the environment- for both I used fc_layer_params=[100,100,100] and for the Categorical Q Network I used n_atoms=51, min_q_value=-20, max_q_value=20, and n_step_update=2. The categorical Q network performs significantly worse on the environment. In particular, in the categorical Q Network, the average reward initially increases up to +10 (after 1 mil steps) and then gradually decreases to +6 (from 1 mil to 10 mil steps).
Might there be a reason for this? Are the min and max q values inappropriate? I'm quite confused- the C51 paper linked in your tutorial says that the C51 Network is particularly suited for complex environments with sparse rewards (precisely my environment- I'm training a snake game where agent gets +1 reward for eating food and -1 reward for dying).
Any help/suggestions is appreciated.
The text was updated successfully, but these errors were encountered: