Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PPOAgent Entropy Regularization, Clipping, GAE are working Incorrectly #681

Open
kochlisGit opened this issue Nov 25, 2021 · 3 comments
Open

Comments

@kochlisGit
Copy link

kochlisGit commented Nov 25, 2021

I have been trying to implement a PPO Agent that solves LunarLander-v2 as in the official example in the github repo:
https://github.com/tensorflow/agents/blob/master/tf_agents/agents/ppo/examples/v2/train_eval_clip_agent.py

In this example, a PPOClip agent is used. However, I would like to use both Clipping & KL-Penalty, so I used the PPOAgent class, that provides both options according to the documentation here:
https://www.tensorflow.org/agents/api_docs/python/tf_agents/agents/PPOAgent

As You may notice, all the parameters of the KL-Penalty have already been selected from the original paper. However, importance_ratio_clipping (Clipping) , entropy_regularization (Entropy Coeff) , use_gae (Generalized Advantage Estimation) are set to 0.

I tried leaving the rest of parameters as they are, but made the following changes:

importance_ratio_clipping=0.3
entropy_regularization=0.01
use_gae=True

While the original PPOAgent, without changing those parameters, works perfectly, when changing on of those parameters or all of them, the agent diverges quickly and always gets negative rewards, no matter how much I train them. At first, I ran those experiments many times to see If I can get a better solution, but the algorithm always did very poorly.

In order to test if this is a bug of tf-agents PPOAgent class, I decided to run the same algorithm with the same parameters using RLLib. Also, i changed the rest of the parameters, so that they match the default ones of tf-agents. Surprisingly, their implementation has no problem at converging at all, using Clipping, KL-penalty, Entropy Coeff & GAE at the same time! Here are the results:

https://github.com/kochlisGit/DRL-Frameworks/blob/main/rllib/ppo_average_return.png

@summer-yue
Copy link
Member

Thank you for raising this issue, and your detailed description of the problem!

Could you please try this with our new examples https://github.com/tensorflow/agents/blob/master/tf_agents/examples/ppo/schulman17/ppo_clip_train_eval.py

The new versions of examples are nightly tested and verified against reported numbers from the paper (thus more reliable).

Once you try the new examples, could you verify 1. whether you get expected learning with the schulman17 parameters with just clipping 2. when you add the KL terms, does it stop learning?

This way it will help us narrow down the where the issues are. My guess is that something with the KL related implementation might be scaled differently or something. That logic is less widely used than the clipping version. Thanks in advance!

@summer-yue
Copy link
Member

In the mean time, our team will look into pointing users towards our new examples better, as opposed to the older version. Thanks again.

@kochlisGit
Copy link
Author

I have have tested 3 times PPOClipAgent in LunarLander-v2 using the example
https://github.com/tensorflow/agents/blob/master/tf_agents/examples/ppo/schulman17/ppo_clip_train_eval.py

It is working fine. The agent learns & converges quickly. Then I added entropy regularization = 0.01, which didn't change much in the training process (GAE was True by default). . In order to use KL parameters, I had to switch from PPOClipAgent to PPOAgent and set importance_ratio_clipping=0.2. This didn't have the expected returns.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants