You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As You may notice, all the parameters of the KL-Penalty have already been selected from the original paper. However, importance_ratio_clipping (Clipping) , entropy_regularization (Entropy Coeff) , use_gae (Generalized Advantage Estimation) are set to 0.
I tried leaving the rest of parameters as they are, but made the following changes:
While the original PPOAgent, without changing those parameters, works perfectly, when changing on of those parameters or all of them, the agent diverges quickly and always gets negative rewards, no matter how much I train them. At first, I ran those experiments many times to see If I can get a better solution, but the algorithm always did very poorly.
In order to test if this is a bug of tf-agents PPOAgent class, I decided to run the same algorithm with the same parameters using RLLib. Also, i changed the rest of the parameters, so that they match the default ones of tf-agents. Surprisingly, their implementation has no problem at converging at all, using Clipping, KL-penalty, Entropy Coeff & GAE at the same time! Here are the results:
The new versions of examples are nightly tested and verified against reported numbers from the paper (thus more reliable).
Once you try the new examples, could you verify 1. whether you get expected learning with the schulman17 parameters with just clipping 2. when you add the KL terms, does it stop learning?
This way it will help us narrow down the where the issues are. My guess is that something with the KL related implementation might be scaled differently or something. That logic is less widely used than the clipping version. Thanks in advance!
It is working fine. The agent learns & converges quickly. Then I added entropy regularization = 0.01, which didn't change much in the training process (GAE was True by default). . In order to use KL parameters, I had to switch from PPOClipAgent to PPOAgent and set importance_ratio_clipping=0.2. This didn't have the expected returns.
I have been trying to implement a PPO Agent that solves LunarLander-v2 as in the official example in the github repo:
https://github.com/tensorflow/agents/blob/master/tf_agents/agents/ppo/examples/v2/train_eval_clip_agent.py
In this example, a PPOClip agent is used. However, I would like to use both Clipping & KL-Penalty, so I used the PPOAgent class, that provides both options according to the documentation here:
https://www.tensorflow.org/agents/api_docs/python/tf_agents/agents/PPOAgent
As You may notice, all the parameters of the KL-Penalty have already been selected from the original paper. However, importance_ratio_clipping (Clipping) , entropy_regularization (Entropy Coeff) , use_gae (Generalized Advantage Estimation) are set to 0.
I tried leaving the rest of parameters as they are, but made the following changes:
importance_ratio_clipping=0.3
entropy_regularization=0.01
use_gae=True
While the original PPOAgent, without changing those parameters, works perfectly, when changing on of those parameters or all of them, the agent diverges quickly and always gets negative rewards, no matter how much I train them. At first, I ran those experiments many times to see If I can get a better solution, but the algorithm always did very poorly.
In order to test if this is a bug of tf-agents PPOAgent class, I decided to run the same algorithm with the same parameters using RLLib. Also, i changed the rest of the parameters, so that they match the default ones of tf-agents. Surprisingly, their implementation has no problem at converging at all, using Clipping, KL-penalty, Entropy Coeff & GAE at the same time! Here are the results:
https://github.com/kochlisGit/DRL-Frameworks/blob/main/rllib/ppo_average_return.png
The text was updated successfully, but these errors were encountered: