You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm using the code from this GitHub repository to train my RLHF using PPOTrainer. My custom reward model ranges from 0 (bad) to 1 (good).
After 150 steps, I've observed some issues: objective_KL increases to 10, ppo/loss_value decreases significantly, ppo/loss/total varies around 0 and can be negative, and the mean_reward per epoch (ideally increases) gradually drops, even going negative against a 0.5 baseline.
I think there might be a problem with the PPOTrainer's loss function, though I'm not fully familiar with the code base. Any advice/pointers would be greatly appreciated, thank you!
(Note: The KL value in the graph is scaled down by a factor of 10 for better visualization.)
Here are my questions:
Why do ppo/loss/total and ppo/loss/value become negative? Could it be due to a high KL value? Shouldn't the loss function limit the KL value from getting too large?
At which point in the current epoch should I have stopped?
Also, I notice here that ppo/loss/total = ppo/loss/value + ppo/loss/policy, but I'm not sure I'm seeing that in the stats_dict. Also, ppo/mean_non_score_reward (KL penalty) values also did not make much sense to me given the high KL values.
I'm attaching stats_loss.csv for further analysis of the graph and loss. The machine_ranks column indicates the rank of each machine. I use four GPUs, running the same rl_training.py script, but with my custom reward model.
stats_dict is the output of log_stats, with numpy converted to list for serialization purposes.
Thank you! Looking forward to any suggestions!
The text was updated successfully, but these errors were encountered:
viethoangtranduong
changed the title
PPOTrainer: PPO loss coming down, but objective/kl increases a lot, and mean_rewards decreases
PPOTrainer issue: PPO loss seems coming down, but objective/kl increases a lot, and mean_rewards decreases
Dec 15, 2023
Hi @viethoangtranduong, in general things look ok (besides the reward not going up), KL can be at ~10 and the loss goes down or hovers around a low value. So I would do the following:
start with the example script at examples/scripts/ppo.py with the default parameters
then step by step customize to your needs: add your model, your reward model etc.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
I'm using the code from this GitHub repository to train my RLHF using PPOTrainer. My custom reward model ranges from 0 (bad) to 1 (good).
After 150 steps, I've observed some issues:
objective_KL
increases to 10,ppo/loss_value
decreases significantly,ppo/loss/total
varies around 0 and can be negative, and themean_reward
per epoch (ideally increases) gradually drops, even going negative against a 0.5 baseline.I think there might be a problem with the PPOTrainer's loss function, though I'm not fully familiar with the code base. Any advice/pointers would be greatly appreciated, thank you!
(Note: The KL value in the graph is scaled down by a factor of 10 for better visualization.)
Here are my questions:
ppo/loss/total
andppo/loss/value
become negative? Could it be due to a high KL value? Shouldn't the loss function limit the KL value from getting too large?ppo/loss/total
=ppo/loss/value
+ppo/loss/policy
, but I'm not sure I'm seeing that in thestats_dict
. Also,ppo/mean_non_score_reward
(KL penalty) values also did not make much sense to me given the high KL values.stats_loss.csv
for further analysis of the graph and loss. Themachine_ranks
column indicates the rank of each machine. I use four GPUs, running the samerl_training.py
script, but with my custom reward model.File: https://drive.google.com/file/d/1To8POOxcxmHcw298S7M24_ql8f6lyOGV/view?usp=sharing
stats_dict
is the output oflog_stats
, with numpy converted to list for serialization purposes.Thank you! Looking forward to any suggestions!
The text was updated successfully, but these errors were encountered: