Update how_to_train.md (#1003)

* Update how_to_train.md fix description about `min_new_tokens` * Update docs/source/how_to_train.md Co-authored-by: Costa Huang <[email protected]> --------- Co-authored-by: Costa Huang <[email protected]>
huggingface · Nov 20, 2023 · e5eb4db · e5eb4db
1 parent 28bdb6a
commit e5eb4db
Showing 1 changed file with 3 additions and 3 deletions.
diff --git a/docs/source/how_to_train.md b/docs/source/how_to_train.md
@@ -29,8 +29,8 @@ To address this issue, we add a penalty to the reward function based on the KL d
 If you generate text by purely sampling from the model distribution things work fine in general. But when you use the `generate` method there are a few caveats because it does not always purely sample depending on the settings which can cause KL-divergence to go negative. Essentially when the active model achieves `log_p_token_active < log_p_token_ref` we get negative KL-div. This can happen in a several cases:
 
 - **top-k sampling**: the model can smooth out the probability distribution causing the top-k tokens having a smaller probability than those of the reference model but they still are selected
-- **min_length**: this ignores the EOS token until `min_length` is reached. thus the model can assign a very high log prob to the EOS token and very low prob to all others until min_length is reached
-- **batched generation**: finished sequences in a batch are padded until all generations are finished. The model can learn to assign very low probabilities to the padding tokens unless they are properly masked or removed.
+- **min_length**: this ignores the EOS token until `min_length` is reached. thus the model can assign a very low log prob to the EOS token and very high probs to all others until min_length is reached
+- **min_length**: this ignores the EOS token until `min_length` is reached, thus the model can assign a very low log prob to the EOS token and very high probs to all others until min_length is reached
 
 These are just a few examples. Why is negative KL an issue? The total reward `R` is computed `R = r - beta * KL` so if the model can learn how to drive KL-divergence negative it effectively gets a positive reward. In many cases it can be much easier to exploit such a bug in the generation than actually learning the reward function. In addition the KL can become arbitrarily small thus the actual reward can be very small compared to it.
 
@@ -63,4 +63,4 @@ Debugging the RL pipeline can be challenging due to its complexity. Here are som
 - **Inspect the generations**: It's always a good idea to inspect what the model is generating. Maybe there is a big in your post-processing or your prompt. Due to bad settings you might cut-off generations too soon. These things are very hard to see on the metrics but very obvious if you look at the generations.
 - **Inspect the reward model**: If you reward is not improving over time maybe there's an issue with the reward model. You can look at extreme cases to see if it does what it should: e.g. in the sentiment case you can check if simple positive and negative examples really get different rewards. And you can look at the distribution of your dataset. Finally, maybe the reward is dominated by the query which the model can't affect so you might need to normalize this (e.g. reward of query+response minus reward of the query).
 
-These are just a few tips that we find helpful - if you have more useful tricks feel free to open a PR to add them as well!
+These are just a few tips that we find helpful - if you have more useful tricks feel free to open a PR to add them as well!