Replies: 2 comments
-
Rewards are sparse in our case, so find it difficult to see how we can use this. |
Beta Was this translation helpful? Give feedback.
0 replies
-
I imagined fine-tuning the language model to just predict responses given the meta data after the prompt but maybe the sparse labels prevent making a sum of the future reward over multiple responses. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I would like to see if the PPO method is better than the Decision Transformer method of learning to maximize reward.
Beta Was this translation helpful? Give feedback.
All reactions