Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RSO (Statistical Rejection Sampling Improves Preference Optimization) #902

Closed
wants to merge 4 commits into from

Conversation

gaetanlop
Copy link
Contributor

@gaetanlop gaetanlop commented Oct 22, 2023

Implementation of Statistical Rejection Sampling Improves Preference Optimization

Respond to #816.

  1. To do:
  • Add a function to make generation from the SFT model
  • Add a function to score the generated sentences using the reward model
  • Example script on how to perform statistical rejection sampling
  • Implement the two ranking methods introduced in the paper: ie tournament ranking and first-round ranking
  1. To be discussed:
  • I have added a generate and score function to trainer.utils as those two functions will also be useful for raft and rest implementations ([WIP] Reward ranked finetuning (RAFT) and Reinforced Self-Training (ReST) #704). It would also be possible to add those functions to a utils.py script inside the examples folder. What do you think?
  • Same for the conduct_rejection_sampling function.
  • RSO nearly left unchanged the loss function of DPO and mainly introduces a method to generate preference pairs from the target optimal policy. Thus, the training script is the same as dpo.py. I don't think adding another similar script is useful. What's your opinion?
  • Not sure of the location to put those RSO example scripts.

cc @kashif @younesbelkada @lvwerra @philschmid

@gaetanlop gaetanlop marked this pull request as draft October 22, 2023 00:27
@gaetanlop gaetanlop marked this pull request as ready for review October 22, 2023 22:10
@gaetanlop gaetanlop changed the title [WIP] RSO (Statistical Rejection Sampling Improves Preference Optimization) RSO (Statistical Rejection Sampling Improves Preference Optimization) Oct 22, 2023
@kashif
Copy link
Collaborator

kashif commented Oct 23, 2023

  • thanks @gaetanlop I added the `"hinge" loss function which is what the RSO authors also use right?
  • I am also refactoring the DPO code a bit to remove the reference model log probs calculation to be done externally to the trainers (as a dataset mapper)
  • I had some questions about the seq-2-seq models for these methods (asked in the seq-2-seq merged PR)

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

@gaetanlop gaetanlop marked this pull request as draft October 23, 2023 17:52
@gaetanlop
Copy link
Contributor Author

Hello @kashif, thanks for looking at this. It looks like the RSO authors experimented both with sigmoid (used in dpo) and hinge (used in SliC) loss function. Can you send the link where you posted the questions for the seq2seq models please? I do not find it

Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

@lvwerra
Copy link
Member

lvwerra commented Nov 24, 2023

Stay calm stale-bot :D

@hendrydong
Copy link

Looks cool!

Are there any recent updates?

Copy link

github-actions bot commented Jan 8, 2024

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

@github-actions github-actions bot closed this Jan 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants