RSO (Statistical Rejection Sampling Improves Preference Optimization) #902

gaetanlop · 2023-10-22T00:26:48Z

Implementation of Statistical Rejection Sampling Improves Preference Optimization

Respond to #816.

To do:

Add a function to make generation from the SFT model
Add a function to score the generated sentences using the reward model
Example script on how to perform statistical rejection sampling
Implement the two ranking methods introduced in the paper: ie tournament ranking and first-round ranking

To be discussed:

I have added a generate and score function to trainer.utils as those two functions will also be useful for raft and rest implementations ([WIP] Reward ranked finetuning (RAFT) and Reinforced Self-Training (ReST) #704). It would also be possible to add those functions to a utils.py script inside the examples folder. What do you think?
Same for the conduct_rejection_sampling function.
RSO nearly left unchanged the loss function of DPO and mainly introduces a method to generate preference pairs from the target optimal policy. Thus, the training script is the same as dpo.py. I don't think adding another similar script is useful. What's your opinion?
Not sure of the location to put those RSO example scripts.

cc @kashif @younesbelkada @lvwerra @philschmid

kashif · 2023-10-23T10:37:29Z

thanks @gaetanlop I added the `"hinge" loss function which is what the RSO authors also use right?
I am also refactoring the DPO code a bit to remove the reference model log probs calculation to be done externally to the trainers (as a dataset mapper)
I had some questions about the seq-2-seq models for these methods (asked in the seq-2-seq merged PR)

HuggingFaceDocBuilderDev · 2023-10-23T15:35:09Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

gaetanlop · 2023-10-24T01:45:57Z

Hello @kashif, thanks for looking at this. It looks like the RSO authors experimented both with sigmoid (used in dpo) and hinge (used in SliC) loss function. Can you send the link where you posted the questions for the seq2seq models please? I do not find it

kashif · 2023-10-30T13:54:20Z

@gaetanlop see here: https://github.com/huggingface/trl/pull/586/files#diff-f2d3c6530916998b7d251573afa193b5f0dfeca5b1a96147fe368c967ea8e079R380

github-actions · 2023-11-23T15:05:25Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

lvwerra · 2023-11-24T14:55:02Z

Stay calm stale-bot :D

hendrydong · 2023-12-15T06:32:16Z

Looks cool!

Are there any recent updates?

github-actions · 2024-01-08T15:05:20Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

gaetanlop added 2 commits October 21, 2023 20:23

initial commit

f4df0c3

generate from sft script

b05dd55

gaetanlop marked this pull request as draft October 22, 2023 00:27

gaetanlop added 2 commits October 22, 2023 17:01

adding ranking to rso script

962e0fb

adding generate and score fn to utils

c2ad3cf

gaetanlop marked this pull request as ready for review October 22, 2023 22:10

gaetanlop changed the title ~~[WIP] RSO (Statistical Rejection Sampling Improves Preference Optimization)~~ RSO (Statistical Rejection Sampling Improves Preference Optimization) Oct 22, 2023

gaetanlop marked this pull request as draft October 23, 2023 17:52

github-actions bot closed this Jan 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RSO (Statistical Rejection Sampling Improves Preference Optimization) #902

RSO (Statistical Rejection Sampling Improves Preference Optimization) #902

gaetanlop commented Oct 22, 2023 •

edited

Loading

kashif commented Oct 23, 2023

HuggingFaceDocBuilderDev commented Oct 23, 2023

gaetanlop commented Oct 24, 2023

kashif commented Oct 30, 2023

github-actions bot commented Nov 23, 2023

lvwerra commented Nov 24, 2023

hendrydong commented Dec 15, 2023

github-actions bot commented Jan 8, 2024

RSO (Statistical Rejection Sampling Improves Preference Optimization) #902

RSO (Statistical Rejection Sampling Improves Preference Optimization) #902

Conversation

gaetanlop commented Oct 22, 2023 • edited Loading

Implementation of Statistical Rejection Sampling Improves Preference Optimization

kashif commented Oct 23, 2023

HuggingFaceDocBuilderDev commented Oct 23, 2023

gaetanlop commented Oct 24, 2023

kashif commented Oct 30, 2023

github-actions bot commented Nov 23, 2023

lvwerra commented Nov 24, 2023

hendrydong commented Dec 15, 2023

github-actions bot commented Jan 8, 2024

gaetanlop commented Oct 22, 2023 •

edited

Loading