RefDPO: Understanding Reference Policies in Direct Preference Optimization

The repository contains the training scripts, datasets, and model checkpoints for the paper "Understanding Reference Policies in Direct Preference Optimization".

Quick Links

Installation
Running the code
Datasets
Experimental Results
Resource collection on Huggingface: RefDPO

Installation

Our code base is based on Huggingface's Transformers library, deepspeed, and PyTorch. To install the required dependencies, run the following command:

pip install -r requirements.txt

Our code base is adapted from the the open-instruct repository.

Running the code

To run the code, you can use the following command (it assumes there are 8 GPUs available):

accelerate launch \
    --mixed_precision bf16 \
    --num_machines 1 \
    --num_processes 8 \
    --use_deepspeed \
    --deepspeed_config_file deepspeed.conf \
    main.py \
    --cuda \
    --dataset 'yale-nlp/RefDPO' \
    --data_split 'mistral' \
    --epoch 3 \
    --beta 0.1 \
    --dpo_weight 1.0 \
    --model_type 'HuggingFaceH4/mistral-7b-sft-beta' \
    --insert_eos \
    -l

Each argument is explained in the main.py file.

To run the training without the reference model, you can use the --ref_free flag:

accelerate launch \
    --mixed_precision bf16 \
    --num_machines 1 \
    --num_processes 8 \
    --use_deepspeed \
    --deepspeed_config_file deepspeed.conf \
    main.py \
    --cuda \
    --dataset 'yale-nlp/RefDPO' \
    --data_split 'mistral' \
    --epoch 3 \
    --beta 10.0 \
    --dpo_weight 1.0 \
    --ref_free \
    --model_type 'HuggingFaceH4/mistral-7b-sft-beta' \
    --insert_eos \
    -l

Code structure

main.py: The main file to run the code.
data_utils.py: The data processing utilities.
utils.py: Utility functions.
deepspeed.conf: The deepspeed configuration file.
dpo_utils.py: The DPO utilities.

Resource requirements

Our training requires 8 GPUs with 48GB of memory each. We use the deepspeed library to distribute the training across multiple GPUs.

Datasets

We have made the datasets used in the paper available on Huggingface's dataset hub: yale-nlp/RefDPO. It contains 5 different datasets. Each dataset is built upon the UltraFeedback dataset, specifically its binarized version ultrafeedback_binarized_cleaned converted from ultrafeedback_binarized. The datasets contain pre-computed log-probabilities of the reference policy/model for the output pairs in the UltraFeedback dataset.

Dataset	Reference Model	Description
`mistral`	HuggingFaceH4/mistral-7b-sft-beta	The log-probabilities are computed using the Mistral-7B-SFT model.
`tulu2`	allenai/tulu-2-7b	The log-probabilities are computed using the Tulu-2-7B model.
`mistral_prior`	HuggingFaceH4/mistral-7b-sft-beta	The prior (unconditional) log-probabilities are computed using the Mistral-7B-SFT model.
`mistralv2`	mistralai/Mistral-7B-Instruct-v0.2	The log-probabilities are computed using the Mistral-7B-Instruct-v0.2 model.
`llama3`	meta-llama/Meta-Llama-3-70B-Instruct	The log-probabilities are computed using the Meta-Llama-3-70B-Instruct model.

Experimental Results

Below are the model checkpoints for the models trained in the paper.

RQ1: What Is the Optimal KL Constraint Strength for DPO?

Below are the model checkpoints fine-tuned with DPO from mistral-7b-sft and tulu-2-7b. The models are fine-tuned with different KL constraint strengths ($\beta$).

The checkpoints are available on Huggingface's model hub. They are evaluated using the length-controlled AlpacaEval2 score reference.

Checkpoints fine-tuned from mistral-7b-sft

$\beta$	HF Checkpoint	AlpacaEval2 LC-Score
0.1	yale-nlp/mistral-7b-dpo-beta-0.1	14.03
0.05	yale-nlp/mistral-7b-dpo-beta-0.05	13.29
0.02	yale-nlp/mistral-7b-dpo-beta-0.02	16.06
0.01	yale-nlp/mistral-7b-dpo-beta-0.01	16.25
0.005	yale-nlp/mistral-7b-dpo-beta-0.005	12.36

Checkpoints fine-tuned from tulu-2-7b

$\beta$	HF Checkpoint	AlpacaEval2 LC-Score
0.1	yale-nlp/tulu2-7b-dpo-beta-0.1	9.38
0.05	yale-nlp/tulu2-7b-dpo-beta-0.05	9.96
0.02	yale-nlp/tulu2-7b-dpo-beta-0.02	10.46
0.01	yale-nlp/tulu2-7b-dpo-beta-0.01	7.86
0.005	yale-nlp/tulu2-7b-dpo-beta-0.005	[degenerate]

RQ2: Is a Reference Policy Necessary for Effective Preference Learning?

Below are the optimal checkpoints fine-tuned from mistral-7b-sft with three different reward parameterizations.

Reward Parameterization	HF Checkpoint	AlpacaEval2 LC-Score	$\beta$
$\beta \cdot \frac{p_\theta(y\|x)}{p_{\mathrm{ref}}(y\|x)}$ (DPO)	yale-nlp/mistral-7b-dpo-beta-0.01	16.25	0.01
$\beta \cdot p_\theta(y\|x)$ (Posterior Probability)	yale-nlp/mistral-probability	12.84	100.0
$\beta \cdot p_\theta(x\|y)$ (Likelihood Function)	yale-nlp/mistral-likelihood	13.63	0.01

RQ3: Does DPO Benefit from Stronger Reference Policies?

Here we use two stronger reference models Mistral-7B-Instruct-v0.2 and Meta-Llama-3-70B-Instruct for DPO training.

Checkpoints fine-tuned from mistral-7b-sft using Mistral-7B-Instruct-v0.2 as the reference policy.

$\beta$	HF Checkpoint	AlpacaEval2 LC-Score
10.0	yale-nlp/mistral-7b-dpo-mistralv2-7b-beta-10.0	18.74
1.00	yale-nlp/mistral-7b-dpo-mistralv2-7b-beta-1.0	20.25
0.10	yale-nlp/mistral-7b-dpo-mistralv2-7b-beta-0.1	19.58
0.01	yale-nlp/mistral-7b-dpo-mistralv2-7b-beta-0.01	17.18
0.005	yale-nlp/mistral-7b-dpo-mistralv2-7b-beta-0.005	15.34

Checkpoints fine-tuned from mistral-7b-sft using Meta-Llama-3-70B-Instruct as the reference policy.

$\beta$	HF Checkpoint	AlpacaEval2 LC-Score
10.0	yale-nlp/mistral-7b-dpo-llama3-70b-beta-10.0	13.29
1.00	yale-nlp/mistral-7b-dpo-llama3-70b-beta-1.0	9.59
0.10	yale-nlp/mistral-7b-dpo-llama3-70b-beta-0.1	10.99
0.01	yale-nlp/mistral-7b-dpo-llama3-70b-beta-0.01	15.37
0.005	yale-nlp/mistral-7b-dpo-llama3-70b-beta-0.005	11.70

Checkpoints fine-tuned from tulu-2-7b using Mistral-7B-Instruct-v0.2 as the reference policy.

$\beta$	HF Checkpoint	AlpacaEval2 LC-Score
10.0	yale-nlp/tulu2-7b-dpo-mistralv2-7b-beta-10.0	7.61
1.00	yale-nlp/tulu2-7b-dpo-mistralv2-7b-beta-1.0	7.85
0.10	yale-nlp/tulu2-7b-dpo-mistralv2-7b-beta-0.1	[degenerate]
0.01	yale-nlp/tulu2-7b-dpo-mistralv2-7b-beta-0.01	[degenerate]
0.005	yale-nlp/tulu2-7b-dpo-mistralv2-7b-beta-0.005	[degenerate]

Checkpoints fine-tuned from tulu-2-7b using Meta-Llama-3-70B-Instruct as the reference policy.

$\beta$	HF Checkpoint	AlpacaEval2 LC-Score
10.0	yale-nlp/tulu2-7b-dpo-llama3-70b-beta-10.0	9.79
1.00	yale-nlp/tulu2-7b-dpo-llama3-70b-beta-1.0	11.17
0.10	yale-nlp/tulu2-7b-dpo-llama3-70b-beta-0.1	10.31
0.01	yale-nlp/tulu2-7b-dpo-llama3-70b-beta-0.01	9.16
0.005	yale-nlp/tulu2-7b-dpo-llama3-70b-beta-0.005	3.29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RefDPO: Understanding Reference Policies in Direct Preference Optimization

Quick Links

Installation

Running the code

Code structure

Resource requirements

Datasets

Experimental Results

RQ1: What Is the Optimal KL Constraint Strength for DPO?

Checkpoints fine-tuned from mistral-7b-sft

Checkpoints fine-tuned from tulu-2-7b

RQ2: Is a Reference Policy Necessary for Effective Preference Learning?

RQ3: Does DPO Benefit from Stronger Reference Policies?

Checkpoints fine-tuned from mistral-7b-sft using Mistral-7B-Instruct-v0.2 as the reference policy.

Checkpoints fine-tuned from mistral-7b-sft using Meta-Llama-3-70B-Instruct as the reference policy.

Checkpoints fine-tuned from tulu-2-7b using Mistral-7B-Instruct-v0.2 as the reference policy.

Checkpoints fine-tuned from tulu-2-7b using Meta-Llama-3-70B-Instruct as the reference policy.

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
README.md		README.md
data_utils.py		data_utils.py
deepspeed.conf		deepspeed.conf
dpo_utils.py		dpo_utils.py
main.py		main.py
requirements.txt		requirements.txt
utils.py		utils.py

yale-nlp/refdpo

Folders and files

Latest commit

History

Repository files navigation

RefDPO: Understanding Reference Policies in Direct Preference Optimization

Quick Links

Installation

Running the code

Code structure

Resource requirements

Datasets

Experimental Results

RQ1: What Is the Optimal KL Constraint Strength for DPO?

Checkpoints fine-tuned from mistral-7b-sft

Checkpoints fine-tuned from tulu-2-7b

RQ2: Is a Reference Policy Necessary for Effective Preference Learning?

RQ3: Does DPO Benefit from Stronger Reference Policies?

Checkpoints fine-tuned from mistral-7b-sft using Mistral-7B-Instruct-v0.2 as the reference policy.

Checkpoints fine-tuned from mistral-7b-sft using Meta-Llama-3-70B-Instruct as the reference policy.

Checkpoints fine-tuned from tulu-2-7b using Mistral-7B-Instruct-v0.2 as the reference policy.

Checkpoints fine-tuned from tulu-2-7b using Meta-Llama-3-70B-Instruct as the reference policy.

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages