DPO did not achieve the expected experimental effect #56

Vance0124 · 2023-12-07T14:34:56Z

I replicated the experiments of pythia28 on hh (Anthropic/hh-rlhf) using the open-source code. Here are some of the experimental results:

SFT1:

python -u train.py exp_name=sft gradient_accumulation_steps=1 batch_size=4 eval_batch_size=16 model.policy_dtype=float32

with the batch_size=4.
Then I evaluate the model using GPT-4:

But the result doesn't seem very promising.
I also implemented three other versions:

SFT2:

python -u train.py model=pythia28 datasets=[hh] loss=sft exp_name=anthropic_dpo_pythia28_sft gradient_accumulation_steps=8 batch_size=64 eval_batch_size=32 trainer=TensorParallelTrainer sample_during_eval=false model.fsdp_policy_mp=bfloat16 model.policy_dtype=float32

The result：

Which also doesn't seem to be good.

SFT3（bfloat16）:

python -u train.py model=pythia28 datasets=[hh] loss=sft exp_name=anthropic_dpo_pythia28_sft gradient_accumulation_steps=8 batch_size=64 eval_batch_size=16 trainer=TensorParallelTrainer sample_during_eval=false model.fsdp_policy_mp=bfloat16 model.policy_dtype=bfloat16

The result：

SFT4（bfloat16）:

python -u train.py model=pythia28 datasets=[hh] loss=sft exp_name=anthropic_dpo_pythia28_sft gradient_accumulation_steps=8 batch_size=64 eval_batch_size=16 trainer=BasicTrainer sample_during_eval=false model.fsdp_policy_mp=bfloat16 model.policy_dtype=bfloat16

The result：

The evaluation of SFT3（bfloat16） and SFT4(bfloat16) seem to be even worse.

Based on The SFT1 (which I think is probably the best among these), I trained the DPO1 with the following command:

python -u train.py loss=dpo loss.beta=0.1 exp_name=DPO_pythia28 gradient_accumulation_steps=1 batch_size=4 eval_batch_size=16 model.policy_dtype=float32 model.archive=.cache/yanxue/DPO_SFT_2023-11-25_20-19-34_103923/LATEST/policy.pt

The result：

In which the highest winning rate is only 50%，but I don't what I did wrong or missed something.

I also implemented another result for DPO based on SFT2：
DPO2:

python -u train.py model=pythia28 datasets=[hh] loss=dpo loss.beta=0.1 exp_name=anthropic_dpo_pythia28_dpo model.policy_dtype=bfloat16 gradient_accumulation_steps=16 batch_size=64 eval_batch_size=16 trainer=TensorParallelTrainer sample_during_eval=false model.fsdp_policy_mp=bfloat16 model.archive=.cache/root/anthropic_dpo_pythia28_sft_2023-12-03_15-39-23_344965/LATEST/policy.pt

Compared to the training results on the open WandB, the experimental results I ran myself did not meet expectations.

The training results "rewards_train/accuracies" on the open WandB:

While the training results "rewards_train/accuracies" of mine:

The "rewards_train/accuracies" on the open WandB can even reach more than 70%，but the "rewards_train/accuracies" of mine could only achieve around 60% at most.
And the evaluating results:
"rewards_eval/accuracies" on the open WandB:

The "rewards_eval/accuracies" of mine:

The eval result also have a gap (getting close to 10%).
Other parameters use the default parameters (e.g. lr=5e-7).
I'm not sure if I made a mistake somewhere or what needs to be modified to bridge this gap. Please help me, and I sincerely appreciate it.

The text was updated successfully, but these errors were encountered:

Vance0124 · 2023-12-07T14:53:14Z

My main issue is:

Is there any mistake in the training method for my SFT model (float32)?
Is it normal for the evaluation performance of the SFT model (bfloat16) to be much worse than the evaluation performance of the SFT model (float32)? Is there a way to compensate for this? Also, if the DPO model trained based on such a SFT model (bfloat16) has poor performance, is there a way to remedy it?
Is there any method or parameter adjustment to bridge the gap between DPO and the displayed results?

eric-mitchell · 2024-01-08T03:35:20Z

Changing the fsdp_policy_dtype when not using the FSDPTrainer will have no effect. The reference wandb run used the command:

train.py model=pythia28 datasets=[hh] loss=sft exp_name=pythia28_hh_sft_bf16 gradient_accumulation_steps=2 batch_size=64 n_epochs=1 eval_batch_size=32 trainer=FSDPTrainer eval_every=5000 sample_during_eval=false model.fsdp_policy_mp=bfloat16

It's possible this has to do with an interaction between TensorParallelTrainer and reduced precision. Can you try again with the FSDPTrainer?
Can you try again with the exact commands used in the demo runs? https://wandb.ai/eric_anthony_mitchell/dpo-demos

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DPO did not achieve the expected experimental effect #56

DPO did not achieve the expected experimental effect #56

Vance0124 commented Dec 7, 2023

Vance0124 commented Dec 7, 2023

eric-mitchell commented Jan 8, 2024

DPO did not achieve the expected experimental effect #56

DPO did not achieve the expected experimental effect #56

Comments

Vance0124 commented Dec 7, 2023

Vance0124 commented Dec 7, 2023

eric-mitchell commented Jan 8, 2024