Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DPO did not achieve the expected experimental effect #56

Open
Vance0124 opened this issue Dec 7, 2023 · 2 comments
Open

DPO did not achieve the expected experimental effect #56

Vance0124 opened this issue Dec 7, 2023 · 2 comments

Comments

@Vance0124
Copy link

I replicated the experiments of pythia28 on hh (Anthropic/hh-rlhf) using the open-source code. Here are some of the experimental results:

SFT1:

python -u train.py exp_name=sft gradient_accumulation_steps=1 batch_size=4 eval_batch_size=16 model.policy_dtype=float32

with the batch_size=4.
Then I evaluate the model using GPT-4:
sft3
But the result doesn't seem very promising.
I also implemented three other versions:

SFT2:

python -u train.py model=pythia28 datasets=[hh] loss=sft exp_name=anthropic_dpo_pythia28_sft gradient_accumulation_steps=8 batch_size=64 eval_batch_size=32 trainer=TensorParallelTrainer sample_during_eval=false model.fsdp_policy_mp=bfloat16 model.policy_dtype=float32

The result:
sft4
Which also doesn't seem to be good.

SFT3(bfloat16):

python -u train.py model=pythia28 datasets=[hh] loss=sft exp_name=anthropic_dpo_pythia28_sft gradient_accumulation_steps=8 batch_size=64 eval_batch_size=16 trainer=TensorParallelTrainer sample_during_eval=false model.fsdp_policy_mp=bfloat16 model.policy_dtype=bfloat16

The result:
sft1

SFT4(bfloat16):

python -u train.py model=pythia28 datasets=[hh] loss=sft exp_name=anthropic_dpo_pythia28_sft gradient_accumulation_steps=8 batch_size=64 eval_batch_size=16 trainer=BasicTrainer sample_during_eval=false model.fsdp_policy_mp=bfloat16 model.policy_dtype=bfloat16

The result:
sft2

The evaluation of SFT3(bfloat16) and SFT4(bfloat16) seem to be even worse.

Based on The SFT1 (which I think is probably the best among these), I trained the DPO1 with the following command:

python -u train.py loss=dpo loss.beta=0.1 exp_name=DPO_pythia28 gradient_accumulation_steps=1 batch_size=4 eval_batch_size=16 model.policy_dtype=float32 model.archive=.cache/yanxue/DPO_SFT_2023-11-25_20-19-34_103923/LATEST/policy.pt 

The result:
DPO1
In which the highest winning rate is only 50%,but I don't what I did wrong or missed something.

I also implemented another result for DPO based on SFT2
DPO2:

python -u train.py model=pythia28 datasets=[hh] loss=dpo loss.beta=0.1 exp_name=anthropic_dpo_pythia28_dpo model.policy_dtype=bfloat16 gradient_accumulation_steps=16 batch_size=64 eval_batch_size=16 trainer=TensorParallelTrainer sample_during_eval=false model.fsdp_policy_mp=bfloat16 model.archive=.cache/root/anthropic_dpo_pythia28_sft_2023-12-03_15-39-23_344965/LATEST/policy.pt

Compared to the training results on the open WandB, the experimental results I ran myself did not meet expectations.

The training results "rewards_train/accuracies" on the open WandB:
wandb_train
While the training results "rewards_train/accuracies" of mine:
result_train
The "rewards_train/accuracies" on the open WandB can even reach more than 70%,but the "rewards_train/accuracies" of mine could only achieve around 60% at most.
And the evaluating results:
"rewards_eval/accuracies" on the open WandB:
wandb_eval
The "rewards_eval/accuracies" of mine:
result_eval_dpo2
The eval result also have a gap (getting close to 10%).
Other parameters use the default parameters (e.g. lr=5e-7).
I'm not sure if I made a mistake somewhere or what needs to be modified to bridge this gap. Please help me, and I sincerely appreciate it.

@Vance0124
Copy link
Author

My main issue is:

  1. Is there any mistake in the training method for my SFT model (float32)?
  2. Is it normal for the evaluation performance of the SFT model (bfloat16) to be much worse than the evaluation performance of the SFT model (float32)? Is there a way to compensate for this? Also, if the DPO model trained based on such a SFT model (bfloat16) has poor performance, is there a way to remedy it?
  3. Is there any method or parameter adjustment to bridge the gap between DPO and the displayed results?

@eric-mitchell
Copy link
Owner

  1. Changing the fsdp_policy_dtype when not using the FSDPTrainer will have no effect. The reference wandb run used the command:
train.py model=pythia28 datasets=[hh] loss=sft exp_name=pythia28_hh_sft_bf16 gradient_accumulation_steps=2 batch_size=64 n_epochs=1 eval_batch_size=32 trainer=FSDPTrainer eval_every=5000 sample_during_eval=false model.fsdp_policy_mp=bfloat16
  1. It's possible this has to do with an interaction between TensorParallelTrainer and reduced precision. Can you try again with the FSDPTrainer?
  2. Can you try again with the exact commands used in the demo runs? https://wandb.ai/eric_anthony_mitchell/dpo-demos

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants