You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Compared to the training results on the open WandB, the experimental results I ran myself did not meet expectations.
The training results "rewards_train/accuracies" on the open WandB:
While the training results "rewards_train/accuracies" of mine:
The "rewards_train/accuracies" on the open WandB can even reach more than 70%,but the "rewards_train/accuracies" of mine could only achieve around 60% at most.
And the evaluating results: "rewards_eval/accuracies" on the open WandB:
The "rewards_eval/accuracies" of mine:
The eval result also have a gap (getting close to 10%).
Other parameters use the default parameters (e.g. lr=5e-7).
I'm not sure if I made a mistake somewhere or what needs to be modified to bridge this gap. Please help me, and I sincerely appreciate it.
The text was updated successfully, but these errors were encountered:
Is there any mistake in the training method for my SFT model (float32)?
Is it normal for the evaluation performance of the SFT model (bfloat16) to be much worse than the evaluation performance of the SFT model (float32)? Is there a way to compensate for this? Also, if the DPO model trained based on such a SFT model (bfloat16) has poor performance, is there a way to remedy it?
Is there any method or parameter adjustment to bridge the gap between DPO and the displayed results?
I replicated the experiments of pythia28 on hh (Anthropic/hh-rlhf) using the open-source code. Here are some of the experimental results:
SFT1:
with the
batch_size=4
.Then I evaluate the model using GPT-4:
But the result doesn't seem very promising.
I also implemented three other versions:
SFT2:
The result:
Which also doesn't seem to be good.
SFT3(bfloat16):
The result:
SFT4(bfloat16):
The result:
The evaluation of SFT3(bfloat16) and SFT4(bfloat16) seem to be even worse.
Based on The SFT1 (which I think is probably the best among these), I trained the DPO1 with the following command:
The result:
In which the highest winning rate is only 50%,but I don't what I did wrong or missed something.
I also implemented another result for DPO based on SFT2:
DPO2:
Compared to the training results on the open WandB, the experimental results I ran myself did not meet expectations.
The training results "rewards_train/accuracies" on the open WandB:
While the training results "rewards_train/accuracies" of mine:
The "rewards_train/accuracies" on the open WandB can even reach more than 70%,but the "rewards_train/accuracies" of mine could only achieve around 60% at most.
And the evaluating results:
"rewards_eval/accuracies" on the open WandB:
The "rewards_eval/accuracies" of mine:
The eval result also have a gap (getting close to 10%).
Other parameters use the default parameters (e.g.
lr=5e-7
).I'm not sure if I made a mistake somewhere or what needs to be modified to bridge this gap. Please help me, and I sincerely appreciate it.
The text was updated successfully, but these errors were encountered: