Conversational dataset support for `DPOTrainer` #2131

qgallouedec · 2024-09-26T16:20:27Z

What does this PR do?

Part of #2071

It includes:

Overwriting the default learning rate of TrainingArguments, so that the user can just use it with the default values. (same as [KTO] learning rate recomentations for kto #2070)
Extending extract_prompt to support standard dataset.
Add some doc about vision datasets
Add script for RLAIF-V dataset generation

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a GitHub issue? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

HuggingFaceDocBuilderDev · 2024-09-26T16:24:40Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

qgallouedec · 2024-09-30T12:23:10Z

trl/data_utils.py

+        if example["chosen"][idx] != example["rejected"][idx]:
+            if example["chosen"][idx - 1] == " ":  # remove space before the prompt
+                idx -= 1


str1 = "I am Quentin" str2 = "I am in Lyon" # What we want: prompt = "I am" # What we don't want: prompt = "I am "

That's why, when the prompt ends with a space, we take idx-1 instead.

…ngface/trl into dpo-conversational-dataset

lewtun

Really nice clean up of the DPO preprocessing and docs 🔥. Great stuff @qgallouedec !

I know we don't currently have any proper regression tests, but WDYT about running the dpo.py script on main and this branch to sanity check the margins / loss look ok before merging?

docs/source/dataset_formats.mdx

docs/source/dpo_trainer.mdx

lewtun · 2024-10-01T15:36:13Z

docs/source/dpo_trainer.mdx


-Note that the `beta` is the temperature parameter for the DPO loss, typically something in the range of `0.1` to `0.5`. We ignore the reference model as `beta` -> 0.
+To see how the [trained model](https://huggingface.co/trl-lib/dpo-qwen2) performs, use the following code to generate completions:


WDYT about using trl chat here as a nice demo to show users it exists?

Something like this?

docs/source/dpo_trainer.mdx

trl/trainer/dpo_config.py

Co-authored-by: lewtun <[email protected]>

docs/source/dpo_trainer.mdx

Co-authored-by: Kashif Rasul <[email protected]>

qgallouedec · 2024-10-02T07:42:19Z

accelerate launch examples/scripts/dpo.py \
    --dataset_name trl-lib/ultrafeedback_binarized \
    --model_name_or_path Qwen/Qwen2-0.5B-Instruct \
    --learning_rate 5.0e-7 \
    --num_train_epochs 1 \
    --per_device_train_batch_size 2 \
    --gradient_accumulation_steps 8 \
    --gradient_checkpointing \
    --logging_steps 25 \
    --eval_strategy steps \
    --eval_steps 50 \
    --output_dir Qwen2-0.5B-DPO-main \
    --no_remove_unused_columns

https://wandb.ai/huggingface/huggingface/runs/b499t4i2
https://wandb.ai/huggingface/huggingface/runs/jqdy51ld

conversational dataset support for dpo

ad6b9a8

qgallouedec and others added 7 commits September 26, 2024 18:49

support standard dataset for extract prompt

94a29ef

test standard dataset for extract prompt

69a6933

fix maybe

e479c85

fix maybe apply prompt

4301724

Merge branch 'main' into dpo-conversational-dataset

2fbca62

style

ead4114

Merge branch 'main' into dpo-conversational-dataset

6f058df

qgallouedec mentioned this pull request Sep 27, 2024

[Tracking issue] General dataset support #2071

Open

29 tasks

qgallouedec added 2 commits September 27, 2024 15:33

overwrite default learning rate of DPO

3428813

style

61c589f

kashif approved these changes Sep 27, 2024

View reviewed changes

qgallouedec and others added 16 commits September 27, 2024 17:07

rlaif script

9c6769b

writer_batch_size in train_test_split

c656c99

initial dpo doc refactoring

8ddf39e

vision data section in doc

d461963

lil format modif

e513cfe

Merge branch 'main' into dpo-conversational-dataset

dbf003e

refine Vision datasets

b22bb82

refine doc

5b8e75f

test new loss type format

93f87b8

restrcture loss function

0671ab5

table loss type

840db37

simplify unsloth

08b21b1

improve doc

083aeb5

looged metrics up

92bed88

refine loss section

985227e

Fix label_smoothing parameter in DPOConfig

9ba55e8

qgallouedec commented Sep 30, 2024

View reviewed changes

Merge branch 'main' into dpo-conversational-dataset

208f34e

qgallouedec marked this pull request as ready for review September 30, 2024 12:25

qgallouedec requested review from kashif, edbeeching and lewtun September 30, 2024 12:25

qgallouedec added 3 commits September 30, 2024 12:32

dataset for test

c2d1836

update readme

9869467

Merge branch 'dpo-conversational-dataset' of https://github.com/huggi…

063628d

…ngface/trl into dpo-conversational-dataset

kashif approved these changes Sep 30, 2024

View reviewed changes

This was referenced Oct 1, 2024

Fix attention mask warning chat cli #2147

Merged

[DPO] Adding weighted preference optimization (WPO) #2141

Merged

lewtun approved these changes Oct 1, 2024

View reviewed changes

qgallouedec mentioned this pull request Oct 1, 2024

Doc: Hub filters in trainers doc #2149

Closed

qgallouedec and others added 5 commits October 1, 2024 18:27

Update docs/source/dpo_trainer.mdx

f50a4bb

Co-authored-by: lewtun <[email protected]>

try colorized code block

df7cb6a

Merge branch 'main' into dpo-conversational-dataset

8749c70

refine doc style

bb2b368

further refine doc

3d8e0b6

kashif reviewed Oct 1, 2024

View reviewed changes

docs/source/dpo_trainer.mdx Outdated Show resolved Hide resolved

kashif reviewed Oct 1, 2024

View reviewed changes

docs/source/dpo_trainer.mdx Outdated Show resolved Hide resolved

qgallouedec and others added 2 commits October 1, 2024 21:07

Update docs/source/dpo_trainer.mdx

2f591f5

Co-authored-by: Kashif Rasul <[email protected]>

Merge branch 'main' into dpo-conversational-dataset

4ac091b

re add pali gemma test

a55c8ec

kashif approved these changes Oct 2, 2024

View reviewed changes

Add missing period

94a31e4

qgallouedec merged commit 78249d9 into main Oct 2, 2024
3 checks passed

qgallouedec deleted the dpo-conversational-dataset branch October 2, 2024 08:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Conversational dataset support for `DPOTrainer` #2131

Conversational dataset support for `DPOTrainer` #2131

qgallouedec commented Sep 26, 2024 •

edited

Loading

HuggingFaceDocBuilderDev commented Sep 26, 2024

qgallouedec Sep 30, 2024

lewtun left a comment

lewtun Oct 1, 2024

qgallouedec Oct 1, 2024

qgallouedec commented Oct 2, 2024 •

edited

Loading


		Note that the `beta` is the temperature parameter for the DPO loss, typically something in the range of `0.1` to `0.5`. We ignore the reference model as `beta` -> 0.
		To see how the [trained model](https://huggingface.co/trl-lib/dpo-qwen2) performs, use the following code to generate completions:

Conversational dataset support for DPOTrainer #2131

Conversational dataset support for DPOTrainer #2131

Conversation

qgallouedec commented Sep 26, 2024 • edited Loading

What does this PR do?

Before submitting

Who can review?

HuggingFaceDocBuilderDev commented Sep 26, 2024

qgallouedec Sep 30, 2024

Choose a reason for hiding this comment

lewtun left a comment

Choose a reason for hiding this comment

lewtun Oct 1, 2024

Choose a reason for hiding this comment

qgallouedec Oct 1, 2024

Choose a reason for hiding this comment

qgallouedec commented Oct 2, 2024 • edited Loading

Conversational dataset support for `DPOTrainer` #2131

Conversational dataset support for `DPOTrainer` #2131

qgallouedec commented Sep 26, 2024 •

edited

Loading

qgallouedec commented Oct 2, 2024 •

edited

Loading