Add option to check dataset labels in SFTTrainer #1414

geronimi73 · 2024-03-11T00:04:03Z

Hi everyone!

This PR introduces a new option check_dataset_labels in SFTTrainer. When enabled, the trainer calls the collator on the first sample from the training set and logs token_id, decoded token_id, and the corresponding label. This helps to uncover mistakes early, such as setting the tokenizer's pad_token toeos_token. The idea is taken from axolotl where such an option exists already.

Problem(s):
It's common practice to set the tokenizer's pad_token to eos_token. This is problematic because DataCollatorForCompletionOnlyLM sets the label for all occurrences of pad_token to -100. If pad=eos, then the model will never learn to output eos. This issue can be hard to debug and many people struggle with it.

Additionally, when using instruction_template and response_template, logging the tokens and labels would be helpful to ensure that all the labels are correctly set to only train on output and ignore the instruction. Furthermore, tokenizers can be complex and the output would aid in spotting tokenization issues like handling of special tokens quickly

Solution:

Add a new option check_dataset_labels to the trainer.
When check_dataset_labels is enabled, the trainer calls the collator on the first sample from the training set and logs token_id, decoded token_id, and the corresponding label.
This creates transparency and helps to uncover mistakes early.

Usage:

# load model and tokenizer 
...

messages = [
    {"role": "user", "content": "Hello who are you?"},
    {"role": "assistant", "content": "Luke, I am your father"},
    {"role": "user", "content": "WTF"},
]
dataset = Dataset.from_list([dict(messages=messages)])

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    data_collator = DataCollatorForCompletionOnlyLM(
        instruction_template = "<|im_start|>user", 
        response_template = "<|im_start|>assistant", 
        tokenizer = tokenizer, 
        mlm = False),
    check_dataset_labels = True,
    dataset_kwargs=dict(add_special_tokens=False),
    args = TrainingArguments(output_dir = "out")
)

output:

check_dataset_labels:
<|im_start|> user
Hello who are you? <|im_end|> 
 <|im_start|> assistant
Luke, I am your father <|im_end|> 
 <|im_start|> user
WTF <|im_end|> 

32000 '<|im_start|>' -100
1792 'user' -100
13 '<0x0A>' -100
10994 'Hello' -100
1058 'who' -100
526 'are' -100
366 'you' -100
29973 '?' -100
32001 '<|im_end|>' -100
13 '<0x0A>' -100
32000 '<|im_start|>' -100
465 'ass' -100
22137 'istant' -100
13 '<0x0A>' 13
24126 'Lu' 24126
446 'ke' 446
29892 ',' 29892
306 'I' 306
626 'am' 626
596 'your' 596
4783 'father' 4783
32001 '<|im_end|>' 32001
13 '<0x0A>' 13
32000 '<|im_start|>' -100
1792 'user' -100
13 '<0x0A>' -100
29956 'W' -100
8969 'TF' -100
32001 '<|im_end|>' -100
13 '<0x0A>' -100

Related issues:

…ocessed sample

younesbelkada

Amazing work !
Can you add the docstring of that arg in SFTTrainer's docstring together with few lines explaining what you posted on the PR? 🙏

younesbelkada

Thanks @geronimi73
the changes look great ! can you just run the styling checks? make precommit , then we can merge

geronimi73 · 2024-03-12T16:35:35Z

Thanks @geronimi73 the changes look great ! can you just run the styling checks? make precommit , then we can merge

what about the print statements I mentioned? leave it like this?

HuggingFaceDocBuilderDev · 2024-03-12T16:36:30Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

younesbelkada

Yes I think we can leave the print

trl/trainer/sft_trainer.py

Co-authored-by: Younes Belkada <[email protected]>

younesbelkada

Hi @geronimi73 !
Again thank you very much for this contribution, after thinking a bit, I think we should make things standardized and use things like logger.info instead of print, so that it approach is harmonized across our coding practice in the HF codebase - sorry for that as you already asked the question and i said yes! - would you mind switching print statements to logger.info ? you will need to import logging at the top of the file and import the logger properly ! let me know if you need any help or if you think we should keep the print statements

younesbelkada · 2024-03-13T15:24:27Z

(to fix the failing tests you just have to rebase with main)

geronimi73 · 2024-03-13T17:14:02Z

you will need to import logging at the top of the file and import the logger properly

i tried. it works but not sure if this is the way to do it. please check my comment on the code

younesbelkada

Thanks ! I left one comment ! can you also run the styling checks? make precommit

trl/trainer/sft_trainer.py

Co-authored-by: Younes Belkada <[email protected]>

younesbelkada · 2024-03-15T09:53:14Z

Hi @geronimi73 sorry for all the iteration ! can you re-run the styling checks again? 🙏

geronimi73 · 2024-03-15T09:54:51Z

Hi @geronimi73 sorry for all the iteration ! can you re-run the styling checks again? 🙏

sure! but i'm still wondering whether it works correctly. Problem is, I never see the output of logger.info()!

how am I supposed to set the log level (as a user of SFTTrainer) ? I tried this, doesnt work:

import transformers
transformers.logging.set_verbosity_info()

geronimi73 · 2024-03-15T09:56:13Z

Hi @geronimi73 sorry for all the iteration ! can you re-run the styling checks again? 🙏

sure! but i'm still wondering whether it works. I never see the output of logger.info()!

how am I supposed to set the log level (as a user of SFTTrainer) ? I tried this, doesnt work:
import transformers
transformers.logging.set_verbosity_info()

also this one does not enable it

args = TrainingArguments(
    output_dir = "out",
    log_level = "info"
    )

geronimi73 · 2024-03-15T11:51:33Z

sorry to keep bothering you @younesbelkada but I think the problem ist that we are using logging from transformers.utils and obtaining a logger with __name__ (=trl.trainer.sft_trainer). This means that setting the log_level in TrainingArguments does not have an effect on the logger we obtained because this line in trainer.py affects the root logger transformers and does not have any effect on our logger (trl.trainer.sft_trainer).

To do this correctly I think, we would have to either

add a trl.utils.logger analogous to transformers.utils.logger (too much effort?)
OR use logger.setLevel(log_level) in SFTTrainer.init(). Otherwise we will not see the output of logging.info().

what do you think? option 2?

github-actions · 2024-04-10T15:04:59Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

geronimi73 added 2 commits March 11, 2024 00:03

add option check_dataset_labels and log tokens and labels of first pr…

3d1f68a

…ocessed sample

removed nonsense logger call

8dc5390

younesbelkada reviewed Mar 11, 2024

View reviewed changes

add docstring, default check_dataset_labels=False

3853037

geronimi73 requested a review from younesbelkada March 12, 2024 16:22

younesbelkada reviewed Mar 12, 2024

View reviewed changes

trl/trainer/sft_trainer.py Outdated Show resolved Hide resolved

geronimi73 and others added 2 commits March 12, 2024 17:42

Update trl/trainer/sft_trainer.py

9632fee

Co-authored-by: Younes Belkada <[email protected]>

spaces for ruff

d435e41

geronimi73 requested a review from younesbelkada March 13, 2024 05:44

younesbelkada reviewed Mar 13, 2024

View reviewed changes

add logger

a760a17

younesbelkada reviewed Mar 14, 2024

View reviewed changes

trl/trainer/sft_trainer.py Outdated Show resolved Hide resolved

geronimi73 and others added 2 commits March 14, 2024 10:09

style

24a1d3e

Update trl/trainer/sft_trainer.py

cdb865c

Co-authored-by: Younes Belkada <[email protected]>

github-actions bot closed this Apr 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add option to check dataset labels in SFTTrainer #1414

Add option to check dataset labels in SFTTrainer #1414

geronimi73 commented Mar 11, 2024

younesbelkada left a comment

younesbelkada left a comment

geronimi73 commented Mar 12, 2024

HuggingFaceDocBuilderDev commented Mar 12, 2024

younesbelkada left a comment

younesbelkada left a comment

younesbelkada commented Mar 13, 2024

geronimi73 commented Mar 13, 2024

younesbelkada left a comment

younesbelkada commented Mar 15, 2024

geronimi73 commented Mar 15, 2024 •

edited

Loading

geronimi73 commented Mar 15, 2024

geronimi73 commented Mar 15, 2024

github-actions bot commented Apr 10, 2024

Add option to check dataset labels in SFTTrainer #1414

Add option to check dataset labels in SFTTrainer #1414

Conversation

geronimi73 commented Mar 11, 2024

younesbelkada left a comment

Choose a reason for hiding this comment

younesbelkada left a comment

Choose a reason for hiding this comment

geronimi73 commented Mar 12, 2024

HuggingFaceDocBuilderDev commented Mar 12, 2024

younesbelkada left a comment

Choose a reason for hiding this comment

younesbelkada left a comment

Choose a reason for hiding this comment

younesbelkada commented Mar 13, 2024

geronimi73 commented Mar 13, 2024

younesbelkada left a comment

Choose a reason for hiding this comment

younesbelkada commented Mar 15, 2024

geronimi73 commented Mar 15, 2024 • edited Loading

geronimi73 commented Mar 15, 2024

geronimi73 commented Mar 15, 2024

github-actions bot commented Apr 10, 2024

geronimi73 commented Mar 15, 2024 •

edited

Loading