-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add option to check dataset labels in SFTTrainer #1414
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing work !
Can you add the docstring of that arg in SFTTrainer's docstring together with few lines explaining what you posted on the PR? 🙏
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @geronimi73
the changes look great ! can you just run the styling checks? make precommit
, then we can merge
what about the |
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I think we can leave the print
Co-authored-by: Younes Belkada <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @geronimi73 !
Again thank you very much for this contribution, after thinking a bit, I think we should make things standardized and use things like logger.info
instead of print
, so that it approach is harmonized across our coding practice in the HF codebase - sorry for that as you already asked the question and i said yes! - would you mind switching print
statements to logger.info
? you will need to import logging
at the top of the file and import the logger properly ! let me know if you need any help or if you think we should keep the print statements
(to fix the failing tests you just have to rebase with main) |
i tried. it works but not sure if this is the way to do it. please check my comment on the code |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks ! I left one comment ! can you also run the styling checks? make precommit
Co-authored-by: Younes Belkada <[email protected]>
Hi @geronimi73 sorry for all the iteration ! can you re-run the styling checks again? 🙏 |
sure! but i'm still wondering whether it works correctly. Problem is, I never see the output of how am I supposed to set the log level (as a user of SFTTrainer) ? I tried this, doesnt work: import transformers
transformers.logging.set_verbosity_info() |
also this one does not enable it args = TrainingArguments(
output_dir = "out",
log_level = "info"
) |
sorry to keep bothering you @younesbelkada but I think the problem ist that we are using logging from To do this correctly I think, we would have to either
what do you think? option 2? |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. |
Hi everyone!
This PR introduces a new option
check_dataset_labels
inSFTTrainer
. When enabled, the trainer calls the collator on the first sample from the training set and logstoken_id
, decodedtoken_id
, and the correspondinglabel
. This helps to uncover mistakes early, such as setting the tokenizer'spad_token
toeos_token
. The idea is taken from axolotl where such an option exists already.Problem(s):
It's common practice to set the tokenizer's pad_token to eos_token. This is problematic because
DataCollatorForCompletionOnlyLM
sets thelabel
for all occurrences ofpad_token
to-100
. Ifpad=eos
, then the model will never learn to output eos. This issue can be hard to debug and many people struggle with it.Additionally, when using
instruction_template
andresponse_template
, logging the tokens and labels would be helpful to ensure that all the labels are correctly set to only train on output and ignore the instruction. Furthermore, tokenizers can be complex and the output would aid in spotting tokenization issues like handling of special tokens quicklySolution:
check_dataset_labels
to the trainer.Usage:
output:
Related issues:
eos_token_id
#976eos_token_id
at the end. transformers#22794