-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A question about the SFTTrainer (also a theoretical question about SFT in general) #1083
Comments
It is not obvious hence the question. For example, the InstructGPT paper mentions SFT but mainly redirects to the (seemingly) first attempt at SFT in this paper which talks about a "Summarization" task but not a "Dialogue" task. In that paper, when human labelers are asked to summarize and then when the paper mentions "Behavioral Cloning" is used to finetune the LLM to adapt to this task, I'd imagine that only "Summary" section is considered label but not the entire prompt/document. Following that principle, for "Dialogue" tasks, intuitively, I'd imagine that only "assistant" turns should be part of labels. |
We off both options: doing "vanilla" CLM or masking out the user queries: https://huggingface.co/docs/trl/sft_trainer#advanced-usage I don't think there is a systematic distinction between fine-tuning, supervised fine-tuning or even instruction tuning. Just terms people use to essentially describe the same thing :) |
Thank you for the pointer! DataCollatorForCompletionOnlyLM is good to know (and what I was looking for in a sense :-)) About the terms, yeah I can see that these can be loosely interchangeable. Based on my literature reading, I have a view of how they are similar and they are (or should be) different - but perhaps everyone has their own view/interpretation :-) |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. |
Hi @PradeepKadubandi, thank you for pointing out the question—it was indeed confusing when I tried to fine-tune the model. The traditional approach separates the input (e.g., document) and label (e.g., summary), and the loss is calculated based on generation compared to the label. However, the SFTTrainer wraps the input and label together as one instruction (where input and label are the same) and trains it as a next-token prediction task. While these approaches seem similar, I wonder if there is a performance difference between these two. Do you have any sense of which method is better suited to specific scenarios? |
I have a general question about Supervised Fine Tuning (SFT) for Dialogue applications.
Should the SFT process use the same LM objective (next-token prediction) that is used in pre-training a language model?
The "Dialogue" task is predicting "assistant" tokens, right? Shouldn't the objective be predicting only those tokens? Is one way to do this is to set labels for only assistant tokens and ignore the labels on others?
The SFTTrainer implementation does not set labels - as far as I understand, this leads to "labels" being cloned to "input_ids" and shifted right (within transformers code) leading to using "next-token" prediction objective.
More on a philosophical note - if using the same objective as pre-training for SFT, why shouldn't that be called "Fine Tuning" the model (On a dialogue dataset of course) rather than "Supervised Fine Tuning". What am I missing? Is there a reference paper that explains this well? The right approach to do SFT for Dialogue applications?
The text was updated successfully, but these errors were encountered: