Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A question about the SFTTrainer (also a theoretical question about SFT in general) #1083

Closed
PradeepKadubandi opened this issue Dec 11, 2023 · 5 comments

Comments

@PradeepKadubandi
Copy link

I have a general question about Supervised Fine Tuning (SFT) for Dialogue applications.

Should the SFT process use the same LM objective (next-token prediction) that is used in pre-training a language model?

The "Dialogue" task is predicting "assistant" tokens, right? Shouldn't the objective be predicting only those tokens? Is one way to do this is to set labels for only assistant tokens and ignore the labels on others?

The SFTTrainer implementation does not set labels - as far as I understand, this leads to "labels" being cloned to "input_ids" and shifted right (within transformers code) leading to using "next-token" prediction objective.

More on a philosophical note - if using the same objective as pre-training for SFT, why shouldn't that be called "Fine Tuning" the model (On a dialogue dataset of course) rather than "Supervised Fine Tuning". What am I missing? Is there a reference paper that explains this well? The right approach to do SFT for Dialogue applications?

@PradeepKadubandi
Copy link
Author

PradeepKadubandi commented Dec 11, 2023

It is not obvious hence the question. For example, the InstructGPT paper mentions SFT but mainly redirects to the (seemingly) first attempt at SFT in this paper which talks about a "Summarization" task but not a "Dialogue" task.

In that paper, when human labelers are asked to summarize and then when the paper mentions "Behavioral Cloning" is used to finetune the LLM to adapt to this task, I'd imagine that only "Summary" section is considered label but not the entire prompt/document. Following that principle, for "Dialogue" tasks, intuitively, I'd imagine that only "assistant" turns should be part of labels.

@lvwerra
Copy link
Member

lvwerra commented Dec 14, 2023

We off both options: doing "vanilla" CLM or masking out the user queries: https://huggingface.co/docs/trl/sft_trainer#advanced-usage

I don't think there is a systematic distinction between fine-tuning, supervised fine-tuning or even instruction tuning. Just terms people use to essentially describe the same thing :)

@PradeepKadubandi
Copy link
Author

Thank you for the pointer! DataCollatorForCompletionOnlyLM is good to know (and what I was looking for in a sense :-))

About the terms, yeah I can see that these can be loosely interchangeable. Based on my literature reading, I have a view of how they are similar and they are (or should be) different - but perhaps everyone has their own view/interpretation :-)

Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

@Hyfred
Copy link

Hyfred commented Nov 9, 2024

these can be loosely interchangeable

Hi @PradeepKadubandi, thank you for pointing out the question—it was indeed confusing when I tried to fine-tune the model. The traditional approach separates the input (e.g., document) and label (e.g., summary), and the loss is calculated based on generation compared to the label.

However, the SFTTrainer wraps the input and label together as one instruction (where input and label are the same) and trains it as a next-token prediction task. While these approaches seem similar, I wonder if there is a performance difference between these two. Do you have any sense of which method is better suited to specific scenarios?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants