Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Guidance with the correct format of the validation dataset #981

Closed
Sosycs opened this issue Nov 10, 2023 · 11 comments
Closed

Guidance with the correct format of the validation dataset #981

Sosycs opened this issue Nov 10, 2023 · 11 comments

Comments

@Sosycs
Copy link

Sosycs commented Nov 10, 2023

Hello every one,

I am in the process of fine-tuning Llama2 using SFT trainer and quantization using Lora.
my dataset is composed of questions structured like:
<s>[INST] <<SYS>> Please select the correct answer from the given multiple Options based on the given Context: <</SYS>> Context: Abrasion is another type of mechanical weathering. With abrasion, one rock bumps against another rock. Gravity causes abrasion as a rock tumbles down a slope. Moving water causes abrasion it moves rocks so that they bump against one another (Figure 9.3). Strong winds cause abrasion by blasting sand against rock surfaces. Finally, the ice in glaciers cause abrasion. Pieces of rock embedded in ice at the bottom of a glacier scrape against the rock below. If you have ever collected beach glass or pebbles from a stream, you have witnessed the work of abrasion. Question: Gravity causes erosion by all of the following except Options:(A) glaciers (B) moving air (C) flowing water (D) mass movement Answer: [/INST]

and the a column 'label' represent the ground truth.
my questions is in my code:

trainer = SFTTrainer(
    model=model,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    peft_config=peft_config,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    args=training_arguments,
)

do I provide the "label" column to the model for the validation dataset?
or do I leave it empty? if that so then how can I access the predictions of the validation set of my model?

@BayesRulez
Copy link

BayesRulez commented Nov 10, 2023

Hi @Sosycs,

Fine-tuning these LLMs is not like training a supervised machine learning model where you have some inputs and a target to compare your prediction with. Decoder-only transformers like LLaMA2 are simply predicting the next token in a sequence.

When you are fine-tuning these models, you hand them a complete sequence. The trainer steps through your input, handing the model them a single extra token at a time and requesting prediction for the next token. The training loss is calculated across the totality of all predictions vs. the actual next tokens.

What this means for you (and this applies to both the training and validation datasets) is that you need to compile both your questions and your answers into a single string. You could do it like this:

template = """<s>[INST] <<SYS>> Please select the correct answer from the given multiple Options based on the given Context: <</SYS>> Context: {question} Answer: {answer}[/INST]"""
question = "Abrasion is another type of mechanical weathering..."
answer = "D" # I hope...
prompt = template.replace("{question}", question).replace("{answer}", answer)

It's worth noting that it's pretty inefficient to fine-tune a model by scoring it's predictions across the entire prompt. You don't really care about teaching it that "weathering" is likely to follow "mechanical". You care much more that it learns to produce "D" (assuming the answer is D...), given the question.

Take a look at the use of the DataCollatorForCompletionOnlyLM function here: https://huggingface.co/docs/trl/sft_trainer

It enables you to use only the predictions for tokens that appear to the right of the response_template parameter to compute the loss. This is a much more efficient way of getting better, task-specific performance quickly.

Best of luck!

@Sosycs
Copy link
Author

Sosycs commented Nov 12, 2023

I am familiar with providing the answer separately and computing the loss depending on the predicted values. what is the name for this fine-tuning? instruction fine-tuning?
and what is the opposite one used for other type of not only a decoder-only LLM?
(I am sorry if my questions sounds silly, this is my first time doing this fine-tuning and I want the names to search and read more)

But I am currently using what you suggested @BayesRulez.
In my case this is the instruction structure:
<s>[INST] <<SYS>> Please select the correct answer from the given multiple Options based on the given Context: <</SYS>> Context: Abrasion is another type of mechanical weathering. With abrasion, one rock bumps against another rock. Gravity causes abrasion as a rock tumbles down a slope. Moving water causes abrasion it moves rocks so that they bump against one another (Figure 9.3). Strong winds cause abrasion by blasting sand against rock surfaces. Question: Gravity causes erosion by all of the following except Options:(A) glaciers (B) moving air (C) flowing water (D) mass movement Answer: [/INST] D </s>

Do I provide the context, question and options as an instruction_template?

instruction_template = "</SYS>>\n\n Context:" response_template = "Answer: [/INST]" collator = DataCollatorForCompletionOnlyLM(instruction_template=instruction_template, response_template=response_template, tokenizer=tokenizer, mlm=False)
OR
response_template = "Answer: [/INST]" collator = DataCollatorForCompletionOnlyLM(response_template=response_template, tokenizer=tokenizer, mlm=False)

@Sosycs
Copy link
Author

Sosycs commented Nov 14, 2023

I have tried multiple response templates but always get the error:
RuntimeError: Could not find response key [835, 4007, 22137, 29901] in token IDs tensor([ 1, 835, ...])

@BayesRulez Can you please guide me to the correct one?

@BayesRulez
Copy link

Hi @Sosycs,

I literally just had the same problem when using the Mistral tokenizer and somebody here was kind enough to point out why.

The reason for the error you're seeing is explained here.

To respond to both of your questions above at the same time, your code should look as follows:

response_template = "Answer: [/INST]"

# Work-around for context-sensitive tokenizers
response_template_tokenized = tokenizer.encode(f"\n{response_template}, add_special_tokens=False)[2:]

collator = DataCollatorForCompletionOnlyLM(response_template=response_template_tokenized , tokenizer=tokenizer)

Note that you'll only need to use the work-around for context-sensitive tokenizers. LLaMA and Mistral both use them. I'm not sure if any other tokenizers do (Mistral was the first I came across).

Hope that helps.

@younesbelkada
Copy link
Contributor

Great point yes @BayesRulez thanks a lot for your help !
This is also a duplicate of #989

@Sosycs
Copy link
Author

Sosycs commented Nov 16, 2023

Hi @Sosycs,

I literally just had the same problem when using the Mistral tokenizer and somebody here was kind enough to point out why.

The reason for the error you're seeing is explained here.

To respond to both of your questions above at the same time, your code should look as follows:

response_template = "Answer: [/INST]"

# Work-around for context-sensitive tokenizers
response_template_tokenized = tokenizer.encode(f"\n{response_template}, add_special_tokens=False)[2:]

collator = DataCollatorForCompletionOnlyLM(response_template=response_template_tokenized , tokenizer=tokenizer)

Note that you'll only need to use the work-around for context-sensitive tokenizers. LLaMA and Mistral both use them. I'm not sure if any other tokenizers do (Mistral was the first I came across).

Hope that helps.

Thank you very much. I have tried this solution in huggingface but I must have missed something as I used "\nAnswer: [/INST]"

So I tried your exact code and got (I also got it from my code as they produce the same toknized IDs):
TypeError: TextEncodeInput must be Union[TextInputSequence, Tuple[InputSequence, InputSequence]]

According to this it has something with a None value present but I don't know where this come from in my case.

@Sosycs
Copy link
Author

Sosycs commented Nov 16, 2023

Hello @younesbelkada,
Thank you very much.
I have tried both "\n[/INST]" and "[/INST]" but with the same error:
RuntimeError: Could not find response key [835, 4007, 22137, 29901] in token IDs tensor([ 1, 835, ...])
regarding the stackoverflow link, I appreciate the link!
as I understand that this error comes from a value None that presented in my text column but I don't have this value.

@Sosycs
Copy link
Author

Sosycs commented Nov 16, 2023

@BayesRulez @younesbelkada Shall I remove all the white spaces in my dataset in "Answer: [/INST] "?

@younesbelkada
Copy link
Contributor

Hi @Sosycs
Can you try to pass directly the ids instead ? Something like:

response_template_with_context = "[/INST]"  # We added context here: "\n". This is enough for this tokenizer
response_template_ids = tokenizer.encode(response_template_with_context, add_special_tokens=False)

data_collator = DataCollatorForCompletionOnlyLM(response_template_ids, tokenizer=tokenizer)

@BayesRulez
Copy link

@Sosycs can you paste your full code here? The following works for me:

response_template = "Answer: [/INST]"

# Work-around for context-sensitive tokenizers
response_template_tokenized = tokenizer.encode(f"\n{response_template}", add_special_tokens=False)[2:]

collator = DataCollatorForCompletionOnlyLM(response_template=response_template_tokenized , tokenizer=tokenizer)

example = """<s>[INST] <<SYS>> Please select the correct answer from the given multiple Options based on the given Context: <</SYS>> 
Context: Abrasion is another type of mechanical weathering. With abrasion, one rock bumps against another rock. Gravity causes abrasion as a rock tumbles down a slope. Moving water causes abrasion it moves rocks so that they bump against one another (Figure 9.3). Strong winds cause abrasion by blasting sand against rock surfaces. Finally, the ice in glaciers cause abrasion. Pieces of rock embedded in ice at the bottom of a glacier scrape against the rock below. If you have ever collected beach glass or pebbles from a stream, you have witnessed the work of abrasion. 
Question: Gravity causes erosion by all of the following except Options:(A) glaciers (B) moving air (C) flowing water (D) mass movement 
Answer: [/INST]"""

example_encoded = tokenizer(example)

collator([example_encoded])

It returns a dict of {input_ids, attention_mask, labels}. You should see that every value o the labels tensor is -100 (cross-entropy loss will ignore this value) except for the final one, which encodes the "D".

Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants