Guidance with the correct format of the validation dataset #981

Sosycs · 2023-11-10T16:27:46Z

Hello every one,

I am in the process of fine-tuning Llama2 using SFT trainer and quantization using Lora.
my dataset is composed of questions structured like:
<s>[INST] <<SYS>> Please select the correct answer from the given multiple Options based on the given Context: <</SYS>> Context: Abrasion is another type of mechanical weathering. With abrasion, one rock bumps against another rock. Gravity causes abrasion as a rock tumbles down a slope. Moving water causes abrasion it moves rocks so that they bump against one another (Figure 9.3). Strong winds cause abrasion by blasting sand against rock surfaces. Finally, the ice in glaciers cause abrasion. Pieces of rock embedded in ice at the bottom of a glacier scrape against the rock below. If you have ever collected beach glass or pebbles from a stream, you have witnessed the work of abrasion. Question: Gravity causes erosion by all of the following except Options:(A) glaciers (B) moving air (C) flowing water (D) mass movement Answer: [/INST]

and the a column 'label' represent the ground truth.
my questions is in my code:

trainer = SFTTrainer(
    model=model,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    peft_config=peft_config,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    args=training_arguments,
)

do I provide the "label" column to the model for the validation dataset?
or do I leave it empty? if that so then how can I access the predictions of the validation set of my model?

The text was updated successfully, but these errors were encountered:

BayesRulez · 2023-11-10T19:18:16Z

Hi @Sosycs,

Fine-tuning these LLMs is not like training a supervised machine learning model where you have some inputs and a target to compare your prediction with. Decoder-only transformers like LLaMA2 are simply predicting the next token in a sequence.

When you are fine-tuning these models, you hand them a complete sequence. The trainer steps through your input, handing the model them a single extra token at a time and requesting prediction for the next token. The training loss is calculated across the totality of all predictions vs. the actual next tokens.

What this means for you (and this applies to both the training and validation datasets) is that you need to compile both your questions and your answers into a single string. You could do it like this:

template = """<s>[INST] <<SYS>> Please select the correct answer from the given multiple Options based on the given Context: <</SYS>> Context: {question} Answer: {answer}[/INST]"""
question = "Abrasion is another type of mechanical weathering..."
answer = "D" # I hope...
prompt = template.replace("{question}", question).replace("{answer}", answer)

It's worth noting that it's pretty inefficient to fine-tune a model by scoring it's predictions across the entire prompt. You don't really care about teaching it that "weathering" is likely to follow "mechanical". You care much more that it learns to produce "D" (assuming the answer is D...), given the question.

Take a look at the use of the DataCollatorForCompletionOnlyLM function here: https://huggingface.co/docs/trl/sft_trainer

It enables you to use only the predictions for tokens that appear to the right of the response_template parameter to compute the loss. This is a much more efficient way of getting better, task-specific performance quickly.

Best of luck!

Sosycs · 2023-11-12T08:15:28Z

I am familiar with providing the answer separately and computing the loss depending on the predicted values. what is the name for this fine-tuning? instruction fine-tuning?
and what is the opposite one used for other type of not only a decoder-only LLM?
(I am sorry if my questions sounds silly, this is my first time doing this fine-tuning and I want the names to search and read more)

But I am currently using what you suggested @BayesRulez.
In my case this is the instruction structure:
<s>[INST] <<SYS>> Please select the correct answer from the given multiple Options based on the given Context: <</SYS>> Context: Abrasion is another type of mechanical weathering. With abrasion, one rock bumps against another rock. Gravity causes abrasion as a rock tumbles down a slope. Moving water causes abrasion it moves rocks so that they bump against one another (Figure 9.3). Strong winds cause abrasion by blasting sand against rock surfaces. Question: Gravity causes erosion by all of the following except Options:(A) glaciers (B) moving air (C) flowing water (D) mass movement Answer: [/INST] D </s>

Do I provide the context, question and options as an instruction_template?

instruction_template = "</SYS>>\n\n Context:" response_template = "Answer: [/INST]" collator = DataCollatorForCompletionOnlyLM(instruction_template=instruction_template, response_template=response_template, tokenizer=tokenizer, mlm=False)
OR
response_template = "Answer: [/INST]" collator = DataCollatorForCompletionOnlyLM(response_template=response_template, tokenizer=tokenizer, mlm=False)

Sosycs · 2023-11-14T08:32:24Z

I have tried multiple response templates but always get the error:
RuntimeError: Could not find response key [835, 4007, 22137, 29901] in token IDs tensor([ 1, 835, ...])

@BayesRulez Can you please guide me to the correct one?

BayesRulez · 2023-11-14T13:20:36Z

Hi @Sosycs,

I literally just had the same problem when using the Mistral tokenizer and somebody here was kind enough to point out why.

The reason for the error you're seeing is explained here.

To respond to both of your questions above at the same time, your code should look as follows:

response_template = "Answer: [/INST]"

# Work-around for context-sensitive tokenizers
response_template_tokenized = tokenizer.encode(f"\n{response_template}, add_special_tokens=False)[2:]

collator = DataCollatorForCompletionOnlyLM(response_template=response_template_tokenized , tokenizer=tokenizer)

Note that you'll only need to use the work-around for context-sensitive tokenizers. LLaMA and Mistral both use them. I'm not sure if any other tokenizers do (Mistral was the first I came across).

Hope that helps.

younesbelkada · 2023-11-14T18:32:33Z

Great point yes @BayesRulez thanks a lot for your help !
This is also a duplicate of #989

Sosycs · 2023-11-16T06:05:32Z

Hi @Sosycs,

I literally just had the same problem when using the Mistral tokenizer and somebody here was kind enough to point out why.

The reason for the error you're seeing is explained here.

To respond to both of your questions above at the same time, your code should look as follows:
response_template = "Answer: [/INST]"

# Work-around for context-sensitive tokenizers
response_template_tokenized = tokenizer.encode(f"\n{response_template}, add_special_tokens=False)[2:]

collator = DataCollatorForCompletionOnlyLM(response_template=response_template_tokenized , tokenizer=tokenizer)
Note that you'll only need to use the work-around for context-sensitive tokenizers. LLaMA and Mistral both use them. I'm not sure if any other tokenizers do (Mistral was the first I came across).

Hope that helps.

Thank you very much. I have tried this solution in huggingface but I must have missed something as I used "\nAnswer: [/INST]"

So I tried your exact code and got (I also got it from my code as they produce the same toknized IDs):
TypeError: TextEncodeInput must be Union[TextInputSequence, Tuple[InputSequence, InputSequence]]

According to this it has something with a None value present but I don't know where this come from in my case.

Sosycs · 2023-11-16T06:13:55Z

Hello @younesbelkada,
Thank you very much.
I have tried both "\n[/INST]" and "[/INST]" but with the same error:
RuntimeError: Could not find response key [835, 4007, 22137, 29901] in token IDs tensor([ 1, 835, ...])
regarding the stackoverflow link, I appreciate the link!
as I understand that this error comes from a value None that presented in my text column but I don't have this value.

Sosycs · 2023-11-16T06:19:17Z

@BayesRulez @younesbelkada Shall I remove all the white spaces in my dataset in "Answer: [/INST] "?

younesbelkada · 2023-11-16T08:29:14Z

Hi @Sosycs
Can you try to pass directly the ids instead ? Something like:

response_template_with_context = "[/INST]"  # We added context here: "\n". This is enough for this tokenizer
response_template_ids = tokenizer.encode(response_template_with_context, add_special_tokens=False)

data_collator = DataCollatorForCompletionOnlyLM(response_template_ids, tokenizer=tokenizer)

BayesRulez · 2023-11-16T10:39:22Z

@Sosycs can you paste your full code here? The following works for me:

response_template = "Answer: [/INST]"

# Work-around for context-sensitive tokenizers
response_template_tokenized = tokenizer.encode(f"\n{response_template}", add_special_tokens=False)[2:]

collator = DataCollatorForCompletionOnlyLM(response_template=response_template_tokenized , tokenizer=tokenizer)

example = """<s>[INST] <<SYS>> Please select the correct answer from the given multiple Options based on the given Context: <</SYS>> 
Context: Abrasion is another type of mechanical weathering. With abrasion, one rock bumps against another rock. Gravity causes abrasion as a rock tumbles down a slope. Moving water causes abrasion it moves rocks so that they bump against one another (Figure 9.3). Strong winds cause abrasion by blasting sand against rock surfaces. Finally, the ice in glaciers cause abrasion. Pieces of rock embedded in ice at the bottom of a glacier scrape against the rock below. If you have ever collected beach glass or pebbles from a stream, you have witnessed the work of abrasion. 
Question: Gravity causes erosion by all of the following except Options:(A) glaciers (B) moving air (C) flowing water (D) mass movement 
Answer: [/INST]"""

example_encoded = tokenizer(example)

collator([example_encoded])

It returns a dict of {input_ids, attention_mask, labels}. You should see that every value o the labels tensor is -100 (cross-entropy loss will ignore this value) except for the final one, which encodes the "D".

github-actions · 2023-12-19T15:05:17Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Sosycs mentioned this issue Nov 16, 2023

data_collator error: Could not find response key in token IDs tensor #989

Closed

github-actions bot closed this as completed Dec 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Guidance with the correct format of the validation dataset #981

Guidance with the correct format of the validation dataset #981

Sosycs commented Nov 10, 2023

BayesRulez commented Nov 10, 2023 •

edited

Loading

Sosycs commented Nov 12, 2023 •

edited

Loading

Sosycs commented Nov 14, 2023

BayesRulez commented Nov 14, 2023

younesbelkada commented Nov 14, 2023

Sosycs commented Nov 16, 2023 •

edited

Loading

Sosycs commented Nov 16, 2023

Sosycs commented Nov 16, 2023

younesbelkada commented Nov 16, 2023

BayesRulez commented Nov 16, 2023

github-actions bot commented Dec 19, 2023

Guidance with the correct format of the validation dataset #981

Guidance with the correct format of the validation dataset #981

Comments

Sosycs commented Nov 10, 2023

BayesRulez commented Nov 10, 2023 • edited Loading

Sosycs commented Nov 12, 2023 • edited Loading

Sosycs commented Nov 14, 2023

BayesRulez commented Nov 14, 2023

younesbelkada commented Nov 14, 2023

Sosycs commented Nov 16, 2023 • edited Loading

Sosycs commented Nov 16, 2023

Sosycs commented Nov 16, 2023

younesbelkada commented Nov 16, 2023

BayesRulez commented Nov 16, 2023

github-actions bot commented Dec 19, 2023

BayesRulez commented Nov 10, 2023 •

edited

Loading

Sosycs commented Nov 12, 2023 •

edited

Loading

Sosycs commented Nov 16, 2023 •

edited

Loading