-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Guidance with the correct format of the validation dataset #981
Comments
Hi @Sosycs, Fine-tuning these LLMs is not like training a supervised machine learning model where you have some inputs and a target to compare your prediction with. Decoder-only transformers like LLaMA2 are simply predicting the next token in a sequence. When you are fine-tuning these models, you hand them a complete sequence. The trainer steps through your input, handing the model them a single extra token at a time and requesting prediction for the next token. The training loss is calculated across the totality of all predictions vs. the actual next tokens. What this means for you (and this applies to both the training and validation datasets) is that you need to compile both your questions and your answers into a single string. You could do it like this:
It's worth noting that it's pretty inefficient to fine-tune a model by scoring it's predictions across the entire prompt. You don't really care about teaching it that "weathering" is likely to follow "mechanical". You care much more that it learns to produce "D" (assuming the answer is D...), given the question. Take a look at the use of the It enables you to use only the predictions for tokens that appear to the right of the Best of luck! |
I am familiar with providing the answer separately and computing the loss depending on the predicted values. what is the name for this fine-tuning? instruction fine-tuning? But I am currently using what you suggested @BayesRulez. Do I provide the context, question and options as an instruction_template?
|
I have tried multiple response templates but always get the error: @BayesRulez Can you please guide me to the correct one? |
Hi @Sosycs, I literally just had the same problem when using the Mistral tokenizer and somebody here was kind enough to point out why. The reason for the error you're seeing is explained here. To respond to both of your questions above at the same time, your code should look as follows:
Note that you'll only need to use the work-around for context-sensitive tokenizers. LLaMA and Mistral both use them. I'm not sure if any other tokenizers do (Mistral was the first I came across). Hope that helps. |
Great point yes @BayesRulez thanks a lot for your help ! |
Thank you very much. I have tried this solution in huggingface but I must have missed something as I used "\nAnswer: [/INST]" So I tried your exact code and got (I also got it from my code as they produce the same toknized IDs): According to this it has something with a None value present but I don't know where this come from in my case. |
Hello @younesbelkada, |
@BayesRulez @younesbelkada Shall I remove all the white spaces in my dataset in "Answer: [/INST] "? |
Hi @Sosycs response_template_with_context = "[/INST]" # We added context here: "\n". This is enough for this tokenizer
response_template_ids = tokenizer.encode(response_template_with_context, add_special_tokens=False)
data_collator = DataCollatorForCompletionOnlyLM(response_template_ids, tokenizer=tokenizer) |
@Sosycs can you paste your full code here? The following works for me:
It returns a dict of |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. |
Hello every one,
I am in the process of fine-tuning Llama2 using SFT trainer and quantization using Lora.
my dataset is composed of questions structured like:
<s>[INST] <<SYS>> Please select the correct answer from the given multiple Options based on the given Context: <</SYS>> Context: Abrasion is another type of mechanical weathering. With abrasion, one rock bumps against another rock. Gravity causes abrasion as a rock tumbles down a slope. Moving water causes abrasion it moves rocks so that they bump against one another (Figure 9.3). Strong winds cause abrasion by blasting sand against rock surfaces. Finally, the ice in glaciers cause abrasion. Pieces of rock embedded in ice at the bottom of a glacier scrape against the rock below. If you have ever collected beach glass or pebbles from a stream, you have witnessed the work of abrasion. Question: Gravity causes erosion by all of the following except Options:(A) glaciers (B) moving air (C) flowing water (D) mass movement Answer: [/INST]
and the a column 'label' represent the ground truth.
my questions is in my code:
do I provide the "label" column to the model for the validation dataset?
or do I leave it empty? if that so then how can I access the predictions of the validation set of my model?
The text was updated successfully, but these errors were encountered: