Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Log Probabilities of the T5 Model #4

Open
dmachapu opened this issue Mar 4, 2024 · 2 comments
Open

Log Probabilities of the T5 Model #4

dmachapu opened this issue Mar 4, 2024 · 2 comments

Comments

@dmachapu
Copy link

dmachapu commented Mar 4, 2024

Dear Team,

This is Divya. First of all congratulations for the great work. I am referring to your to estimate the confidence of a language model. I have actually taken the Flan T5 model. I would like to know how to get the log probs of the T5 model.

From your repository, I could see in the following path ,,data/predictions/T0_prompts/flan/cos_e", the probabilities and the log probabilities could be seen as follows:

{"dataset_name": "cos_e", "dataset_config_name": "v1.11", "template_name": "description_question_option_id", "context_id": "080ef6941410139d6869e78122bc741e", "target": 2, "prediction": 2, "probabilities": [0.00020205100008752197, 0.005021515768021345, 0.9925065636634827, 0.00039718663902021945, 1.838841853896156e-05], "log_probabilities": [-8.506990432739258, -5.294023513793945, -0.007521673105657101, -7.831104278564453, -10.903789520263672]}

As far as I understood, the model does not provide this information in the response. Could you provide me an idea about how to derive this.

I would be very thankful to you if you could help me out.

Thanks and Regards,
Divya

@AADeLucia
Copy link
Member

HuggingFace interface provides logits when you pass the following options:

with torch.no_grad():
    outputs = model.generate(inputs, return_dict_in_generate=True, output_scores=True)

This returns the logits for all tokens in the output. However, you only want the scores for the answer choices.

Example code (file paths and others should be changed in order to match the format above.

from datasets import load_dataset
import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, AutoConfig
import json
import pandas as pd


def get_dataset_names(filename):
    with open(filename) as fp:
        lines = fp.readlines()
        dataset_names = [line.rstrip() for line in lines]
    return dataset_names


PROMPT_OUTPUT_TYPE = 'MC ID'  # TODO use only rows of dataframe where dataset_df['Output type'] == PROMPT_OUTPUT_TYPE

model_name = "T0"
config = AutoConfig.from_pretrained('bigscience/' + model_name)
model = AutoModelForSeq2SeqLM.from_pretrained('bigscience/' + model_name)
tokenizer = AutoTokenizer.from_pretrained('bigscience/' + model_name)

dataset_df = pd.read_csv('../data/p3_dataset_info.csv')

# dataset_names = get_dataset_names('../data/p3_datasets_mc_id.txt')

for index, row in dataset_df.iterrows():
    predictions_list = []

    # Only getting validation set predictions
    dataset_name = row['Dataset']

    # TODO fix--currently skipping a dataset with no answer choice field
    if dataset_name == 'wiqa_effect_with_label_answer':
        continue

    split = row['Validation split']
    answer_choice_field = row['Answer choice field']
    # other_split = row['Other split']  # TODO

    output_file = '../data/predictions/' + model_name + '_' + PROMPT_OUTPUT_TYPE + '_' + dataset_name + '_' + split + '.json'
    print(f"Getting predictions for {dataset_name} and saving to {output_file}")

    dataset = load_dataset("bigscience/P3", dataset_name)

    counter = 0
    for example in dataset[split]:
        # print(tokenizer.decode(example['targets']))

        with torch.no_grad():
            inputs = torch.tensor(example['inputs'], dtype=torch.long).view(1,-1)
            outputs = model.generate(inputs, return_dict_in_generate=True, output_scores=True)
        text_out = tokenizer.decode(outputs.sequences[0], skip_special_tokens=False)
        scores = outputs['scores']
        scores = scores[0]  # We only have output with sequence length of 1, so we only need to keep the first probability
        output_token = torch.argmax(scores).item()

        options = example[answer_choice_field]
        indices = tokenizer.batch_encode_plus(options, add_special_tokens=False, return_attention_mask=False, return_tensors="pt")['input_ids']

        correct_token = example['targets'][0]
        output_is_correct = output_token == correct_token
        unnormalized_scores =  scores[0][indices]  # unnormalized meaning no softmax
        normalized_scores = torch.nn.functional.softmax(scores, dim=-1)[0][indices]  # normalized meaning that softmax was applied

        # Get dictionary of key=token index, value=option
        indices = indices.tolist()
        tokens_to_options_dict = {}
        for option, index in zip(options, indices):
             tokens_to_options_dict[index[0]] = option

        # Get dictionary of key=option, value=token index
        options_to_tokens_dict = {}
        for option, index in zip(options, indices):
             options_to_tokens_dict[option] = index[0]

        # Get dictionary of key=option, value=unnormalized scores
        unnormalized_scores = unnormalized_scores.tolist()
        unnormalized_scores_dict = {}
        for option, score in zip(options, unnormalized_scores):
            unnormalized_scores_dict[option] = score[0]

        # Get dictionary of key=option, value=unnormalized scores
        normalized_scores = normalized_scores.tolist()
        normalized_scores_dict = {}
        for option, score in zip(options, normalized_scores):
            normalized_scores_dict[option] = score[0]

        correct_option = tokens_to_options_dict[correct_token]
        output_option = tokens_to_options_dict[output_token]

        dictObj = {'dataset': dataset_name, 'options': options, 'tokens_to_options_dict': tokens_to_options_dict, 'options_to_tokens_dict': options_to_tokens_dict, 'unnormalized_scores': unnormalized_scores_dict, 'normalized_scores': normalized_scores_dict, 'correct_option': correct_option, 'output_option': output_option, 'output_is_correct': output_is_correct}
        # print(dictObj)
        predictions_list.append(dictObj)

        counter += 1
        if counter % 100 == 0:
            print(f"{dataset_name}: Finished with {counter} examples")  # TODO /{total_count}

    print(f"Saving predictions to {output_file}...")
    with open(output_file, 'w') as f:
        json.dump(predictions_list, f)
    print(f"Done with {dataset_name}")
    print()

@dmachapu
Copy link
Author

dmachapu commented Mar 6, 2024

@AADeLucia Thank you so much for the quick and elaborated response. I am dealing with the text generation task. So I am still trying to adapt the solution which you provided to my task. Will keep you posted in case of any further queries.

Once again many thanks for your active response.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants