Skip to content

Commit

Permalink
MultiMedQA (EleutherAI#1198)
Browse files Browse the repository at this point in the history
* multimedqa

* Update medqa.yaml

* move to benchmarks folder

* add README.md

---------

Co-authored-by: Lintang Sutawika <[email protected]>
  • Loading branch information
tmabraham and lintangsutawika authored Jan 11, 2024
1 parent 692e0f8 commit 818c056
Show file tree
Hide file tree
Showing 6 changed files with 115 additions and 0 deletions.
43 changes: 43 additions & 0 deletions lm_eval/tasks/benchmarks/multimedqa/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
# MultiMedQA (multiple-choice subset)

### Paper

Title: Large Language Models Encode Clinical Knowledge

Abstract: https://arxiv.org/abs/2212.13138

A benchmark combining four existing multiple-choice question answering datasets spanning professional medical exams and research queries.

### Citation

```
@Article{Singhal2023,
author={Singhal, Karan and Azizi, Shekoofeh and Tu, Tao and Mahdavi, S. Sara and Wei, Jason and Chung, Hyung Won and Scales, Nathan and Tanwani, Ajay and Cole-Lewis, Heather and Pfohl, Stephen and Payne, Perry and Seneviratne, Martin and Gamble, Paul and Kelly, Chris and Babiker, Abubakr and Sch{\"a}rli, Nathanael and Chowdhery, Aakanksha and Mansfield, Philip and Demner-Fushman, Dina and Ag{\"u}era y Arcas, Blaise and Webster, Dale and Corrado, Greg S. and Matias, Yossi and Chou, Katherine and Gottweis, Juraj and Tomasev, Nenad and Liu, Yun and Rajkomar, Alvin and Barral, Joelle and Semturs, Christopher and Karthikesalingam, Alan and Natarajan, Vivek},
title={Large language models encode clinical knowledge},
journal={Nature},
year={2023},
month={Aug},
day={01},
volume={620},
number={7972},
pages={172-180},
issn={1476-4687},
doi={10.1038/s41586-023-06291-2},
url={https://doi.org/10.1038/s41586-023-06291-2}
}
```

### Tasks

* [PubMedQA](https://pubmedqa.github.io/) - 1,000 expert-labeled Q&A pairs where a question and corresponding PubMed abstract as context is given and the a yes/maybe/no answer must be produced. Unlike the rest of the tasks in this suite, PubMedQA is a closed-domain Q&A task.
* [MedQA](https://github.com/jind11/MedQA) - US Medical License Exam (USMLE) questions with 4 or 5 possible answers. Typically, only the 4-option questions are used.
* [MedMCQA](https://medmcqa.github.io/) - 4-option multiple choice questions from Indian medical entrance examinations, >191k total questions.
* [MMLU](https://arxiv.org/abs/2009.03300) - 4-option multiple choice exam questions from a variety of domains. The following 6 domains are utilized here:
* Anatomy
* Clinical Knowledge
* College Medicine
* Medical Genetics
* Professional Medicine
* College Biology

Note that MultiMedQA also includes some short-form and long-form Q&A tasks (LiveQA, MedicationQA, HealthSearchQA). Evaluation on these tasks is usually done by experts and is not typically performed automatically, and therefore is ignored here.
11 changes: 11 additions & 0 deletions lm_eval/tasks/benchmarks/multimedqa/multimedqa.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
group: multimedqa
task:
- pubmedqa
- medmcqa
- medqa_4options
- mmlu_anatomy
- mmlu_clinical_knowledge
- mmlu_college_medicine
- mmlu_medical_genetics
- mmlu_professional_medicine
- mmlu_college_biology
18 changes: 18 additions & 0 deletions lm_eval/tasks/medmcqa/medmcqa.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
task: medmcqa
dataset_path: medmcqa
output_type: multiple_choice
training_split: train
validation_split: validation
test_split: validation
doc_to_text: !function utils_medmcqa.doc_to_text
doc_to_target: cop
doc_to_choice: [ 'A','B','C','D' ]
should_decontaminate: true
doc_to_decontamination_query: "{{question}}"
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
- metric: acc_norm
aggregation: mean
higher_is_better: true
19 changes: 19 additions & 0 deletions lm_eval/tasks/medmcqa/utils_medmcqa.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
# Copied from Master
def doc_to_text(doc) -> str:
"""
Question: <question>
Choices:
A. <choice1>
B. <choice2>
C. <choice3>
D. <choice4>
Answer:
"""
choices = [doc["opa"], doc["opb"], doc["opc"], doc["opd"]]
option_choices = {'A': choices[0], 'B': choices[1], 'C': choices[2], 'D': choices[3]}

prompt = "Question: " + doc["question"] + "\nChoices:\n"
for choice, option in option_choices.items():
prompt += f"{choice.upper()}. {option}\n"
prompt += "Answer:"
return prompt
16 changes: 16 additions & 0 deletions lm_eval/tasks/medqa/medqa.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
task: medqa_4options
dataset_path: GBaker/MedQA-USMLE-4-options-hf
output_type: multiple_choice
training_split: train
validation_split: validation
test_split: test
doc_to_text: !function preprocess_medqa.doc_to_text
doc_to_target: !function preprocess_medqa.doc_to_target
doc_to_choice: [ 'A', 'B', 'C', 'D' ]
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
- metric: acc_norm
aggregation: mean
higher_is_better: true
8 changes: 8 additions & 0 deletions lm_eval/tasks/medqa/preprocess_medqa.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
def doc_to_text(doc) -> str:
option_choices = {'A': doc["ending0"], 'B': doc["ending1"], 'C': doc["ending2"], 'D': doc["ending3"]}
answers = "".join((f"{k}. {v}\n") for k, v in option_choices.items())
return f"Question: {doc['sent1']}\n{answers}Answer:"


def doc_to_target(doc) -> int:
return doc["label"]

0 comments on commit 818c056

Please sign in to comment.