Multiple Choice Questions and Large Languages Models: A Case Study wi…

…th Fictional Medical Data (EleutherAI#1867) * glianorex tasks * Create README.md * Update README.md * Update README.md * fix formatting * fix internal formatting
OpenLLM-France · Jun 5, 2024 · 7257aa2 · 7257aa2
1 parent 070d31d
commit 7257aa2
Show file tree

Hide file tree

Showing 5 changed files with 87 additions and 0 deletions.
diff --git a/lm_eval/tasks/glianorex/README.md b/lm_eval/tasks/glianorex/README.md
@@ -0,0 +1,20 @@
+# Glianorex
+
+The goal of this benchmark is to isolate the test answering capabilities from the content knowledge.
+
+### Paper
+
+Title: Multiple Choice Questions and Large Languages Models: A Case Study with Fictional Medical Data
+
+Abstract: https://arxiv.org/abs/2406.02394
+
+To test the relevance of MCQs to assess LLM performance without prior data exposure, we created a fictional medical benchmark and knowledge base on a non-existent gland, the Glianorex. Using GPT-4 we generated a comprehensive textbook on the Glianorex in both English and French, and created multiple-choice questions in both English and French.
+
+### Tasks
+
+All tasks are multiple choice questions with 4 options, only one correct option.
+
+- `glianorex`: Evaluates all tasks listed below.
+
+- `glianorex_en`: Evaluates the accuracy on 264 questions in English.
+- `glianorex_fr`: Evaluates the accuracy on 264 questions in French.
diff --git a/lm_eval/tasks/glianorex/glianorex.yaml b/lm_eval/tasks/glianorex/glianorex.yaml
@@ -0,0 +1,14 @@
+task: glianorex
+dataset_path: maximegmd/glianorex
+output_type: multiple_choice
+test_split: train
+doc_to_text: !function preprocess_glianorex.doc_to_text
+doc_to_target: !function preprocess_glianorex.doc_to_target
+doc_to_choice: [ 'A', 'B', 'C', 'D' ]
+metric_list:
+  - metric: acc
+    aggregation: mean
+    higher_is_better: true
+  - metric: acc_norm
+    aggregation: mean
+    higher_is_better: true
diff --git a/lm_eval/tasks/glianorex/glianorex_en.yaml b/lm_eval/tasks/glianorex/glianorex_en.yaml
@@ -0,0 +1,15 @@
+task: glianorex_en
+dataset_path: maximegmd/glianorex
+output_type: multiple_choice
+test_split: train
+doc_to_text: !function preprocess_glianorex.doc_to_text
+doc_to_target: !function preprocess_glianorex.doc_to_target
+process_docs: !function preprocess_glianorex.filter_english
+doc_to_choice: [ 'A', 'B', 'C', 'D' ]
+metric_list:
+  - metric: acc
+    aggregation: mean
+    higher_is_better: true
+  - metric: acc_norm
+    aggregation: mean
+    higher_is_better: true
diff --git a/lm_eval/tasks/glianorex/glianorex_fr.yaml b/lm_eval/tasks/glianorex/glianorex_fr.yaml
@@ -0,0 +1,15 @@
+task: glianorex_fr
+dataset_path: maximegmd/glianorex
+output_type: multiple_choice
+test_split: train
+doc_to_text: !function preprocess_glianorex.doc_to_text
+doc_to_target: !function preprocess_glianorex.doc_to_target
+process_docs: !function preprocess_glianorex.filter_french
+doc_to_choice: [ 'A', 'B', 'C', 'D' ]
+metric_list:
+  - metric: acc
+    aggregation: mean
+    higher_is_better: true
+  - metric: acc_norm
+    aggregation: mean
+    higher_is_better: true
diff --git a/lm_eval/tasks/glianorex/preprocess_glianorex.py b/lm_eval/tasks/glianorex/preprocess_glianorex.py
@@ -0,0 +1,23 @@
+import datasets
+
+
+def doc_to_text(doc) -> str:
+    option_choices = doc["options"]
+    answers = "".join((f"{k}. {v}\n") for k, v in option_choices.items())
+    return f"Question: {doc['question']}\n{answers}Answer:"
+
+
+def doc_to_target(doc) -> int:
+    return doc["answer_idx"]
+
+
+def filter_dataset(dataset: datasets.Dataset, lang: str) -> datasets.Dataset:
+    return dataset.filter(lambda example: example["language"].startswith(lang))
+
+
+def filter_french(dataset: datasets.Dataset) -> datasets.Dataset:
+    return filter_dataset(dataset, "fr")
+
+
+def filter_english(dataset: datasets.Dataset) -> datasets.Dataset:
+    return filter_dataset(dataset, "en")