diff --git a/lm_eval/tasks/README.md b/lm_eval/tasks/README.md index 6a7847b108..bb04d4f279 100644 --- a/lm_eval/tasks/README.md +++ b/lm_eval/tasks/README.md @@ -26,6 +26,7 @@ | [ceval](ceval/README.md) | Tasks that evaluate language understanding and reasoning in an educational context. | Chinese | | [cmmlu](cmmlu/README.md) | Multi-subject multiple choice question tasks for comprehensive academic assessment. | Chinese | | code_x_glue | Tasks that involve understanding and generating code across multiple programming languages. | Go, Java, JS, PHP, Python, Ruby | +| [commonsense_qa](commmonsense_qa/README.md) | CommonsenseQA, a multiple-choice QA dataset for measuring commonsense knowledge. | English | | [copal_id](copal_id/README.md) | Indonesian causal commonsense reasoning dataset that captures local nuances. | Indonesian | | [coqa](coqa/README.md) | Conversational question answering tasks to test dialog understanding. | English | | [crows_pairs](crows_pairs/README.md) | Tasks designed to test model biases in various sociodemographic groups. | English, French | diff --git a/lm_eval/tasks/commonsense_qa/README.md b/lm_eval/tasks/commonsense_qa/README.md new file mode 100644 index 0000000000..94ef87a57a --- /dev/null +++ b/lm_eval/tasks/commonsense_qa/README.md @@ -0,0 +1,60 @@ +# Task-name + +### Paper + +Title: `COMMONSENSEQA: A Question Answering Challenge Targeting +Commonsense Knowledge` + +Abstract: https://arxiv.org/pdf/1811.00937.pdf + +CommonsenseQA is a multiple-choice question answering dataset that requires different types of commonsense knowledge to predict the correct answers. +It contains 12,102 questions with one correct answer and four distractor answers. + +Homepage: https://www.tau-nlp.org/commonsenseqa + + +### Citation + +``` +@inproceedings{talmor-etal-2019-commonsenseqa, + title = "{C}ommonsense{QA}: A Question Answering Challenge Targeting Commonsense Knowledge", + author = "Talmor, Alon and + Herzig, Jonathan and + Lourie, Nicholas and + Berant, Jonathan", + booktitle = "Proceedings of the 2019 Conference of the North {A}merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)", + month = jun, + year = "2019", + address = "Minneapolis, Minnesota", + publisher = "Association for Computational Linguistics", + url = "https://aclanthology.org/N19-1421", + doi = "10.18653/v1/N19-1421", + pages = "4149--4158", + archivePrefix = "arXiv", + eprint = "1811.00937", + primaryClass = "cs", +} +``` + +### Groups and Tasks + +#### Groups + +* Not part of a group yet. + +#### Tasks + +* `commonsense_qa`: Represents the "random" split from the paper. Uses an MMLU-style prompt, as (presumably) used by Llama evaluations. + +### Checklist + +For adding novel benchmarks/datasets to the library: +* [x] Is the task an existing benchmark in the literature? + * [x] Have you referenced the original paper that introduced the task? + * [x] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test? + + +If other tasks on this dataset are already supported: +* [ ] Is the "Main" variant of this task clearly denoted? +* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates? +* [ ] Have you noted which, if any, published evaluation setups are matched by this variant? diff --git a/lm_eval/tasks/commonsense_qa/default.yaml b/lm_eval/tasks/commonsense_qa/default.yaml new file mode 100644 index 0000000000..31d31b0125 --- /dev/null +++ b/lm_eval/tasks/commonsense_qa/default.yaml @@ -0,0 +1,12 @@ +task: commonsense_qa +dataset_path: tau/commonsense_qa +training_split: train +validation_split: validation +output_type: multiple_choice +doc_to_text: "Question: {{ question.strip() }}\nA. {{choices['text'][0]}}\nB. {{choices['text'][1]}}\nC. {{choices['text'][2]}}\nD. {{choices['text'][3]}}\nE. {{choices['text'][4]}}\nAnswer:" +doc_to_target: answerKey +doc_to_choice: ['A', 'B', 'C', 'D', 'E'] +metric_list: + - metric: acc + aggregation: mean + higher_is_better: true