Skip to content

Commit

Permalink
Correctly Print Task Versioning (EleutherAI#1173)
Browse files Browse the repository at this point in the history
* change version field formatting in metadata

* mention versioning in new task guide

* add instructions for changelog

* run linters
  • Loading branch information
haileyschoelkopf authored Dec 21, 2023
1 parent a0cfe3f commit 9cd7989
Show file tree
Hide file tree
Showing 128 changed files with 150 additions and 125 deletions.
19 changes: 19 additions & 0 deletions docs/new_task_guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -315,6 +315,25 @@ python -m scripts.write_out \
Open the file specified at the `--output_base_path <path>` and ensure it passes
a simple eye test.

## Versioning

One key feature in LM Evaluation Harness is the ability to version tasks--that is, mark them with a specific version number that can be bumped whenever a breaking change is made.

This version info can be provided by adding the following to your new task config file:

```
metadata:
version: 0
```

Now, whenever a change needs to be made to your task in the future, please increase the version number by 1 so that users can differentiate the different task iterations and versions.

If you are incrementing a task's version, please also consider adding a changelog to the task's README.md noting the date, PR number, what version you have updated to, and a one-liner describing the change.

for example,

* \[Dec 25, 2023\] (PR #999) Version 0.0 -> 1.0: Fixed a bug with answer extraction that led to underestimated performance.

## Checking performance + equivalence

It's now time to check models' performance on your task! In the evaluation harness, we intend to support a wide range of evaluation tasks and setups, but prioritize the inclusion of already-proven benchmarks following the precise evaluation setups in the literature where possible.
Expand Down
2 changes: 1 addition & 1 deletion lm_eval/tasks/anli/anli_r1.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -23,4 +23,4 @@ metric_list:
aggregation: mean
higher_is_better: true
metadata:
- version: 1.0
version: 1.0
2 changes: 1 addition & 1 deletion lm_eval/tasks/arc/arc_easy.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -20,4 +20,4 @@ metric_list:
aggregation: mean
higher_is_better: true
metadata:
- version: 1.0
version: 1.0
2 changes: 1 addition & 1 deletion lm_eval/tasks/arithmetic/arithmetic_1dc.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -13,4 +13,4 @@ metric_list:
aggregation: mean
higher_is_better: true
metadata:
- version: 1.0
version: 1.0
2 changes: 1 addition & 1 deletion lm_eval/tasks/asdiv/default.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -11,4 +11,4 @@ metric_list:
aggregation: mean
higher_is_better: true
metadata:
- version: 1.0
version: 1.0
2 changes: 1 addition & 1 deletion lm_eval/tasks/babi/babi.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -17,4 +17,4 @@ metric_list:
aggregation: mean
higher_is_better: true
metadata:
- version: 0.0
version: 0.0
2 changes: 1 addition & 1 deletion lm_eval/tasks/bbh/cot_fewshot/_cot_fewshot_template_yaml
Original file line number Diff line number Diff line change
Expand Up @@ -27,4 +27,4 @@ filter_list:
- function: "take_first"
num_fewshot: 0
metadata:
- version: 1.0
version: 1.0
2 changes: 1 addition & 1 deletion lm_eval/tasks/bbh/cot_zeroshot/_cot_zeroshot_template_yaml
Original file line number Diff line number Diff line change
Expand Up @@ -24,4 +24,4 @@ filter_list:
- function: "take_first"
num_fewshot: 0
metadata:
- version: 0
version: 0
2 changes: 1 addition & 1 deletion lm_eval/tasks/bbh/fewshot/_fewshot_template_yaml
Original file line number Diff line number Diff line change
Expand Up @@ -18,4 +18,4 @@ generation_kwargs:
temperature: 0.0
num_fewshot: 0
metadata:
- version: 0
version: 0
2 changes: 1 addition & 1 deletion lm_eval/tasks/bbh/zeroshot/_zeroshot_template_yaml
Original file line number Diff line number Diff line change
Expand Up @@ -18,4 +18,4 @@ generation_kwargs:
temperature: 0.0
num_fewshot: 0
metadata:
- version: 0
version: 0
2 changes: 1 addition & 1 deletion lm_eval/tasks/belebele/_default_template_yaml
Original file line number Diff line number Diff line change
Expand Up @@ -18,4 +18,4 @@ metric_list:
aggregation: mean
higher_is_better: true
metadata:
- version: 0.0
version: 0.0
2 changes: 1 addition & 1 deletion lm_eval/tasks/bigbench/generate_until_template_yaml
Original file line number Diff line number Diff line change
Expand Up @@ -15,4 +15,4 @@ metric_list:
higher_is_better: true
ignore_punctuation: true
metadata:
- version: 0.0
version: 0.0
4 changes: 4 additions & 0 deletions lm_eval/tasks/bigbench/multiple_choice/causal_judgement.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
# Generated by utils.py
dataset_name: causal_judgment_zero_shot
include: ../multiple_choice_template_yaml
task: bigbench_causal_judgement_multiple_choice
2 changes: 1 addition & 1 deletion lm_eval/tasks/bigbench/multiple_choice_template_yaml
Original file line number Diff line number Diff line change
Expand Up @@ -12,4 +12,4 @@ metric_list:
- metric: acc
# TODO: brier score and other metrics
metadata:
- version: 0.0
version: 0.0
2 changes: 1 addition & 1 deletion lm_eval/tasks/blimp/_template_yaml
Original file line number Diff line number Diff line change
Expand Up @@ -11,4 +11,4 @@ doc_to_decontamination_query: "{{sentence_good}} {{sentence_bad}}"
metric_list:
- metric: acc
metadata:
- version: 1.0
version: 1.0
2 changes: 1 addition & 1 deletion lm_eval/tasks/ceval/_default_ceval_yaml
Original file line number Diff line number Diff line change
Expand Up @@ -16,4 +16,4 @@ metric_list:
aggregation: mean
higher_is_better: true
metadata:
- version: 1.0
version: 1.0
2 changes: 1 addition & 1 deletion lm_eval/tasks/cmmlu/_default_template_yaml
Original file line number Diff line number Diff line change
Expand Up @@ -16,4 +16,4 @@ metric_list:
aggregation: mean
higher_is_better: true
metadata:
- version: 0.0
version: 0.0
2 changes: 1 addition & 1 deletion lm_eval/tasks/code_x_glue/code-text/go.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -18,4 +18,4 @@ metric_list:
aggregation: mean
higher_is_better: True
metadata:
- version: 0.0
version: 0.0
2 changes: 1 addition & 1 deletion lm_eval/tasks/code_x_glue/code-text/java.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -18,4 +18,4 @@ metric_list:
aggregation: mean
higher_is_better: True
metadata:
- version: 0.0
version: 0.0
2 changes: 1 addition & 1 deletion lm_eval/tasks/code_x_glue/code-text/javascript.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -18,4 +18,4 @@ metric_list:
aggregation: mean
higher_is_better: True
metadata:
- version: 0.0
version: 0.0
2 changes: 1 addition & 1 deletion lm_eval/tasks/code_x_glue/code-text/php.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -18,4 +18,4 @@ metric_list:
aggregation: mean
higher_is_better: True
metadata:
- version: 0.0
version: 0.0
2 changes: 1 addition & 1 deletion lm_eval/tasks/code_x_glue/code-text/python.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -18,4 +18,4 @@ metric_list:
aggregation: mean
higher_is_better: True
metadata:
- version: 0.0
version: 0.0
2 changes: 1 addition & 1 deletion lm_eval/tasks/code_x_glue/code-text/ruby.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -18,4 +18,4 @@ metric_list:
aggregation: mean
higher_is_better: True
metadata:
- version: 2.0
version: 2.0
2 changes: 1 addition & 1 deletion lm_eval/tasks/coqa/default.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -19,4 +19,4 @@ metric_list:
aggregation: mean
higher_is_better: true
metadata:
- version: 2.0
version: 2.0
2 changes: 1 addition & 1 deletion lm_eval/tasks/crows_pairs/crows_pairs_english.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -20,4 +20,4 @@ metric_list:
aggregation: mean
higher_is_better: false
metadata:
- version: 1.0
version: 1.0
2 changes: 1 addition & 1 deletion lm_eval/tasks/csatqa/_default_csatqa_yaml
Original file line number Diff line number Diff line change
Expand Up @@ -14,4 +14,4 @@ metric_list:
aggregation: mean
higher_is_better: true
metadata:
- version: 0.0
version: 0.0
2 changes: 1 addition & 1 deletion lm_eval/tasks/drop/default.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -21,4 +21,4 @@ metric_list:
aggregation: mean
higher_is_better: true
metadata:
- version: 2.0
version: 2.0
2 changes: 2 additions & 0 deletions lm_eval/tasks/fld/fld_default.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -12,3 +12,5 @@ metric_list:
- metric: exact_match
aggregation: mean
higher_is_better: true
metadata:
version: 0.0
2 changes: 1 addition & 1 deletion lm_eval/tasks/glue/cola/default.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -13,4 +13,4 @@ doc_to_decontamination_query: sentence
metric_list:
- metric: mcc
metadata:
- version: 1.0
version: 1.0
2 changes: 1 addition & 1 deletion lm_eval/tasks/glue/mnli/default.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -11,4 +11,4 @@ doc_to_choice: ["True", "Neither", "False"]
metric_list:
- metric: acc
metadata:
- version: 1.0
version: 1.0
2 changes: 1 addition & 1 deletion lm_eval/tasks/glue/mrpc/default.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -12,4 +12,4 @@ metric_list:
- metric: acc
- metric: f1
metadata:
- version: 1.0
version: 1.0
2 changes: 1 addition & 1 deletion lm_eval/tasks/glue/qnli/default.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -11,4 +11,4 @@ doc_to_choice: ["yes", "no"]
metric_list:
- metric: acc
metadata:
- version: 1.0
version: 1.0
2 changes: 1 addition & 1 deletion lm_eval/tasks/glue/qqp/default.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -12,4 +12,4 @@ metric_list:
- metric: acc
- metric: f1
metadata:
- version: 1.0
version: 1.0
2 changes: 1 addition & 1 deletion lm_eval/tasks/glue/rte/default.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -11,4 +11,4 @@ doc_to_choice: ["True", "False"]
metric_list:
- metric: acc
metadata:
- version: 1.0
version: 1.0
2 changes: 1 addition & 1 deletion lm_eval/tasks/glue/sst2/default.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -11,4 +11,4 @@ doc_to_choice: ["negative", "positive"]
metric_list:
- metric: acc
metadata:
- version: 1.0
version: 1.0
2 changes: 1 addition & 1 deletion lm_eval/tasks/glue/wnli/default.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -11,4 +11,4 @@ doc_to_choice: ["False", "True"]
metric_list:
- metric: acc
metadata:
- version: 2.0
version: 2.0
2 changes: 1 addition & 1 deletion lm_eval/tasks/gsm8k/gsm8k-cot-self-consistency.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -31,4 +31,4 @@ filter_list:
- function: "majority_vote"
- function: "take_first"
metadata:
- version: 0.0
version: 0.0
2 changes: 1 addition & 1 deletion lm_eval/tasks/gsm8k/gsm8k-cot.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -41,4 +41,4 @@ filter_list:
regex_pattern: "The answer is (\\-?[0-9\\.\\,]+)."
- function: "take_first"
metadata:
- version: 0.0
version: 0.0
2 changes: 1 addition & 1 deletion lm_eval/tasks/gsm8k/gsm8k.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -34,4 +34,4 @@ filter_list:
regex_pattern: "#### (\\-?[0-9\\.\\,]+)"
- function: "take_first"
metadata:
- version: 1.0
version: 1.0
2 changes: 1 addition & 1 deletion lm_eval/tasks/headqa/headqa_en.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -20,4 +20,4 @@ metric_list:
aggregation: mean
higher_is_better: true
metadata:
- version: 1.0
version: 1.0
2 changes: 1 addition & 1 deletion lm_eval/tasks/hellaswag/hellaswag.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -19,4 +19,4 @@ metric_list:
aggregation: mean
higher_is_better: true
metadata:
- version: 1.0
version: 1.0
2 changes: 1 addition & 1 deletion lm_eval/tasks/hendrycks_ethics/commonsense.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -12,4 +12,4 @@ doc_to_choice: ['no', 'yes']
metric_list:
- metric: acc
metadata:
- version: 1.0
version: 1.0
2 changes: 1 addition & 1 deletion lm_eval/tasks/hendrycks_ethics/deontology.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -5,5 +5,5 @@ doc_to_text: "Question: Would most people believe this reasonable or unreasonabl
doc_to_target: label
doc_to_choice: ['unreasonable', 'reasonable']
metadata:
- version: 1.0
version: 1.0
# TODO: implement exact-match metric for this subset
2 changes: 1 addition & 1 deletion lm_eval/tasks/hendrycks_ethics/justice.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -6,4 +6,4 @@ dataset_name: justice
doc_to_text: "Question: Would most people believe this reasonable or unreasonable to say? \"{{scenario}}\"\nAnswer:"
# TODO: impl. exact match for this and deontology
metadata:
- version: 1.0
version: 1.0
2 changes: 1 addition & 1 deletion lm_eval/tasks/hendrycks_ethics/utilitarianism.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -9,4 +9,4 @@ doc_to_choice: ['no', 'yes']
metric_list:
- metric: acc
metadata:
- version: 1.0
version: 1.0
Original file line number Diff line number Diff line change
Expand Up @@ -13,4 +13,4 @@
# - metric: acc
# TODO: we want this to be implemented as a winograd_schema task type, actually
# metadata:
# - version: 1.0
# version: 1.0
2 changes: 1 addition & 1 deletion lm_eval/tasks/hendrycks_ethics/virtue.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7,4 +7,4 @@ doc_to_text: "Sentence: {{scenario}}\nQuestion: Does the character in this sente
doc_to_target: label
doc_to_choice: ['no', 'yes']
metadata:
- version: 1.0
version: 1.0
2 changes: 1 addition & 1 deletion lm_eval/tasks/ifeval/ifeval.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -26,4 +26,4 @@ metric_list:
aggregation: !function utils.agg_inst_level_acc
higher_is_better: true
metadata:
- version: 1.0
version: 1.0
2 changes: 1 addition & 1 deletion lm_eval/tasks/lambada/lambada_openai.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -17,4 +17,4 @@ metric_list:
aggregation: mean
higher_is_better: true
metadata:
- version: 1.0
version: 1.0
2 changes: 1 addition & 1 deletion lm_eval/tasks/lambada/lambada_standard.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -18,4 +18,4 @@ metric_list:
aggregation: mean
higher_is_better: true
metadata:
- version: 1.0
version: 1.0
2 changes: 1 addition & 1 deletion lm_eval/tasks/lambada_cloze/lambada_openai_cloze.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -17,4 +17,4 @@ metric_list:
aggregation: mean
higher_is_better: true
metadata:
- version: 1.0
version: 1.0
2 changes: 1 addition & 1 deletion lm_eval/tasks/lambada_cloze/lambada_standard_cloze.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -18,4 +18,4 @@ metric_list:
aggregation: mean
higher_is_better: true
metadata:
- version: 1.0
version: 1.0
2 changes: 1 addition & 1 deletion lm_eval/tasks/lambada_multilingual/lambada_mt_en.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -17,4 +17,4 @@ metric_list:
aggregation: mean
higher_is_better: true
metadata:
- version: 1.0
version: 1.0
2 changes: 1 addition & 1 deletion lm_eval/tasks/logiqa/logiqa.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -18,4 +18,4 @@ metric_list:
aggregation: mean
higher_is_better: true
metadata:
- version: 1.0
version: 1.0
2 changes: 1 addition & 1 deletion lm_eval/tasks/logiqa2/logieval.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -24,4 +24,4 @@ filter_list:
regex_pattern: "^\\s*([A-D])"
- function: "take_first"
metadata:
- version: 0.0
version: 0.0
2 changes: 1 addition & 1 deletion lm_eval/tasks/logiqa2/logiqa2.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -18,4 +18,4 @@ metric_list:
aggregation: mean
higher_is_better: true
metadata:
- version: 0.0
version: 0.0
2 changes: 1 addition & 1 deletion lm_eval/tasks/mathqa/mathqa.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -19,4 +19,4 @@ metric_list:
aggregation: mean
higher_is_better: true
metadata:
- version: 1.0
version: 1.0
2 changes: 1 addition & 1 deletion lm_eval/tasks/mc_taco/default.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -12,4 +12,4 @@ metric_list:
- metric: acc
- metric: f1
metadata:
- version: 1.0
version: 1.0
2 changes: 1 addition & 1 deletion lm_eval/tasks/mgsm/direct/direct_yaml
Original file line number Diff line number Diff line change
Expand Up @@ -26,4 +26,4 @@ metric_list:
ignore_case: true
ignore_punctuation: true
metadata:
- version: 0.0
version: 0.0
2 changes: 1 addition & 1 deletion lm_eval/tasks/mgsm/en_cot/cot_yaml
Original file line number Diff line number Diff line change
Expand Up @@ -28,4 +28,4 @@ filter_list:
regex_pattern: "The answer is (\\-?[0-9\\.\\,]+)"
- function: "take_first"
metadata:
- version: 0.0
version: 0.0
2 changes: 1 addition & 1 deletion lm_eval/tasks/mgsm/native_cot/cot_yaml
Original file line number Diff line number Diff line change
Expand Up @@ -28,4 +28,4 @@ filter_list:
regex_pattern: "The answer is (\\-?[0-9\\.\\,]+)"
- function: "take_first"
metadata:
- version: 1.0
version: 1.0
2 changes: 1 addition & 1 deletion lm_eval/tasks/minerva_math/minerva_math_algebra.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -21,4 +21,4 @@ metric_list:
higher_is_better: true
num_fewshot: 0
metadata:
- version: 0.0
version: 0.0
2 changes: 1 addition & 1 deletion lm_eval/tasks/mmlu/default/_default_template_yaml
Original file line number Diff line number Diff line change
Expand Up @@ -12,4 +12,4 @@ metric_list:
aggregation: mean
higher_is_better: true
metadata:
- version: 0.0
version: 0.0
Loading

0 comments on commit 9cd7989

Please sign in to comment.