Skip to content

Commit

Permalink
add group metric to leaderboard group
Browse files Browse the repository at this point in the history
  • Loading branch information
NathanHB committed Nov 5, 2024
1 parent a0530fa commit d35f589
Showing 1 changed file with 24 additions and 0 deletions.
24 changes: 24 additions & 0 deletions lm_eval/tasks/leaderboard/leaderboard.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -6,3 +6,27 @@ task:
- leaderboard_math_hard
- leaderboard_ifeval
- leaderboard_musr
aggregate_metric_list:
- metric: acc
aggregation: mean
weight_by_size: true # defaults to `true`. Set this to `false` to do a "macro" average (taking each subtask's average accuracy, and summing those accuracies and dividing by 3)--by default we do a "micro" average (retain all subtasks' per-document accuracies, and take the mean over all documents' accuracies to get our aggregate mean).
- metric: acc_norm
aggregation: mean
weight_by_size: true # defaults to `true`. Set this to `false` to do a "macro" average (taking each subtask's average accuracy, and summing those accuracies and dividing by 3)--by default we do a "micro" average (retain all subtasks' per-document accuracies, and take the mean over all documents' accuracies to get our aggregate mean).
- metric: exact_match
aggregation: mean
weight_by_size: true # defaults to `true`. Set this to `false` to do a "macro" average (taking each subtask's average accuracy, and summing those accuracies and dividing by 3)--by default we do a "micro" average (retain all subtasks' per-document accuracies, and take the mean over all documents' accuracies to get our aggregate mean).
- metric: inst_level_loose_acc
aggregation: mean
weight_by_size: true # defaults to `true`. Set this to `false` to do a "macro" average (taking each subtask's average accuracy, and summing those accuracies and dividing by 3)--by default we do a "micro" average (retain all subtasks' per-document accuracies, and take the mean over all documents' accuracies to get our aggregate mean).
- metric: inst_level_strict_acc
aggregation: mean
weight_by_size: true # defaults to `true`. Set this to `false` to do a "macro" average (taking each subtask's average accuracy, and summing those accuracies and dividing by 3)--by default we do a "micro" average (retain all subtasks' per-document accuracies, and take the mean over all documents' accuracies to get our aggregate mean).
- metric: prompt_level_loose_acc
aggregation: mean
weight_by_size: true # defaults to `true`. Set this to `false` to do a "macro" average (taking each subtask's average accuracy, and summing those accuracies and dividing by 3)--by default we do a "micro" average (retain all subtasks' per-document accuracies, and take the mean over all documents' accuracies to get our aggregate mean).
- metric: prompt_level_strict_acc
aggregation: mean
weight_by_size: true # defaults to `true`. Set this to `false` to do a "macro" average (taking each subtask's average accuracy, and summing those accuracies and dividing by 3)--by default we do a "micro" average (retain all subtasks' per-document accuracies, and take the mean over all documents' accuracies to get our aggregate mean).
metadata:
version: 1.0

0 comments on commit d35f589

Please sign in to comment.