Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update livemathbench-judge #3

Merged
merged 1 commit into from
Jan 10, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 5 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,8 +16,9 @@
[📚[LeaderBoard](https://github.com/open-compass/GPassK/index.html)] -->

## 🚀 News
- **[2025.1.10]** 🔥 We release a small-scale judge model [LiveMath-Judge](https://huggingface.co/jnanliu/LiveMath-Judge).
- **[2025.1.6]** 🔥 **[LiveMathBench](https://huggingface.co/datasets/opencompass/LiveMathBench)** now can be accessed through hugginface, and you can now evaluate your LLMs on it using G-Pass@k in OpenCompass. We have addressed potential errors in LiveMathBench and inconsistencies in the sampling parameters. Please also refer to our updated version of the **[Paper](http://arxiv.org/abs/2412.13147)** for further details.
- **[2024.12.18]** We release the **[ArXiv Paper](http://arxiv.org/abs/2412.13147)** of G-Pass@k. 🎉🎉🎉
- **[2024.12.18]** 🎉 We release the **[ArXiv Paper](http://arxiv.org/abs/2412.13147)** of G-Pass@k.


## ☀️Introduction
Expand Down Expand Up @@ -87,10 +88,12 @@ lmdeploy serve api_server Qwen/Qwen2.5-72B-Instruct --server-port 8000 \
--cache-max-entry-count 0.9 \
--log-level INFO
```
After setting up the judge model, define the URLs in the `eval_urls` within `opencompass_config_templates/*.py`. Adjust other parameters such as `k`, `temperatures`, `llm_infos`, and other params according to your needs.
After setting up the judge model, define the URLs in the `eval_urls` and `eval_model_name` within `opencompass_config_templates/*.py`. Adjust other parameters such as `k`, `temperatures`, `llm_infos`, and other params according to your needs.

> ❗️Note that omitting `eval_urls` will default to an internal rule-based judge, which might only apply to datasets with numerical answers

> 💡Now you can use the [LiveMath-Judge](https://huggingface.co/jnanliu/LiveMath-Judge) for judging, which greatly reduces deploy and inference costs.

### 4. Evaluation

To begin the evaluation, first generate the necessary configuration files by running the following script:
Expand Down
3 changes: 2 additions & 1 deletion opencompass_config_templates/nono1.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@
eval_urls = [
# Put your judge model urls urls here
]
eval_model_name = 'Qwen/Qwen2.5-72B-Instruct'


llm_infos = [
Expand Down Expand Up @@ -81,7 +82,7 @@
abbr=f'LiveMathBench-v{version}_k{"-".join(map(str, [k] if isinstance(k, int) else k))}_r{replication}'
))
livemathbench_dataset['eval_cfg']['evaluator'].update(dict(
model_name='Qwen/Qwen2.5-72B-Instruct',
model_name=eval_model_name,
url=eval_urls,
k=k,
replication=replication
Expand Down
3 changes: 2 additions & 1 deletion opencompass_config_templates/o1.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@
eval_urls = [
# Put your judge model urls urls here
]
eval_model_name = 'Qwen/Qwen2.5-72B-Instruct'


llm_infos = [
Expand Down Expand Up @@ -69,7 +70,7 @@
abbr=f'LiveMathBench-v{version}-k{"_".join(map(str, [k] if isinstance(k, int) else k))}-r{replication}'
))
livemathbench_dataset['eval_cfg']['evaluator'].update(dict(
model_name='Qwen/Qwen2.5-72B-Instruct',
model_name=eval_model_name,
url=eval_urls,
k=k,
replication=replication
Expand Down