- Official code repository for paper Can LLMs Solve Longer Math Word Problems Better?
- Extended Grade-School Math (E-GSM) benchmark is an arithmetic reasoning dataset built upon GSM8K, by extending problem descriptions into longer ones.
- E-GSM is constructed to evaluate Context Length Generalizability (CoLeG) of LLMs, the ability of LLMs to solve long math word problems.
- For proprietary LLMs, we introduce Condition-Retrieving Instruction (CoRe), an instructional prompt.
- For opensource LLMs, we suggest incorporating extension as an auxiliary task of finetuning and release our SFT data.
Clone CoLeG-Math and install the required packages:
git clone https://github.com/XinXU-USTC/CoLeG-Math.git
cd CoLeG-Math
pip install -r requirements.txt
For vLLM installation problems, please refer to vLLM.
For now, E-GSM and our SFT data are under ./data
forder. Huggingface Link: coming soon...
For proprietary LLMs, you need to put your key in proprietary-llms/api_keys.py
cd proprietary-llms
python3 main.py config.yaml
or
cd proprietary-llms
bash ../scripts/eval_proprietary.sh
or
python3 main.py \
--llm gpt-3.5-turbo-0125 \
--n 1 \
--top_p 0.7 \
--temperature 0.0 \
--max_tokens 1024 \
--prompt_name zero-shot-cot \
--generate_log_file \
--use_core_instruction \
--dataset_filepath /path/to/datafile \
--output_filepath /path/to/save
For opensource LLMs:
cd opensource-llms
bash ../scripts/eval_opensource.sh
or
python opensource-llms/eval_gsm8k.py --model "path/to/save" --dataset_filepath data/E-GSM/Q1.jsonl --output_filepath Q1_results.jsonl
You need to prepare the LLM to be fine-tuned
bash scripts/train.sh
Thanks for the open source code of MetaMath, WizardMath and RFT. Some of our codes are based on them.
Please cite our paper if you use our dataset or extend our work:
@article{xu2024coleg-math,
title={Can LLMs Solve longer Math Word Problems Better?},
author={Xu, Xin and Xiao, Tong and Chao, Zitong and Huang, Zhenya and Yang, Can and Wang, Yang},
journal={arXiv preprint arXiv:2405.14804},
year={2024}
}