Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance Report of Qwen2.5-Math-7B-Instruct on GaoKao Dataset #35

Open
xiaobanni opened this issue Nov 1, 2024 · 0 comments
Open

Comments

@xiaobanni
Copy link

I used the default command:

PROMPT_TYPE="qwen25-math-cot"
MODEL_NAME_OR_PATH="Qwen/Qwen2.5-Math-7B-Instruct"
bash sh/eval.sh $PROMPT_TYPE $MODEL_NAME_OR_PATH

With the following default setup:

DATA_NAME="gaokao2024_I,gaokao2024_II,gaokao2024_mix,gaokao_math_cloze,gaokao_math_qa"
TOKENIZERS_PARALLELISM=false \
python3 -u math_eval.py \
    --model_name_or_path ${MODEL_NAME_OR_PATH} \
    --data_name ${DATA_NAME} \
    --output_dir ${OUTPUT_DIR} \
    --split ${SPLIT} \
    --prompt_type ${PROMPT_TYPE} \
    --num_test_sample ${NUM_TEST_SAMPLE} \
    --seed 0 \
    --temperature 0 \
    --n_sampling 1 \
    --top_p 1 \
    --start 0 \
    --end -1 \
    --use_vllm \
    --save_outputs \
    --overwrite \
    --adapt_few_shot

Attempting to reproduce results on the GaoKao dataset, I observed the following:

gaokao_math_cloze

{
    "num_samples": 118,
    "num_scores": 118,
    "timeout_samples": 0,
    "empty_samples": 0,
    "acc": 68.6,
    "time_use_in_second": 33.38,
    "time_use_in_minute": "0:33"
}

gaokao_math_qa

{
    "num_samples": 351,
    "num_scores": 351,
    "timeout_samples": 0,
    "empty_samples": 2,
    "acc": 57.0,
    "time_use_in_second": 126.39,
    "time_use_in_minute": "2:06"
}

gaokao2024_I

{
    "num_samples": 14,
    "num_scores": 14,
    "timeout_samples": 0,
    "empty_samples": 0,
    "acc": 50.0,
    "type_acc": {
        "blank": 33.3,
        "multi": 0.0,
        "single": 75.0
    },
    "time_use_in_second": 23.06,
    "time_use_in_minute": "0:23"
}

gaokao2024_II

{
    "num_samples": 14,
    "num_scores": 14,
    "timeout_samples": 0,
    "empty_samples": 0,
    "acc": 57.1,
    "type_acc": {
        "blank": 33.3,
        "multi": 100.0,
        "single": 50.0
    },
    "time_use_in_second": 26.19,
    "time_use_in_minute": "0:26"
}

gaokao2024_mix

{
    "num_samples": 91,
    "num_scores": 91,
    "timeout_samples": 0,
    "empty_samples": 0,
    "acc": 59.3,
    "time_use_in_second": 39.97,
    "time_use_in_minute": "0:39"
}

The final average accuracy achieved is 59.2%, which shows a significant gap compared to the reported accuracy of 66.3% in the paper. Could you help me identify any potential issues?

Code to calculate average accuracy:

# Defining the provided data to calculate the total `num_samples` and combined `acc`
data = [
    {"num_samples": 118, "acc": 68.6},
    {"num_samples": 351, "acc": 57.0},
    {"num_samples": 14, "acc": 50.0},
    {"num_samples": 14, "acc": 57.1},
    {"num_samples": 91, "acc": 59.3},
]

# Calculating total `num_samples` and weighted `acc`
total_samples = sum(item["num_samples"] for item in data)
weighted_acc = sum(item["num_samples"] * item["acc"] for item in data) / total_samples

total_samples, weighted_acc
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant