Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Support downloading dataset from OpenMind #1792

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

FightingZhen
Copy link

@FightingZhen FightingZhen commented Dec 27, 2024

Thanks for your contribution and we appreciate it a lot. The following instructions would make your pull request more healthy and more easily get feedback. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers.

Motivation

Modelers is a popular open-source community that has recently gained a lot of attention. It includes some popular datasets and Ascend-NPU supported models. By using accompanying openMind library, users can train model on Ascend NPU easily.

We hope to integrate the dataset resources of the OpenMind community into opencompass through this PR. After that, we plan to integrate the evaluation capabilities of opencompass into the OpenMind library to facilitate users in conducting model evaluations, and play a role in promoting opencompass at the same time.

By simply setting the environment variable DATASET_SOURCE=OpenMind, users can use dataset from openMind community when using opencompass.

Modification

This PR aims to establish the process of integrating opencompass with the dataset resources from the OpenMind community, and use the GSM8K dataset as a pilot for this integration. More other datasets will be supported soon.

The modifications are:

  • opencompass/datasets/gsm8k.py: Support using openMind library to automatically download GSM8K dataset in OpenMind community when environment variable DATASET_SOURCE=OpenMind
  • opencompass/utils/datasets.py: Support getting dataset id from OpenMind community in variable DATASETS_MAPPING with a new key om_id, om is short for OpenMind.
  • opencompass/utils/datasets_info.py: Add "om_id": "OpenCompass/gsm8k", into opencompass/gsm8k dict, string OpenCompass/gsm8k comes from GSM8K dataset in OpenMind community.
  • tests/dataset/test_om_datasets.py: Add test script for datasets from OpenMind community.
  • Update english and chinese README

BC-breaking (Optional)

Not related.

Use cases (Optional)

We verify this PR in python 3.10.16 on Windows as follows:

Install dependencies and launch:

# opencompass-related packages have been installed with pip install -e .
pip install openmind
set DATASET_SOURCE=OpenMind
opencompass --models hf_internlm2_5_1_8b_chat --datasets demo_gsm8k_chat_gen

Running result:

(openmind) D:\workspace>opencompass --models hf_internlm2_5_1_8b_chat --datasets demo_gsm8k_chat_gen
d:\workspace\opencompass-openmind\opencompass\__init__.py:19: UserWarning: Starting from v0.4.0, all AMOTIC configuration files currently located in `./configs/datasets`, `./configs/models`, and `./configs/summarizers` will be migrated to the `opencompass/configs/` package. Please update your configuration file paths accordingly.
  _warn_about_config_migration()
signal.SIGALRM is not available on this platform
signal.SIGALRM is not available on this platform
01/03 22:21:28 - OpenCompass - INFO - Loading demo_gsm8k_chat_gen: d:\workspace\opencompass-openmind\opencompass\configs\./datasets\demo\demo_gsm8k_chat_gen.py
01/03 22:21:28 - OpenCompass - INFO - Loading hf_internlm2_5_1_8b_chat: d:\workspace\opencompass-openmind\opencompass\configs\./models\hf_internlm\hf_internlm2_5_1_8b_chat.py
01/03 22:21:28 - OpenCompass - INFO - Loading example: d:\workspace\opencompass-openmind\opencompass\configs\./summarizers\example.py
01/03 22:21:28 - OpenCompass - INFO - Current exp folder: outputs\default\20250103_222128
01/03 22:21:28 - OpenCompass - WARNING - SlurmRunner is not used, so the partition argument is ignored.
Downloading readme: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7.95k/7.95k [00:00<00:00, 993kB/s]
C:\ProgramData\anaconda3\envs\openmind\lib\site-packages\datasets\builder.py:885: FutureWarning: 'try_from_hf_gcs' was deprecated in version 2.16.0 and will be removed in 3.0.0.
  warnings.warn(
(…)k@main/main/train-00000-of-00001.parquet: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2.31M/2.31M [00:00<00:00, 5.50MB/s]
Downloading data:   0%|                                                                                                                                                                                                                                    | 0.00/2.31M [00:02<?, ?B/s]
(…)8k@main/main/test-00000-of-00001.parquet: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 419k/419k [00:00<00:00, 6.16MB/s]
Downloading data:   0%|                                                                                                                                                                                                                                     | 0.00/419k [00:00<?, ?B/s]
Generating train split: 7473 examples [00:00, 995491.13 examples/s]
Generating test split: 1319 examples [00:00, 659312.00 examples/s]
[INFO][2025-01-03 22:21:42,781]: Context manager of om-dataset exited.
01/03 22:21:42 - OpenCompass - INFO - Partitioned into 1 tasks.
launch OpenICLInfer[internlm2_5-1_8b-chat-hf/demo_gsm8k] on GPU 0
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [01:59<00:00, 119.21s/it]
01/03 22:23:41 - OpenCompass - INFO - Partitioned into 1 tasks.
launch OpenICLEval[internlm2_5-1_8b-chat-hf/demo_gsm8k] on CPU
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:15<00:00, 15.33s/it]
dataset     version    metric    mode      internlm2_5-1_8b-chat-hf
----------  ---------  --------  ------  --------------------------
demo_gsm8k  1d7fe4     accuracy  gen                          51.56
01/03 22:23:57 - OpenCompass - INFO - write summary to D:\workspace\outputs\default\20250103_222128\summary\summary_20250103_222128.txt
01/03 22:23:57 - OpenCompass - INFO - write csv to D:\workspace\outputs\default\20250103_222128\summary\summary_20250103_222128.csv


The markdown format results is as below:

| dataset | version | metric | mode | internlm2_5-1_8b-chat-hf |
|----- | ----- | ----- | ----- | -----|
| demo_gsm8k | 1d7fe4 | accuracy | gen | 51.56 |

01/03 22:23:57 - OpenCompass - INFO - write markdown summary to D:\workspace\outputs\default\20250103_222128\summary\summary_20250103_222128.md

Checklist

Before PR:

  • Pre-commit or other linting tools are used to fix the potential lint issues.
  • Bug fixes are fully covered by unit tests, the case that causes the bug should be added in the unit tests.
  • The modification is covered by complete unit tests. If not, please add more unit test to ensure the correctness.
  • The documentation has been modified accordingly, like docstring or example tutorials.

After PR:

  • If the modification has potential influence on downstream or other related projects, this PR should be tested with those projects.
  • CLA has been signed and all committers have signed the CLA in this PR.

@FightingZhen FightingZhen changed the title support downloading dataset from OpenMind [WIP] support downloading dataset from OpenMind Dec 27, 2024
@FightingZhen FightingZhen changed the title [WIP] support downloading dataset from OpenMind [WIP] [Feature] Support downloading dataset from OpenMind Dec 28, 2024
@FightingZhen FightingZhen marked this pull request as draft December 30, 2024 01:05
@FightingZhen FightingZhen changed the title [WIP] [Feature] Support downloading dataset from OpenMind [Feature] Support downloading dataset from OpenMind Dec 30, 2024
@FightingZhen FightingZhen force-pushed the main branch 9 times, most recently from 10c2c82 to 04b1b4a Compare January 3, 2025 15:30
@FightingZhen FightingZhen marked this pull request as ready for review January 3, 2025 15:37
@FightingZhen
Copy link
Author

@tonysy This PR is ready for review, we are looking forward to receiving any feedback you may have on it, thanks : )

@FightingZhen
Copy link
Author

@liushz hello, I have noticed that some others friends are also paying attention to the integration progress of this PR. 3 days gone but no progress, can you help me review this? thanks 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants