Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Docs] new_dataset.md update #1827

Merged
merged 3 commits into from
Jan 17, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
36 changes: 36 additions & 0 deletions docs/en/advanced_guides/new_dataset.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,5 +53,41 @@ Although OpenCompass has already included most commonly used datasets, users nee
eval_cfg=mydataset_eval_cfg)
]
```

- To facilitate the access of your datasets to other users, you need to specify the channels for downloading the datasets in the configuration file. Specifically, you need to first fill in a dataset name given by yourself in the `path` field in the `mydataset_datasets` configuration, and this name will be mapped to the actual download path in the `opencompass/utils/datasets_info.py` file. Here's an example:

```python
mmlu_datasets = [an
dict(
...,
path='opencompass/mmlu',
...,
)
]
```
- Next, you need to create a dictionary key in `opencompass/utils/datasets_info.py` with the same name as the one you provided above. If you have already hosted the dataset on HuggingFace or Modelscope, please add a dictionary key to the `DATASETS_MAPPING` dictionary and fill in the HuggingFace or Modelscope dataset address in the `hf_id` or `ms_id` key, respectively. You can also specify a default local address. Here's an example:

```python
"opencompass/mmlu": {
"ms_id": "opencompass/mmlu",
"hf_id": "opencompass/mmlu",
"local": "./data/mmlu/",
}
```

- If you wish for the provided dataset to be directly accessible from the OpenCompass OSS repository when used by others, you need to submit the dataset files in the Pull Request phase. We will then transfer the dataset to the OSS on your behalf and create a new dictionary key in the `DATASET_URL`.

- To ensure the optionality of data sources, you need to improve the method `load` in the dataset script `mydataset.py`. Specifically, you need to implement a functionality to switch among different download sources based on the setting of the environment variable `DATASET_SOURCE`. It shoule be noted that if the environment variable `DATASET_SOURCE` is not set, the dataset will default to being downloaded from the OSS repository. Here's an example from `opencompass/dataset/cmmlu.py`:

```python
def load(path: str, name: str, **kwargs):
...
if environ.get('DATASET_SOURCE') == 'ModelScope':
...
else:
...
return dataset
```


Detailed dataset configuration files and other required configuration files can be referred to in the [Configuration Files](../user_guides/config.md) tutorial. For guides on launching tasks, please refer to the [Quick Start](../get_started/quick_start.md) tutorial.
40 changes: 39 additions & 1 deletion docs/zh_cn/advanced_guides/new_dataset.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,5 +54,43 @@
eval_cfg=mydataset_eval_cfg)
]
```

- 为了使用户提供的数据集能够被其他使用者更方便的获取,需要用户在配置文件中给出下载数据集的渠道。具体的方式是首先在`mydataset_datasets`配置中的`path`字段填写用户指定的数据集名称,该名称将以mapping的方式映射到`opencompass/utils/datasets_info.py`中的实际下载路径。具体示例如下:

```python
mmlu_datasets = [
dict(
...,
path='opencompass/mmlu',
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

强调一下这里是个 mapping 的映射

...,
)
]
```

- 接着,需要在`opencompass/utils/datasets_info.py`中创建对应名称的字典字段。如果用户已将数据集托管到huggingface或modelscope,那么请在`DATASETS_MAPPING`字典中添加对应名称的字段,并将对应的huggingface或modelscope数据集地址填入`ms_id`和`hf_id`;另外,还允许指定一个默认的`local`地址。具体示例如下:

```python
"opencompass/mmlu": {
"ms_id": "opencompass/mmlu",
"hf_id": "opencompass/mmlu",
"local": "./data/mmlu/",
}
```

Comment on lines +70 to +79
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

用户需要在 load 函数中实现根据不同的环境变量切换数据源的操作,具体可以参考 cmmlu 中的实现
https://github.com/open-compass/opencompass/blob/main/opencompass/datasets/cmmlu.py。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已添加

- 如果希望提供的数据集在其他用户使用时能够直接从OpenCompass官方的OSS仓库获取,则需要在Pull Request阶段向我们提交数据集文件,我们将代为传输数据集至OSS,并在`DATASET_URL`新建字段。

- 为了确保数据来源的可选择性,用户需要根据所提供数据集的下载路径类型来完善数据集脚本`mydataset.py`中的`load`方法的功能。具体而言,需要用户实现根据环境变量`DATASET_SOURCE`的不同设置来切换不同的下载数据源的功能。需要注意的是,若未设置`DATASET_SOURCE`的值,将默认从OSS仓库下载数据。`opencompass/dataset/cmmlu.py`中的具体示例如下:

```python
def load(path: str, name: str, **kwargs):
...
if environ.get('DATASET_SOURCE') == 'ModelScope':
...
else:
...
return dataset
```



详细的数据集配置文件以及其他需要的配置文件可以参考[配置文件](../user_guides/config.md)教程,启动任务相关的教程可以参考[快速开始](../get_started/quick_start.md)教程。
详细的数据集配置文件以及其他需要的配置文件可以参考[配置文件](../user_guides/config.md)教程,启动任务相关的教程可以参考[快速开始](../get_started/quick_start.md)教程。
Loading