Skip to content

Commit

Permalink
🎴 Add readme for datasets (#2491)
Browse files Browse the repository at this point in the history
* adding readme for ultrafeedback dataset

* using ModelCard as DatasetsCard like hf datasets is understaffed

* more info in readme.md of the dataset

* generated readme for all dataset scripts

* precommit

* fixing test

* md format; corrections; generation script link

* some collections

---------

Co-authored-by: Quentin Gallouédec <[email protected]>
Co-authored-by: Quentin Gallouédec <[email protected]>
  • Loading branch information
3 people authored Jan 8, 2025
1 parent beb892b commit ed7de87
Show file tree
Hide file tree
Showing 11 changed files with 304 additions and 0 deletions.
10 changes: 10 additions & 0 deletions docs/source/dataset_formats.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -161,6 +161,8 @@ prompt_only_example = {"prompt": "The sky is"}
prompt_only_example = {"prompt": [{"role": "user", "content": "What color is the sky?"}]}
```

For examples of prompt-only datasets, refer to the [Prompt-only datasets collection](https://huggingface.co/collections/trl-lib/prompt-only-datasets-677ea25245d20252cea00368).

<Tip>

While both the prompt-only and language modeling types are similar, they differ in how the input is handled. In the prompt-only type, the prompt represents a partial input that expects the model to complete or continue, while in the language modeling type, the input is treated as a complete sentence or sequence. These two types are processed differently by TRL. Below is an example showing the difference in the output of the `apply_chat_template` function for each type:
Expand Down Expand Up @@ -199,6 +201,8 @@ prompt_completion_example = {"prompt": [{"role": "user", "content": "What color
"completion": [{"role": "assistant", "content": "It is blue."}]}
```

For examples of prompt-completion datasets, refer to the [Prompt-completion datasets collection](https://huggingface.co/collections/trl-lib/prompt-completion-datasets-677ea2bb20bbb6bdccada216).

#### Preference

A preference dataset is used for tasks where the model is trained to choose between two or more possible completions to the same prompt. This dataset includes a `"prompt"`, a `"chosen"` completion, and a `"rejected"` completion. The model is trained to select the `"chosen"` response over the `"rejected"` response.
Expand All @@ -223,6 +227,8 @@ preference_example = {"chosen": [{"role": "user", "content": "What color is the
{"role": "assistant", "content": "It is green."}]}
```

For examples of preference datasets, refer to the [Preference datasets collection](https://huggingface.co/collections/trl-lib/preference-datasets-677e99b581018fcad9abd82c).

Some preference datasets can be found with [the tag `dpo` on Hugging Face Hub](https://huggingface.co/datasets?other=dpo). You can also explore the [librarian-bots' DPO Collections](https://huggingface.co/collections/librarian-bots/direct-preference-optimization-datasets-66964b12835f46289b6ef2fc) to identify preference datasets.

#### Unpaired preference
Expand All @@ -238,6 +244,8 @@ unpaired_preference_example = {"prompt": [{"role": "user", "content": "What colo
"label": True}
```

For examples of unpaired preference datasets, refer to the [Unpaired preference datasets collection](https://huggingface.co/collections/trl-lib/unpaired-preference-datasets-677ea22bf5f528c125b0bcdf).

#### Stepwise supervision

A stepwise (or process) supervision dataset is similar to an [unpaired preference](#unpaired-preference) dataset but includes multiple steps of completions, each with its own label. This structure is useful for tasks that need detailed, step-by-step labeling, such as reasoning tasks. By evaluating each step separately and providing targeted labels, this approach helps identify precisely where the reasoning is correct and where errors occur, allowing for targeted feedback on each part of the reasoning process.
Expand All @@ -250,6 +258,8 @@ stepwise_example = {
}
```

For examples of stepwise supervision datasets, refer to the [Stepwise supervision datasets collection](https://huggingface.co/collections/trl-lib/stepwise-supervision-datasets-677ea27fd4c5941beed7a96e).

## Which dataset type to use?

Choosing the right dataset type depends on the task you are working on and the specific requirements of the TRL trainer you are using. Below is a brief overview of the dataset types supported by each TRL trainer.
Expand Down
30 changes: 30 additions & 0 deletions examples/datasets/hh-rlhf-helpful-base.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@
from typing import Optional

from datasets import load_dataset
from huggingface_hub import ModelCard
from transformers import HfArgumentParser


Expand Down Expand Up @@ -92,6 +93,34 @@ def extract_dialogue(example: str) -> list[dict[str, str]]:
return {"prompt": prompt, "chosen": chosen, "rejected": rejected}


model_card = ModelCard("""
---
tags: [trl]
---
# HH-RLHF-Helpful-Base Dataset
## Summary
The HH-RLHF-Helpful-Base dataset is a processed version of [Anthropic's HH-RLHF](https://huggingface.co/datasets/Anthropic/hh-rlhf) dataset, specifically curated to train models using the [TRL library](https://github.com/huggingface/trl) for preference learning and alignment tasks. It contains pairs of text samples, each labeled as either "chosen" or "rejected," based on human preferences regarding the helpfulness of the responses. This dataset enables models to learn human preferences in generating helpful responses, enhancing their ability to assist users effectively.
## Data Structure
- **Format**: [Conversational](https://huggingface.co/docs/trl/main/dataset_formats#conversational)
- **Type**: [Preference](https://huggingface.co/docs/trl/main/dataset_formats#preference)
Columns:
- `"pompt"`: The user query.
- `"chosen"`: A response deemed helpful by human evaluators.
- `"rejected"`: A response considered less helpful or unhelpful.
This structure allows models to learn to prefer the _chosen_ response over the _rejected_ one, thereby aligning with human preferences in helpfulness.
## Generation script
The script used to generate this dataset can be found [here](https://github.com/huggingface/trl/blob/main/examples/datasets/hh-rlhf-helpful-base.py).
""")

if __name__ == "__main__":
parser = HfArgumentParser(ScriptArguments)
script_args = parser.parse_args_into_dataclasses()[0]
Expand All @@ -101,3 +130,4 @@ def extract_dialogue(example: str) -> list[dict[str, str]]:

if script_args.push_to_hub:
dataset.push_to_hub(script_args.repo_id)
model_card.push_to_hub(script_args.repo_id, repo_type="dataset")
30 changes: 30 additions & 0 deletions examples/datasets/lm-human-preferences-descriptiveness.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@
from typing import Optional

from datasets import load_dataset
from huggingface_hub import ModelCard
from transformers import AutoTokenizer, HfArgumentParser


Expand Down Expand Up @@ -64,6 +65,34 @@ def to_prompt_completion(example, tokenizer):
return {"prompt": prompt, "chosen": chosen, "rejected": rejected}


model_card = ModelCard("""
---
tags: [trl]
---
# LM-Human-Preferences-Descriptiveness Dataset
## Summary
The LM-Human-Preferences-Descriptiveness dataset is a processed subset of [OpenAI's LM-Human-Preferences](https://github.com/openai/lm-human-preferences), focusing specifically on enhancing the descriptiveness of generated text. It contains pairs of text samples, each labeled as either "chosen" or "rejected," based on human preferences regarding the level of detail and vividness in the descriptions. This dataset enables models to learn human preferences in descriptive language, improving their ability to generate rich and engaging narratives.
## Data Structure
- **Format**: [Standard](https://huggingface.co/docs/trl/main/dataset_formats#standard)
- **Type**: [Preference](https://huggingface.co/docs/trl/main/dataset_formats#preference)
Columns:
- `"pompt"`: The text sample.
- `"chosen"`: A version of the text with enhanced descriptiveness.
- `"rejected"`: A version of the text with less descriptiveness.
This structure allows models to learn to prefer the _chosen_ response over the _rejected_ one, thereby aligning with human preferences in descriptive language.
## Generation script
The script used to generate this dataset can be found [here](https://github.com/huggingface/trl/blob/main/examples/datasets/lm-human-preferences-descriptiveness.py).
""")

if __name__ == "__main__":
parser = HfArgumentParser(ScriptArguments)
script_args = parser.parse_args_into_dataclasses()[0]
Expand All @@ -88,3 +117,4 @@ def to_prompt_completion(example, tokenizer):

if script_args.push_to_hub:
dataset.push_to_hub(script_args.repo_id)
model_card.push_to_hub(script_args.repo_id, repo_type="dataset")
30 changes: 30 additions & 0 deletions examples/datasets/lm-human-preferences-sentiment.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@
from typing import Optional

from datasets import load_dataset
from huggingface_hub import ModelCard
from transformers import AutoTokenizer, HfArgumentParser


Expand Down Expand Up @@ -59,6 +60,34 @@ def to_prompt_completion(example, tokenizer):
return {"prompt": prompt, "chosen": chosen, "rejected": rejected}


model_card = ModelCard("""
---
tags: [trl]
---
# LM-Human-Preferences-Sentiment Dataset
## Summary
The LM-Human-Preferences-Sentiment dataset is a processed subset of [OpenAI's LM-Human-Preferences](https://github.com/openai/lm-human-preferences), focusing specifically on sentiment analysis tasks. It contains pairs of text samples, each labeled as either "chosen" or "rejected," based on human preferences regarding the sentiment conveyed in the text. This dataset enables models to learn human preferences in sentiment expression, enhancing their ability to generate and evaluate text with desired emotional tones.
## Data Structure
- **Format**: [Standard](https://huggingface.co/docs/trl/main/dataset_formats#standard)
- **Type**: [Preference](https://huggingface.co/docs/trl/main/dataset_formats#preference)
Columns:
- `"pompt"`: The text sample.
- `"chosen"`: A version of the text that conveys the desired sentiment.
- `"rejected"`: A version of the text that does not convey the desired sentiment.
This structure allows models to learn to prefer the _chosen_ response over the _rejected_ one, thereby aligning with human preferences in sentiment expression.
## Generation script
The script used to generate this dataset can be found [here](https://github.com/huggingface/trl/blob/main/examples/datasets/lm-human-preferences-sentiment.py).
""")

if __name__ == "__main__":
parser = HfArgumentParser(ScriptArguments)
script_args = parser.parse_args_into_dataclasses()[0]
Expand All @@ -81,3 +110,4 @@ def to_prompt_completion(example, tokenizer):

if script_args.push_to_hub:
dataset.push_to_hub(script_args.repo_id)
model_card.push_to_hub(script_args.repo_id, repo_type="dataset")
30 changes: 30 additions & 0 deletions examples/datasets/math_shepherd.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@
from typing import Optional

from datasets import load_dataset
from huggingface_hub import ModelCard
from transformers import HfArgumentParser


Expand Down Expand Up @@ -123,6 +124,34 @@ def process_example(example):
return {"prompt": prompt, "completions": completions, "labels": labels}


model_card = ModelCard("""
---
tags: [trl]
---
# Math-Shepherd Dataset
## Summary
The Math-Shepherd dataset is a processed version of [Math-Shepherd dataset](peiyi9979/Math-Shepherd), designed to train models using the [TRL library](https://github.com/huggingface/trl) for stepwise supervision tasks. It provides step-by-step solutions to mathematical problems, enabling models to learn and verify each step of a solution, thereby enhancing their reasoning capabilities.
## Data Structure
- **Format**: [Standard](https://huggingface.co/docs/trl/main/dataset_formats#standard)
- **Type**: [Stepwise supervision](https://huggingface.co/docs/trl/main/dataset_formats#stepwise-supervision)
Columns:
- `"pompt"`: The problem statement.
- `"completions"`: A list of reasoning steps generated to solve the problem.
- `"labels"`: A list of booleans or floats indicating the correctness of each corresponding reasoning step.
This structure allows models to learn the correctness of each step in a solution, facilitating improved reasoning and problem-solving abilities.
## Generation script
The script used to generate this dataset can be found [here](https://github.com/huggingface/trl/blob/main/examples/datasets/math_shepherd.py).
""")

if __name__ == "__main__":
parser = HfArgumentParser(ScriptArguments)
script_args = parser.parse_args_into_dataclasses()[0]
Expand All @@ -138,3 +167,4 @@ def process_example(example):

if script_args.push_to_hub:
dataset.push_to_hub(script_args.repo_id)
model_card.push_to_hub(script_args.repo_id, repo_type="dataset")
30 changes: 30 additions & 0 deletions examples/datasets/prm800k.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@
from typing import Optional

from datasets import load_dataset
from huggingface_hub import ModelCard
from transformers import HfArgumentParser


Expand Down Expand Up @@ -97,6 +98,34 @@ def process_batch(examples):
return outputs


model_card = ModelCard("""
---
tags: [trl]
---
# PRM800K Dataset
## Summary
The PRM800K dataset is a processed version of [OpenAI's PRM800K](https://github.com/openai/prm800k), designed to train models using the [TRL library](https://github.com/huggingface/trl) for stepwise supervision tasks. It contains 800,000 step-level correctness labels for model-generated solutions to problems from the MATH dataset. This dataset enables models to learn and verify each step of a solution, enhancing their reasoning capabilities.
## Data Structure
- **Format**: [Standard](https://huggingface.co/docs/trl/main/dataset_formats#standard)
- **Type**: [Stepwise supervision](https://huggingface.co/docs/trl/main/dataset_formats#stepwise-supervision)
Columns:
- `"pompt"`: The problem statement.
- `"completions"`: A list of reasoning steps generated to solve the problem.
- `"labels"`: A list of booleans or floats indicating the correctness of each corresponding reasoning step.
This structure allows models to learn the correctness of each step in a solution, facilitating improved reasoning and problem-solving abilities.
## Generation script
The script used to generate this dataset can be found [here](https://github.com/huggingface/trl/blob/main/examples/datasets/prm800k.py).
""")

if __name__ == "__main__":
parser = HfArgumentParser(ScriptArguments)
script_args = parser.parse_args_into_dataclasses()[0]
Expand Down Expand Up @@ -125,3 +154,4 @@ def process_batch(examples):

if script_args.push_to_hub:
dataset.push_to_hub(script_args.repo_id)
model_card.push_to_hub(script_args.repo_id, repo_type="dataset")
31 changes: 31 additions & 0 deletions examples/datasets/rlaif-v.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@
from typing import Optional

from datasets import features, load_dataset
from huggingface_hub import ModelCard
from transformers import HfArgumentParser


Expand Down Expand Up @@ -59,6 +60,35 @@ def to_conversational(example):
return {"prompt": prompt, "images": [example["image"]], "chosen": chosen, "rejected": rejected}


model_card = ModelCard("""
---
tags: [trl]
---
# RLAIF-V Dataset
## Summary
The RLAIF-V dataset is a processed version of the [openbmb/RLAIF-V-Dataset](https://huggingface.co/datasets/openbmb/RLAIF-V-Dataset#dataset-card-for-rlaif-v-dataset), specifically curated to train vision-language models using the [TRL library](https://github.com/huggingface/trl) for preference learning tasks. It contains 83,132 high-quality comparison pairs, each comprising an image and two textual descriptions: one preferred and one rejected. This dataset enables models to learn human preferences in visual contexts, enhancing their ability to generate and evaluate image captions.
## Data Structure
- **Format**: [Conversational](https://huggingface.co/docs/trl/main/dataset_formats#conversational)
- **Type**: [Preference](https://huggingface.co/docs/trl/main/dataset_formats#preference)
Columns:
- `"pompt"`: The task related to the image.
- `"images"`: The image.
- `"chosen"`: The preferred answer.
- `"rejected"`: An alternative answer that was not preferred.
This structure allows models to learn to prefer the _chosen_ response over the _rejected_ one, thereby aligning with human preferences in visual tasks.
## Generation script
The script used to generate this dataset can be found [here](https://github.com/huggingface/trl/blob/main/examples/datasets/rlaif-v.py).
""")

if __name__ == "__main__":
parser = HfArgumentParser(ScriptArguments)
script_args = parser.parse_args_into_dataclasses()[0]
Expand All @@ -80,3 +110,4 @@ def to_conversational(example):

if script_args.push_to_hub:
dataset.push_to_hub(script_args.repo_id)
model_card.push_to_hub(script_args.repo_id, repo_type="dataset")
29 changes: 29 additions & 0 deletions examples/datasets/tldr.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@
from typing import Optional

from datasets import load_dataset
from huggingface_hub import ModelCard
from transformers import HfArgumentParser


Expand Down Expand Up @@ -54,6 +55,33 @@ def to_prompt_completion(example):
return {"prompt": prompt, "completion": completion}


model_card = ModelCard("""
---
tags: [trl]
---
# TL;DR Dataset
## Summary
The TL;DR dataset is a processed version of Reddit posts, specifically curated to train models using the [TRL library](https://github.com/huggingface/trl) for summarization tasks. It leverages the common practice on Reddit where users append "TL;DR" (Too Long; Didn't Read) summaries to lengthy posts, providing a rich source of paired text data for training summarization models.
## Data Structure
- **Format**: [Standard](https://huggingface.co/docs/trl/main/dataset_formats#standard)
- **Type**: [Prompt-completion](https://huggingface.co/docs/trl/main/dataset_formats#prompt-completion)
Columns:
- `"pompt"`: The unabridged Reddit post.
- `"completion"`: The concise "TL;DR" summary appended by the author.
This structure enables models to learn the relationship between detailed content and its abbreviated form, enhancing their summarization capabilities.
## Generation script
The script used to generate this dataset can be found [here](https://github.com/huggingface/trl/blob/main/examples/datasets/tldr.py).
""")

if __name__ == "__main__":
parser = HfArgumentParser(ScriptArguments)
script_args = parser.parse_args_into_dataclasses()[0]
Expand All @@ -74,3 +102,4 @@ def to_prompt_completion(example):

if script_args.push_to_hub:
dataset.push_to_hub(script_args.repo_id)
model_card.push_to_hub(script_args.repo_id, repo_type="dataset")
Loading

0 comments on commit ed7de87

Please sign in to comment.