Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Tutorials] Update tutorial documentations #241

Merged
merged 1 commit into from
Sep 11, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -65,6 +65,7 @@ These modules offer flexibility and permit reordering, with only a few exception
- [Scale and Curate High-Quality Datasets for LLM Training with NVIDIA NeMo Curator](https://developer.nvidia.com/blog/scale-and-curate-high-quality-datasets-for-llm-training-with-nemo-curator/)
- [Curating Custom Datasets for LLM Training with NVIDIA NeMo Curator](https://developer.nvidia.com/blog/curating-custom-datasets-for-llm-training-with-nvidia-nemo-curator/)
- [Curating Custom Datasets for LLM Parameter-Efficient Fine-Tuning with NVIDIA NeMo Curator](https://developer.nvidia.com/blog/curating-custom-datasets-for-llm-parameter-efficient-fine-tuning-with-nvidia-nemo-curator/)
- [Streamlining Data Processing for Domain Adaptive Pretraining with NVIDIA NeMo Curator](https://developer.nvidia.com/blog/streamlining-data-processing-for-domain-adaptive-pretraining-with-nvidia-nemo-curator/)

## Get Started

Expand Down
28 changes: 28 additions & 0 deletions tutorials/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
# Tutorials
The following is a set of tutorials that demonstrate various functionalities and features of NeMo Curator. These tutorials are meant to provide the coding foundation for building applications that consume the data that NeMo Curator curates.

## Get Started
To get started, we recommend starting with the following tutorials to become familiar with various functionalities of NeMo Curator and get an idea of what a data curation pipeline might look like:
1. **[tinystories](./tinystories)**, which overviews core functionalities such as downloading, filtering, PII removal and exact deduplication.
2. **[peft-curation](./peft-curation)**, which overviews operations suitable for curating small-scale datasets which are used for task-specific fine-tuning.
3. **[synthetic-data-hello-world](./synthetic-data-hello-world)**, which overviews basic synthetic data generation facilities for interfacing with external models such as [Nemotron-4 340B Instruct](https://build.nvidia.com/nvidia/nemotron-4-340b-instruct).
4. **[peft-curation-with-sdg](./peft-curation-with-sdg)**, which combines data processing opeartions and synthetic data generation using [Nemotron-4 340B Instruct](https://build.nvidia.com/nvidia/nemotron-4-340b-instruct) or [LLaMa 3.1 405B Instruct](https://build.nvidia.com/meta/llama-3_1-405b-instruct) into a single pipeline. Additionally, this tutorial also demonstrates advanced functions such as reward score assignment via [Nemotron-4 340B Reward](https://build.nvidia.com/nvidia/nemotron-4-340b-reward), as well as semantic deduplication to remove semantically similar real or synthetic records.


## List of Tutorials

<div align="center">

| Tutorial | Description | Additional Resources |
| --- | --- | --- |
| [dapt-curation](./dapt-curation) | Data curation sample for domain-adaptive pre-training (DAPT), focusing on [ChipNeMo](https://blogs.nvidia.com/blog/llm-semiconductors-chip-nemo/) data curation as an example | [Blog post](https://developer.nvidia.com/blog/streamlining-data-processing-for-domain-adaptive-pretraining-with-nvidia-nemo-curator/) |
| [distributed_data_classification](./distributed_data_classification) | Demonstrates data domain and data quality classification at scale in a distributed environment | |
| [nemotron_340B_synthetic_datagen](./nemotron_340B_synthetic_datagen) | Demonstrates the use of NeMo Curator synthetic data generation modules to leverage [Nemotron-4 340B Instruct](https://build.nvidia.com/nvidia/nemotron-4-340b-instruct) for generating synthetic preference data | |
| [peft-curation](./peft-curation/) | Data curation sample for parameter efficient fine-tuning (PEFT) use-cases | [Blog post](https://developer.nvidia.com/blog/curating-custom-datasets-for-llm-parameter-efficient-fine-tuning-with-nvidia-nemo-curator/) |
| [peft-curation-with-sdg](./peft-curation/) | Demonstrates a pipeline to leverage external models such as [Nemotron-4 340B Instruct](https://build.nvidia.com/nvidia/nemotron-4-340b-instruct) for synthetic data generation, data quality annotation via [Nemotron-4 340B Reward](https://build.nvidia.com/nvidia/nemotron-4-340b-reward), as well as other data processing steps (semantic deduplication, HTML tag removal, etc.) for parameter efficient fine-tuning (PEFT) use-cases | [Use this data to fine-tune your own model](https://github.com/NVIDIA/NeMo/blob/main/tutorials/llm/llama-3/sdg-law-title-generation/llama3-sdg-lora-nemofw.ipynb) |
| [single_node_tutorial](./single_node_tutorial) | A comprehensive example to demonstrate running various NeMo Curator functionalities locally | |
| [synthetic-data-hello-world](./synthetic-data-hello-world) | An introductory example of synthetic data generation using NeMo Curator | |
| [synthetic-preference-data](./synthetic-preference-data) | Demonstrates the use of NeMo Curator synthetic data generation modules to leverage [LLaMa 3.1 405B Instruct](https://build.nvidia.com/meta/llama-3_1-405b-instruct) for generating synthetic preference data |
| [synthetic-retrieval-evaluation](./synthetic-retrieval-evaluation) | Demonstrates the use of NeMo Curator synthetic data generation modules to leverage [LLaMa 3.1 405B Instruct](https://build.nvidia.com/meta/llama-3_1-405b-instruct) for generating synthetic data to evaluate retrieval pipelines |
| [tinystories](./tinystories) | A comprehensive example of curating a small dataset to use for model pre-training. | [Blog post](https://developer.nvidia.com/blog/curating-custom-datasets-for-llm-training-with-nvidia-nemo-curator/)
</div>
27 changes: 26 additions & 1 deletion tutorials/peft-curation-with-sdg/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,13 +9,38 @@ well as human-provided answers.

In this tutorial, we implement various filtering and processing operations on the records. We then
demonstrate the usage of external LLM services for synthetic data generation and reward models to
assign qualitative metrics to each synthetic record. We further NeMo Curator's facilities
assign qualitative metrics to each synthetic record. We further use NeMo Curator's facilities
to iteratively augment and refine the data until the dataset has reached the desired size.

> **Note:** The use of external LLM services for synthetic data generation is entirely optional.
> Similarly, this tutorial can be executed on a local machine without the need for a GPU. To fully
> experience all the capabilities of this code, see the "Optional Prerequisites" section below.

## Overview of the Pipeline

The pipeline for this tutorial aims to demonstrate a basic loop with two stages as follows. These stages are repeated until the desired dataset size is achieved:

1. **Data processing**: perform operations such as HTML tag cleaning, quality-based filtering and semantic deduplication on the records.
2. **Synthetic data generation**: query a synthetic data generation model (such as [LLaMa 3.1 405B Instruct](https://build.nvidia.com/meta/llama-3_1-405b-instruct), or [Nemotron-4 340B Instruct](https://build.nvidia.com/nvidia/nemotron-4-340b-instruct)) to produce synthetic variants of existing records. Each synthetic record is then fed to a reward model (such as [Nemotron-4 340B Reward](https://build.nvidia.com/nvidia/nemotron-4-340b-reward)), and assigned a quality score. All records are then fed to the data processing stage for further processing.

The following diagram depicts the pipeline demonstrated in this tutorial:

![image](images/peft-sdg.png)


### Code Structure

This code is organized as follows:
- **[main.py](main.py)**: the entry point to the code. Implements the data curation and synthetic data generation pipeline, and consists of the following high-level functionality:
- `download_and_convert_to_jsonl()`: contains the logic necessary to download the sample dataset and convert it into JSONL.
- `random_split_rows()`: contains the logic for spliting the dataset into training/validation/test splits.
- `semantic_dedupe()`: implements the semantic deduplication functionality (requires an NVIDIA GPU).
- `run_curation_pipeline()`: the main curation pipeline implementation. Captures the data processing, as well as the synthetic data generation operations.
- **[docbuilder.py](docbuilder.py)**: contains the implementations of NeMo Curator document builder modules to facilitate dataset download and conversion into the JSONL format.
- **[filters.py](filters.py)**: contains the implementation of a score-based filtering mechanism, to filter out low-quality documents. Used in `run_curation_pipeline()`.
- **[modifiers.py](modifiers.py)**: contains the implementation of the HTML-cleaning logic. Used in `run_curation_pipeline()`.
- **[synthetic_gen.py](synthetic_gen.py)**: abstracts the logic needed for invoking the synthetic data generation model, and also assigning reward scores to each record. Used in `run_curation_pipeline()`.

## Optional Prerequisites

The following is a list of optional dependencies to allow experimentation with all the features
Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.