Skip to content

Commit

Permalink
[Tutorials] Update the SDG tutorial's documentation
Browse files Browse the repository at this point in the history
- Add a structure outline to the README file to make the tutorial more
accessible.
- Add a block diagram to further clarify the pipeline.

Signed-off-by: Mehran Maghoumi <[email protected]>
  • Loading branch information
Maghoumi committed Sep 10, 2024
1 parent c2f296c commit c90c617
Show file tree
Hide file tree
Showing 2 changed files with 26 additions and 1 deletion.
27 changes: 26 additions & 1 deletion tutorials/peft-curation-with-sdg/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,13 +9,38 @@ well as human-provided answers.

In this tutorial, we implement various filtering and processing operations on the records. We then
demonstrate the usage of external LLM services for synthetic data generation and reward models to
assign qualitative metrics to each synthetic record. We further NeMo Curator's facilities
assign qualitative metrics to each synthetic record. We further use NeMo Curator's facilities
to iteratively augment and refine the data until the dataset has reached the desired size.

> **Note:** The use of external LLM services for synthetic data generation is entirely optional.
> Similarly, this tutorial can be executed on a local machine without the need for a GPU. To fully
> experience all the capabilities of this code, see the "Optional Prerequisites" section below.
## Overview of the Pipeline

The pipeline for this tutorial aims to demonstrate a basic loop with two stages as follows. These stages are repeated until the desired dataset size is achieved:

1. **Data processing**: perform operations such as HTML tag cleaning, quality-based filtering and semantic deduplication on the records.
2. **Synthetic data generation**: query a synthetic data generation model (such as LLaMa 3.1 405B Instruct, or Nemotron-4 340B Instruct) to produce synthetic variants of existing records. Each synthetic record is then fed to a reward model (such as Nemotron-4 340B Instruct), and assigned a quality score. All records are then fed to the data processing stage.

The following diagram depicts the pipeline demonstrated in this tutorial:

![image](images/peft-sdg.png)


### Code Structure

This code is organized as follows:
- **[main.py](main.py)**: the entry point to the code. Implements the data curation and synthetic data generation pipeline, and consists of the following high-level functionality:
- `download_and_convert_to_jsonl()`: contains the logic necessary to download the sample dataset and convert it into JSONL.
- `random_split_rows()`: contains the logic for spliting the dataset into training/validation/test splits.
- `semantic_dedupe()`: implements the semantic deduplication functionality (requires an NVIDIA GPU).
- `run_curation_pipeline()`: the main curation pipeline implementation. Captures the data processing, as well as the synthetic data generation operations.
- **[docbuilder.py](docbuilder.py)**: contains the implementations of NeMo Curator document builder modules to facilitate dataset download and conversion into the JSONL format.
- **[filters.py](filters.py)**: contains the implementation of a score-based filtering mechanism, to filter out low-quality documents. Used in `run_curation_pipeline()`.
- **[modifiers.py](modifiers.py)**: contains the implementation of the HTML-cleaning logic. Used in `run_curation_pipeline()`.
- **[synthetic_gen.py](synthetic_gen.py)**: abstracts the logic needed for invoking the synthetic data generation model, and also assigning reward scores to each record. Used in `run_curation_pipeline()`.

## Optional Prerequisites

The following is a list of optional dependencies to allow experimentation with all the features
Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit c90c617

Please sign in to comment.