Skip to content

Commit

Permalink
[Tutorials] Update the SDG tutorial and expose the inference endpoint (
Browse files Browse the repository at this point in the history
…NVIDIA#301)

This PR ensures that users can run the PEFT SDG tutorial using arbitrary
API endpoints by exposing the URL that is used for synthetic data
generation.

Signed-off-by: Mehran Maghoumi <[email protected]>
  • Loading branch information
Maghoumi authored Oct 21, 2024
1 parent 94d41ee commit 4ad1a4d
Show file tree
Hide file tree
Showing 2 changed files with 40 additions and 19 deletions.
26 changes: 15 additions & 11 deletions tutorials/peft-curation-with-sdg/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,45 +48,49 @@ showcased in this code:

* In order to run the data curation pipeline with semantic deduplication enabled, you would need an
NVIDIA GPU.
* To generate synthetic data, you would need a synthetic data generation model compatible with the OpenAI API. Out of the box, this tutorial supports the following model through the [build.nvidia.com](https://build.nvidia.com) API gateway:
* To generate synthetic data, you would need a synthetic data generation model compatible with the [OpenAI API](https://platform.openai.com/docs/api-reference/introduction). Out of the box, this tutorial supports the following model through the [build.nvidia.com](https://build.nvidia.com) API gateway:
* [Nemotron-4 340B Instruct](https://build.nvidia.com/nvidia/nemotron-4-340b-instruct)
* [LLaMa 3.1 405B Instruct](https://build.nvidia.com/meta/llama-3_1-405b-instruct)
* For assigning qualitative metrics to the generated records, you would need a reward model compatible with the OpenAI API (such as the [Nemotron-4 340B Reward](https://build.nvidia.com/nvidia/nemotron-4-340b-reward) model).
* For assigning qualitative metrics to the generated records, you would need a reward model compatible with the [OpenAI API](https://platform.openai.com/docs/api-reference/introduction) (such as the [Nemotron-4 340B Reward](https://build.nvidia.com/nvidia/nemotron-4-340b-reward) model).

> **Note:** A valid [build.nvidia.com](https://build.nvidia.com) API key is required to use any of the above models.
> **Note:** A valid [build.nvidia.com](https://build.nvidia.com) API key is required to use any of the above models. You can obtain a free API key by visiting [build.nvidia.com](https://build.nvidia.com) and creating an account with your email address.
## Usage
After installing the NeMo Curator package, you can simply run the following commands:
```bash
# Running the basic pipeline (no GPUs or external LLMs needed)
python tutorials/peft-curation-with-sdg/main.py

# Run with synthetic data generation and semantic dedeuplication
# Running with synthetic data generation and semantic dedeuplication using
# an external LLM inference endpoint located at "https://api.example.com/v1/chat/completions"
# and the model called "my-llm-model" that is served at that endpoint:
python tutorials/peft-curation-with-sdg/main.py \
--api-key YOUR_BUILD.NVIDIA.COM_API_KEY \
--synth-gen-endpoint https://api.example.com/v1/chat/completions \
--synth-gen-model my-llm-model \
--api-key API_KEY_FOR_LLM_ENDPOINT \
--device gpu

# Here are some examples that:
# - Use the GPU and enable semantic deduplication
# - Use the specified model from build.nvidia.com for synthetic data generation
# - Do 1 round of synthetic data generation
# - Generate synthetic data using 0.1% of the real data
# - Use the specified model from build.nvidia.com for synthetic data generation
# - Use the GPU and enable semantic deduplication

# Using LLaMa 3.1 405B:
python tutorials/peft-curation-with-sdg/main.py \
--api-key YOUR_BUILD.NVIDIA.COM_API_KEY \
--device gpu \
--synth-gen-model "meta/llama-3.1-405b-instruct" \
--synth-gen-rounds 1 \
--synth-gen-ratio 0.001 \
--synth-gen-model "meta/llama-3.1-405b-instruct"
--device gpu

# Using Nemotron-4 340B:
python tutorials/peft-curation-with-sdg/main.py \
--api-key YOUR_BUILD.NVIDIA.COM_API_KEY \
--device gpu \
--synth-gen-model "nvidia/nemotron-4-340b-instruct" \
--synth-gen-rounds 1 \
--synth-gen-ratio 0.001 \
--synth-gen-model "nvidia/nemotron-4-340b-instruct"
--device gpu
```

By default, this tutorial will use at most 8 workers to run the curation pipeline. If you face any
Expand Down
33 changes: 25 additions & 8 deletions tutorials/peft-curation-with-sdg/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -242,16 +242,28 @@ def run_pipeline(args, jsonl_fp):
Returns:
The file path to the final curated JSONL file.
"""
# Disable synthetic data generation if no model specified, or no API key is provided.
if args.synth_gen_model is None or args.synth_gen_model == "":
# Disable synthetic data generation if the necessary arguments are not provided.
if not args.synth_gen_endpoint:
print(
"No synthetic data generation endpoint provided. Skipping synthetic data generation."
)
args.synth_gen_rounds = 0
if not args.synth_gen_model:
print(
"No synthetic data generation model provided. Skipping synthetic data generation."
)
args.synth_gen_round = 0
if args.api_key is None:
print("No API key provided. Skipping synthetic data generation.")
args.synth_gen_rounds = 0
if not args.api_key:
print(
"No synthetic data generation API key provided. Skipping synthetic data generation."
)
args.synth_gen_rounds = 0

if args.synth_gen_rounds:
print(
f"Using {args.synth_gen_endpoint}/{args.synth_gen_model} for synthetic data generation."
)

synth_gen_ratio = args.synth_gen_ratio
synth_gen_rounds = args.synth_gen_rounds
synth_n_variants = args.synth_n_variants
Expand All @@ -277,7 +289,7 @@ def run_pipeline(args, jsonl_fp):
# Create the synthetic data generator.
llm_client = AsyncOpenAIClient(
AsyncOpenAI(
base_url="https://integrate.api.nvidia.com/v1",
base_url=args.synth_gen_endpoint,
api_key=args.api_key or "",
timeout=args.api_timeout,
)
Expand Down Expand Up @@ -348,12 +360,17 @@ def run_pipeline(args, jsonl_fp):
def main():
parser = argparse.ArgumentParser()
parser = ArgumentHelper(parser).add_distributed_args()
parser.add_argument(
"--synth-gen-endpoint",
type=str,
default="https://integrate.api.nvidia.com/v1",
help="The API endpoint to use for synthetic data generation. Any endpoint compatible with the OpenAI API can be used.",
)
parser.add_argument(
"--synth-gen-model",
type=str,
default="nvidia/nemotron-4-340b-instruct",
choices=["nvidia/nemotron-4-340b-instruct", "meta/llama-3.1-405b-instruct", ""],
help="The model from build.nvidia.com to use for synthetic data generation. Leave blank to skip synthetic data generation.",
help="The model from the provided API endpoint to use for synthetic data generation. Leave blank to skip synthetic data generation.",
)
parser.add_argument(
"--synth-gen-ratio",
Expand Down

0 comments on commit 4ad1a4d

Please sign in to comment.