Skip to content

Commit

Permalink
[Tutorials] Improve the usability of the SDG tutorial's command (NVID…
Browse files Browse the repository at this point in the history
…IA#286)

Addressing some feedbacks from an internal review.

Signed-off-by: Mehran Maghoumi <[email protected]>
Signed-off-by: Vinay Raman <[email protected]>
  • Loading branch information
Maghoumi authored and vinay-raman committed Nov 12, 2024
1 parent d4bffca commit a2b6581
Show file tree
Hide file tree
Showing 2 changed files with 20 additions and 11 deletions.
28 changes: 17 additions & 11 deletions tutorials/peft-curation-with-sdg/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,21 +66,27 @@ python tutorials/peft-curation-with-sdg/main.py \
--api-key YOUR_BUILD.NVIDIA.COM_API_KEY \
--device gpu

# To control the amount of synthetic data to generate using LLaMa 3.1 405B
# Here are some examples that:
# - Use the GPU and enable semantic deduplication
# - Do 1 round of synthetic data generation
# - Generate synthetic data using 0.1% of the real data
# - Use the specified model from build.nvidia.com for synthetic data generation

# Using LLaMa 3.1 405B:
python tutorials/peft-curation-with-sdg/main.py \
--api-key YOUR_BUILD.NVIDIA.COM_API_KEY \
--device gpu \ # Use the GPU and enable semantic deduplication
--synth-gen-rounds 1 \ # Do 1 round of synthetic data generation
--synth-gen-ratio 0.001 \ # Generate synthetic data using 0.1% of the real data
--synth-gen-model "meta/llama-3.1-405b-instruct" # Use LLaMa 3.1 405B
--device gpu \
--synth-gen-rounds 1 \
--synth-gen-ratio 0.001 \
--synth-gen-model "meta/llama-3.1-405b-instruct"

# To control the amount of synthetic data to generate using Nemotron-4 340B
# Using Nemotron-4 340B:
python tutorials/peft-curation-with-sdg/main.py \
--api-key YOUR_BUILD.NVIDIA.COM_API_KEY \
--device gpu \ # Use the GPU and enable semantic deduplication
--synth-gen-rounds 1 \ # Do 1 round of synthetic data generation
--synth-gen-ratio 0.001 \ # Generate synthetic data using 0.1% of the real data
--synth-gen-model "nvidia/nemotron-4-340b-instruct" # Use Nemotron-4 340B
--device gpu \
--synth-gen-rounds 1 \
--synth-gen-ratio 0.001 \
--synth-gen-model "nvidia/nemotron-4-340b-instruct"
```

By default, this tutorial will use at most 8 workers to run the curation pipeline. If you face any
Expand All @@ -91,4 +97,4 @@ Once the code finishes executing, the curated dataset will be available under `d
By default, the script outputs splits for training (80%), validation (10%) and testing (10%).

## Next Step: Fine-tune Your Own Model
The curated dataset from this tutorial can be readily used for model customization and fine-tuning using the [NeMo Framework](https://github.com/NVIDIA/NeMo). Please refer to the [law title generation tutorial](https://github.com/NVIDIA/NeMo/blob/main/tutorials/llm/llama-3/sdg-law-title-generation/llama3-sdg-lora-nemofw.ipynb) in the NeMo Framework repository to learn more.
The curated dataset from this tutorial can be readily used for model customization and fine-tuning using the [NeMo Framework](https://github.com/NVIDIA/NeMo). Please refer to the [law title generation tutorial](https://github.com/NVIDIA/NeMo/blob/main/tutorials/llm/llama-3/sdg-law-title-generation/llama3-sdg-lora-nemofw.ipynb) in the NeMo Framework repository to learn more. In that tutorial, you will learn more about using the data you just curated to fine-tune a model that can read a legal question and generate a title for that question.
3 changes: 3 additions & 0 deletions tutorials/peft-curation-with-sdg/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -402,6 +402,9 @@ def main():
curated_dir = os.path.dirname(train_fp_curated)
os.system(f"cp {val_fp} {curated_dir}")
os.system(f"cp {test_fp} {curated_dir}")
print(
"--------------------------------------------------------------------------------"
)
print(f"Curated files are saved in '{curated_dir}'.")


Expand Down

0 comments on commit a2b6581

Please sign in to comment.