From ad12e155809a461e4d4e609e0154e65df6bc3a8f Mon Sep 17 00:00:00 2001 From: Sarah Yurick Date: Mon, 28 Oct 2024 15:56:25 -0700 Subject: [PATCH 1/5] save progress Signed-off-by: Sarah Yurick --- CONTRIBUTING.md | 2 +- docs/user-guide/cpuvsgpu.rst | 8 ++++---- docs/user-guide/kubernetescurator.rst | 6 +++--- ...bleinformationidentificationandremoval.rst | 4 ++-- examples/README.md | 1 + examples/classifiers/README.md | 19 +++++++++++++++++++ examples/k8s/README.md | 3 +++ examples/nemo_run/README.md | 3 +++ examples/nemo_run/launch_slurm.py | 4 ++-- examples/slurm/README.md | 1 + examples/slurm/start-slurm.sh | 2 +- nemo_curator/nemo_run/slurm.py | 6 +++--- nemo_curator/scripts/README.md | 1 + nemo_curator/scripts/classifiers/README.md | 1 + tutorials/pretraining-data-curation/README.md | 2 +- .../red-pajama-v2-curation-tutorial.ipynb | 2 +- .../start-distributed-notebook.sh | 2 +- 17 files changed, 48 insertions(+), 19 deletions(-) create mode 100644 examples/README.md create mode 100644 examples/classifiers/README.md create mode 100644 examples/k8s/README.md create mode 100644 examples/nemo_run/README.md create mode 100644 examples/slurm/README.md create mode 100644 nemo_curator/scripts/README.md create mode 100644 nemo_curator/scripts/classifiers/README.md diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 70ab2eda..5dcb2f98 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -37,7 +37,7 @@ There should be at least one example per module in the curator. They should be incredibly lightweight and rely on the core `nemo_curator` modules for their functionality. Most should be designed for a user to get up and running on their local machines, but distributed examples are welcomed if it makes sense. Python scripts should be the primary way to showcase your module. -Though, SLURM scripts or other cluster scripts should be included if there are special steps needed to run the module. +Though, Slurm scripts or other cluster scripts should be included if there are special steps needed to run the module. The documentation should complement each example by going through the motivation behind why a user would use each module. It should include both an explanation of the module, and how it's used in its corresponding example. diff --git a/docs/user-guide/cpuvsgpu.rst b/docs/user-guide/cpuvsgpu.rst index 683723b2..7bb5858b 100644 --- a/docs/user-guide/cpuvsgpu.rst +++ b/docs/user-guide/cpuvsgpu.rst @@ -35,7 +35,7 @@ All of the ``examples/`` use it to set up a Dask cluster. It is possible to run entirely CPU-based workflows on a GPU cluster, though the process count (and therefore the number of parallel tasks) will be limited by the number of GPUs on your machine. * ``scheduler_address`` and ``scheduler_file`` are used for connecting to an existing Dask cluster. - Supplying one of these is essential if you are running a Dask cluster on SLURM or Kubernetes. + Supplying one of these is essential if you are running a Dask cluster on Slurm or Kubernetes. All other arguments are ignored if either of these are passed, as the cluster configuration will be done when you create the schduler and works on your cluster. * The remaining arguments can be modified `here `_. @@ -82,12 +82,12 @@ Even if you start a GPU dask cluster, you can't operate on datasets that use a ` The ``DocuemntDataset`` must either have been originally read in with a ``cudf`` backend, or it must be transferred during the script. ----------------------------------------- -Dask with SLURM +Dask with Slurm ----------------------------------------- -We provide an example SLURM script pipeline in ``examples/slurm``. +We provide an example Slurm script pipeline in ``examples/slurm``. This pipeline has a script ``start-slurm.sh`` that provides configuration options similar to what ``get_client`` provides. -Every SLURM cluster is different, so make sure you understand how your SLURM cluster works so the scripts can be easily adapted. +Every Slurm cluster is different, so make sure you understand how your Slurm cluster works so the scripts can be easily adapted. ``start-slurm.sh`` calls ``containter-entrypoint.sh`` which sets up a Dask scheduler and workers across the cluster. Our Python examples are designed to work such that they can be run locally on their own, or easily substituted into the ``start-slurm.sh`` to run on multiple nodes. diff --git a/docs/user-guide/kubernetescurator.rst b/docs/user-guide/kubernetescurator.rst index d695286a..e85b2170 100644 --- a/docs/user-guide/kubernetescurator.rst +++ b/docs/user-guide/kubernetescurator.rst @@ -139,7 +139,7 @@ use ``kubectl cp``, but ``exec`` has fewer surprises regarding compressed files: Create a Dask Cluster --------------------- -Use the ``create_dask_cluster.py`` to create a CPU or GPU dask cluster. +Use the ``create_dask_cluster.py`` to create a CPU or GPU Dask cluster. .. note:: If you are creating another Dask cluster with the same ``--name ``, first delete it via:: @@ -289,7 +289,7 @@ container, we will need to build a custom image with your code installed: # Fill in // kubectl create secret docker-registry my-private-registry --docker-server= --docker-username= --docker-password= - And with this new secret, you create your new dask cluster: + And with this new secret, you create your new Dask cluster: .. code-block:: bash @@ -360,7 +360,7 @@ At this point you can tail the logs and look for ``Finished!`` in ``/nemo-worksp Deleting Cluster ---------------- -After you have finished using the created dask cluster, you can delete it to release the resources: +After you have finished using the created Dask cluster, you can delete it to release the resources: .. code-block:: bash diff --git a/docs/user-guide/personalidentifiableinformationidentificationandremoval.rst b/docs/user-guide/personalidentifiableinformationidentificationandremoval.rst index 81f17d0e..6ebe3bff 100644 --- a/docs/user-guide/personalidentifiableinformationidentificationandremoval.rst +++ b/docs/user-guide/personalidentifiableinformationidentificationandremoval.rst @@ -26,7 +26,7 @@ The tool utilizes `Dask `_ to parallelize tasks and hence it c used to scale up to terabytes of data easily. Although Dask can be deployed on various distributed compute environments such as HPC clusters, Kubernetes and other cloud offerings such as AWS EKS, Google cloud etc, the current implementation only supports -Dask on HPC clusters that use SLURM as the resource manager. +Dask on HPC clusters that use Slurm as the resource manager. ----------------------------------------- Usage @@ -92,7 +92,7 @@ The PII redaction module can also be invoked via ``script/find_pii_and_deidentif ``python nemo_curator/scripts/find_pii_and_deidentify.py`` -To launch the script from within a SLURM environment, the script ``examples/slurm/start-slurm.sh`` can be modified and used. +To launch the script from within a Slurm environment, the script ``examples/slurm/start-slurm.sh`` can be modified and used. ############################ diff --git a/examples/README.md b/examples/README.md new file mode 100644 index 00000000..46409041 --- /dev/null +++ b/examples/README.md @@ -0,0 +1 @@ +# TODO diff --git a/examples/classifiers/README.md b/examples/classifiers/README.md new file mode 100644 index 00000000..782732b5 --- /dev/null +++ b/examples/classifiers/README.md @@ -0,0 +1,19 @@ +## Text Classification + +The Python scripts in this directory demonstrate how to run classification on your text data with each of these 4 classifiers: + +- Domain Classifier +- Quality Classifier +- AEGIS Safety Models +- FineWeb Educational Content Classifier + +For more information about these classifiers, please see NeMo Curator's [Distributed Data Classification documentation](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/distributeddataclassification.html). + +Each of these scripts provide simple examples of what your own Python scripts might look like. + +At a high level, you will: + +1. Create a Dask client by using the `get_client` function +2. Use `DocumentDataset.read_json` (or `DocumentDataset.read_parquet`) to read your data +3. Initialize and call the classifier on your data +4. Write your results to the desired output type with `to_json` or `to_parquet` diff --git a/examples/k8s/README.md b/examples/k8s/README.md new file mode 100644 index 00000000..75da4e8f --- /dev/null +++ b/examples/k8s/README.md @@ -0,0 +1,3 @@ +The `create_dask_cluster.py` can be used to create a CPU or GPU Dask cluster. + +See [Running NeMo Curator on Kubernetes](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/kubernetescurator.html) for more information. diff --git a/examples/nemo_run/README.md b/examples/nemo_run/README.md new file mode 100644 index 00000000..cc30508b --- /dev/null +++ b/examples/nemo_run/README.md @@ -0,0 +1,3 @@ +The `launch_slurm.py` script shows an example of how to run a Slurm job via Python APIs. + +See the [Dask with Slurm](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/cpuvsgpu.html?highlight=slurm#dask-with-slurm) and [NeMo-Run Quickstart](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemo-2.0/quickstart.html?highlight=slurm#execute-on-a-slurm-cluster) pages for more information. diff --git a/examples/nemo_run/launch_slurm.py b/examples/nemo_run/launch_slurm.py index daa46cab..005d7ff9 100644 --- a/examples/nemo_run/launch_slurm.py +++ b/examples/nemo_run/launch_slurm.py @@ -21,7 +21,7 @@ @run.factory def nemo_curator_slurm_executor() -> SlurmExecutor: """ - Configure the following function with the details of your SLURM cluster + Configure the following function with the details of your Slurm cluster """ return SlurmExecutor( job_name_prefix="nemo-curator", @@ -35,7 +35,7 @@ def nemo_curator_slurm_executor() -> SlurmExecutor: def main(): - # Path to NeMo-Curator/examples/slurm/container_entrypoint.sh on the SLURM cluster + # Path to NeMo-Curator/examples/slurm/container_entrypoint.sh on the Slurm cluster container_entrypoint = "/cluster/path/slurm/container_entrypoint.sh" # The NeMo Curator command to run # This command can be susbstituted with any NeMo Curator command diff --git a/examples/slurm/README.md b/examples/slurm/README.md new file mode 100644 index 00000000..46409041 --- /dev/null +++ b/examples/slurm/README.md @@ -0,0 +1 @@ +# TODO diff --git a/examples/slurm/start-slurm.sh b/examples/slurm/start-slurm.sh index a408f4b8..0a9767fe 100644 --- a/examples/slurm/start-slurm.sh +++ b/examples/slurm/start-slurm.sh @@ -23,7 +23,7 @@ # Begin easy customization # ================================================================= -# Base directory for all SLURM job logs and files +# Base directory for all Slurm job logs and files # Does not affect directories referenced in your script export BASE_JOB_DIR=`pwd`/nemo-curator-jobs export JOB_DIR=$BASE_JOB_DIR/$SLURM_JOB_ID diff --git a/nemo_curator/nemo_run/slurm.py b/nemo_curator/nemo_run/slurm.py index 3ab8c838..5ed5cc56 100644 --- a/nemo_curator/nemo_run/slurm.py +++ b/nemo_curator/nemo_run/slurm.py @@ -23,7 +23,7 @@ @dataclass class SlurmJobConfig: """ - Configuration for running a NeMo Curator script on a SLURM cluster using + Configuration for running a NeMo Curator script on a Slurm cluster using NeMo Run Args: @@ -74,13 +74,13 @@ def to_script(self, add_scheduler_file: bool = True, add_device: bool = True): add_scheduler_file: Automatically appends a '--scheduler-file' argument to the script_command where the value is job_dir/logs/scheduler.json. All scripts included in NeMo Curator accept and require this argument to scale - properly on SLURM clusters. + properly on Slurm clusters. add_device: Automatically appends a '--device' argument to the script_command where the value is the member variable of device. All scripts included in NeMo Curator accept and require this argument. Returns: A NeMo Run Script that will intialize a Dask cluster, and run the specified command. - It is designed to be executed on a SLURM cluster + It is designed to be executed on a Slurm cluster """ env_vars = self._build_env_vars() diff --git a/nemo_curator/scripts/README.md b/nemo_curator/scripts/README.md new file mode 100644 index 00000000..46409041 --- /dev/null +++ b/nemo_curator/scripts/README.md @@ -0,0 +1 @@ +# TODO diff --git a/nemo_curator/scripts/classifiers/README.md b/nemo_curator/scripts/classifiers/README.md new file mode 100644 index 00000000..46409041 --- /dev/null +++ b/nemo_curator/scripts/classifiers/README.md @@ -0,0 +1 @@ +# TODO diff --git a/tutorials/pretraining-data-curation/README.md b/tutorials/pretraining-data-curation/README.md index d3637281..3d32c2d7 100644 --- a/tutorials/pretraining-data-curation/README.md +++ b/tutorials/pretraining-data-curation/README.md @@ -6,6 +6,6 @@ This tutorial demonstrates the usage of NeMo Curator to curate the RedPajama-Dat RedPajama-V2 (RPV2) is an open dataset for training large language models. The dataset includes over 100B text documents coming from 84 CommonCrawl snapshots and processed using the CCNet pipeline. In this tutorial, we will be perform data curation on two raw snapshots from RPV2 for demonstration purposes. ## Getting Started -This tutorial is designed to run in multi-node environment due to the pre-training dataset scale. To start the tutorial, run the slurm script `start-distributed-notebook.sh` in this directory which will start the Jupyter notebook that demonstrates the step by step walkthrough of the end to end curation pipeline. To access the Jupyter notebook running on the scheduler node from your local machine, you can establish an SSH tunnel by running the following command: +This tutorial is designed to run in multi-node environment due to the pre-training dataset scale. To start the tutorial, run the Slurm script `start-distributed-notebook.sh` in this directory which will start the Jupyter notebook that demonstrates the step by step walkthrough of the end to end curation pipeline. To access the Jupyter notebook running on the scheduler node from your local machine, you can establish an SSH tunnel by running the following command: `ssh -L :localhost:8888 @` diff --git a/tutorials/pretraining-data-curation/red-pajama-v2-curation-tutorial.ipynb b/tutorials/pretraining-data-curation/red-pajama-v2-curation-tutorial.ipynb index 42c92bfa..d854c9d9 100644 --- a/tutorials/pretraining-data-curation/red-pajama-v2-curation-tutorial.ipynb +++ b/tutorials/pretraining-data-curation/red-pajama-v2-curation-tutorial.ipynb @@ -88,7 +88,7 @@ "# 2. Getting started\n", "\n", "\n", - "NeMo-Curator uses dask for parallelization. Before we start using curator, we need to start a dask cluster. To start a multi-node dask cluster in slurm, we can use the `start-distributed-notebook.sh` script in this directory to start the cluster. The user will need to change the following variables:\n", + "NeMo-Curator uses dask for parallelization. Before we start using curator, we need to start a dask cluster. To start a multi-node dask cluster in Slurm, we can use the `start-distributed-notebook.sh` script in this directory to start the cluster. The user will need to change the following variables:\n", "\n", "- Slurm job directives\n", "- Device type (`cpu` or `gpu`). Curator has both cpu and gpu modules. Check [here](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/cpuvsgpu.html) to see which modules are cpu/gpu\n", diff --git a/tutorials/pretraining-data-curation/start-distributed-notebook.sh b/tutorials/pretraining-data-curation/start-distributed-notebook.sh index 0c1cd7ee..0108975c 100644 --- a/tutorials/pretraining-data-curation/start-distributed-notebook.sh +++ b/tutorials/pretraining-data-curation/start-distributed-notebook.sh @@ -23,7 +23,7 @@ # Begin easy customization # ================================================================= -# Base directory for all SLURM job logs and files +# Base directory for all Slurm job logs and files # Does not affect directories referenced in your script export BASE_JOB_DIR=`pwd`/nemo-curator-jobs export JOB_DIR=$BASE_JOB_DIR/$SLURM_JOB_ID From 71aedab1ec4167da6dce323bd27ffd51b39c9384 Mon Sep 17 00:00:00 2001 From: Sarah Yurick Date: Tue, 29 Oct 2024 13:00:30 -0700 Subject: [PATCH 2/5] add remaining docs Signed-off-by: Sarah Yurick --- docs/user-guide/cpuvsgpu.rst | 4 +- examples/README.md | 22 ++++- examples/slurm/README.md | 10 ++- nemo_curator/modules/dataset_ops.py | 2 +- nemo_curator/scripts/README.md | 30 ++++++- nemo_curator/scripts/classifiers/README.md | 93 +++++++++++++++++++++- 6 files changed, 154 insertions(+), 7 deletions(-) diff --git a/docs/user-guide/cpuvsgpu.rst b/docs/user-guide/cpuvsgpu.rst index 7bb5858b..8d8cd64a 100644 --- a/docs/user-guide/cpuvsgpu.rst +++ b/docs/user-guide/cpuvsgpu.rst @@ -88,9 +88,9 @@ Dask with Slurm We provide an example Slurm script pipeline in ``examples/slurm``. This pipeline has a script ``start-slurm.sh`` that provides configuration options similar to what ``get_client`` provides. Every Slurm cluster is different, so make sure you understand how your Slurm cluster works so the scripts can be easily adapted. -``start-slurm.sh`` calls ``containter-entrypoint.sh`` which sets up a Dask scheduler and workers across the cluster. +``start-slurm.sh`` calls ``containter-entrypoint.sh``, which sets up a Dask scheduler and workers across the cluster. -Our Python examples are designed to work such that they can be run locally on their own, or easily substituted into the ``start-slurm.sh`` to run on multiple nodes. +Our Python examples are designed to work such that they can be run locally on their own, or easily substituted into the ``start-slurm.sh`` script to run on multiple nodes. You can adapt your scripts easily too by simply following the pattern of adding ``get_client`` with ``add_distributed_args``. ----------------------------------------- diff --git a/examples/README.md b/examples/README.md index 46409041..a61dea8b 100644 --- a/examples/README.md +++ b/examples/README.md @@ -1 +1,21 @@ -# TODO +# NeMo Curator Python API examples + +This directory contains multiple Python scripts with examples of how to use various NeMo Curator classes and functions. +The goal of these examples is to give the user an overview of many of the ways your text data can be curated. +These include: + +- `blend_and_shuffle.py`: Combine multiple datasets into one with different amounts of each dataset, then randomly permute the dataset. +- `classifier_filtering.py`: Train a fastText classifier, then use it to filter high and low quality data. +- `download_arxiv.py`: Download Arxiv tar files and extract them. +- `download_common_crawl.py`: Download Common Crawl WARC snapshots and extract them. +- `download_wikipedia.py`: Download the latest Wikipedia dumps and extract them. +- `exact_deduplication.py`: Use the `ExactDuplicates` class to perform exact deduplication on text data. +- `find_pii_and_deidentify.py`: Use the `PiiModifier` and `Modify` classes to remove personally identifiable information from text data. +- `fuzzy_deduplication.py`: Use the `FuzzyDuplicatesConfig` and `FuzzyDuplicates` classes to perform fuzzy deduplication on text data. +- `identify_languages_and_fix_unicode.py`: Use `FastTextLangId` to filter data by language, then fix the unicode in it. +- `raw_download_common_crawl.py`: Download the raw compressed WARC files from Common Crawl without extracting them. +- `semdedup_example.py`: Use the `SemDedup` class to perform semantic deduplication on text data. +- `task_decontamination.py`: Remove segments of downstream evaluation tasks from a dataset. +- `translation_example.py`: Create and use an `IndicTranslation` model for language translation. + +The `classifiers`, `k8s`, `nemo_run`, and `slurm` subdirectories contain even more examples of NeMo Curator's capabilities. diff --git a/examples/slurm/README.md b/examples/slurm/README.md index 46409041..df6e4408 100644 --- a/examples/slurm/README.md +++ b/examples/slurm/README.md @@ -1 +1,9 @@ -# TODO +# Dask with Slurm + +This directory provides an example Slurm script pipeline. +This pipeline has a script `start-slurm.sh` that provides configuration options similar to what `get_client` provides. +Every Slurm cluster is different, so make sure you understand how your Slurm cluster works so the scripts can be easily adapted. +`start-slurm.sh` calls `containter-entrypoint.sh`, which sets up a Dask scheduler and workers across the cluster. + +Our Python examples are designed to work such that they can be run locally on their own, or easily substituted into the `start-slurm.sh` script to run on multiple nodes. +You can adapt your scripts easily too by simply following the pattern of adding `get_client` with `add_distributed_args`. diff --git a/nemo_curator/modules/dataset_ops.py b/nemo_curator/modules/dataset_ops.py index 38589b1e..745d741c 100644 --- a/nemo_curator/modules/dataset_ops.py +++ b/nemo_curator/modules/dataset_ops.py @@ -117,7 +117,7 @@ def blend_datasets( target_size: int, datasets: List[DocumentDataset], sampling_weights: List[float] ) -> DocumentDataset: """ - Combined multiple datasets into one with different amounts of each dataset + Combines multiple datasets into one with different amounts of each dataset. Args: target_size: The number of documents the resulting dataset should have. The actual size of the dataset may be slightly larger if the normalized weights do not allow diff --git a/nemo_curator/scripts/README.md b/nemo_curator/scripts/README.md index 46409041..418043e4 100644 --- a/nemo_curator/scripts/README.md +++ b/nemo_curator/scripts/README.md @@ -1 +1,29 @@ -# TODO +# NeMo Curator CLI Scripts + +The following Python scripts are designed to be executed from the command line (terminal) only. + +Here, we list all of the Python scripts and their terminal commands: + +| Python Command | CLI Command | +|------------------------------------------|--------------------------------| +| python add_id.py | add_id | +| python blend_datasets.py | blend_datasets | +| python download_and_extract.py | download_and_extract | +| python filter_documents.py | filter_documents | +| python find_exact_duplicates.py | gpu_exact_dups | +| python find_matching_ngrams.py | find_matching_ngrams | +| python find_pii_and_deidentify.py | deidentify | +| python get_common_crawl_urls.py | get_common_crawl_urls | +| python get_wikipedia_urls.py | get_wikipedia_urls | +| python make_data_shards.py | make_data_shards | +| python prepare_fasttext_training_data.py | prepare_fasttext_training_data | +| python prepare_task_data.py | prepare_task_data | +| python remove_matching_ngrams.py | remove_matching_ngrams | +| python separate_by_metadata.py | separate_by_metadata | +| python text_cleaning.py | text_cleaning | +| python train_fasttext.py | train_fasttext | +| python verify_classification_results.py | verify_classification_results | + +For more information about the arguments needed for each script, you can use `add_id --help`, etc. + +More scripts can be found in the `classifiers`, `fuzzy_deduplication`, and `semdedup` subdirectories. diff --git a/nemo_curator/scripts/classifiers/README.md b/nemo_curator/scripts/classifiers/README.md index 46409041..0499e370 100644 --- a/nemo_curator/scripts/classifiers/README.md +++ b/nemo_curator/scripts/classifiers/README.md @@ -1 +1,92 @@ -# TODO +## Text Classification + +The Python scripts in this directory demonstrate how to run classification on your text data with each of these 4 classifiers: + +- Domain Classifier +- Quality Classifier +- AEGIS Safety Models +- FineWeb Educational Content Classifier + +For more information about these classifiers, please see NeMo Curator's [Distributed Data Classification documentation](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/distributeddataclassification.html). + +### Usage + +#### Domain classifier inference + +```bash +# same as `python domain_classifier_inference.py` +domain_classifier_inference \ + --input-data-dir /path/to/data/directory \ + --output-data-dir /path/to/output/directory \ + --input-file-type "jsonl" \ + --input-file-extension "jsonl" \ + --output-file-type "jsonl" \ + --input-text-field "text" \ + --batch-size 64 \ + --autocast \ + --max-chars 2000 \ + --device "gpu" +``` + +Additional arguments may be added for customizing a Dask cluster and client. Run `domain_classifier_inference --help` for more information. + +#### Quality classifier inference + +```bash +# same as `python quality_classifier_inference.py` +quality_classifier_inference \ + --input-data-dir /path/to/data/directory \ + --output-data-dir /path/to/output/directory \ + --input-file-type "jsonl" \ + --input-file-extension "jsonl" \ + --output-file-type "jsonl" \ + --input-text-field "text" \ + --batch-size 64 \ + --autocast \ + --max-chars 2000 \ + --device "gpu" +``` + +Additional arguments may be added for customizing a Dask cluster and client. Run `quality_classifier_inference --help` for more information. + +#### AEGIS classifier inference + +```bash +# same as `python aegis_classifier_inference.py` +aegis_classifier_inference \ + --input-data-dir /path/to/data/directory \ + --output-data-dir /path/to/output/directory \ + --input-file-type "jsonl" \ + --input-file-extension "jsonl" \ + --output-file-type "jsonl" \ + --input-text-field "text" \ + --batch-size 64 \ + --max-chars 6000 \ + --device "gpu" \ + --aegis-variant "nvidia/Aegis-AI-Content-Safety-LlamaGuard-Defensive-1.0" \ + --token "hf_1234" +``` + +- `--aegis-variant` can be `nvidia/Aegis-AI-Content-Safety-LlamaGuard-Defensive-1.0`, `nvidia/Aegis-AI-Content-Safety-LlamaGuard-Permissive-1.0`, or a path to your own PEFT of LlamaGuard 2. +- `--token` is your HuggingFace token, which is used when downloading the base Llama Guard model. + +Additional arguments may be added for customizing a Dask cluster and client. Run `aegis_classifier_inference --help` for more information. + +#### FineWeb-Edu classifier inference + +```bash +# same as `python fineweb_edu_classifier_inference.py` +fineweb_edu_classifier_inference \ + --input-data-dir /path/to/data/directory \ + --output-data-dir /path/to/output/directory \ + --input-file-type "jsonl" \ + --input-file-extension "jsonl" \ + --output-file-type "jsonl" \ + --input-text-field "text" \ + --batch-size 64 \ + --autocast \ + --max-chars 2000 \ + --device "gpu" +``` + +Additional arguments may be added for customizing a Dask cluster and client. Run `fineweb_edu_classifier_inference --help` for more information. From afad6bd771378992eec8725cab5958f4577970b9 Mon Sep 17 00:00:00 2001 From: Sarah Yurick Date: Tue, 29 Oct 2024 13:11:21 -0700 Subject: [PATCH 3/5] add titles and table Signed-off-by: Sarah Yurick --- examples/README.md | 28 +++++++++++++++------------- examples/k8s/README.md | 2 ++ examples/nemo_run/README.md | 2 ++ 3 files changed, 19 insertions(+), 13 deletions(-) diff --git a/examples/README.md b/examples/README.md index a61dea8b..0e7f5952 100644 --- a/examples/README.md +++ b/examples/README.md @@ -4,18 +4,20 @@ This directory contains multiple Python scripts with examples of how to use vari The goal of these examples is to give the user an overview of many of the ways your text data can be curated. These include: -- `blend_and_shuffle.py`: Combine multiple datasets into one with different amounts of each dataset, then randomly permute the dataset. -- `classifier_filtering.py`: Train a fastText classifier, then use it to filter high and low quality data. -- `download_arxiv.py`: Download Arxiv tar files and extract them. -- `download_common_crawl.py`: Download Common Crawl WARC snapshots and extract them. -- `download_wikipedia.py`: Download the latest Wikipedia dumps and extract them. -- `exact_deduplication.py`: Use the `ExactDuplicates` class to perform exact deduplication on text data. -- `find_pii_and_deidentify.py`: Use the `PiiModifier` and `Modify` classes to remove personally identifiable information from text data. -- `fuzzy_deduplication.py`: Use the `FuzzyDuplicatesConfig` and `FuzzyDuplicates` classes to perform fuzzy deduplication on text data. -- `identify_languages_and_fix_unicode.py`: Use `FastTextLangId` to filter data by language, then fix the unicode in it. -- `raw_download_common_crawl.py`: Download the raw compressed WARC files from Common Crawl without extracting them. -- `semdedup_example.py`: Use the `SemDedup` class to perform semantic deduplication on text data. -- `task_decontamination.py`: Remove segments of downstream evaluation tasks from a dataset. -- `translation_example.py`: Create and use an `IndicTranslation` model for language translation. +| Python Script | Description | +|---------------------------------------|---------------------------------------------------------------------------------------------------------------| +| blend_and_shuffle.py | Combine multiple datasets into one with different amounts of each dataset, then randomly permute the dataset. | +| classifier_filtering.py | Train a fastText classifier, then use it to filter high and low quality data. | +| download_arxiv.py | Download Arxiv tar files and extract them. | +| download_common_crawl.py | Download Common Crawl WARC snapshots and extract them. | +| download_wikipedia.py | Download the latest Wikipedia dumps and extract them. | +| exact_deduplication.py | Use the `ExactDuplicates` class to perform exact deduplication on text data. | +| find_pii_and_deidentify.py | Use the `PiiModifier` and `Modify` classes to remove personally identifiable information from text data. | +| fuzzy_deduplication.py | Use the `FuzzyDuplicatesConfig` and `FuzzyDuplicates` classes to perform fuzzy deduplication on text data. | +| identify_languages_and_fix_unicode.py | Use `FastTextLangId` to filter data by language, then fix the unicode in it. | +| raw_download_common_crawl.py | Download the raw compressed WARC files from Common Crawl without extracting them. | +| semdedup_example.py | Use the `SemDedup` class to perform semantic deduplication on text data. | +| task_decontamination.py | Remove segments of downstream evaluation tasks from a dataset. | +| translation_example.py | Create and use an `IndicTranslation` model for language translation. | The `classifiers`, `k8s`, `nemo_run`, and `slurm` subdirectories contain even more examples of NeMo Curator's capabilities. diff --git a/examples/k8s/README.md b/examples/k8s/README.md index 75da4e8f..ce15fe4f 100644 --- a/examples/k8s/README.md +++ b/examples/k8s/README.md @@ -1,3 +1,5 @@ +## Kubernetes + The `create_dask_cluster.py` can be used to create a CPU or GPU Dask cluster. See [Running NeMo Curator on Kubernetes](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/kubernetescurator.html) for more information. diff --git a/examples/nemo_run/README.md b/examples/nemo_run/README.md index cc30508b..f419c414 100644 --- a/examples/nemo_run/README.md +++ b/examples/nemo_run/README.md @@ -1,3 +1,5 @@ +## NeMo-Run + The `launch_slurm.py` script shows an example of how to run a Slurm job via Python APIs. See the [Dask with Slurm](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/cpuvsgpu.html?highlight=slurm#dask-with-slurm) and [NeMo-Run Quickstart](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemo-2.0/quickstart.html?highlight=slurm#execute-on-a-slurm-cluster) pages for more information. From 4141699c0f5baaf87164dc4c3dd091d2c4f90951 Mon Sep 17 00:00:00 2001 From: Sarah Yurick Date: Tue, 29 Oct 2024 13:13:46 -0700 Subject: [PATCH 4/5] remove trailing whitespace Signed-off-by: Sarah Yurick --- examples/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/examples/README.md b/examples/README.md index 0e7f5952..2f5ba19f 100644 --- a/examples/README.md +++ b/examples/README.md @@ -17,7 +17,7 @@ These include: | identify_languages_and_fix_unicode.py | Use `FastTextLangId` to filter data by language, then fix the unicode in it. | | raw_download_common_crawl.py | Download the raw compressed WARC files from Common Crawl without extracting them. | | semdedup_example.py | Use the `SemDedup` class to perform semantic deduplication on text data. | -| task_decontamination.py | Remove segments of downstream evaluation tasks from a dataset. | +| task_decontamination.py | Remove segments of downstream evaluation tasks from a dataset. | | translation_example.py | Create and use an `IndicTranslation` model for language translation. | The `classifiers`, `k8s`, `nemo_run`, and `slurm` subdirectories contain even more examples of NeMo Curator's capabilities. From 5816b76611d587450727b74cbcc05b487cf3ac2e Mon Sep 17 00:00:00 2001 From: Sarah Yurick Date: Mon, 18 Nov 2024 14:36:06 -0800 Subject: [PATCH 5/5] add --help instructions Signed-off-by: Sarah Yurick --- examples/README.md | 2 ++ examples/classifiers/README.md | 2 ++ 2 files changed, 4 insertions(+) diff --git a/examples/README.md b/examples/README.md index 2f5ba19f..3e101a1e 100644 --- a/examples/README.md +++ b/examples/README.md @@ -20,4 +20,6 @@ These include: | task_decontamination.py | Remove segments of downstream evaluation tasks from a dataset. | | translation_example.py | Create and use an `IndicTranslation` model for language translation. | +Before running any of these scripts, we strongly recommend displaying `python