Merge remote-tracking branch 'upstream/main' into filter_by_docs

NVIDIA · Feb 7, 2025 · 519e825 · 519e825
2 parents 06a7605 + 34a1cc6
commit 519e825
Show file tree

Hide file tree

Showing 121 changed files with 7,072 additions and 3,241 deletions.
diff --git a/.github/workflows/build-test-publish-wheel.yml b/.github/workflows/build-test-publish-wheel.yml
@@ -27,20 +27,12 @@ defaults:
 
 jobs:
   build-test-publish-wheel:
-    uses: NVIDIA/NeMo-FW-CI-templates/.github/workflows/_build_test_publish_wheel.yml@v0.7.0
+    uses: NVIDIA/NeMo-FW-CI-templates/.github/workflows/_build_test_publish_wheel.yml@v0.20.0
     with:
-      image-name: nemo_curator_container
-      dockerfile: Dockerfile
-      image-label: nemo-curator
-      build-args: |
-        IMAGE_LABEL=nemo-curator
-        REPO_URL=https://github.com/${{ github.repository }}.git
-        CURATOR_COMMIT=${{ github.sha }}
-      prune-filter-timerange: 24h
       dry-run: true
       python-package: nemo_curator
-      container-workdir: /opt/NeMo-Curator/
       environment: public
+      python-version: '3.10'
     secrets:
       TWINE_USERNAME: ${{ secrets.TWINE_USERNAME }}
       TWINE_PASSWORD: ${{ secrets.TWINE_PASSWORD }}

diff --git a/.github/workflows/release.yml b/.github/workflows/release.yml
@@ -25,24 +25,20 @@ on:
         required: true
         default: true
         type: boolean
-
+      version-bump-branch:
+        type: string
+        required: true
+        description: Branch to target for version bump
 jobs:
   release:
-    uses: NVIDIA/NeMo-FW-CI-templates/.github/workflows/_release_library.yml@v0.17.4
+    uses: NVIDIA/NeMo-FW-CI-templates/.github/workflows/_release_library.yml@v0.20.1
     with:
       release-ref: ${{ inputs.release-ref }}
-      image-name: nemo_curator_container
-      dockerfile: Dockerfile
-      image-label: nemo-curator
-      build-args: |
-        IMAGE_LABEL=nemo-curator
-        REPO_URL=https://github.com/${{ github.repository }}.git
-        CURATOR_COMMIT=${{ inputs.release-ref }}
-      prune-filter-timerange: 24h
       python-package: nemo_curator
-      container-workdir: /opt/NeMo-Curator
+      python-version: '3.10'
       library-name: NeMo Curator
       dry-run: ${{ inputs.dry-run }}
+      version-bump-branch: ${{ inputs.version-bump-branch }}
     secrets:
       TWINE_USERNAME: ${{ secrets.TWINE_USERNAME }}
       TWINE_PASSWORD: ${{ secrets.TWINE_PASSWORD }}

diff --git a/.github/workflows/test.yml b/.github/workflows/test.yml
@@ -19,7 +19,7 @@ jobs:
       fail-fast: false
       matrix:
         os: [ubuntu-latest]
-        python-version: ["3.10"]
+        python-version: ["3.10", "3.12"]
     steps:
       - uses: actions/checkout@v4
       - name: Optionally free up space on Ubuntu

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,6 +1,18 @@
 # Changelog
 
-## NeMo Curator 0.5.0
+## NVIDIA NeMo Curator 0.6.0
+
+- Synthetic Data Generation for Text Retrieval
+  - LLM-based Filters
+    - Easiness
+    - Answerability
+  - Q&A Retrieval Generation Pipeline
+- Parallel Dataset Curation for Machine Translation
+  - Load/Write Bitext Files
+  - Heuristic filtering (Histogram, Length Ratio)
+  - Classifier filtering (Comet, Cometoid)
+
+## NVIDIA NeMo Curator 0.5.0
 
 ### Highlights
 
@@ -16,15 +28,15 @@
 
 **Full Changelog**: <https://github.com/NVIDIA/NeMo-Curator/commits/v0.5.0>
 
-## NeMo Curator 0.4.1
+## NVIDIA NeMo Curator 0.4.1
 
 ## What's Changed
 
 * Add spacy<3.8 pin to r0.4.1 by @ayushdg in <https://github.com/NVIDIA/NeMo-Curator/pull/279>
 
 **Full Changelog**: <https://github.com/NVIDIA/NeMo-Curator/compare/v0.4.0...v0.4.1>
 
-## NeMo Curator 0.4.0
+## NVIDIA NeMo Curator 0.4.0
 
 ## Highlights
 

diff --git a/Dockerfile b/Dockerfile
@@ -1,8 +1,8 @@
 # See https://github.com/rapidsai/ci-imgs for ARG options
-# NeMo Curator requires Python 3.10, Ubuntu 22.04/20.04, and CUDA 12 (or above)
+# NeMo Curator requires Python 3.12, Ubuntu 22.04/20.04, and CUDA 12 (or above)
 ARG CUDA_VER=12.5.1
 ARG LINUX_VER=ubuntu22.04
-ARG PYTHON_VER=3.10
+ARG PYTHON_VER=3.12
 ARG IMAGE_LABEL
 ARG REPO_URL
 ARG CURATOR_COMMIT
@@ -33,7 +33,7 @@ ARG CUDA_VER
 
 # Install the minimal libcu* libraries needed by NeMo Curator
 RUN conda create -y --name curator -c nvidia/label/cuda-${CUDA_VER} -c conda-forge \
-  python=3.10 \
+  python=3.12 \
   cuda-cudart \
   libcufft \
   libcublas \

diff --git a/MANIFEST.in b/MANIFEST.in
@@ -0,0 +1 @@
+include LICENSE
diff --git a/README.md b/README.md
@@ -23,8 +23,8 @@ All of our text pipelines have great multilingual support.
 - [Download and Extraction](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/download.html)
   - Default implementations for Common Crawl, Wikipedia, and ArXiv sources
   - Easily customize and extend to other sources
-- [Language Identification](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/languageidentificationunicodeformatting.html)
-- [Unicode Reformatting](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/languageidentificationunicodeformatting.html)
+- [Language Identification](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/languageidentification.html)
+- [Text Cleaning](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/textcleaning.html)
 - [Heuristic Filtering](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/qualityfiltering.html)
 - Classifier Filtering
   - [fastText](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/qualityfiltering.html)
@@ -69,7 +69,7 @@ This section explains how to install NeMo Curator and use the Python library, Py
 
 Before installing NeMo Curator, ensure that the following requirements are met:
 
-- Python 3.10
+- Python 3.10 or higher
 - Ubuntu 22.04/20.04
 - NVIDIA GPU (optional)
   - Volta™ or higher ([compute capability 7.0+](https://developer.nvidia.com/cuda-gpus))
@@ -158,7 +158,7 @@ To get started with NeMo Curator, you can follow the tutorials [available here](
 
 - [`tinystories`](https://github.com/NVIDIA/NeMo-Curator/tree/main/tutorials/tinystories) which focuses on data curation for training LLMs from scratch.
 - [`peft-curation`](https://github.com/NVIDIA/NeMo-Curator/tree/main/tutorials/peft-curation) which focuses on data curation for LLM parameter-efficient fine-tuning (PEFT) use-cases.
-- [`distributed_data_classification`](https://github.com/NVIDIA/NeMo-Curator/tree/main/tutorials/distributed_data_classification) which focuses on using the domain and quality classifiers to help with data annotation.
+- [`distributed_data_classification`](https://github.com/NVIDIA/NeMo-Curator/tree/main/tutorials/distributed_data_classification) which demonstrates how to use NVIDIA's Hugging Face classifiers to help with data annotation.
 - [`single_node_tutorial`](https://github.com/NVIDIA/NeMo-Curator/tree/main/tutorials/single_node_tutorial) which demonstrates an end-to-end data curation pipeline for curating Wikipedia data in Thai.
 - [`image-curation`](https://github.com/NVIDIA/NeMo-Curator/blob/main/tutorials/image-curation/image-curation.ipynb) which explores the scalable image curation modules.
 

diff --git a/config/sem_dedup_config.yaml b/config/sem_dedup_config.yaml
@@ -6,6 +6,7 @@ num_files: 16
 embeddings_save_loc: "embeddings"
 embedding_model_name_or_path: "sentence-transformers/all-MiniLM-L6-v2"
 embedding_batch_size: 128
+write_embeddings_to_disk: true
 
 # Clustering configuration
 clustering_save_loc: "clustering_results"

diff --git a/docs/user-guide/api/dask.rst b/docs/user-guide/api/dask.rst
@@ -4,4 +4,7 @@ Dask Cluster Functions
 
 .. autofunction:: nemo_curator.get_client
 
-.. autofunction:: nemo_curator.get_network_interfaces
+.. autofunction:: nemo_curator.get_network_interfaces
+
+.. autoclass:: nemo_curator.ToBackend
+    :members:
diff --git a/docs/user-guide/api/deduplication.rst b/docs/user-guide/api/deduplication.rst
@@ -13,12 +13,21 @@ Exact
 Fuzzy
 ------------------------
 
+.. autoclass:: nemo_curator.BucketsToEdges
+    :members:
+
+.. autoclass:: nemo_curator.ConnectedComponents
+    :members:
+
 .. autoclass:: nemo_curator.FuzzyDuplicatesConfig
     :members:
 
 .. autoclass:: nemo_curator.FuzzyDuplicates
     :members:
 
+.. autoclass:: nemo_curator.JaccardSimilarity
+    :members:
+
 .. autoclass:: nemo_curator.LSH
     :members:
 

diff --git a/docs/user-guide/cpuvsgpu.rst b/docs/user-guide/cpuvsgpu.rst
@@ -69,10 +69,10 @@ The following NeMo Curator modules are GPU based.
 
   * Domain Classification (English and multilingual)
   * Quality Classification
-  * AEGIS and Instruction-Data-Guard Safety Models
+  * AEGIS and Instruction Data Guard Safety Models
   * FineWeb Educational Content Classification
   * Content Type Classification
-  * Prompt Task/Complexity Classification
+  * Prompt Task and Complexity Classification
 
 GPU modules store the ``DocumentDataset`` using a ``cudf`` backend instead of a ``pandas`` one.
 To read a dataset into GPU memory, one could use the following function call.
@@ -85,6 +85,46 @@ To read a dataset into GPU memory, one could use the following function call.
 Even if you start a GPU dask cluster, you can't operate on datasets that use a ``pandas`` backend.
 The ``DocuemntDataset`` must either have been originally read in with a ``cudf`` backend, or it must be transferred during the script.
 
+-----------------------------------------
+Moving data between CPU and GPU
+-----------------------------------------
+
+The ``ToBackend`` module provides a way to move data between CPU memory and GPU memory by swapping between pandas and cuDF backends for your dataset.
+To see how it works, take a look at this example.
+
+.. code-block:: python
+
+  from nemo_curator import Sequential, ToBackend, ScoreFilter, get_client
+  from nemo_curator.datasets import DocumentDataset
+  from nemo_curator.classifiers import DomainClassifier
+  from nemo_curator.filters import RepeatingTopNGramsFilter, NonAlphaNumericFilter
+
+  def main():
+      client = get_client(cluster_type="gpu")
+
+      dataset = DocumentDataset.read_json("books.jsonl")
+      curation_pipeline = Sequential([
+          ScoreFilter(RepeatingTopNGramsFilter(n=5)),
+          ToBackend("cudf"),
+          DomainClassifier(),
+          ToBackend("pandas"),
+          ScoreFilter(NonAlphaNumericFilter()),
+      ])
+
+      curated_dataset = curation_pipeline(dataset)
+
+      curated_dataset.to_json("curated_books.jsonl")
+
+  if __name__ == "__main__":
+      main()
+
+Let's highlight some of the important parts of this example.
+
+* ``client = get_client(cluster_type="gpu")``: Creates a local Dask cluster with access to the GPUs. In order to use/swap to a cuDF dataframe backend, you need to make sure you are running on a GPU Dask cluster.
+* ``dataset = DocumentDataset.read_json("books.jsonl")``: Reads in the dataset to a pandas (CPU) backend by default.
+* ``curation_pipeline = ...``: Defines a curation pipeline consisting of a CPU filtering step, a GPU classifier step, and another CPU filtering step. The ``ToBackend("cudf")`` moves the dataset from CPU to GPU for the classifier, and the ``ToBackend("pandas")`` moves the dataset back to the CPU from the GPU for the last filter.
+* ``curated_dataset.to_json("curated_books.jsonl")``: Writes the dataset directly to disk from the GPU. There is no need to transfer back to the CPU before writing to disk.
+
 -----------------------------------------
 Dask with Slurm
 -----------------------------------------

diff --git a/docs/user-guide/distributeddataclassification.rst b/docs/user-guide/distributeddataclassification.rst
@@ -15,7 +15,7 @@ NeMo Curator provides a module to help users run inference with pre-trained mode
 This is achieved by chunking the datasets across multiple computing nodes, each equipped with multiple GPUs, to accelerate the classification task in a distributed manner.
 Since the classification of a single text document is independent of other documents within the dataset, we can distribute the workload across multiple nodes and GPUs to perform parallel processing.
 
-Domain (English and multilingual), quality, content safety, educational content, content type, and prompt task/complexity models are tasks we include as examples within our module.
+Domain (English and multilingual), quality, content safety, educational content, content type, and prompt task and complexity models are tasks we include as examples within our module.
 
 Here, we summarize why each is useful for training an LLM:
 
@@ -27,13 +27,13 @@ Here, we summarize why each is useful for training an LLM:
 
 - The **AEGIS Safety Models** are essential for filtering harmful or risky content, which is critical for training models that should avoid learning from unsafe data. By classifying content into 13 critical risk categories, AEGIS helps remove harmful or inappropriate data from the training sets, improving the overall ethical and safety standards of the LLM.
 
-- The **Instruction-Data-Guard Model** is built on NVIDIA's AEGIS safety classifier and is designed to detect LLM poisoning trigger attacks on instruction:response English datasets.
+- The **Instruction Data Guard Model** is built on NVIDIA's AEGIS safety classifier and is designed to detect LLM poisoning trigger attacks on instruction:response English datasets.
 
 - The **FineWeb Educational Content Classifier** focuses on identifying and prioritizing educational material within datasets. This classifier is especially useful for training LLMs on specialized educational content, which can improve their performance on knowledge-intensive tasks. Models trained on high-quality educational content demonstrate enhanced capabilities on academic benchmarks such as MMLU and ARC, showcasing the classifier's impact on improving the knowledge-intensive task performance of LLMs.
 
 - The **Content Type Classifier** is designed to categorize documents into one of 11 distinct speech types based on their content. It analyzes and understands the nuances of textual information, enabling accurate classification across a diverse range of content types.
 
-- The **Prompt Task/Complexity Classifier** is a multi-headed model which classifies English text prompts across task types and complexity dimensions.
+- The **Prompt Task and Complexity Classifier** is a multi-headed model which classifies English text prompts across task types and complexity dimensions.
 
 -----------------------------------------
 Usage
@@ -95,8 +95,8 @@ Using the ``MultilingualDomainClassifier`` is very similar to using the ``Domain
 
 For more information about the multilingual domain classifier, including its supported languages, please see the `nvidia/multilingual-domain-classifier <https://huggingface.co/nvidia/multilingual-domain-classifier>`_ on Hugging Face.
 
-Quality Classifier
-^^^^^^^^^^^^^^^^^^
+Quality Classifier DeBERTa
+^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 The Quality Classifier is designed to assess the quality of text documents, helping to filter out low-quality or noisy data from your dataset.
 
@@ -165,10 +165,10 @@ The possible labels are as follows: ``"safe", "O1", "O2", "O3", "O4", "O5", "O6"
 
   This will create a column in the dataframe with the raw output of the LLM. You can choose to parse this response however you want.
 
-Instruction-Data-Guard Model
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+Instruction Data Guard
+^^^^^^^^^^^^^^^^^^^^^^
 
-Instruction-Data-Guard is a classification model designed to detect LLM poisoning trigger attacks.
+Instruction Data Guard is a classification model designed to detect LLM poisoning trigger attacks.
 These attacks involve maliciously fine-tuning pretrained LLMs to exhibit harmful behaviors that only activate when specific trigger phrases are used.
 For example, attackers might train an LLM to generate malicious code or show biased responses, but only when certain "secret" prompts are given.
 
@@ -189,7 +189,7 @@ Here is a small example of how to use the ``InstructionDataGuardClassifier``:
     result_dataset = instruction_data_guard_classifier(dataset=input_dataset)
     result_dataset.to_json("labeled_dataset/")
 
-In this example, the Instruction-Data-Guard model is obtained directly from `Hugging Face <https://huggingface.co/nvidia/instruction-data-guard>`_.
+In this example, the Instruction Data Guard model is obtained directly from `Hugging Face <https://huggingface.co/nvidia/instruction-data-guard>`_.
 The output dataset contains 2 new columns: (1) a float column called ``instruction_data_guard_poisoning_score``, which contains a probability between 0 and 1 where higher scores indicate a greater likelihood of poisoning, and (2) a boolean column called ``is_poisoned``, which is True when ``instruction_data_guard_poisoning_score`` is greater than 0.5 and False otherwise.
 
 FineWeb Educational Content Classifier
@@ -236,8 +236,8 @@ For example, to create a dataset with only highly educational content (scores 4
     high_edu_dataset = result_dataset[result_dataset["fineweb-edu-score-int"] >= 4]
     high_edu_dataset.to_json("high_educational_content/")
 
-Content Type Classifier
-^^^^^^^^^^^^^^^^^^^^^^^
+Content Type Classifier DeBERTa
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 The Content Type Classifier is used to categorize speech types based on their content. It analyzes and understands the nuances of textual information, enabling accurate classification across a diverse range of content types.
 
@@ -258,10 +258,10 @@ Let's see how ``ContentTypeClassifier`` works in a small excerpt taken from ``ex
 In this example, the content type classifier is obtained directly from `Hugging Face <https://huggingface.co/nvidia/content-type-classifier-deberta>`_.
 It filters the input dataset to include only documents classified as "Blogs" or "News".
 
-Prompt Task/Complexity Classifier
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+Prompt Task and Complexity Classifier
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
-The Prompt Task/Complexity Classifier is a multi-headed model which classifies English text prompts across task types and complexity dimensions. Tasks are classified across 11 common categories. Complexity is evaluated across 6 dimensions and ensembled to create an overall complexity score.
+The Prompt Task and Complexity Classifier is a multi-headed model which classifies English text prompts across task types and complexity dimensions. Tasks are classified across 11 common categories. Complexity is evaluated across 6 dimensions and ensembled to create an overall complexity score.
 
 Here's an example of how to use the ``PromptTaskComplexityClassifier``:
 

diff --git a/docs/user-guide/documentdataset.rst b/docs/user-guide/documentdataset.rst
@@ -68,14 +68,16 @@ Let's walk through this code line by line.
              "books_dataset/books_02.jsonl"]
 
 * ``books = DocumentDataset.read_json(files, add_filename=True)`` This will read the files listed into memory.
-  The ``add_filename=True`` option preserves the name of the shard (``books_00.jsonl``, ``books_01.jsonl``, etc.) as an additional ``filename`` field.
-  When the dataset is written back to disk, this option (in conjunction with the ``write_to_filename`` option) ensure that documents stay in their original shard.
+  The ``add_filename=True`` option preserves the name of the shard (``books_00.jsonl``, ``books_01.jsonl``, etc.) as an additional ``file_name`` field.
+  When the dataset is written back to disk, this option (in conjunction with the ``write_to_filename`` option and ``filename_col`` ) ensure that documents stay in their original shard.
   This can be useful for manually inspecting the results of filtering shard by shard.
+  The ``add_filename`` option can also be used as a string, in which case it will be used as the name of the column (instead of the default ``file_name``).
 * ``filter_step = ...`` This constructs and applies a heuristic filter for the length of the document.
   More information is provided in the filtering page of the documentation.
 * ``long_books.to_json("long_books/", write_to_filename=True)`` This writes the filtered dataset to a new directory.
   As mentioned above, the ``write_to_filename=True`` preserves the sharding of the dataset.
   If the dataset was not read in with ``add_filename=True``, setting ``write_to_filename=True`` will throw an error.
+  If the dataset was read with ``add_filename="path"`` then along with ``write_to_filename=True`` the ``filename_col="path"`` will need to be set as well.
 
 ``DocumentDataset`` is just a wrapper around a `Dask dataframe <https://docs.dask.org/en/stable/dataframe.html>`_.
 The underlying dataframe can be accessed with the ``DocumentDataset.df`` member variable.

diff --git a/docs/user-guide/image/gettingstarted.rst b/docs/user-guide/image/gettingstarted.rst
@@ -12,7 +12,7 @@ Install NeMo Curator
 ---------------------
 To install the image curation modules of NeMo Curator, ensure you meet the following requirements:
 
-* Python 3.10
+* Python 3.10 or higher
 * Ubuntu 22.04/20.04
 * NVIDIA GPU
   * Volta™ or higher (compute capability 7.0+)