Merge branch 'main' into standardize_field_names

Signed-off-by: Sarah Yurick <[email protected]>
NVIDIA · Feb 13, 2025 · 7d006e8 · 7d006e8
2 parents 48fd14b + a5d1a7b
commit 7d006e8
Show file tree

Hide file tree

Showing 65 changed files with 5,320 additions and 3,010 deletions.
diff --git a/.github/workflows/gpuci.yml b/.github/workflows/gpuci.yml
@@ -7,12 +7,12 @@ on:
   pull_request:
     branches:
       # We can run gpuCI on any PR targeting these branches
-      - 'main'
-      - '[rv][0-9].[0-9].[0-9]'
-      - '[rv][0-9].[0-9].[0-9]rc[0-9]'
+      - "main"
+      - "[rv][0-9].[0-9].[0-9]"
+      - "[rv][0-9].[0-9].[0-9]rc[0-9]"
     # PR has to be labeled with "gpuCI" label
     # If new commits are added, the "gpuCI" label has to be removed and re-added to rerun gpuCI
-    types: [ labeled ]
+    types: [labeled]
 
 concurrency:
   group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}
@@ -40,50 +40,52 @@ jobs:
     # This is the tag on our Azure runner found in Actions -> Runners -> Self-hosted runners
     # It has 2 A100 GPUs
     runs-on: self-hosted-azure
+    # Unit tests shouldn't take longer than 30minutes
+    timeout-minutes: 30
     # "run-gpu-tests" job is run if the "gpuci" label is added to the PR
     if: ${{ github.event.label.name == 'gpuci' || github.ref == 'refs/heads/main' }}
 
     steps:
       # If something went wrong during the last cleanup, this step ensures any existing container is removed
-    - name: Remove existing container if it exists
-      run: |
-        if [ "$(docker ps -aq -f name=nemo-curator-container)" ]; then
-            docker rm -f nemo-curator-container
-        fi
+      - name: Remove existing container if it exists
+        run: |
+          if [ "$(docker ps -aq -f name=nemo-curator-container)" ]; then
+              docker rm -f nemo-curator-container
+          fi
 
-      # This runs the container which was pushed by build-container, which we call "nemo-curator-container"
-      # `--gpus all` ensures that all of the GPUs from our self-hosted-azure runner are available in the container
-      # We use "github.run_id" to identify the PR with the commits we want to run the PyTests with
-      # `bash -c "sleep infinity"` keeps the container running indefinitely without exiting
-    - name: Run Docker container
-      run: |
-        docker run --gpus all --name nemo-curator-container -d nemoci.azurecr.io/nemo_curator_container:${{ github.run_id }} bash -c "sleep infinity"
+        # This runs the container which was pushed by build-container, which we call "nemo-curator-container"
+        # `--gpus all` ensures that all of the GPUs from our self-hosted-azure runner are available in the container
+        # We use "github.run_id" to identify the PR with the commits we want to run the PyTests with
+        # `bash -c "sleep infinity"` keeps the container running indefinitely without exiting
+      - name: Run Docker container
+        run: |
+          docker run --gpus all --name nemo-curator-container -d nemoci.azurecr.io/nemo_curator_container:${{ github.run_id }} bash -c "sleep infinity"
 
-      # Expect `whoami` to be "azureuser"
-      # Expect `nvidia-smi` to show our 2 A100 GPUs
-    - name: Check GPUs
-      run: |
-        whoami
-        docker exec nemo-curator-container nvidia-smi
+        # Expect `whoami` to be "azureuser"
+        # Expect `nvidia-smi` to show our 2 A100 GPUs
+      - name: Check GPUs
+        run: |
+          whoami
+          docker exec nemo-curator-container nvidia-smi
 
-      # In the virtual environment (called "curator") we created in the container,
-      # list all of our packages. Useful for debugging
-    - name: Verify installations
-      run: |
-        docker exec nemo-curator-container pip list
+        # In the virtual environment (called "curator") we created in the container,
+        # list all of our packages. Useful for debugging
+      - name: Verify installations
+        run: |
+          docker exec nemo-curator-container pip list
 
-      # In the virtual environment (called "curator") we created in the container,
-      # run our PyTests marked with `@pytest.mark.gpu`
-      # We specify the `rootdir` to help locate the "pyproject.toml" file (which is in the root directory of the repository),
-      # and then the directory where the PyTests are located
-    - name: Run PyTests with GPU mark
-      run: |
-        docker exec nemo-curator-container pytest -m gpu --rootdir /opt/NeMo-Curator /opt/NeMo-Curator/tests
+        # In the virtual environment (called "curator") we created in the container,
+        # run our PyTests marked with `@pytest.mark.gpu`
+        # We specify the `rootdir` to help locate the "pyproject.toml" file (which is in the root directory of the repository),
+        # and then the directory where the PyTests are located
+      - name: Run PyTests with GPU mark
+        run: |
+          docker exec nemo-curator-container pytest -m gpu --rootdir /opt/NeMo-Curator /opt/NeMo-Curator/tests
 
-      # After running `docker stop`, the container remains in an exited state
-      # It is still present on our system and could be restarted with `docker start`
-      # Thus, we use `docker rm` to permanently removed it from the system
-    - name: Cleanup
-      if: always()
-      run: |
-        docker stop nemo-curator-container && docker rm nemo-curator-container
+        # After running `docker stop`, the container remains in an exited state
+        # It is still present on our system and could be restarted with `docker start`
+        # Thus, we use `docker rm` to permanently removed it from the system
+      - name: Cleanup
+        if: always()
+        run: |
+          docker stop nemo-curator-container && docker rm nemo-curator-container
diff --git a/.github/workflows/release-freeze.yml b/.github/workflows/release-freeze.yml
@@ -1,4 +1,4 @@
-name: 'Code freeze'
+name: "Code freeze"
 
 on:
   workflow_dispatch:
@@ -9,14 +9,20 @@ on:
         options:
           - major
           - minor
+      freeze-commit:
+        type: string
+        description: Commit SHA to use for cut-off
+        required: false
+        default: main
 
 jobs:
   code-freeze:
-    uses: NVIDIA/NeMo-FW-CI-templates/.github/workflows/_code_freeze.yml@v0.12.0
+    uses: NVIDIA/NeMo-FW-CI-templates/.github/workflows/_code_freeze.yml@v0.21.6
     with:
       library-name: NeMo Curator
       python-package: nemo_curator
       release-type: ${{ inputs.release-type }}
-
+      freeze-commit: ${{ inputs.freeze-commit }}
     secrets:
       SLACK_RELEASE_ENDPOINT: ${{ secrets.SLACK_RELEASE_ENDPOINT }}
+      SLACK_WEBHOOK_ADMIN: ${{ secrets.SLACK_WEBHOOK_ADMIN }}
diff --git a/README.md b/README.md
@@ -70,6 +70,7 @@ This section explains how to install NeMo Curator and use the Python library, Py
 Before installing NeMo Curator, ensure that the following requirements are met:
 
 - Python 3.10 or higher
+  - packaging >= 22.0
 - Ubuntu 22.04/20.04
 - NVIDIA GPU (optional)
   - Volta™ or higher ([compute capability 7.0+](https://developer.nvidia.com/cuda-gpus))
@@ -187,7 +188,11 @@ The following figure shows that the use of different data curation modules imple
   <img src="./docs/user-guide/assets/zeroshot_ablations.png" alt="drawing" width="700"/>
 </p>
 
-In terms of scalability and compute performance, using the combination of RAPIDS and Dask fuzzy deduplication enabled us to deduplicate the 1.1 Trillion token Red Pajama dataset in 1.8 hours with 64 NVIDIA A100 Tensor Core GPUs.
+In terms of scalability and compute performance, using the combination of RAPIDS and Dask fuzzy deduplication enabled us to deduplicate the 1.96 Trillion token subset of the RedPajama V2 dataset in 0.5 hours with 32 NVIDIA H100 GPUs.
+
+Processing Time            |  Comparison to Alternative Libraries
+:-------------------------:|:---------------------------------------:
+![](./docs/user-guide/assets/readme/fuzzy-dedup-processing-time.png)  |  ![](./docs/user-guide/assets/readme/fuzzy-dedup-processing-optimization-16x.png)
 
 Additionally, using the CPU-based modules, the following table shows the time required and resulting data size reduction for each processing step [Common Crawl snapshot from November/December of 2020](https://commoncrawl.org/2020/12/nov-dec-2020-crawl-archive-now-available/) using 30 CPU nodes (with hardware similar to the `c5.24xlarge` [Amazon AWS C5 instance](https://aws.amazon.com/ec2/instance-types/c5/)).
 

diff --git a/conftest.py b/conftest.py
@@ -1,4 +1,11 @@
 import pytest
+from dask.distributed import Client
+
+from nemo_curator.utils.import_utils import gpu_only_import, gpu_only_import_from
+
+cudf = gpu_only_import("cudf")
+dask_cudf = gpu_only_import("dask_cudf")
+LocalCUDACluster = gpu_only_import_from("dask_cuda", "LocalCUDACluster")
 
 
 def pytest_addoption(parser):
@@ -13,3 +20,16 @@ def pytest_collection_modifyitems(config, items):
         for item in items:
             if "gpu" in item.keywords:
                 item.add_marker(skip_gpu)
+
+
+@pytest.fixture(autouse=True, scope="session")
+def gpu_client(request):
+    if not request.config.getoption("--cpu"):
+        with LocalCUDACluster(n_workers=1) as cluster, Client(cluster) as client:
+            request.session.client = client
+            request.session.cluster = cluster
+            yield client
+            client.close()
+            cluster.close()
+    else:
+        yield None
diff --git a/docs/user-guide/api/classifiers.rst b/docs/user-guide/api/classifiers.rst
@@ -14,6 +14,12 @@ Classifiers
 .. autoclass:: nemo_curator.classifiers.FineWebEduClassifier
     :members:
 
+.. autoclass:: nemo_curator.classifiers.FineWebMixtralEduClassifier
+    :members:
+
+.. autoclass:: nemo_curator.classifiers.FineWebNemotronEduClassifier
+    :members:
+
 .. autoclass:: nemo_curator.classifiers.AegisClassifier
     :members:
 

diff --git a/docs/user-guide/api/filters.rst b/docs/user-guide/api/filters.rst
@@ -152,6 +152,14 @@ Heuristic Filters
     :members:
     :member-order: bysource
 
+.. autoclass:: nemo_curator.filters.TokenCountFilter
+    :members:
+    :member-order: bysource
+
+.. autoclass:: nemo_curator.filters.SubstringFilter
+    :members:
+    :member-order: bysource
+
 ------------------------------
 Code Filters
 ------------------------------

diff --git a/docs/user-guide/api/misc.rst b/docs/user-guide/api/misc.rst
@@ -15,3 +15,9 @@ Miscellaneous
 
 .. autoclass:: nemo_curator.Shuffle
     :members:
+
+.. autoclass:: nemo_curator.DocumentSplitter
+    :members:
+
+.. autoclass:: nemo_curator.DocumentJoiner
+    :members:
diff --git a/docs/user-guide/api/modifiers.rst b/docs/user-guide/api/modifiers.rst
@@ -32,3 +32,22 @@ Modifiers
 
 .. autoclass:: nemo_curator.modifiers.PiiModifier
     :members:
+
+.. autoclass:: nemo_curator.modifiers.LineRemover
+    :members:
+
+.. autoclass:: nemo_curator.modifiers.MarkdownRemover
+    :members:
+
+.. autoclass:: nemo_curator.modifiers.NewlineNormalizer
+    :members:
+
+.. autoclass:: nemo_curator.modifiers.UrlRemover
+    :members:
+
+.. autoclass:: nemo_curator.modifiers.Slicer
+    :members:
+
+.. autoclass:: nemo_curator.modifiers.QuotationRemover
+    :members:
+
diff --git a/docs/user-guide/api/synthetic.rst b/docs/user-guide/api/synthetic.rst
@@ -8,6 +8,18 @@ Synthetic Data
 .. autoclass:: nemo_curator.synthetic.AsyncNemotronGenerator
     :members:
 
+.. autoclass:: nemo_curator.synthetic.NemotronCCGenerator
+    :members:
+
+.. autoclass:: nemo_curator.synthetic.NemotronCCDiverseQAPostprocessor
+    :members:
+
+.. autoclass:: nemo_curator.synthetic.NemotronCCKnowledgeListPostprocessor
+    :members:
+
+.. autoclass:: nemo_curator.synthetic.AsyncNemotronGenerator
+    :members:
+
 .. autoclass:: nemo_curator.synthetic.NemotronFormatter
     :members:
 

diff --git a/docs/user-guide/assets/readme/fuzzy-dedup-processing-optimization-16x.png b/docs/user-guide/assets/readme/fuzzy-dedup-processing-optimization-16x.png
diff --git a/docs/user-guide/assets/readme/fuzzy-dedup-processing-time.png b/docs/user-guide/assets/readme/fuzzy-dedup-processing-time.png
diff --git a/docs/user-guide/cpuvsgpu.rst b/docs/user-guide/cpuvsgpu.rst
@@ -71,6 +71,7 @@ The following NeMo Curator modules are GPU based.
   * Quality Classification
   * AEGIS and Instruction Data Guard Safety Models
   * FineWeb Educational Content Classification
+  * FineWeb Mixtral and FineWeb Nemotron-4 Educational Models
   * Content Type Classification
   * Prompt Task and Complexity Classification
 

diff --git a/docs/user-guide/distributeddataclassification.rst b/docs/user-guide/distributeddataclassification.rst
@@ -31,6 +31,10 @@ Here, we summarize why each is useful for training an LLM:
 
 - The **FineWeb Educational Content Classifier** focuses on identifying and prioritizing educational material within datasets. This classifier is especially useful for training LLMs on specialized educational content, which can improve their performance on knowledge-intensive tasks. Models trained on high-quality educational content demonstrate enhanced capabilities on academic benchmarks such as MMLU and ARC, showcasing the classifier's impact on improving the knowledge-intensive task performance of LLMs.
 
+- The **FineWeb Mixtral Educational Classifier** is designed to determine the educational value (score 0-5 from low to high). It is similar to the FineWeb-Edu classifier and was trained on the same text samples, but using annotations from Mixtral 8x22B-Instruct.
+
+- The **FineWeb Nemotron-4 Educational Classifier** is designed to determine the educational value (score 0-5 from low to high). It is similar to the FineWeb-Edu classifier and was trained on the same text samples, but using annotations from Nemotron-4-340B-Instruct.
+
 - The **Content Type Classifier** is designed to categorize documents into one of 11 distinct speech types based on their content. It analyzes and understands the nuances of textual information, enabling accurate classification across a diverse range of content types.
 
 - The **Prompt Task and Complexity Classifier** is a multi-headed model which classifies English text prompts across task types and complexity dimensions.
@@ -236,6 +240,92 @@ For example, to create a dataset with only highly educational content (scores 4
     high_edu_dataset = result_dataset[result_dataset["fineweb-edu-score-int"] >= 4]
     high_edu_dataset.to_json("high_educational_content/")
 
+FineWeb Mixtral Edu Classifier
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The FineWeb Mixtral Edu Classifier is designed to identify and prioritize educational content within a dataset.
+It is similar to the FineWeb-Edu classifier and was trained on the same text samples, but using annotations from Mixtral 8x22B-Instruct.
+In contrast, the original FineWeb-Edu classifier was trained using annotations from Llama 3 70B-Instruct.
+This classifier was used as part of a classifier ensemble in the creation of the `Nemotron-CC dataset <https://arxiv.org/abs/2412.02595>`_.
+These datasets can be used to train LLMs with a focus on educational content, potentially improving their performance on knowledge-intensive tasks.
+
+To use the FineWeb Mixtral Edu Classifier, you can follow this example:
+
+.. code-block:: python
+
+    from nemo_curator.classifiers import FineWebMixtralEduClassifier
+
+    files = get_all_files_paths_under("web_documents/")
+    input_dataset = DocumentDataset.read_json(files, backend="cudf")
+
+    classifier = FineWebMixtralEduClassifier(
+        batch_size=256,
+        text_field="text",
+        pred_column="fineweb-mixtral-edu-score",
+        int_column="fineweb-mixtral-edu-score-int",
+        quality_label_column="fineweb-mixtral-edu-score-label",
+    )
+    result_dataset = classifier(dataset=input_dataset)
+
+    result_dataset.to_json("educational_content/")
+
+This classifier uses a model based on the `Snowflake Arctic-embed-m <https://huggingface.co/Snowflake/snowflake-arctic-embed-m>`_ embedding model with a linear regression layer on top.
+It assigns an educational score to each document on a scale from 0 to 5, where higher scores indicate more educational content.
+
+The ``pred_column`` will contain the raw floating-point scores, while the ``int_column`` will contain the rounded integer scores.
+The ``quality_label_column`` identifies text as high quality if it scores higher than 2.5 and low quality otherwise.
+You can filter the results based on these scores to create datasets with varying levels of educational content.
+
+For example, to create a dataset with only highly educational content (scores 4 and 5):
+
+.. code-block:: python
+
+    high_edu_dataset = result_dataset[result_dataset["fineweb-mixtral-edu-score-int"] >= 4]
+    high_edu_dataset.to_json("high_educational_content/")
+
+FineWeb Nemotron-4 Edu Classifier
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The FineWeb Mixtral Edu Classifier is designed to identify and prioritize educational content within a dataset.
+It is similar to the FineWeb-Edu classifier and was trained on the same text samples, but using annotations from Nemotron-4-340B-Instruct.
+In contrast, the original FineWeb-Edu classifier was trained using annotations from Llama 3 70B-Instruct.
+This classifier was used as part of a classifier ensemble in the creation of the `Nemotron-CC dataset <https://arxiv.org/abs/2412.02595>`_.
+These datasets can be used to train LLMs with a focus on educational content, potentially improving their performance on knowledge-intensive tasks.
+
+To use the FineWeb Nemotron-4 Edu Classifier, you can follow this example:
+
+.. code-block:: python
+
+    from nemo_curator.classifiers import FineWebNemotronEduClassifier
+
+    files = get_all_files_paths_under("web_documents/")
+    input_dataset = DocumentDataset.read_json(files, backend="cudf")
+
+    classifier = FineWebNemotronEduClassifier(
+        batch_size=256,
+        text_field="text",
+        pred_column="fineweb-nemotron-edu-score",
+        int_column="fineweb-nemotron-edu-score-int",
+        quality_label_column="fineweb-nemotron-edu-score-label",
+    )
+    result_dataset = classifier(dataset=input_dataset)
+
+    result_dataset.to_json("educational_content/")
+
+This classifier uses a model based on the `Snowflake Arctic-embed-m <https://huggingface.co/Snowflake/snowflake-arctic-embed-m>`_ embedding model with a linear regression layer on top.
+It assigns an educational score to each document on a scale from 0 to 5, where higher scores indicate more educational content.
+
+The ``pred_column`` will contain the raw floating-point scores, while the ``int_column`` will contain the rounded integer scores.
+The ``quality_label_column`` identifies text as high quality if it scores higher than 2.5 and low quality otherwise.
+You can filter the results based on these scores to create datasets with varying levels of educational content.
+
+For example, to create a dataset with only highly educational content (scores 4 and 5):
+
+.. code-block:: python
+
+    high_edu_dataset = result_dataset[result_dataset["fineweb-nemotron-edu-score-int"] >= 4]
+    high_edu_dataset.to_json("high_educational_content/")
+
 Content Type Classifier DeBERTa
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^