Skip to content

Commit

Permalink
Merge branch 'main' into standardize_field_names
Browse files Browse the repository at this point in the history
Signed-off-by: Sarah Yurick <[email protected]>
  • Loading branch information
sarahyurick authored Feb 13, 2025
2 parents 48fd14b + a5d1a7b commit 7d006e8
Show file tree
Hide file tree
Showing 65 changed files with 5,320 additions and 3,010 deletions.
84 changes: 43 additions & 41 deletions .github/workflows/gpuci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -7,12 +7,12 @@ on:
pull_request:
branches:
# We can run gpuCI on any PR targeting these branches
- 'main'
- '[rv][0-9].[0-9].[0-9]'
- '[rv][0-9].[0-9].[0-9]rc[0-9]'
- "main"
- "[rv][0-9].[0-9].[0-9]"
- "[rv][0-9].[0-9].[0-9]rc[0-9]"
# PR has to be labeled with "gpuCI" label
# If new commits are added, the "gpuCI" label has to be removed and re-added to rerun gpuCI
types: [ labeled ]
types: [labeled]

concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}
Expand Down Expand Up @@ -40,50 +40,52 @@ jobs:
# This is the tag on our Azure runner found in Actions -> Runners -> Self-hosted runners
# It has 2 A100 GPUs
runs-on: self-hosted-azure
# Unit tests shouldn't take longer than 30minutes
timeout-minutes: 30
# "run-gpu-tests" job is run if the "gpuci" label is added to the PR
if: ${{ github.event.label.name == 'gpuci' || github.ref == 'refs/heads/main' }}

steps:
# If something went wrong during the last cleanup, this step ensures any existing container is removed
- name: Remove existing container if it exists
run: |
if [ "$(docker ps -aq -f name=nemo-curator-container)" ]; then
docker rm -f nemo-curator-container
fi
- name: Remove existing container if it exists
run: |
if [ "$(docker ps -aq -f name=nemo-curator-container)" ]; then
docker rm -f nemo-curator-container
fi
# This runs the container which was pushed by build-container, which we call "nemo-curator-container"
# `--gpus all` ensures that all of the GPUs from our self-hosted-azure runner are available in the container
# We use "github.run_id" to identify the PR with the commits we want to run the PyTests with
# `bash -c "sleep infinity"` keeps the container running indefinitely without exiting
- name: Run Docker container
run: |
docker run --gpus all --name nemo-curator-container -d nemoci.azurecr.io/nemo_curator_container:${{ github.run_id }} bash -c "sleep infinity"
# This runs the container which was pushed by build-container, which we call "nemo-curator-container"
# `--gpus all` ensures that all of the GPUs from our self-hosted-azure runner are available in the container
# We use "github.run_id" to identify the PR with the commits we want to run the PyTests with
# `bash -c "sleep infinity"` keeps the container running indefinitely without exiting
- name: Run Docker container
run: |
docker run --gpus all --name nemo-curator-container -d nemoci.azurecr.io/nemo_curator_container:${{ github.run_id }} bash -c "sleep infinity"
# Expect `whoami` to be "azureuser"
# Expect `nvidia-smi` to show our 2 A100 GPUs
- name: Check GPUs
run: |
whoami
docker exec nemo-curator-container nvidia-smi
# Expect `whoami` to be "azureuser"
# Expect `nvidia-smi` to show our 2 A100 GPUs
- name: Check GPUs
run: |
whoami
docker exec nemo-curator-container nvidia-smi
# In the virtual environment (called "curator") we created in the container,
# list all of our packages. Useful for debugging
- name: Verify installations
run: |
docker exec nemo-curator-container pip list
# In the virtual environment (called "curator") we created in the container,
# list all of our packages. Useful for debugging
- name: Verify installations
run: |
docker exec nemo-curator-container pip list
# In the virtual environment (called "curator") we created in the container,
# run our PyTests marked with `@pytest.mark.gpu`
# We specify the `rootdir` to help locate the "pyproject.toml" file (which is in the root directory of the repository),
# and then the directory where the PyTests are located
- name: Run PyTests with GPU mark
run: |
docker exec nemo-curator-container pytest -m gpu --rootdir /opt/NeMo-Curator /opt/NeMo-Curator/tests
# In the virtual environment (called "curator") we created in the container,
# run our PyTests marked with `@pytest.mark.gpu`
# We specify the `rootdir` to help locate the "pyproject.toml" file (which is in the root directory of the repository),
# and then the directory where the PyTests are located
- name: Run PyTests with GPU mark
run: |
docker exec nemo-curator-container pytest -m gpu --rootdir /opt/NeMo-Curator /opt/NeMo-Curator/tests
# After running `docker stop`, the container remains in an exited state
# It is still present on our system and could be restarted with `docker start`
# Thus, we use `docker rm` to permanently removed it from the system
- name: Cleanup
if: always()
run: |
docker stop nemo-curator-container && docker rm nemo-curator-container
# After running `docker stop`, the container remains in an exited state
# It is still present on our system and could be restarted with `docker start`
# Thus, we use `docker rm` to permanently removed it from the system
- name: Cleanup
if: always()
run: |
docker stop nemo-curator-container && docker rm nemo-curator-container
12 changes: 9 additions & 3 deletions .github/workflows/release-freeze.yml
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
name: 'Code freeze'
name: "Code freeze"

on:
workflow_dispatch:
Expand All @@ -9,14 +9,20 @@ on:
options:
- major
- minor
freeze-commit:
type: string
description: Commit SHA to use for cut-off
required: false
default: main

jobs:
code-freeze:
uses: NVIDIA/NeMo-FW-CI-templates/.github/workflows/_code_freeze.yml@v0.12.0
uses: NVIDIA/NeMo-FW-CI-templates/.github/workflows/_code_freeze.yml@v0.21.6
with:
library-name: NeMo Curator
python-package: nemo_curator
release-type: ${{ inputs.release-type }}

freeze-commit: ${{ inputs.freeze-commit }}
secrets:
SLACK_RELEASE_ENDPOINT: ${{ secrets.SLACK_RELEASE_ENDPOINT }}
SLACK_WEBHOOK_ADMIN: ${{ secrets.SLACK_WEBHOOK_ADMIN }}
7 changes: 6 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,7 @@ This section explains how to install NeMo Curator and use the Python library, Py
Before installing NeMo Curator, ensure that the following requirements are met:

- Python 3.10 or higher
- packaging >= 22.0
- Ubuntu 22.04/20.04
- NVIDIA GPU (optional)
- Volta™ or higher ([compute capability 7.0+](https://developer.nvidia.com/cuda-gpus))
Expand Down Expand Up @@ -187,7 +188,11 @@ The following figure shows that the use of different data curation modules imple
<img src="./docs/user-guide/assets/zeroshot_ablations.png" alt="drawing" width="700"/>
</p>

In terms of scalability and compute performance, using the combination of RAPIDS and Dask fuzzy deduplication enabled us to deduplicate the 1.1 Trillion token Red Pajama dataset in 1.8 hours with 64 NVIDIA A100 Tensor Core GPUs.
In terms of scalability and compute performance, using the combination of RAPIDS and Dask fuzzy deduplication enabled us to deduplicate the 1.96 Trillion token subset of the RedPajama V2 dataset in 0.5 hours with 32 NVIDIA H100 GPUs.

Processing Time | Comparison to Alternative Libraries
:-------------------------:|:---------------------------------------:
![](./docs/user-guide/assets/readme/fuzzy-dedup-processing-time.png) | ![](./docs/user-guide/assets/readme/fuzzy-dedup-processing-optimization-16x.png)

Additionally, using the CPU-based modules, the following table shows the time required and resulting data size reduction for each processing step [Common Crawl snapshot from November/December of 2020](https://commoncrawl.org/2020/12/nov-dec-2020-crawl-archive-now-available/) using 30 CPU nodes (with hardware similar to the `c5.24xlarge` [Amazon AWS C5 instance](https://aws.amazon.com/ec2/instance-types/c5/)).

Expand Down
20 changes: 20 additions & 0 deletions conftest.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,11 @@
import pytest
from dask.distributed import Client

from nemo_curator.utils.import_utils import gpu_only_import, gpu_only_import_from

cudf = gpu_only_import("cudf")
dask_cudf = gpu_only_import("dask_cudf")
LocalCUDACluster = gpu_only_import_from("dask_cuda", "LocalCUDACluster")


def pytest_addoption(parser):
Expand All @@ -13,3 +20,16 @@ def pytest_collection_modifyitems(config, items):
for item in items:
if "gpu" in item.keywords:
item.add_marker(skip_gpu)


@pytest.fixture(autouse=True, scope="session")
def gpu_client(request):
if not request.config.getoption("--cpu"):
with LocalCUDACluster(n_workers=1) as cluster, Client(cluster) as client:
request.session.client = client
request.session.cluster = cluster
yield client
client.close()
cluster.close()
else:
yield None
6 changes: 6 additions & 0 deletions docs/user-guide/api/classifiers.rst
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,12 @@ Classifiers
.. autoclass:: nemo_curator.classifiers.FineWebEduClassifier
:members:

.. autoclass:: nemo_curator.classifiers.FineWebMixtralEduClassifier
:members:

.. autoclass:: nemo_curator.classifiers.FineWebNemotronEduClassifier
:members:

.. autoclass:: nemo_curator.classifiers.AegisClassifier
:members:

Expand Down
8 changes: 8 additions & 0 deletions docs/user-guide/api/filters.rst
Original file line number Diff line number Diff line change
Expand Up @@ -152,6 +152,14 @@ Heuristic Filters
:members:
:member-order: bysource

.. autoclass:: nemo_curator.filters.TokenCountFilter
:members:
:member-order: bysource

.. autoclass:: nemo_curator.filters.SubstringFilter
:members:
:member-order: bysource

------------------------------
Code Filters
------------------------------
Expand Down
6 changes: 6 additions & 0 deletions docs/user-guide/api/misc.rst
Original file line number Diff line number Diff line change
Expand Up @@ -15,3 +15,9 @@ Miscellaneous

.. autoclass:: nemo_curator.Shuffle
:members:

.. autoclass:: nemo_curator.DocumentSplitter
:members:

.. autoclass:: nemo_curator.DocumentJoiner
:members:
19 changes: 19 additions & 0 deletions docs/user-guide/api/modifiers.rst
Original file line number Diff line number Diff line change
Expand Up @@ -32,3 +32,22 @@ Modifiers

.. autoclass:: nemo_curator.modifiers.PiiModifier
:members:

.. autoclass:: nemo_curator.modifiers.LineRemover
:members:

.. autoclass:: nemo_curator.modifiers.MarkdownRemover
:members:

.. autoclass:: nemo_curator.modifiers.NewlineNormalizer
:members:

.. autoclass:: nemo_curator.modifiers.UrlRemover
:members:

.. autoclass:: nemo_curator.modifiers.Slicer
:members:

.. autoclass:: nemo_curator.modifiers.QuotationRemover
:members:

12 changes: 12 additions & 0 deletions docs/user-guide/api/synthetic.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,18 @@ Synthetic Data
.. autoclass:: nemo_curator.synthetic.AsyncNemotronGenerator
:members:

.. autoclass:: nemo_curator.synthetic.NemotronCCGenerator
:members:

.. autoclass:: nemo_curator.synthetic.NemotronCCDiverseQAPostprocessor
:members:

.. autoclass:: nemo_curator.synthetic.NemotronCCKnowledgeListPostprocessor
:members:

.. autoclass:: nemo_curator.synthetic.AsyncNemotronGenerator
:members:

.. autoclass:: nemo_curator.synthetic.NemotronFormatter
:members:

Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions docs/user-guide/cpuvsgpu.rst
Original file line number Diff line number Diff line change
Expand Up @@ -71,6 +71,7 @@ The following NeMo Curator modules are GPU based.
* Quality Classification
* AEGIS and Instruction Data Guard Safety Models
* FineWeb Educational Content Classification
* FineWeb Mixtral and FineWeb Nemotron-4 Educational Models
* Content Type Classification
* Prompt Task and Complexity Classification

Expand Down
90 changes: 90 additions & 0 deletions docs/user-guide/distributeddataclassification.rst
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,10 @@ Here, we summarize why each is useful for training an LLM:

- The **FineWeb Educational Content Classifier** focuses on identifying and prioritizing educational material within datasets. This classifier is especially useful for training LLMs on specialized educational content, which can improve their performance on knowledge-intensive tasks. Models trained on high-quality educational content demonstrate enhanced capabilities on academic benchmarks such as MMLU and ARC, showcasing the classifier's impact on improving the knowledge-intensive task performance of LLMs.

- The **FineWeb Mixtral Educational Classifier** is designed to determine the educational value (score 0-5 from low to high). It is similar to the FineWeb-Edu classifier and was trained on the same text samples, but using annotations from Mixtral 8x22B-Instruct.

- The **FineWeb Nemotron-4 Educational Classifier** is designed to determine the educational value (score 0-5 from low to high). It is similar to the FineWeb-Edu classifier and was trained on the same text samples, but using annotations from Nemotron-4-340B-Instruct.

- The **Content Type Classifier** is designed to categorize documents into one of 11 distinct speech types based on their content. It analyzes and understands the nuances of textual information, enabling accurate classification across a diverse range of content types.

- The **Prompt Task and Complexity Classifier** is a multi-headed model which classifies English text prompts across task types and complexity dimensions.
Expand Down Expand Up @@ -236,6 +240,92 @@ For example, to create a dataset with only highly educational content (scores 4
high_edu_dataset = result_dataset[result_dataset["fineweb-edu-score-int"] >= 4]
high_edu_dataset.to_json("high_educational_content/")
FineWeb Mixtral Edu Classifier
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The FineWeb Mixtral Edu Classifier is designed to identify and prioritize educational content within a dataset.
It is similar to the FineWeb-Edu classifier and was trained on the same text samples, but using annotations from Mixtral 8x22B-Instruct.
In contrast, the original FineWeb-Edu classifier was trained using annotations from Llama 3 70B-Instruct.
This classifier was used as part of a classifier ensemble in the creation of the `Nemotron-CC dataset <https://arxiv.org/abs/2412.02595>`_.
These datasets can be used to train LLMs with a focus on educational content, potentially improving their performance on knowledge-intensive tasks.

To use the FineWeb Mixtral Edu Classifier, you can follow this example:

.. code-block:: python
from nemo_curator.classifiers import FineWebMixtralEduClassifier
files = get_all_files_paths_under("web_documents/")
input_dataset = DocumentDataset.read_json(files, backend="cudf")
classifier = FineWebMixtralEduClassifier(
batch_size=256,
text_field="text",
pred_column="fineweb-mixtral-edu-score",
int_column="fineweb-mixtral-edu-score-int",
quality_label_column="fineweb-mixtral-edu-score-label",
)
result_dataset = classifier(dataset=input_dataset)
result_dataset.to_json("educational_content/")
This classifier uses a model based on the `Snowflake Arctic-embed-m <https://huggingface.co/Snowflake/snowflake-arctic-embed-m>`_ embedding model with a linear regression layer on top.
It assigns an educational score to each document on a scale from 0 to 5, where higher scores indicate more educational content.

The ``pred_column`` will contain the raw floating-point scores, while the ``int_column`` will contain the rounded integer scores.
The ``quality_label_column`` identifies text as high quality if it scores higher than 2.5 and low quality otherwise.
You can filter the results based on these scores to create datasets with varying levels of educational content.

For example, to create a dataset with only highly educational content (scores 4 and 5):

.. code-block:: python
high_edu_dataset = result_dataset[result_dataset["fineweb-mixtral-edu-score-int"] >= 4]
high_edu_dataset.to_json("high_educational_content/")
FineWeb Nemotron-4 Edu Classifier
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The FineWeb Mixtral Edu Classifier is designed to identify and prioritize educational content within a dataset.
It is similar to the FineWeb-Edu classifier and was trained on the same text samples, but using annotations from Nemotron-4-340B-Instruct.
In contrast, the original FineWeb-Edu classifier was trained using annotations from Llama 3 70B-Instruct.
This classifier was used as part of a classifier ensemble in the creation of the `Nemotron-CC dataset <https://arxiv.org/abs/2412.02595>`_.
These datasets can be used to train LLMs with a focus on educational content, potentially improving their performance on knowledge-intensive tasks.

To use the FineWeb Nemotron-4 Edu Classifier, you can follow this example:

.. code-block:: python
from nemo_curator.classifiers import FineWebNemotronEduClassifier
files = get_all_files_paths_under("web_documents/")
input_dataset = DocumentDataset.read_json(files, backend="cudf")
classifier = FineWebNemotronEduClassifier(
batch_size=256,
text_field="text",
pred_column="fineweb-nemotron-edu-score",
int_column="fineweb-nemotron-edu-score-int",
quality_label_column="fineweb-nemotron-edu-score-label",
)
result_dataset = classifier(dataset=input_dataset)
result_dataset.to_json("educational_content/")
This classifier uses a model based on the `Snowflake Arctic-embed-m <https://huggingface.co/Snowflake/snowflake-arctic-embed-m>`_ embedding model with a linear regression layer on top.
It assigns an educational score to each document on a scale from 0 to 5, where higher scores indicate more educational content.

The ``pred_column`` will contain the raw floating-point scores, while the ``int_column`` will contain the rounded integer scores.
The ``quality_label_column`` identifies text as high quality if it scores higher than 2.5 and low quality otherwise.
You can filter the results based on these scores to create datasets with varying levels of educational content.

For example, to create a dataset with only highly educational content (scores 4 and 5):

.. code-block:: python
high_edu_dataset = result_dataset[result_dataset["fineweb-nemotron-edu-score-int"] >= 4]
high_edu_dataset.to_json("high_educational_content/")
Content Type Classifier DeBERTa
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Expand Down
Loading

0 comments on commit 7d006e8

Please sign in to comment.