Skip to content

Commit

Permalink
Add support for Nemotron-CC EDU classifiers (#518)
Browse files Browse the repository at this point in the history
* add fineweb mixtral classifier

Signed-off-by: Sarah Yurick <[email protected]>

* add more files

Signed-off-by: Sarah Yurick <[email protected]>

* run black

Signed-off-by: Sarah Yurick <[email protected]>

* create _FineWebBaseClassifier

Signed-off-by: Sarah Yurick <[email protected]>

* add more docs

Signed-off-by: Sarah Yurick <[email protected]>

* add notebooks and tests

Signed-off-by: Sarah Yurick <[email protected]>

* update classifier names

Signed-off-by: Sarah Yurick <[email protected]>

* fix label logic

Signed-off-by: Sarah Yurick <[email protected]>

* add Vibhu's suggestions

Signed-off-by: Sarah Yurick <[email protected]>

* skip pytests

Signed-off-by: Sarah Yurick <[email protected]>

---------

Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
  • Loading branch information
sarahyurick authored Feb 12, 2025
1 parent c5a1c50 commit a7fde15
Show file tree
Hide file tree
Showing 16 changed files with 1,358 additions and 26 deletions.
6 changes: 6 additions & 0 deletions docs/user-guide/api/classifiers.rst
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,12 @@ Classifiers
.. autoclass:: nemo_curator.classifiers.FineWebEduClassifier
:members:

.. autoclass:: nemo_curator.classifiers.FineWebMixtralEduClassifier
:members:

.. autoclass:: nemo_curator.classifiers.FineWebNemotronEduClassifier
:members:

.. autoclass:: nemo_curator.classifiers.AegisClassifier
:members:

Expand Down
1 change: 1 addition & 0 deletions docs/user-guide/cpuvsgpu.rst
Original file line number Diff line number Diff line change
Expand Up @@ -71,6 +71,7 @@ The following NeMo Curator modules are GPU based.
* Quality Classification
* AEGIS and Instruction Data Guard Safety Models
* FineWeb Educational Content Classification
* FineWeb Mixtral and FineWeb Nemotron-4 Educational Models
* Content Type Classification
* Prompt Task and Complexity Classification

Expand Down
90 changes: 90 additions & 0 deletions docs/user-guide/distributeddataclassification.rst
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,10 @@ Here, we summarize why each is useful for training an LLM:

- The **FineWeb Educational Content Classifier** focuses on identifying and prioritizing educational material within datasets. This classifier is especially useful for training LLMs on specialized educational content, which can improve their performance on knowledge-intensive tasks. Models trained on high-quality educational content demonstrate enhanced capabilities on academic benchmarks such as MMLU and ARC, showcasing the classifier's impact on improving the knowledge-intensive task performance of LLMs.

- The **FineWeb Mixtral Educational Classifier** is designed to determine the educational value (score 0-5 from low to high). It is similar to the FineWeb-Edu classifier and was trained on the same text samples, but using annotations from Mixtral 8x22B-Instruct.

- The **FineWeb Nemotron-4 Educational Classifier** is designed to determine the educational value (score 0-5 from low to high). It is similar to the FineWeb-Edu classifier and was trained on the same text samples, but using annotations from Nemotron-4-340B-Instruct.

- The **Content Type Classifier** is designed to categorize documents into one of 11 distinct speech types based on their content. It analyzes and understands the nuances of textual information, enabling accurate classification across a diverse range of content types.

- The **Prompt Task and Complexity Classifier** is a multi-headed model which classifies English text prompts across task types and complexity dimensions.
Expand Down Expand Up @@ -236,6 +240,92 @@ For example, to create a dataset with only highly educational content (scores 4
high_edu_dataset = result_dataset[result_dataset["fineweb-edu-score-int"] >= 4]
high_edu_dataset.to_json("high_educational_content/")
FineWeb Mixtral Edu Classifier
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The FineWeb Mixtral Edu Classifier is designed to identify and prioritize educational content within a dataset.
It is similar to the FineWeb-Edu classifier and was trained on the same text samples, but using annotations from Mixtral 8x22B-Instruct.
In contrast, the original FineWeb-Edu classifier was trained using annotations from Llama 3 70B-Instruct.
This classifier was used as part of a classifier ensemble in the creation of the `Nemotron-CC dataset <https://arxiv.org/abs/2412.02595>`_.
These datasets can be used to train LLMs with a focus on educational content, potentially improving their performance on knowledge-intensive tasks.

To use the FineWeb Mixtral Edu Classifier, you can follow this example:

.. code-block:: python
from nemo_curator.classifiers import FineWebMixtralEduClassifier
files = get_all_files_paths_under("web_documents/")
input_dataset = DocumentDataset.read_json(files, backend="cudf")
classifier = FineWebMixtralEduClassifier(
batch_size=256,
text_field="text",
pred_column="fineweb-mixtral-edu-score",
int_column="fineweb-mixtral-edu-score-int",
quality_label_column="fineweb-mixtral-edu-score-label",
)
result_dataset = classifier(dataset=input_dataset)
result_dataset.to_json("educational_content/")
This classifier uses a model based on the `Snowflake Arctic-embed-m <https://huggingface.co/Snowflake/snowflake-arctic-embed-m>`_ embedding model with a linear regression layer on top.
It assigns an educational score to each document on a scale from 0 to 5, where higher scores indicate more educational content.

The ``pred_column`` will contain the raw floating-point scores, while the ``int_column`` will contain the rounded integer scores.
The ``quality_label_column`` identifies text as high quality if it scores higher than 2.5 and low quality otherwise.
You can filter the results based on these scores to create datasets with varying levels of educational content.

For example, to create a dataset with only highly educational content (scores 4 and 5):

.. code-block:: python
high_edu_dataset = result_dataset[result_dataset["fineweb-mixtral-edu-score-int"] >= 4]
high_edu_dataset.to_json("high_educational_content/")
FineWeb Nemotron-4 Edu Classifier
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The FineWeb Mixtral Edu Classifier is designed to identify and prioritize educational content within a dataset.
It is similar to the FineWeb-Edu classifier and was trained on the same text samples, but using annotations from Nemotron-4-340B-Instruct.
In contrast, the original FineWeb-Edu classifier was trained using annotations from Llama 3 70B-Instruct.
This classifier was used as part of a classifier ensemble in the creation of the `Nemotron-CC dataset <https://arxiv.org/abs/2412.02595>`_.
These datasets can be used to train LLMs with a focus on educational content, potentially improving their performance on knowledge-intensive tasks.

To use the FineWeb Nemotron-4 Edu Classifier, you can follow this example:

.. code-block:: python
from nemo_curator.classifiers import FineWebNemotronEduClassifier
files = get_all_files_paths_under("web_documents/")
input_dataset = DocumentDataset.read_json(files, backend="cudf")
classifier = FineWebNemotronEduClassifier(
batch_size=256,
text_field="text",
pred_column="fineweb-nemotron-edu-score",
int_column="fineweb-nemotron-edu-score-int",
quality_label_column="fineweb-nemotron-edu-score-label",
)
result_dataset = classifier(dataset=input_dataset)
result_dataset.to_json("educational_content/")
This classifier uses a model based on the `Snowflake Arctic-embed-m <https://huggingface.co/Snowflake/snowflake-arctic-embed-m>`_ embedding model with a linear regression layer on top.
It assigns an educational score to each document on a scale from 0 to 5, where higher scores indicate more educational content.

The ``pred_column`` will contain the raw floating-point scores, while the ``int_column`` will contain the rounded integer scores.
The ``quality_label_column`` identifies text as high quality if it scores higher than 2.5 and low quality otherwise.
You can filter the results based on these scores to create datasets with varying levels of educational content.

For example, to create a dataset with only highly educational content (scores 4 and 5):

.. code-block:: python
high_edu_dataset = result_dataset[result_dataset["fineweb-nemotron-edu-score-int"] >= 4]
high_edu_dataset.to_json("high_educational_content/")
Content Type Classifier DeBERTa
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Expand Down
2 changes: 2 additions & 0 deletions examples/classifiers/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,8 @@ The Python scripts in this directory demonstrate how to run classification on yo
- AEGIS Safety Models
- Instruction Data Guard Model
- FineWeb Educational Content Classifier
- FineWeb Mixtral Educational Classifier
- FineWeb Nemotron-4 Educational Classifier
- Content Type Classifier
- Prompt Task and Complexity Classifier

Expand Down
64 changes: 64 additions & 0 deletions examples/classifiers/fineweb_mixtral_edu_example.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import argparse
import time

from nemo_curator.classifiers import FineWebMixtralEduClassifier
from nemo_curator.datasets import DocumentDataset
from nemo_curator.utils.distributed_utils import get_client
from nemo_curator.utils.script_utils import ArgumentHelper


def main(args):
global_st = time.time()

# Input can be a string or list
input_file_path = "/path/to/data"
output_file_path = "./"

client_args = ArgumentHelper.parse_client_args(args)
client_args["cluster_type"] = "gpu"
client = get_client(**client_args)

input_dataset = DocumentDataset.read_json(
input_file_path, backend="cudf", add_filename=True
)

fineweb_mixtral_edu_classifier = FineWebMixtralEduClassifier()
result_dataset = fineweb_mixtral_edu_classifier(dataset=input_dataset)
result_dataset.to_json(output_path=output_file_path, write_to_filename=True)

global_et = time.time()
print(
f"Total time taken for FineWeb Mixtral Edu Classifier inference: {global_et-global_st} s",
flush=True,
)

client.close()


def attach_args(
parser=argparse.ArgumentParser(
formatter_class=argparse.ArgumentDefaultsHelpFormatter
),
):
argumentHelper = ArgumentHelper(parser)
argumentHelper.add_distributed_classifier_cluster_args()

return argumentHelper.parser


if __name__ == "__main__":
main(attach_args().parse_args())
64 changes: 64 additions & 0 deletions examples/classifiers/fineweb_nemotron_edu_example.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import argparse
import time

from nemo_curator.classifiers import FineWebNemotronEduClassifier
from nemo_curator.datasets import DocumentDataset
from nemo_curator.utils.distributed_utils import get_client
from nemo_curator.utils.script_utils import ArgumentHelper


def main(args):
global_st = time.time()

# Input can be a string or list
input_file_path = "/path/to/data"
output_file_path = "./"

client_args = ArgumentHelper.parse_client_args(args)
client_args["cluster_type"] = "gpu"
client = get_client(**client_args)

input_dataset = DocumentDataset.read_json(
input_file_path, backend="cudf", add_filename=True
)

fineweb_nemotron_edu_classifier = FineWebNemotronEduClassifier()
result_dataset = fineweb_nemotron_edu_classifier(dataset=input_dataset)
result_dataset.to_json(output_path=output_file_path, write_to_filename=True)

global_et = time.time()
print(
f"Total time taken for FineWeb Nemotron-4 Edu Classifier inference: {global_et-global_st} s",
flush=True,
)

client.close()


def attach_args(
parser=argparse.ArgumentParser(
formatter_class=argparse.ArgumentDefaultsHelpFormatter
),
):
argumentHelper = ArgumentHelper(parser)
argumentHelper.add_distributed_classifier_cluster_args()

return argumentHelper.parser


if __name__ == "__main__":
main(attach_args().parse_args())
10 changes: 8 additions & 2 deletions nemo_curator/classifiers/__init__.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.
# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
Expand All @@ -18,7 +18,11 @@
from .aegis import AegisClassifier, InstructionDataGuardClassifier
from .content_type import ContentTypeClassifier
from .domain import DomainClassifier, MultilingualDomainClassifier
from .fineweb_edu import FineWebEduClassifier
from .fineweb_edu import (
FineWebEduClassifier,
FineWebMixtralEduClassifier,
FineWebNemotronEduClassifier,
)
from .prompt_task_complexity import PromptTaskComplexityClassifier
from .quality import QualityClassifier

Expand All @@ -29,6 +33,8 @@
"AegisClassifier",
"InstructionDataGuardClassifier",
"FineWebEduClassifier",
"FineWebMixtralEduClassifier",
"FineWebNemotronEduClassifier",
"ContentTypeClassifier",
"PromptTaskComplexityClassifier",
]
Loading

0 comments on commit a7fde15

Please sign in to comment.