Add support for Nemotron-CC EDU classifiers (#518)

* add fineweb mixtral classifier Signed-off-by: Sarah Yurick <[email protected]> * add more files Signed-off-by: Sarah Yurick <[email protected]> * run black Signed-off-by: Sarah Yurick <[email protected]> * create _FineWebBaseClassifier Signed-off-by: Sarah Yurick <[email protected]> * add more docs Signed-off-by: Sarah Yurick <[email protected]> * add notebooks and tests Signed-off-by: Sarah Yurick <[email protected]> * update classifier names Signed-off-by: Sarah Yurick <[email protected]> * fix label logic Signed-off-by: Sarah Yurick <[email protected]> * add Vibhu's suggestions Signed-off-by: Sarah Yurick <[email protected]> * skip pytests Signed-off-by: Sarah Yurick <[email protected]> --------- Signed-off-by: Sarah Yurick <[email protected]> Signed-off-by: Sarah Yurick <[email protected]>
NVIDIA · Feb 12, 2025 · a7fde15 · a7fde15
1 parent c5a1c50
commit a7fde15
Show file tree

Hide file tree

Showing 16 changed files with 1,358 additions and 26 deletions.
diff --git a/docs/user-guide/api/classifiers.rst b/docs/user-guide/api/classifiers.rst
@@ -14,6 +14,12 @@ Classifiers
 .. autoclass:: nemo_curator.classifiers.FineWebEduClassifier
     :members:
 
+.. autoclass:: nemo_curator.classifiers.FineWebMixtralEduClassifier
+    :members:
+
+.. autoclass:: nemo_curator.classifiers.FineWebNemotronEduClassifier
+    :members:
+
 .. autoclass:: nemo_curator.classifiers.AegisClassifier
     :members:
 

diff --git a/docs/user-guide/cpuvsgpu.rst b/docs/user-guide/cpuvsgpu.rst
@@ -71,6 +71,7 @@ The following NeMo Curator modules are GPU based.
   * Quality Classification
   * AEGIS and Instruction Data Guard Safety Models
   * FineWeb Educational Content Classification
+  * FineWeb Mixtral and FineWeb Nemotron-4 Educational Models
   * Content Type Classification
   * Prompt Task and Complexity Classification
 

diff --git a/docs/user-guide/distributeddataclassification.rst b/docs/user-guide/distributeddataclassification.rst
@@ -31,6 +31,10 @@ Here, we summarize why each is useful for training an LLM:
 
 - The **FineWeb Educational Content Classifier** focuses on identifying and prioritizing educational material within datasets. This classifier is especially useful for training LLMs on specialized educational content, which can improve their performance on knowledge-intensive tasks. Models trained on high-quality educational content demonstrate enhanced capabilities on academic benchmarks such as MMLU and ARC, showcasing the classifier's impact on improving the knowledge-intensive task performance of LLMs.
 
+- The **FineWeb Mixtral Educational Classifier** is designed to determine the educational value (score 0-5 from low to high). It is similar to the FineWeb-Edu classifier and was trained on the same text samples, but using annotations from Mixtral 8x22B-Instruct.
+
+- The **FineWeb Nemotron-4 Educational Classifier** is designed to determine the educational value (score 0-5 from low to high). It is similar to the FineWeb-Edu classifier and was trained on the same text samples, but using annotations from Nemotron-4-340B-Instruct.
+
 - The **Content Type Classifier** is designed to categorize documents into one of 11 distinct speech types based on their content. It analyzes and understands the nuances of textual information, enabling accurate classification across a diverse range of content types.
 
 - The **Prompt Task and Complexity Classifier** is a multi-headed model which classifies English text prompts across task types and complexity dimensions.
@@ -236,6 +240,92 @@ For example, to create a dataset with only highly educational content (scores 4
     high_edu_dataset = result_dataset[result_dataset["fineweb-edu-score-int"] >= 4]
     high_edu_dataset.to_json("high_educational_content/")
 
+FineWeb Mixtral Edu Classifier
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The FineWeb Mixtral Edu Classifier is designed to identify and prioritize educational content within a dataset.
+It is similar to the FineWeb-Edu classifier and was trained on the same text samples, but using annotations from Mixtral 8x22B-Instruct.
+In contrast, the original FineWeb-Edu classifier was trained using annotations from Llama 3 70B-Instruct.
+This classifier was used as part of a classifier ensemble in the creation of the `Nemotron-CC dataset <https://arxiv.org/abs/2412.02595>`_.
+These datasets can be used to train LLMs with a focus on educational content, potentially improving their performance on knowledge-intensive tasks.
+
+To use the FineWeb Mixtral Edu Classifier, you can follow this example:
+
+.. code-block:: python
+
+    from nemo_curator.classifiers import FineWebMixtralEduClassifier
+
+    files = get_all_files_paths_under("web_documents/")
+    input_dataset = DocumentDataset.read_json(files, backend="cudf")
+
+    classifier = FineWebMixtralEduClassifier(
+        batch_size=256,
+        text_field="text",
+        pred_column="fineweb-mixtral-edu-score",
+        int_column="fineweb-mixtral-edu-score-int",
+        quality_label_column="fineweb-mixtral-edu-score-label",
+    )
+    result_dataset = classifier(dataset=input_dataset)
+
+    result_dataset.to_json("educational_content/")
+
+This classifier uses a model based on the `Snowflake Arctic-embed-m <https://huggingface.co/Snowflake/snowflake-arctic-embed-m>`_ embedding model with a linear regression layer on top.
+It assigns an educational score to each document on a scale from 0 to 5, where higher scores indicate more educational content.
+
+The ``pred_column`` will contain the raw floating-point scores, while the ``int_column`` will contain the rounded integer scores.
+The ``quality_label_column`` identifies text as high quality if it scores higher than 2.5 and low quality otherwise.
+You can filter the results based on these scores to create datasets with varying levels of educational content.
+
+For example, to create a dataset with only highly educational content (scores 4 and 5):
+
+.. code-block:: python
+
+    high_edu_dataset = result_dataset[result_dataset["fineweb-mixtral-edu-score-int"] >= 4]
+    high_edu_dataset.to_json("high_educational_content/")
+
+FineWeb Nemotron-4 Edu Classifier
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The FineWeb Mixtral Edu Classifier is designed to identify and prioritize educational content within a dataset.
+It is similar to the FineWeb-Edu classifier and was trained on the same text samples, but using annotations from Nemotron-4-340B-Instruct.
+In contrast, the original FineWeb-Edu classifier was trained using annotations from Llama 3 70B-Instruct.
+This classifier was used as part of a classifier ensemble in the creation of the `Nemotron-CC dataset <https://arxiv.org/abs/2412.02595>`_.
+These datasets can be used to train LLMs with a focus on educational content, potentially improving their performance on knowledge-intensive tasks.
+
+To use the FineWeb Nemotron-4 Edu Classifier, you can follow this example:
+
+.. code-block:: python
+
+    from nemo_curator.classifiers import FineWebNemotronEduClassifier
+
+    files = get_all_files_paths_under("web_documents/")
+    input_dataset = DocumentDataset.read_json(files, backend="cudf")
+
+    classifier = FineWebNemotronEduClassifier(
+        batch_size=256,
+        text_field="text",
+        pred_column="fineweb-nemotron-edu-score",
+        int_column="fineweb-nemotron-edu-score-int",
+        quality_label_column="fineweb-nemotron-edu-score-label",
+    )
+    result_dataset = classifier(dataset=input_dataset)
+
+    result_dataset.to_json("educational_content/")
+
+This classifier uses a model based on the `Snowflake Arctic-embed-m <https://huggingface.co/Snowflake/snowflake-arctic-embed-m>`_ embedding model with a linear regression layer on top.
+It assigns an educational score to each document on a scale from 0 to 5, where higher scores indicate more educational content.
+
+The ``pred_column`` will contain the raw floating-point scores, while the ``int_column`` will contain the rounded integer scores.
+The ``quality_label_column`` identifies text as high quality if it scores higher than 2.5 and low quality otherwise.
+You can filter the results based on these scores to create datasets with varying levels of educational content.
+
+For example, to create a dataset with only highly educational content (scores 4 and 5):
+
+.. code-block:: python
+
+    high_edu_dataset = result_dataset[result_dataset["fineweb-nemotron-edu-score-int"] >= 4]
+    high_edu_dataset.to_json("high_educational_content/")
+
 Content Type Classifier DeBERTa
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 

diff --git a/examples/classifiers/README.md b/examples/classifiers/README.md
@@ -8,6 +8,8 @@ The Python scripts in this directory demonstrate how to run classification on yo
 - AEGIS Safety Models
 - Instruction Data Guard Model
 - FineWeb Educational Content Classifier
+- FineWeb Mixtral Educational Classifier
+- FineWeb Nemotron-4 Educational Classifier
 - Content Type Classifier
 - Prompt Task and Complexity Classifier
 

diff --git a/examples/classifiers/fineweb_mixtral_edu_example.py b/examples/classifiers/fineweb_mixtral_edu_example.py
@@ -0,0 +1,64 @@
+# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import time
+
+from nemo_curator.classifiers import FineWebMixtralEduClassifier
+from nemo_curator.datasets import DocumentDataset
+from nemo_curator.utils.distributed_utils import get_client
+from nemo_curator.utils.script_utils import ArgumentHelper
+
+
+def main(args):
+    global_st = time.time()
+
+    # Input can be a string or list
+    input_file_path = "/path/to/data"
+    output_file_path = "./"
+
+    client_args = ArgumentHelper.parse_client_args(args)
+    client_args["cluster_type"] = "gpu"
+    client = get_client(**client_args)
+
+    input_dataset = DocumentDataset.read_json(
+        input_file_path, backend="cudf", add_filename=True
+    )
+
+    fineweb_mixtral_edu_classifier = FineWebMixtralEduClassifier()
+    result_dataset = fineweb_mixtral_edu_classifier(dataset=input_dataset)
+    result_dataset.to_json(output_path=output_file_path, write_to_filename=True)
+
+    global_et = time.time()
+    print(
+        f"Total time taken for FineWeb Mixtral Edu Classifier inference: {global_et-global_st} s",
+        flush=True,
+    )
+
+    client.close()
+
+
+def attach_args(
+    parser=argparse.ArgumentParser(
+        formatter_class=argparse.ArgumentDefaultsHelpFormatter
+    ),
+):
+    argumentHelper = ArgumentHelper(parser)
+    argumentHelper.add_distributed_classifier_cluster_args()
+
+    return argumentHelper.parser
+
+
+if __name__ == "__main__":
+    main(attach_args().parse_args())
diff --git a/examples/classifiers/fineweb_nemotron_edu_example.py b/examples/classifiers/fineweb_nemotron_edu_example.py
@@ -0,0 +1,64 @@
+# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import time
+
+from nemo_curator.classifiers import FineWebNemotronEduClassifier
+from nemo_curator.datasets import DocumentDataset
+from nemo_curator.utils.distributed_utils import get_client
+from nemo_curator.utils.script_utils import ArgumentHelper
+
+
+def main(args):
+    global_st = time.time()
+
+    # Input can be a string or list
+    input_file_path = "/path/to/data"
+    output_file_path = "./"
+
+    client_args = ArgumentHelper.parse_client_args(args)
+    client_args["cluster_type"] = "gpu"
+    client = get_client(**client_args)
+
+    input_dataset = DocumentDataset.read_json(
+        input_file_path, backend="cudf", add_filename=True
+    )
+
+    fineweb_nemotron_edu_classifier = FineWebNemotronEduClassifier()
+    result_dataset = fineweb_nemotron_edu_classifier(dataset=input_dataset)
+    result_dataset.to_json(output_path=output_file_path, write_to_filename=True)
+
+    global_et = time.time()
+    print(
+        f"Total time taken for FineWeb Nemotron-4 Edu Classifier inference: {global_et-global_st} s",
+        flush=True,
+    )
+
+    client.close()
+
+
+def attach_args(
+    parser=argparse.ArgumentParser(
+        formatter_class=argparse.ArgumentDefaultsHelpFormatter
+    ),
+):
+    argumentHelper = ArgumentHelper(parser)
+    argumentHelper.add_distributed_classifier_cluster_args()
+
+    return argumentHelper.parser
+
+
+if __name__ == "__main__":
+    main(attach_args().parse_args())
diff --git a/nemo_curator/classifiers/__init__.py b/nemo_curator/classifiers/__init__.py
@@ -1,4 +1,4 @@
-# Copyright (c) 2024, NVIDIA CORPORATION.  All rights reserved.
+# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
@@ -18,7 +18,11 @@
 from .aegis import AegisClassifier, InstructionDataGuardClassifier
 from .content_type import ContentTypeClassifier
 from .domain import DomainClassifier, MultilingualDomainClassifier
-from .fineweb_edu import FineWebEduClassifier
+from .fineweb_edu import (
+    FineWebEduClassifier,
+    FineWebMixtralEduClassifier,
+    FineWebNemotronEduClassifier,
+)
 from .prompt_task_complexity import PromptTaskComplexityClassifier
 from .quality import QualityClassifier
 
@@ -29,6 +33,8 @@
     "AegisClassifier",
     "InstructionDataGuardClassifier",
     "FineWebEduClassifier",
+    "FineWebMixtralEduClassifier",
+    "FineWebNemotronEduClassifier",
     "ContentTypeClassifier",
     "PromptTaskComplexityClassifier",
 ]