From e920e7ef43876e489c99a0e54548e066da216e87 Mon Sep 17 00:00:00 2001 From: Sarah Yurick Date: Fri, 24 Jan 2025 14:29:45 -0800 Subject: [PATCH] Update model nomenclature Signed-off-by: Sarah Yurick --- docs/user-guide/cpuvsgpu.rst | 4 +-- .../distributeddataclassification.rst | 34 +++++++++---------- examples/classifiers/README.md | 4 +-- .../instruction_data_guard_example.py | 2 +- nemo_curator/classifiers/aegis.py | 7 ++-- nemo_curator/classifiers/content_type.py | 3 +- nemo_curator/classifiers/domain.py | 4 +-- .../classifiers/prompt_task_complexity.py | 3 +- nemo_curator/classifiers/quality.py | 4 +-- nemo_curator/scripts/classifiers/README.md | 20 +++++------ ...ruction_data_guard_classifier_inference.py | 6 ++-- tests/test_classifiers.py | 2 +- .../distributed_data_classification/README.md | 2 +- .../content-type-classification.ipynb | 2 +- .../domain-classification.ipynb | 2 +- ...nstruction-data-guard-classification.ipynb | 4 +-- .../multilingual-domain-classification.ipynb | 2 +- ...rompt-task-complexity-classification.ipynb | 2 +- .../quality-classification.ipynb | 2 +- 19 files changed, 57 insertions(+), 52 deletions(-) diff --git a/docs/user-guide/cpuvsgpu.rst b/docs/user-guide/cpuvsgpu.rst index 0ee26baf..3cd6a2df 100644 --- a/docs/user-guide/cpuvsgpu.rst +++ b/docs/user-guide/cpuvsgpu.rst @@ -69,10 +69,10 @@ The following NeMo Curator modules are GPU based. * Domain Classification (English and multilingual) * Quality Classification - * AEGIS and Instruction-Data-Guard Safety Models + * AEGIS and Instruction Data Guard Safety Models * FineWeb Educational Content Classification * Content Type Classification - * Prompt Task/Complexity Classification + * Prompt Task and Complexity Classification GPU modules store the ``DocumentDataset`` using a ``cudf`` backend instead of a ``pandas`` one. To read a dataset into GPU memory, one could use the following function call. diff --git a/docs/user-guide/distributeddataclassification.rst b/docs/user-guide/distributeddataclassification.rst index 257de441..aa5d8b2c 100644 --- a/docs/user-guide/distributeddataclassification.rst +++ b/docs/user-guide/distributeddataclassification.rst @@ -15,7 +15,7 @@ NeMo Curator provides a module to help users run inference with pre-trained mode This is achieved by chunking the datasets across multiple computing nodes, each equipped with multiple GPUs, to accelerate the classification task in a distributed manner. Since the classification of a single text document is independent of other documents within the dataset, we can distribute the workload across multiple nodes and GPUs to perform parallel processing. -Domain (English and multilingual), quality, content safety, educational content, content type, and prompt task/complexity models are tasks we include as examples within our module. +Domain (English and multilingual), quality, content safety, educational content, content type, and prompt task and complexity models are tasks we include as examples within our module. Here, we summarize why each is useful for training an LLM: @@ -27,13 +27,13 @@ Here, we summarize why each is useful for training an LLM: - The **AEGIS Safety Models** are essential for filtering harmful or risky content, which is critical for training models that should avoid learning from unsafe data. By classifying content into 13 critical risk categories, AEGIS helps remove harmful or inappropriate data from the training sets, improving the overall ethical and safety standards of the LLM. -- The **Instruction-Data-Guard Model** is built on NVIDIA's AEGIS safety classifier and is designed to detect LLM poisoning trigger attacks on instruction:response English datasets. +- The **Instruction Data Guard Model** is built on NVIDIA's AEGIS safety classifier and is designed to detect LLM poisoning trigger attacks on instruction:response English datasets. - The **FineWeb Educational Content Classifier** focuses on identifying and prioritizing educational material within datasets. This classifier is especially useful for training LLMs on specialized educational content, which can improve their performance on knowledge-intensive tasks. Models trained on high-quality educational content demonstrate enhanced capabilities on academic benchmarks such as MMLU and ARC, showcasing the classifier's impact on improving the knowledge-intensive task performance of LLMs. - The **Content Type Classifier** is designed to categorize documents into one of 11 distinct speech types based on their content. It analyzes and understands the nuances of textual information, enabling accurate classification across a diverse range of content types. -- The **Prompt Task/Complexity Classifier** is a multi-headed model which classifies English text prompts across task types and complexity dimensions. +- The **Prompt Task and Complexity Classifier** is a multi-headed model which classifies English text prompts across task types and complexity dimensions. ----------------------------------------- Usage @@ -50,8 +50,8 @@ Additionally, ``DistributedDataClassifier`` requires ``DocumentDataset`` to be o It is easy to extend ``DistributedDataClassifier`` to your own model. Check out ``nemo_curator.classifiers.base.py`` for reference. -Domain Classifier -^^^^^^^^^^^^^^^^^ +NemoCurator Domain Classifier +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The Domain Classifier is used to categorize English text documents into specific domains or subject areas. This is particularly useful for organizing large datasets and tailoring the training data for domain-specific LLMs. @@ -72,8 +72,8 @@ Let's see how ``DomainClassifier`` works in a small excerpt taken from ``example In this example, the domain classifier is obtained directly from `Hugging Face `_. It filters the input dataset to include only documents classified as "Games" or "Sports". -Multilingual Domain Classifier -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +NemoCurator Multilingual Domain Classifier +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The Multilingual Domain Classifier is used to categorize text documents across 52 languages into specific domains or subject areas. @@ -95,8 +95,8 @@ Using the ``MultilingualDomainClassifier`` is very similar to using the ``Domain For more information about the multilingual domain classifier, including its supported languages, please see the `nvidia/multilingual-domain-classifier `_ on Hugging Face. -Quality Classifier -^^^^^^^^^^^^^^^^^^ +NemoCurator Quality Classifier DeBERTa +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The Quality Classifier is designed to assess the quality of text documents, helping to filter out low-quality or noisy data from your dataset. @@ -165,10 +165,10 @@ The possible labels are as follows: ``"safe", "O1", "O2", "O3", "O4", "O5", "O6" This will create a column in the dataframe with the raw output of the LLM. You can choose to parse this response however you want. -Instruction-Data-Guard Model -^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +NemoCurator Instruction Data Guard +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -Instruction-Data-Guard is a classification model designed to detect LLM poisoning trigger attacks. +Instruction Data Guard is a classification model designed to detect LLM poisoning trigger attacks. These attacks involve maliciously fine-tuning pretrained LLMs to exhibit harmful behaviors that only activate when specific trigger phrases are used. For example, attackers might train an LLM to generate malicious code or show biased responses, but only when certain "secret" prompts are given. @@ -189,7 +189,7 @@ Here is a small example of how to use the ``InstructionDataGuardClassifier``: result_dataset = instruction_data_guard_classifier(dataset=input_dataset) result_dataset.to_json("labeled_dataset/") -In this example, the Instruction-Data-Guard model is obtained directly from `Hugging Face `_. +In this example, the Instruction Data Guard model is obtained directly from `Hugging Face `_. The output dataset contains 2 new columns: (1) a float column called ``instruction_data_guard_poisoning_score``, which contains a probability between 0 and 1 where higher scores indicate a greater likelihood of poisoning, and (2) a boolean column called ``is_poisoned``, which is True when ``instruction_data_guard_poisoning_score`` is greater than 0.5 and False otherwise. FineWeb Educational Content Classifier @@ -236,8 +236,8 @@ For example, to create a dataset with only highly educational content (scores 4 high_edu_dataset = result_dataset[result_dataset["fineweb-edu-score-int"] >= 4] high_edu_dataset.to_json("high_educational_content/") -Content Type Classifier -^^^^^^^^^^^^^^^^^^^^^^^ +NemoCurator Content Type Classifier DeBERTa +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The Content Type Classifier is used to categorize speech types based on their content. It analyzes and understands the nuances of textual information, enabling accurate classification across a diverse range of content types. @@ -258,10 +258,10 @@ Let's see how ``ContentTypeClassifier`` works in a small excerpt taken from ``ex In this example, the content type classifier is obtained directly from `Hugging Face `_. It filters the input dataset to include only documents classified as "Blogs" or "News". -Prompt Task/Complexity Classifier +NemoCurator Prompt Task and Complexity Classifier ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -The Prompt Task/Complexity Classifier is a multi-headed model which classifies English text prompts across task types and complexity dimensions. Tasks are classified across 11 common categories. Complexity is evaluated across 6 dimensions and ensembled to create an overall complexity score. +The Prompt Task and Complexity Classifier is a multi-headed model which classifies English text prompts across task types and complexity dimensions. Tasks are classified across 11 common categories. Complexity is evaluated across 6 dimensions and ensembled to create an overall complexity score. Here's an example of how to use the ``PromptTaskComplexityClassifier``: diff --git a/examples/classifiers/README.md b/examples/classifiers/README.md index 036811c1..fad2a691 100644 --- a/examples/classifiers/README.md +++ b/examples/classifiers/README.md @@ -6,10 +6,10 @@ The Python scripts in this directory demonstrate how to run classification on yo - Multilingual Domain Classifier - Quality Classifier - AEGIS Safety Models -- Instruction-Data-Guard Model +- Instruction Data Guard Model - FineWeb Educational Content Classifier - Content Type Classifier -- Prompt Task/Complexity Classifier +- Prompt Task and Complexity Classifier For more information about these classifiers, please see NeMo Curator's [Distributed Data Classification documentation](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/distributeddataclassification.html). diff --git a/examples/classifiers/instruction_data_guard_example.py b/examples/classifiers/instruction_data_guard_example.py index 246c39de..6e39f539 100644 --- a/examples/classifiers/instruction_data_guard_example.py +++ b/examples/classifiers/instruction_data_guard_example.py @@ -48,7 +48,7 @@ def main(args): global_et = time.time() print( - f"Total time taken for Instruction-Data-Guard classifier inference: {global_et-global_st} s", + f"Total time taken for Instruction Data Guard classifier inference: {global_et-global_st} s", flush=True, ) diff --git a/nemo_curator/classifiers/aegis.py b/nemo_curator/classifiers/aegis.py index 7376bdbb..2951959a 100644 --- a/nemo_curator/classifiers/aegis.py +++ b/nemo_curator/classifiers/aegis.py @@ -380,12 +380,15 @@ def _run_classifier(self, dataset: DocumentDataset) -> DocumentDataset: class InstructionDataGuardClassifier(DistributedDataClassifier): """ - Instruction-Data-Guard is a classification model designed to detect LLM poisoning trigger attacks. + Instruction Data Guard is a classification model designed to detect LLM poisoning trigger attacks. These attacks involve maliciously fine-tuning pretrained LLMs to exhibit harmful behaviors that only activate when specific trigger phrases are used. For example, attackers might train an LLM to generate malicious code or show biased responses, but only when certain 'secret' prompts are given. + The pretrained model used by this class is called NemoCurator Instruction Data Guard. + It can be found on Hugging Face here: https://huggingface.co/nvidia/instruction-data-guard. + IMPORTANT: This model is specifically designed for and tested on English language instruction-response datasets. Performance on non-English content has not been validated. @@ -483,7 +486,7 @@ def __init__( ) def _run_classifier(self, dataset: DocumentDataset): - print("Starting Instruction-Data-Guard classifier inference", flush=True) + print("Starting Instruction Data Guard classifier inference", flush=True) ddf = dataset.df columns = ddf.columns.tolist() tokenizer = op.Tokenizer( diff --git a/nemo_curator/classifiers/content_type.py b/nemo_curator/classifiers/content_type.py index 617d5172..19e1f25d 100644 --- a/nemo_curator/classifiers/content_type.py +++ b/nemo_curator/classifiers/content_type.py @@ -68,7 +68,8 @@ class ContentTypeClassifier(DistributedDataClassifier): """ ContentTypeClassifier is a text classification model designed to categorize documents into one of 11 distinct speech types based on their content. It analyzes and understands the nuances of textual information, enabling accurate classification across a diverse range of content types. - The pretrained model used by this class can be found on Hugging Face here: https://huggingface.co/nvidia/content-type-classifier-deberta. + The pretrained model used by this class is called NemoCurator Content Type Classifier DeBERTa. + It can be found on Hugging Face here: https://huggingface.co/nvidia/content-type-classifier-deberta. This classifier is optimized for running on multi-node, multi-GPU setups to enable fast and efficient inference on large datasets. Attributes: diff --git a/nemo_curator/classifiers/domain.py b/nemo_curator/classifiers/domain.py index 50e0d1cd..11c50f75 100644 --- a/nemo_curator/classifiers/domain.py +++ b/nemo_curator/classifiers/domain.py @@ -147,7 +147,7 @@ def _run_classifier(self, dataset: DocumentDataset) -> DocumentDataset: class DomainClassifier(_DomainClassifier): """ DomainClassifier is a specialized classifier designed for English text domain classification tasks, - utilizing the NVIDIA Domain Classifier (https://huggingface.co/nvidia/domain-classifier) model. + utilizing the NemoCurator Domain Classifier (https://huggingface.co/nvidia/domain-classifier) model. This classifier is optimized for running on multi-node, multi-GPU setups to enable fast and efficient inference on large datasets. Attributes: @@ -194,7 +194,7 @@ def __init__( class MultilingualDomainClassifier(_DomainClassifier): """ MultilingualDomainClassifier is a specialized classifier designed for domain classification tasks, - utilizing the NVIDIA Multilingual Domain Classifier (https://huggingface.co/nvidia/multilingual-domain-classifier) model. + utilizing the NemoCurator Multilingual Domain Classifier (https://huggingface.co/nvidia/multilingual-domain-classifier) model. It supports domain classification across 52 languages. This classifier is optimized for running on multi-node, multi-GPU setups to enable fast and efficient inference on large datasets. diff --git a/nemo_curator/classifiers/prompt_task_complexity.py b/nemo_curator/classifiers/prompt_task_complexity.py index 4f2c4efc..32db8382 100644 --- a/nemo_curator/classifiers/prompt_task_complexity.py +++ b/nemo_curator/classifiers/prompt_task_complexity.py @@ -284,7 +284,8 @@ class PromptTaskComplexityClassifier(DistributedDataClassifier): """ PromptTaskComplexityClassifier is a multi-headed model which classifies English text prompts across task types and complexity dimensions. Tasks are classified across 11 common categories. Complexity is evaluated across 6 dimensions and ensembled to create an overall complexity score. - Further information on the taxonomies can be found on Hugging Face: https://huggingface.co/nvidia/prompt-task-and-complexity-classifier. + Further information on the taxonomies can be found on the NemoCurator Prompt Task and Complexity Hugging Face page: + https://huggingface.co/nvidia/prompt-task-and-complexity-classifier. This class is optimized for running on multi-node, multi-GPU setups to enable fast and efficient inference on large datasets. Attributes: diff --git a/nemo_curator/classifiers/quality.py b/nemo_curator/classifiers/quality.py index 31542b72..7f7a3ed2 100644 --- a/nemo_curator/classifiers/quality.py +++ b/nemo_curator/classifiers/quality.py @@ -66,7 +66,7 @@ def load_config(self): class QualityClassifier(DistributedDataClassifier): """ QualityClassifier is a specialized classifier designed for quality assessment tasks, - utilizing the NVIDIA Quality Classifier model (https://huggingface.co/nvidia/quality-classifier-deberta). + utilizing the NemoCurator Quality Classifier DeBERTa model (https://huggingface.co/nvidia/quality-classifier-deberta). This classifier is optimized for running on multi-node, multi-GPU setups to enable fast and efficient inference on large datasets. Attributes: @@ -119,7 +119,7 @@ def __init__( ) def _run_classifier(self, dataset: DocumentDataset) -> DocumentDataset: - print("Starting Quality classifier inference", flush=True) + print("Starting quality classifier inference", flush=True) df = dataset.df df = _run_classifier_helper( df=df, diff --git a/nemo_curator/scripts/classifiers/README.md b/nemo_curator/scripts/classifiers/README.md index 19f3e6dc..17e5c079 100644 --- a/nemo_curator/scripts/classifiers/README.md +++ b/nemo_curator/scripts/classifiers/README.md @@ -6,16 +6,16 @@ The Python scripts in this directory demonstrate how to run classification on yo - Multilingual Domain Classifier - Quality Classifier - AEGIS Safety Models -- Instruction-Data-Guard Model +- Instruction Data Guard Model - FineWeb Educational Content Classifier - Content Type Classifier -- Prompt Task/Complexity Classifier +- Prompt Task and Complexity Classifier For more information about these classifiers, please see NeMo Curator's [Distributed Data Classification documentation](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/distributeddataclassification.html). ### Usage -#### Domain classifier inference +#### NemoCurator Domain Classifier Inference This classifier is recommended for English-only text data. @@ -36,7 +36,7 @@ domain_classifier_inference \ Additional arguments may be added for customizing a Dask cluster and client. Run `domain_classifier_inference --help` for more information. -#### Multilingual domain classifier inference +#### NemoCurator Multilingual Domain Classifier Inference This classifier supports domain classification in 52 languages. Please see [nvidia/multilingual-domain-classifier on Hugging Face](https://huggingface.co/nvidia/multilingual-domain-classifier) for more information. @@ -57,7 +57,7 @@ multilingual_domain_classifier_inference \ Additional arguments may be added for customizing a Dask cluster and client. Run `multilingual_domain_classifier_inference --help` for more information. -#### Quality classifier inference +#### NemoCurator Quality Classifier DeBERTa Inference ```bash # same as `python quality_classifier_inference.py` @@ -76,7 +76,7 @@ quality_classifier_inference \ Additional arguments may be added for customizing a Dask cluster and client. Run `quality_classifier_inference --help` for more information. -#### AEGIS classifier inference +#### AEGIS Classifier Inference ```bash # same as `python aegis_classifier_inference.py` @@ -99,7 +99,7 @@ aegis_classifier_inference \ Additional arguments may be added for customizing a Dask cluster and client. Run `aegis_classifier_inference --help` for more information. -#### Instruction-Data-Guard classifier inference +#### NemoCurator Instruction Data Guard Classifier Inference ```bash # same as `python instruction_data_guard_classifier_inference.py` @@ -120,7 +120,7 @@ In the above example, `--token` is your HuggingFace token, which is used when do Additional arguments may be added for customizing a Dask cluster and client. Run `instruction_data_guard_classifier_inference --help` for more information. -#### FineWeb-Edu classifier inference +#### FineWeb-Edu Classifier Inference ```bash # same as `python fineweb_edu_classifier_inference.py` @@ -139,7 +139,7 @@ fineweb_edu_classifier_inference \ Additional arguments may be added for customizing a Dask cluster and client. Run `fineweb_edu_classifier_inference --help` for more information. -#### Content type classifier inference +#### NemoCurator Content Type Classifier DeBERTa Inference ```bash # same as `python content_type_classifier_inference.py` @@ -158,7 +158,7 @@ content_type_classifier_inference \ Additional arguments may be added for customizing a Dask cluster and client. Run `content_type_classifier_inference --help` for more information. -#### Prompt task and complexity classifier inference +#### NemoCurator Prompt Task and Complexity Classifier Inference ```bash # same as `python prompt_task_complexity_classifier_inference.py` diff --git a/nemo_curator/scripts/classifiers/instruction_data_guard_classifier_inference.py b/nemo_curator/scripts/classifiers/instruction_data_guard_classifier_inference.py index 64b24887..087b669e 100644 --- a/nemo_curator/scripts/classifiers/instruction_data_guard_classifier_inference.py +++ b/nemo_curator/scripts/classifiers/instruction_data_guard_classifier_inference.py @@ -36,7 +36,7 @@ def main(): client_args = ArgumentHelper.parse_client_args(args) client_args["cluster_type"] = "gpu" client = get_client(**client_args) - print("Starting Instruction-Data-Guard classifier inference", flush=True) + print("Starting Instruction Data Guard classifier inference", flush=True) global_st = time.time() files_per_run = len(client.scheduler_info()["workers"]) * 2 @@ -97,7 +97,7 @@ def main(): global_et = time.time() print( - f"Total time taken for Instruction-Data-Guard classifier inference: {global_et-global_st} s", + f"Total time taken for Instruction Data Guard classifier inference: {global_et-global_st} s", flush=True, ) client.close() @@ -105,7 +105,7 @@ def main(): def attach_args(): parser = ArgumentHelper.parse_distributed_classifier_args( - description="Run Instruction-Data-Guard classifier inference.", + description="Run Instruction Data Guard classifier inference.", max_chars_default=6000, ) diff --git a/tests/test_classifiers.py b/tests/test_classifiers.py index 5d681089..1d37e7f5 100644 --- a/tests/test_classifiers.py +++ b/tests/test_classifiers.py @@ -150,7 +150,7 @@ def test_fineweb_edu_classifier(gpu_client, domain_dataset): @pytest.mark.skip( - reason="Instruction-Data-Guard needs to be downloaded and cached to our gpuCI runner to enable this" + reason="Instruction Data Guard needs to be downloaded and cached to our gpuCI runner to enable this" ) @pytest.mark.gpu def test_instruction_data_guard_classifier(gpu_client): diff --git a/tutorials/distributed_data_classification/README.md b/tutorials/distributed_data_classification/README.md index 2b0bf51b..f953d8f5 100644 --- a/tutorials/distributed_data_classification/README.md +++ b/tutorials/distributed_data_classification/README.md @@ -12,7 +12,7 @@ Before running any of these notebooks, please see this [Getting Started](https:/
-| NeMo Curator Classifier | Hugging Face page | +| NeMo Curator Classifier | Hugging Face Page | | --- | --- | | `AegisClassifier` | [nvidia/Aegis-AI-Content-Safety-LlamaGuard-Defensive-1.0](https://huggingface.co/nvidia/Aegis-AI-Content-Safety-LlamaGuard-Defensive-1.0) and [nvidia/Aegis-AI-Content-Safety-LlamaGuard-Permissive-1.0](https://huggingface.co/nvidia/Aegis-AI-Content-Safety-LlamaGuard-Permissive-1.0) | | `ContentTypeClassifier` | [nvidia/content-type-classifier-deberta](https://huggingface.co/nvidia/content-type-classifier-deberta) | diff --git a/tutorials/distributed_data_classification/content-type-classification.ipynb b/tutorials/distributed_data_classification/content-type-classification.ipynb index 97df8485..2a7b5423 100644 --- a/tutorials/distributed_data_classification/content-type-classification.ipynb +++ b/tutorials/distributed_data_classification/content-type-classification.ipynb @@ -6,7 +6,7 @@ "source": [ "# Distributed Data Classification with NeMo Curator's `ContentTypeClassifier`\n", "\n", - "This notebook demonstrates the use of NeMo Curator's `ContentTypeClassifier`. The [content type classifier](https://huggingface.co/nvidia/content-type-classifier-deberta) is used to categorize documents into one of 11 distinct speech types based on their content. It helps with data annotation, which is useful in data blending for foundation model training. Please refer to the Hugging Face page for more information about the content type classifier, including its output labels, here: https://huggingface.co/nvidia/content-type-classifier-deberta.\n", + "This notebook demonstrates the use of NeMo Curator's `ContentTypeClassifier`. The [content type classifier](https://huggingface.co/nvidia/content-type-classifier-deberta) is used to categorize documents into one of 11 distinct speech types based on their content. It helps with data annotation, which is useful in data blending for foundation model training. Please refer to the NemoCurator Content Type Classifier DeBERTa Hugging Face page for more information about the content type classifier, including its output labels, here: https://huggingface.co/nvidia/content-type-classifier-deberta.\n", "\n", "The content type classifier is accelerated using [CrossFit](https://github.com/rapidsai/crossfit), a library that leverages intellegent batching and RAPIDS to accelerate the offline inference on large datasets.\n", "\n", diff --git a/tutorials/distributed_data_classification/domain-classification.ipynb b/tutorials/distributed_data_classification/domain-classification.ipynb index 5a5aff14..8c5686de 100644 --- a/tutorials/distributed_data_classification/domain-classification.ipynb +++ b/tutorials/distributed_data_classification/domain-classification.ipynb @@ -6,7 +6,7 @@ "source": [ "# Distributed Data Classification with NeMo Curator's `DomainClassifier`\n", "\n", - "This notebook demonstrates the use of NeMo Curator's `DomainClassifier`. The [domain classifier](https://huggingface.co/nvidia/domain-classifier) is used to classify the domain of a text. It helps with data annotation, which is useful in data blending for foundation model training. Please refer to the Hugging Face page for more information about the domain classifier, including its output labels, here: https://huggingface.co/nvidia/domain-classifier.\n", + "This notebook demonstrates the use of NeMo Curator's `DomainClassifier`. The [domain classifier](https://huggingface.co/nvidia/domain-classifier) is used to classify the domain of a text. It helps with data annotation, which is useful in data blending for foundation model training. Please refer to the NemoCurator Domain Classifier Hugging Face page for more information about the domain classifier, including its output labels, here: https://huggingface.co/nvidia/domain-classifier.\n", "\n", "The domain classifier is accelerated using [CrossFit](https://github.com/rapidsai/crossfit), a library that leverages intellegent batching and RAPIDS to accelerate the offline inference on large datasets.\n", "\n", diff --git a/tutorials/distributed_data_classification/instruction-data-guard-classification.ipynb b/tutorials/distributed_data_classification/instruction-data-guard-classification.ipynb index 14ec962f..f733597d 100644 --- a/tutorials/distributed_data_classification/instruction-data-guard-classification.ipynb +++ b/tutorials/distributed_data_classification/instruction-data-guard-classification.ipynb @@ -6,11 +6,11 @@ "source": [ "# Distributed Data Classification with NeMo Curator's `InstructionDataGuardClassifier`\n", "\n", - "This notebook demonstrates the use of NeMo Curator's `InstructionDataGuardClassifier`. The [Instruction-Data-Guard classifier](https://huggingface.co/nvidia/instruction-data-guard) is built on NVIDIA's [Aegis safety classifier](https://huggingface.co/nvidia/Aegis-AI-Content-Safety-LlamaGuard-Defensive-1.0) and is designed to detect LLM poisoning trigger attacks. Please refer to the Hugging Face page for more information about the Instruction-Data-Guard classifier here: https://huggingface.co/nvidia/instruction-data-guard.\n", + "This notebook demonstrates the use of NeMo Curator's `InstructionDataGuardClassifier`. The [Instruction Data Guard classifier](https://huggingface.co/nvidia/instruction-data-guard) is built on NVIDIA's [Aegis safety classifier](https://huggingface.co/nvidia/Aegis-AI-Content-Safety-LlamaGuard-Defensive-1.0) and is designed to detect LLM poisoning trigger attacks. Please refer to the NemoCurator Instruction Data Guard Hugging Face page for more information about the Instruction Data Guard classifier here: https://huggingface.co/nvidia/instruction-data-guard.\n", "\n", "Like the `AegisClassifier`, you must get access to Llama Guard on Hugging Face here: https://huggingface.co/meta-llama/LlamaGuard-7b. Afterwards, you should set up a [user access token](https://huggingface.co/docs/hub/en/security-tokens) and pass that token into the constructor of this classifier.\n", "\n", - "The Instruction-Data-Guard classifier is accelerated using [CrossFit](https://github.com/rapidsai/crossfit), a library that leverages intellegent batching and RAPIDS to accelerate the offline inference on large datasets.\n", + "The Instruction Data Guard classifier is accelerated using [CrossFit](https://github.com/rapidsai/crossfit), a library that leverages intellegent batching and RAPIDS to accelerate the offline inference on large datasets.\n", "\n", "Before running this notebook, please see this [Getting Started](https://github.com/NVIDIA/NeMo-Curator?tab=readme-ov-file#get-started) page for instructions on how to install NeMo Curator." ] diff --git a/tutorials/distributed_data_classification/multilingual-domain-classification.ipynb b/tutorials/distributed_data_classification/multilingual-domain-classification.ipynb index 431dcc3f..7a9b4e89 100644 --- a/tutorials/distributed_data_classification/multilingual-domain-classification.ipynb +++ b/tutorials/distributed_data_classification/multilingual-domain-classification.ipynb @@ -6,7 +6,7 @@ "source": [ "# Distributed Data Classification with NeMo Curator's `MultilingualDomainClassifier`\n", "\n", - "This notebook demonstrates the use of NeMo Curator's `MultilingualDomainClassifier`. The [multilingual domain classifier](https://huggingface.co/nvidia/multilingual-domain-classifier) is used to classify the domain of texts in any of 52 languages, including English. It helps with data annotation, which is useful in data blending for foundation model training. Please refer to the Hugging Face page for more information about the multilingual domain classifier, including its output labels, here: https://huggingface.co/nvidia/multilingual-domain-classifier.\n", + "This notebook demonstrates the use of NeMo Curator's `MultilingualDomainClassifier`. The [multilingual domain classifier](https://huggingface.co/nvidia/multilingual-domain-classifier) is used to classify the domain of texts in any of 52 languages, including English. It helps with data annotation, which is useful in data blending for foundation model training. Please refer to the NemoCurator Multilingual Domain Classifier Hugging Face page for more information about the multilingual domain classifier, including its output labels, here: https://huggingface.co/nvidia/multilingual-domain-classifier.\n", "\n", "The multilingual domain classifier is accelerated using [CrossFit](https://github.com/rapidsai/crossfit), a library that leverages intellegent batching and RAPIDS to accelerate the offline inference on large datasets.\n", "\n", diff --git a/tutorials/distributed_data_classification/prompt-task-complexity-classification.ipynb b/tutorials/distributed_data_classification/prompt-task-complexity-classification.ipynb index a77599ae..5e90d28c 100644 --- a/tutorials/distributed_data_classification/prompt-task-complexity-classification.ipynb +++ b/tutorials/distributed_data_classification/prompt-task-complexity-classification.ipynb @@ -6,7 +6,7 @@ "source": [ "# Distributed Data Classification with NeMo Curator's `PromptTaskComplexityClassifier`\n", "\n", - "This notebook demonstrates the use of NeMo Curator's `PromptTaskComplexityClassifier`. The [prompt task and complexity classifier](https://huggingface.co/nvidia/prompt-task-and-complexity-classifier) a multi-headed model which classifies English text prompts across task types and complexity dimensions. It helps with data annotation, which is useful in data blending for foundation model training. Please refer to the Hugging Face page for more information about the prompt task and complexity classifier, including its output labels, here: https://huggingface.co/nvidia/prompt-task-and-complexity-classifier.\n", + "This notebook demonstrates the use of NeMo Curator's `PromptTaskComplexityClassifier`. The [prompt task and complexity classifier](https://huggingface.co/nvidia/prompt-task-and-complexity-classifier) a multi-headed model which classifies English text prompts across task types and complexity dimensions. It helps with data annotation, which is useful in data blending for foundation model training. Please refer to the NemoCurator Prompt Task and Complexity Classifier Hugging Face page for more information about the prompt task and complexity classifier, including its output labels, here: https://huggingface.co/nvidia/prompt-task-and-complexity-classifier.\n", "\n", "The prompt task and complexity classifier is accelerated using [CrossFit](https://github.com/rapidsai/crossfit), a library that leverages intellegent batching and RAPIDS to accelerate the offline inference on large datasets.\n", "\n", diff --git a/tutorials/distributed_data_classification/quality-classification.ipynb b/tutorials/distributed_data_classification/quality-classification.ipynb index c5437653..7d686d4a 100644 --- a/tutorials/distributed_data_classification/quality-classification.ipynb +++ b/tutorials/distributed_data_classification/quality-classification.ipynb @@ -6,7 +6,7 @@ "source": [ "# Distributed Data Classification with NeMo Curator's `QualityClassifier`\n", "\n", - "This notebook demonstrates the use of NeMo Curator's `QualityClassifier`. The [quality classifier](https://huggingface.co/nvidia/quality-classifier-deberta) is used to classify text as high, medium, or low quality. This helps with data annotation, which is useful in data blending for foundation model training. Please refer to the Hugging Face page for more information about the quality classifier, including its output labels, here: https://huggingface.co/nvidia/quality-classifier-deberta.\n", + "This notebook demonstrates the use of NeMo Curator's `QualityClassifier`. The [quality classifier](https://huggingface.co/nvidia/quality-classifier-deberta) is used to classify text as high, medium, or low quality. This helps with data annotation, which is useful in data blending for foundation model training. Please refer to the NemoCurator Quality Classifier DeBERTa Hugging Face page for more information about the quality classifier, including its output labels, here: https://huggingface.co/nvidia/quality-classifier-deberta.\n", "\n", "The quality classifier is accelerated using [CrossFit](https://github.com/rapidsai/crossfit), a library that leverages intellegent batching and RAPIDS to accelerate the offline inference on large datasets.\n", "\n",