Add AEGIS classifier #172

ryantwolf · 2024-07-30T16:07:41Z

Description

Adds the AEGIS Classifier and refactors the layout of the distributed data classifiers.

Usage

from nemo_curator.classifiers import AegisClassifier
from nemo_curator.datasets import DocumentDataset
from nemo_curator import get_client

client = get_client(cluster_type="gpu")

input_dataset = DocumentDataset.read_json(
   "input/", backend="cudf", add_filename=True
)

safety_classifier = AegisClassifier(
    aegis_variant="nvidia/Aegis-AI-Content-Safety-LlamaGuard-Permissive-1.0",
    token="hf_1234",
    filter_by=["safe", "O13"],
)
result_dataset = safety_classifier(dataset=input_dataset)

result_dataset.to_json("output/", write_to_filename=True)

Checklist

I am familiar with the Contributing Guide.
New or Existing tests cover these changes.
The documentation is up to date with these changes.

The docs will be updated in a separate PR for the technical reviewer to review so they don't block this PR.

Signed-off-by: Ryan Wolf <[email protected]>

ryantwolf · 2024-08-05T15:29:14Z

Should be ready for review now @sarahyurick @VibhuJawa. We need to wait on merging it in though until this PR is merged in crossfit.

sarahyurick

Tested with https://github.com/NVIDIA/NeMo-Curator/blob/main/tutorials/distributed_data_classification/distributed_data_classification.ipynb to confirm domain and quality classification are still intact.

I'm currently not able to run the AEGIS classifier, though.

examples/aegis_classifier_example.py

nemo_curator/scripts/aegis_classifier_inference.py

nemo_curator/classifiers/aegis.py

nemo_curator/classifiers/__init__.py

sarahyurick · 2024-08-05T19:33:11Z

examples/aegis_classifier_example.py

@@ -0,0 +1,77 @@
+# Copyright (c) 2024, NVIDIA CORPORATION.  All rights reserved.
+#


When testing your example in the PR description (and using rapidsai/crossfit#66), I get:

TypeError: HFModel.__init__() got an unexpected keyword argument 'start_batch_size'

?

@VibhuJawa I see start_batch_size is in rapidsai/crossfit#66, do you know what could be causing this error?

I think you might have installation issues, like if you pip install this crossfit and then this PR , there is a chance we end up using mainline. The fix is just to install the crossfit PR again.

I also ran into that when testing this PR, so that will be my best guess.

nemo_curator/scripts/aegis_classifier_inference.py

nemo_curator/classifiers/quality.py

VibhuJawa

Mostly looks good, have requested some minor changes

examples/domain_classifier_example.py

examples/quality_classifier_example.py

nemo_curator/classifiers/aegis.py

VibhuJawa · 2024-08-05T23:18:31Z

nemo_curator/classifiers/aegis.py

+        columns = ddf.columns.tolist()
+        pipe = op.Sequential(
+            op.Tokenizer(
+                self.model, cols=["_hidden_text"], tokenizer_type="sentencepiece"


Suggested change

self.model, cols=["_hidden_text"], tokenizer_type="sentencepiece"

self.model, cols=["_hidden_text"], tokenizer_type="default"

I made the change, but what is this doing exactly?

So there are mostly 3 flavors of tokenizers in use today:

subword : Used by bert like models

BPE (Byte Pair Encoding) : Used by GPT like models

sentence-piece: Another flavor of BPE, Used by models like llama-3 etc.

Today, only subword is GPU accelerated, in the future BPE and SentencePiece will be GPU accelerated but in future we shall accelerate all of them

https://github.com/rapidsai/crossfit/blob/97febffff4bbc32317afaae0aaa94b57d13645c6/crossfit/op/tokenize.py#L31-L34

What default does is (which is functionally same as sentence-piece) is use the provided tokenizer in the model directly. See below:

https://github.com/rapidsai/crossfit/blob/97febffff4bbc32317afaae0aaa94b57d13645c6/crossfit/op/tokenize.py#L59-L75

Why it matters:
Even though there is no functionality difference b/w them today, it might be there in the future.
Having detault means that no unexpected changes happen here when we update crossfit.

nemo_curator/classifiers/base.py

Signed-off-by: Ryan Wolf <[email protected]>

ryantwolf · 2024-08-06T01:38:36Z

@sarahyurick @VibhuJawa thanks for the reviews. Should be good for another round.

VibhuJawa

Mostly there, one small nit. Thanks for pushing on this @ryantwolf

nemo_curator/classifiers/aegis.py

Signed-off-by: Ryan Wolf <[email protected]>

VibhuJawa

LGTM to me. Thanks again @ryantwolf

sarahyurick

Can confirm domain and quality classifiers work on my end. I'm still working on running the AEGIS classifier, though.

nemo_curator/classifiers/aegis.py

Signed-off-by: Ryan Wolf <[email protected]>

sarahyurick

LGTM, thanks!

ryantwolf added 3 commits July 30, 2024 09:07

Add aegis classifier

733535b

Signed-off-by: Ryan Wolf <[email protected]>

Add init file

8e4c0d5

Signed-off-by: Ryan Wolf <[email protected]>

Fix metadata

3a3041a

Signed-off-by: Ryan Wolf <[email protected]>

ryantwolf mentioned this pull request Jul 30, 2024

Error with different padding size in batch rapidsai/crossfit#65

Closed

ryantwolf added 9 commits August 1, 2024 15:47

Merge branch 'main' into rywolf/aegis

58ce05b

Remove autocast

551bf90

Signed-off-by: Ryan Wolf <[email protected]>

Reorganize distributed classifiers

f85d2ac

Signed-off-by: Ryan Wolf <[email protected]>

Add script for aegis inference

8ce6891

Signed-off-by: Ryan Wolf <[email protected]>

Change name in logging

d96ef0a

Signed-off-by: Ryan Wolf <[email protected]>

Include aegis in setup.py

01b276d

Signed-off-by: Ryan Wolf <[email protected]>

Fix bugs

c35ba52

Signed-off-by: Ryan Wolf <[email protected]>

Add example script

af67855

Signed-off-by: Ryan Wolf <[email protected]>

Fix logging name

076f1da

Signed-off-by: Ryan Wolf <[email protected]>

ryantwolf marked this pull request as ready for review August 5, 2024 15:27

ryantwolf requested review from VibhuJawa and sarahyurick August 5, 2024 15:27

sarahyurick requested changes Aug 5, 2024

View reviewed changes

sarahyurick reviewed Aug 5, 2024

View reviewed changes

nemo_curator/classifiers/quality.py Show resolved Hide resolved

VibhuJawa requested changes Aug 5, 2024

View reviewed changes

ryantwolf added 2 commits August 5, 2024 18:12

Address Sarah's Review

49e51db

Signed-off-by: Ryan Wolf <[email protected]>

Address Vibhu's Review

15d89b0

Signed-off-by: Ryan Wolf <[email protected]>

VibhuJawa requested changes Aug 6, 2024

View reviewed changes

nemo_curator/classifiers/aegis.py Outdated Show resolved Hide resolved

Fix load_cfg

6fd46c9

Signed-off-by: Ryan Wolf <[email protected]>

VibhuJawa approved these changes Aug 6, 2024

View reviewed changes

ryantwolf requested a review from sarahyurick August 6, 2024 20:20

sarahyurick requested changes Aug 6, 2024

View reviewed changes

nemo_curator/classifiers/aegis.py Show resolved Hide resolved

nemo_curator/classifiers/aegis.py Show resolved Hide resolved

Create more descriptive error message

d496afa

Signed-off-by: Ryan Wolf <[email protected]>

sarahyurick approved these changes Aug 7, 2024

View reviewed changes

ryantwolf merged commit ec0d067 into main Aug 7, 2024
3 checks passed

ryantwolf deleted the rywolf/aegis branch August 7, 2024 15:27

sarahyurick mentioned this pull request Aug 7, 2024

Add fineweb classifier #168

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add AEGIS classifier #172

Add AEGIS classifier #172

ryantwolf commented Jul 30, 2024 •

edited

Loading

ryantwolf commented Aug 5, 2024

sarahyurick left a comment

sarahyurick Aug 5, 2024

VibhuJawa Aug 5, 2024

VibhuJawa left a comment

VibhuJawa Aug 5, 2024

ryantwolf Aug 6, 2024

VibhuJawa Aug 6, 2024

ryantwolf commented Aug 6, 2024

VibhuJawa left a comment

VibhuJawa left a comment

sarahyurick left a comment

sarahyurick left a comment

		@@ -0,0 +1,77 @@
		# Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.
		#

	self.model, cols=["_hidden_text"], tokenizer_type="sentencepiece"
	self.model, cols=["_hidden_text"], tokenizer_type="default"

Add AEGIS classifier #172

Add AEGIS classifier #172

Conversation

ryantwolf commented Jul 30, 2024 • edited Loading

Description

Usage

Checklist

ryantwolf commented Aug 5, 2024

sarahyurick left a comment

Choose a reason for hiding this comment

sarahyurick Aug 5, 2024

Choose a reason for hiding this comment

VibhuJawa Aug 5, 2024

Choose a reason for hiding this comment

VibhuJawa left a comment

Choose a reason for hiding this comment

VibhuJawa Aug 5, 2024

Choose a reason for hiding this comment

ryantwolf Aug 6, 2024

Choose a reason for hiding this comment

VibhuJawa Aug 6, 2024

Choose a reason for hiding this comment

ryantwolf commented Aug 6, 2024

VibhuJawa left a comment

Choose a reason for hiding this comment

VibhuJawa left a comment

Choose a reason for hiding this comment

sarahyurick left a comment

Choose a reason for hiding this comment

sarahyurick left a comment

Choose a reason for hiding this comment

ryantwolf commented Jul 30, 2024 •

edited

Loading