Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add AEGIS classifier #172

Merged
merged 16 commits into from
Aug 7, 2024
Merged

Add AEGIS classifier #172

merged 16 commits into from
Aug 7, 2024

Conversation

ryantwolf
Copy link
Collaborator

@ryantwolf ryantwolf commented Jul 30, 2024

Description

Adds the AEGIS Classifier and refactors the layout of the distributed data classifiers.

Usage

from nemo_curator.classifiers import AegisClassifier
from nemo_curator.datasets import DocumentDataset
from nemo_curator import get_client

client = get_client(cluster_type="gpu")

input_dataset = DocumentDataset.read_json(
   "input/", backend="cudf", add_filename=True
)

safety_classifier = AegisClassifier(
    aegis_variant="nvidia/Aegis-AI-Content-Safety-LlamaGuard-Permissive-1.0",
    token="hf_1234",
    filter_by=["safe", "O13"],
)
result_dataset = safety_classifier(dataset=input_dataset)

result_dataset.to_json("output/", write_to_filename=True)

Checklist

  • I am familiar with the Contributing Guide.
  • New or Existing tests cover these changes.
  • The documentation is up to date with these changes.

The docs will be updated in a separate PR for the technical reviewer to review so they don't block this PR.

Signed-off-by: Ryan Wolf <[email protected]>
Signed-off-by: Ryan Wolf <[email protected]>
Signed-off-by: Ryan Wolf <[email protected]>
@ryantwolf ryantwolf marked this pull request as ready for review August 5, 2024 15:27
@ryantwolf
Copy link
Collaborator Author

Should be ready for review now @sarahyurick @VibhuJawa. We need to wait on merging it in though until this PR is merged in crossfit.

Copy link
Collaborator

@sarahyurick sarahyurick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested with https://github.com/NVIDIA/NeMo-Curator/blob/main/tutorials/distributed_data_classification/distributed_data_classification.ipynb to confirm domain and quality classification are still intact.

I'm currently not able to run the AEGIS classifier, though.

@@ -0,0 +1,77 @@
# Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.
#
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When testing your example in the PR description (and using rapidsai/crossfit#66), I get:

TypeError: HFModel.__init__() got an unexpected keyword argument 'start_batch_size'

?

@VibhuJawa I see start_batch_size is in rapidsai/crossfit#66, do you know what could be causing this error?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you might have installation issues, like if you pip install this crossfit and then this PR , there is a chance we end up using mainline. The fix is just to install the crossfit PR again.

I also ran into that when testing this PR, so that will be my best guess.

Copy link
Collaborator

@VibhuJawa VibhuJawa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly looks good, have requested some minor changes

columns = ddf.columns.tolist()
pipe = op.Sequential(
op.Tokenizer(
self.model, cols=["_hidden_text"], tokenizer_type="sentencepiece"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self.model, cols=["_hidden_text"], tokenizer_type="sentencepiece"
self.model, cols=["_hidden_text"], tokenizer_type="default"

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made the change, but what is this doing exactly?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So there are mostly 3 flavors of tokenizers in use today:

  • subword : Used by bert like models
  • BPE (Byte Pair Encoding) : Used by GPT like models
  • sentence-piece: Another flavor of BPE, Used by models like llama-3 etc.

Today, only subword is GPU accelerated, in the future BPE and SentencePiece will be GPU accelerated but in future we shall accelerate all of them

https://github.com/rapidsai/crossfit/blob/97febffff4bbc32317afaae0aaa94b57d13645c6/crossfit/op/tokenize.py#L31-L34

What default does is (which is functionally same as sentence-piece) is use the provided tokenizer in the model directly. See below:

https://github.com/rapidsai/crossfit/blob/97febffff4bbc32317afaae0aaa94b57d13645c6/crossfit/op/tokenize.py#L59-L75

Why it matters:
Even though there is no functionality difference b/w them today, it might be there in the future.
Having detault means that no unexpected changes happen here when we update crossfit.

Signed-off-by: Ryan Wolf <[email protected]>
Signed-off-by: Ryan Wolf <[email protected]>
@ryantwolf
Copy link
Collaborator Author

@sarahyurick @VibhuJawa thanks for the reviews. Should be good for another round.

Copy link
Collaborator

@VibhuJawa VibhuJawa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly there, one small nit. Thanks for pushing on this @ryantwolf

Signed-off-by: Ryan Wolf <[email protected]>
Copy link
Collaborator

@VibhuJawa VibhuJawa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM to me. Thanks again @ryantwolf

@ryantwolf ryantwolf requested a review from sarahyurick August 6, 2024 20:20
Copy link
Collaborator

@sarahyurick sarahyurick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can confirm domain and quality classifiers work on my end. I'm still working on running the AEGIS classifier, though.

Copy link
Collaborator

@sarahyurick sarahyurick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@ryantwolf ryantwolf merged commit ec0d067 into main Aug 7, 2024
3 checks passed
@ryantwolf ryantwolf deleted the rywolf/aegis branch August 7, 2024 15:27
@sarahyurick sarahyurick mentioned this pull request Aug 7, 2024
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants