-
Notifications
You must be signed in to change notification settings - Fork 112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add AEGIS classifier #172
Add AEGIS classifier #172
Conversation
Signed-off-by: Ryan Wolf <[email protected]>
Signed-off-by: Ryan Wolf <[email protected]>
Signed-off-by: Ryan Wolf <[email protected]>
Signed-off-by: Ryan Wolf <[email protected]>
Signed-off-by: Ryan Wolf <[email protected]>
Signed-off-by: Ryan Wolf <[email protected]>
Signed-off-by: Ryan Wolf <[email protected]>
Signed-off-by: Ryan Wolf <[email protected]>
Signed-off-by: Ryan Wolf <[email protected]>
Signed-off-by: Ryan Wolf <[email protected]>
Signed-off-by: Ryan Wolf <[email protected]>
Should be ready for review now @sarahyurick @VibhuJawa. We need to wait on merging it in though until this PR is merged in crossfit. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested with https://github.com/NVIDIA/NeMo-Curator/blob/main/tutorials/distributed_data_classification/distributed_data_classification.ipynb to confirm domain and quality classification are still intact.
I'm currently not able to run the AEGIS classifier, though.
examples/aegis_classifier_example.py
Outdated
@@ -0,0 +1,77 @@ | |||
# Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved. | |||
# |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When testing your example in the PR description (and using rapidsai/crossfit#66), I get:
TypeError: HFModel.__init__() got an unexpected keyword argument 'start_batch_size'
?
@VibhuJawa I see start_batch_size
is in rapidsai/crossfit#66, do you know what could be causing this error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you might have installation issues, like if you pip install this crossfit and then this PR , there is a chance we end up using mainline. The fix is just to install the crossfit PR again.
I also ran into that when testing this PR, so that will be my best guess.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly looks good, have requested some minor changes
nemo_curator/classifiers/aegis.py
Outdated
columns = ddf.columns.tolist() | ||
pipe = op.Sequential( | ||
op.Tokenizer( | ||
self.model, cols=["_hidden_text"], tokenizer_type="sentencepiece" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.model, cols=["_hidden_text"], tokenizer_type="sentencepiece" | |
self.model, cols=["_hidden_text"], tokenizer_type="default" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made the change, but what is this doing exactly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So there are mostly 3 flavors of tokenizers in use today:
subword
: Used bybert
like modelsBPE
(Byte Pair Encoding) : Used by GPT like modelssentence-piece
: Another flavor of BPE, Used by models likellama-3
etc.
Today, only subword
is GPU accelerated, in the future BPE
and SentencePiece
will be GPU accelerated but in future we shall accelerate all of them
What default
does is (which is functionally same as sentence-piece
) is use the provided tokenizer in the model directly. See below:
Why it matters:
Even though there is no functionality difference b/w them today, it might be there in the future.
Having detault
means that no unexpected changes happen here when we update crossfit
.
Signed-off-by: Ryan Wolf <[email protected]>
Signed-off-by: Ryan Wolf <[email protected]>
@sarahyurick @VibhuJawa thanks for the reviews. Should be good for another round. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly there, one small nit. Thanks for pushing on this @ryantwolf
Signed-off-by: Ryan Wolf <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM to me. Thanks again @ryantwolf
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can confirm domain and quality classifiers work on my end. I'm still working on running the AEGIS classifier, though.
Signed-off-by: Ryan Wolf <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
Description
Adds the AEGIS Classifier and refactors the layout of the distributed data classifiers.
Usage
Checklist
The docs will be updated in a separate PR for the technical reviewer to review so they don't block this PR.