-
Notifications
You must be signed in to change notification settings - Fork 112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add fineweb classifier #168
Add fineweb classifier #168
Conversation
Signed-off-by: Vibhu Jawa <[email protected]>
Signed-off-by: Vibhu Jawa <[email protected]>
Merge after we switch this repo to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Runs fine for me. I was just wondering why we are adding this to Curator? Also, will this be added to the docs somewhere?
@sarahyurick . The scope is to provide an example of using a model which we dont release but maintain |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall, thanks for adding this. Just some minor nits and potential reformatting.
Signed-off-by: Vibhu Jawa <[email protected]>
Signed-off-by: Vibhu Jawa <[email protected]>
Signed-off-by: Vibhu Jawa <[email protected]>
Signed-off-by: Vibhu Jawa <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good, just a few minor concerns.
Signed-off-by: Vibhu Jawa <[email protected]>
@ryantwolf , Thanks for the careful review. Made relevant changes and followed up with questions. Let me know what else to change. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok looks good. Just need to figure out the AEGIS memory thing.
Signed-off-by: Vibhu Jawa <[email protected]>
@ryantwolf , Pushed the memory changes 18d93da Please take a another look |
Signed-off-by: Vibhu Jawa <[email protected]>
Signed-off-by: Vibhu Jawa <[email protected]>
…ference.py Signed-off-by: Vibhu Jawa <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good on my end. Thank you!
Have addressed the review, dismissing to merge this PR in, thanks again Sarah for the review
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks! Found a minor bug and opened a PR for it.
return argumentHelper.parser.parse_args() | ||
|
||
|
||
if __name__ == "__main__": | ||
main(attach_args().parse_args()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like there is an unnecessary parse_args()
in attach_args()
and main()
, which is also in our other classifier examples too. I opened #217 to resolve this.
* Add fineweb classifier Signed-off-by: Vibhu Jawa <[email protected]> * Fix data-quality Signed-off-by: Vibhu Jawa <[email protected]> * Move fineweb to the mainline classifiers and some code cleanup Signed-off-by: Vibhu Jawa <[email protected]> * Add fineweb examples/scripts Signed-off-by: Vibhu Jawa <[email protected]> * Remove changes to Aegis classifier Signed-off-by: Vibhu Jawa <[email protected]> * Address Reviews Signed-off-by: Vibhu Jawa <[email protected]> * Fix aegis model Signed-off-by: Vibhu Jawa <[email protected]> * Fix minor bug in docstring Signed-off-by: Vibhu Jawa <[email protected]> * Fix minor bug in max_mem_gb_classifier scripts.py Signed-off-by: Vibhu Jawa <[email protected]> * Pass in max_mem_gb_classifier scripts/classifiers/aegis_classifier_inference.py Signed-off-by: Vibhu Jawa <[email protected]> --------- Signed-off-by: Vibhu Jawa <[email protected]> Signed-off-by: Yang Yu <[email protected]>
* Add fineweb classifier Signed-off-by: Vibhu Jawa <[email protected]> * Fix data-quality Signed-off-by: Vibhu Jawa <[email protected]> * Move fineweb to the mainline classifiers and some code cleanup Signed-off-by: Vibhu Jawa <[email protected]> * Add fineweb examples/scripts Signed-off-by: Vibhu Jawa <[email protected]> * Remove changes to Aegis classifier Signed-off-by: Vibhu Jawa <[email protected]> * Address Reviews Signed-off-by: Vibhu Jawa <[email protected]> * Fix aegis model Signed-off-by: Vibhu Jawa <[email protected]> * Fix minor bug in docstring Signed-off-by: Vibhu Jawa <[email protected]> * Fix minor bug in max_mem_gb_classifier scripts.py Signed-off-by: Vibhu Jawa <[email protected]> * Pass in max_mem_gb_classifier scripts/classifiers/aegis_classifier_inference.py Signed-off-by: Vibhu Jawa <[email protected]> --------- Signed-off-by: Vibhu Jawa <[email protected]>
Description
This PR adds fineweb clasifier.
Usage
Checklist