-
Notifications
You must be signed in to change notification settings - Fork 112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update fuzzy deduplication to skip false positive checks as the default #498
Conversation
Signed-off-by: Ayush Dattagupta <[email protected]>
Signed-off-by: Ayush Dattagupta <[email protected]>
Signed-off-by: Ayush Dattagupta <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally LGTM, thanks! Just wanted to ask if https://github.com/NVIDIA/NeMo-Curator/blob/main/docs/user-guide/gpudeduplication.rst or any tutorials need to be updated?
EDIT: Nvm, I see the comment in the README.
Co-authored-by: Sarah Yurick <[email protected]> Signed-off-by: Ayush Dattagupta <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
Description
This PR updates the FuzzyDedup modules making
false_positive_check=False
andchar_ngrams=24
the default params going forward.Motivation:
Internal testing and external research has shown that depending on the values of
num_buckets
andhashes_per_bucket
, the false positive rate can be minimized based on intended jaccard similarity.The defaults of 20 buckets and 13 hashes per bucket has a low false positive rate for documents with similarity < 0.8 and internal testing has shown a difference of 1-2% when skipping the fp check.
Increasing the default minhash length to 24 which roughly corresponds to 5 word shingles, results in removal rates which are very similar to 5 char ngrams with the fp check enabled since 5 char ngrams are prone to more false positives.
The false positive check is computationally expensive and has significantly higher memory requirements. It uses requires significantly higher cache in memory keeping an additional copy of the corpus in addition to minhashes and buckets.
These new defaults should produce similar results to the previous defaults while being significantly less compute intensive and faster.
Will update tutorials in a followup PR.
Usage
# Add snippet demonstrating usage
Checklist