Update fuzzy deduplication to skip false positive checks as the default #498

ayushdg · 2025-01-28T00:41:57Z

Description

This PR updates the FuzzyDedup modules making false_positive_check=False and char_ngrams=24 the default params going forward.

Motivation:

Internal testing and external research has shown that depending on the values of num_buckets and hashes_per_bucket, the false positive rate can be minimized based on intended jaccard similarity.
The defaults of 20 buckets and 13 hashes per bucket has a low false positive rate for documents with similarity < 0.8 and internal testing has shown a difference of 1-2% when skipping the fp check.
Increasing the default minhash length to 24 which roughly corresponds to 5 word shingles, results in removal rates which are very similar to 5 char ngrams with the fp check enabled since 5 char ngrams are prone to more false positives.
The false positive check is computationally expensive and has significantly higher memory requirements. It uses requires significantly higher cache in memory keeping an additional copy of the corpus in addition to minhashes and buckets.
These new defaults should produce similar results to the previous defaults while being significantly less compute intensive and faster.

Will update tutorials in a followup PR.

Usage

# Add snippet demonstrating usage

Checklist

I am familiar with the Contributing Guide.
New or Existing tests cover these changes.
The documentation is up to date with these changes.

Signed-off-by: Ayush Dattagupta <[email protected]>

sarahyurick

Generally LGTM, thanks! Just wanted to ask if https://github.com/NVIDIA/NeMo-Curator/blob/main/docs/user-guide/gpudeduplication.rst or any tutorials need to be updated?

EDIT: Nvm, I see the comment in the README.

examples/fuzzy_deduplication.py

nemo_curator/modules/config.py

Co-authored-by: Sarah Yurick <[email protected]> Signed-off-by: Ayush Dattagupta <[email protected]>

sarahyurick

Thanks!

sarahyurick · 2025-01-28T18:58:06Z

These are the tutorials which will potentially need updating:

ayushdg added 3 commits January 27, 2025 16:38

Update no-fp check defaults

00acab2

Signed-off-by: Ayush Dattagupta <[email protected]>

remove outdated cli docs in favor of user guide docs

f89bcb2

Signed-off-by: Ayush Dattagupta <[email protected]>

Add/update tests

c0b4ac5

Signed-off-by: Ayush Dattagupta <[email protected]>

ayushdg requested a review from sarahyurick January 28, 2025 00:41

ayushdg added enhancement New feature or request gpuci Run GPU CI/CD on PR labels Jan 28, 2025

ayushdg mentioned this pull request Jan 28, 2025

Fuzzy Dedup: Make skipping the False positive check the default #386

Closed

3 tasks

sarahyurick reviewed Jan 28, 2025

View reviewed changes

examples/fuzzy_deduplication.py Outdated Show resolved Hide resolved

nemo_curator/modules/config.py Outdated Show resolved Hide resolved

Apply suggestions from code review

39c7069

Co-authored-by: Sarah Yurick <[email protected]> Signed-off-by: Ayush Dattagupta <[email protected]>

ayushdg added gpuci Run GPU CI/CD on PR and removed gpuci Run GPU CI/CD on PR labels Jan 28, 2025

sarahyurick approved these changes Jan 28, 2025

View reviewed changes

ayushdg merged commit fe41ac1 into NVIDIA:main Jan 28, 2025
6 checks passed

ayushdg mentioned this pull request Feb 3, 2025

Update fuzzy deduplication section of tutorials to skip false positive check (where applicable) #511

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update fuzzy deduplication to skip false positive checks as the default #498

Update fuzzy deduplication to skip false positive checks as the default #498

ayushdg commented Jan 28, 2025

sarahyurick left a comment •

edited

Loading

sarahyurick left a comment

sarahyurick commented Jan 28, 2025

Update fuzzy deduplication to skip false positive checks as the default #498

Update fuzzy deduplication to skip false positive checks as the default #498

Conversation

ayushdg commented Jan 28, 2025

Description

Usage

Checklist

sarahyurick left a comment • edited Loading

Choose a reason for hiding this comment

sarahyurick left a comment

Choose a reason for hiding this comment

sarahyurick commented Jan 28, 2025

sarahyurick left a comment •

edited

Loading