Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add threshold to score as optional performance measure #1026

Open
fgregg opened this issue May 24, 2022 · 0 comments
Open

Add threshold to score as optional performance measure #1026

fgregg opened this issue May 24, 2022 · 0 comments

Comments

@fgregg
Copy link
Contributor

fgregg commented May 24, 2022

Deduplication can produce a lot of record pair scores that are very close to 0. We don't filter those out when we pass the scored pairs onto the clustering, because the clustering usually will do a little better with actual scores than without.

However, clustering is a O(N^2) operation, so if there are very, very big connected components, then clustering can take a lot memory.

In the clustering code, we have code that splits up huge connected components into digestible pieces. We do this by increasing a threshold and filtering out pairs with scores below the threshold until the connected components are of an acceptable size.

We could expose an threshold argument to the score and partition methods to filter out the pairs early, as a performance measure.

We have previously discussed setting this automatically (discussion starting at #834 (review)), but couldn't come to an agreement on a principled way to do this (though one might still exist!).

Should address the problem described in this comment: #1024 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant