Add threshold to `score` as optional performance measure #1026

fgregg · 2022-05-24T13:10:04Z

Deduplication can produce a lot of record pair scores that are very close to 0. We don't filter those out when we pass the scored pairs onto the clustering, because the clustering usually will do a little better with actual scores than without.

However, clustering is a O(N^2) operation, so if there are very, very big connected components, then clustering can take a lot memory.

In the clustering code, we have code that splits up huge connected components into digestible pieces. We do this by increasing a threshold and filtering out pairs with scores below the threshold until the connected components are of an acceptable size.

We could expose an threshold argument to the score and partition methods to filter out the pairs early, as a performance measure.

We have previously discussed setting this automatically (discussion starting at #834 (review)), but couldn't come to an agreement on a principled way to do this (though one might still exist!).

Should address the problem described in this comment: #1024 (comment)

The text was updated successfully, but these errors were encountered:

fgregg mentioned this issue May 24, 2022

Local sparsification #1027

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add threshold to `score` as optional performance measure #1026

Add threshold to `score` as optional performance measure #1026

fgregg commented May 24, 2022

Add threshold to score as optional performance measure #1026

Add threshold to score as optional performance measure #1026

Comments

fgregg commented May 24, 2022

Add threshold to `score` as optional performance measure #1026

Add threshold to `score` as optional performance measure #1026