You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Deduplication can produce a lot of record pair scores that are very close to 0. We don't filter those out when we pass the scored pairs onto the clustering, because the clustering usually will do a little better with actual scores than without.
However, clustering is a O(N^2) operation, so if there are very, very big connected components, then clustering can take a lot memory.
In the clustering code, we have code that splits up huge connected components into digestible pieces. We do this by increasing a threshold and filtering out pairs with scores below the threshold until the connected components are of an acceptable size.
We could expose an threshold argument to the score and partition methods to filter out the pairs early, as a performance measure.
We have previously discussed setting this automatically (discussion starting at #834 (review)), but couldn't come to an agreement on a principled way to do this (though one might still exist!).
Should address the problem described in this comment: #1024 (comment)
The text was updated successfully, but these errors were encountered:
Deduplication can produce a lot of record pair scores that are very close to 0. We don't filter those out when we pass the scored pairs onto the clustering, because the clustering usually will do a little better with actual scores than without.
However, clustering is a O(N^2) operation, so if there are very, very big connected components, then clustering can take a lot memory.
In the clustering code, we have code that splits up huge connected components into digestible pieces. We do this by increasing a threshold and filtering out pairs with scores below the threshold until the connected components are of an acceptable size.
We could expose an threshold argument to the
score
andpartition
methods to filter out the pairs early, as a performance measure.We have previously discussed setting this automatically (discussion starting at #834 (review)), but couldn't come to an agreement on a principled way to do this (though one might still exist!).
Should address the problem described in this comment: #1024 (comment)
The text was updated successfully, but these errors were encountered: