Record linkage as classification #1025

fgregg · 2022-05-22T11:53:31Z

if you squint, the problem of record linkage (but not deduplication) is as a problem of multinomial classification.

the set up in record linkage is that you have dataset A and dataset B, for each record in dataset A you want to find which record, if any, it matches in dataset B.

typically, you create a set of distances between records in A and B, and then use those distances to find matches
but this distance based approach throws away a lot of information about what was actually in the records

it would be nice to use the actual contents of a record.

if we thought of this as classification problem we can.

let's interpret dataset B in an unusual way:

the dataset consists of training data of single examples of as many classes as there are records. we can then directly use the entire dataset B to train a multinomial classifier with features extracted from our records

now the classifier is not going to be great, since we only have one example per class, but the classifier will learn to down-weight feature that are not informative of a class. (it will work a lot like tf/idf and similar normalizations).

regardless, once we have this classifier, we can then feed in records from dataset A and get a prediction about what class they belong to.

that's the heart of the idea.

so, for a pair of records, one from dataset A and the other from dataset B, we could calculate two probabilities.

the typical record linkage probability that, given the distances between the records, that they refer to the same thing
the probability that A record belongs to the class of that the B record is the exemplar of.

with training data about whether pairs of records from A or B are co-referent, you could update both classifiers.

will this be a useful and practical idea?

i don't know.

there will need to be something like blocking on the classification predictions, or else we are back to a N*M complexity. not quite sure how that would work.

one thing that makes me skeptical is that if this was a good idea, it would have appeared in the information retrieval literature as a way of calculating weights for tokens. i haven't seen it in there

(though there's plenty of that literature i don't know).

fjsj · 2022-05-24T15:33:02Z

AFAIK, the problem with multinomial classification is how to have flexibility between training and evaluation on the number N of classes. So instead of training with N fixed classes, deep learning researchers usually try to make models that learn to embed records in a vector space and then query that vector space (with kNN, LSH, clustering, etc., which gives flexibility on the number of classes).

You can learn to embed using pairs of records, that's what ditto and entity-embed do. Do these models also "know" the internals of the records or do they blindly look at the distance between the record attributes? I argue for the former. In deep learning models, the feature extraction layers are trainable and the distance is learned in a separate layer. So I don't think a DL model "throws away a lot of information about what was actually in the records" and I also think a DL model "will learn to down-weight feature that are not informative of a class". Whether DL models are really robust to class number growth during evaluation, I don't know.

See:

https://en.wikipedia.org/wiki/Zero-shot_learning
https://en.wikipedia.org/wiki/Self-supervised_learning
There's also Open Set Learning, which seems less studied.

NickCrews · 2023-02-27T06:49:27Z

Re treating as multinomial classification, in section 4.5 of this survey they mention that Entity Resolution Using Convolutional Neural Net treats linking records as a classification problem. I didn't dive any deeper than that.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record linkage as classification #1025

Record linkage as classification #1025

fgregg commented May 22, 2022

fjsj commented May 24, 2022

NickCrews commented Feb 27, 2023

Record linkage as classification #1025

Record linkage as classification #1025

Comments

fgregg commented May 22, 2022

fjsj commented May 24, 2022

NickCrews commented Feb 27, 2023