Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Record linkage as classification #1025

Open
fgregg opened this issue May 22, 2022 · 2 comments
Open

Record linkage as classification #1025

fgregg opened this issue May 22, 2022 · 2 comments

Comments

@fgregg
Copy link
Contributor

fgregg commented May 22, 2022

if you squint, the problem of record linkage (but not deduplication) is as a problem of multinomial classification.

the set up in record linkage is that you have dataset A and dataset B, for each record in dataset A you want to find which record, if any, it matches in dataset B.

typically, you create a set of distances between records in A and B, and then use those distances to find matches
but this distance based approach throws away a lot of information about what was actually in the records

it would be nice to use the actual contents of a record.

if we thought of this as classification problem we can.

let's interpret dataset B in an unusual way:

the dataset consists of training data of single examples of as many classes as there are records. we can then directly use the entire dataset B to train a multinomial classifier with features extracted from our records

now the classifier is not going to be great, since we only have one example per class, but the classifier will learn to down-weight feature that are not informative of a class. (it will work a lot like tf/idf and similar normalizations).

regardless, once we have this classifier, we can then feed in records from dataset A and get a prediction about what class they belong to.

that's the heart of the idea.

so, for a pair of records, one from dataset A and the other from dataset B, we could calculate two probabilities.

  1. the typical record linkage probability that, given the distances between the records, that they refer to the same thing
  2. the probability that A record belongs to the class of that the B record is the exemplar of.

with training data about whether pairs of records from A or B are co-referent, you could update both classifiers.

will this be a useful and practical idea?

i don't know.

there will need to be something like blocking on the classification predictions, or else we are back to a N*M complexity. not quite sure how that would work.

one thing that makes me skeptical is that if this was a good idea, it would have appeared in the information retrieval literature as a way of calculating weights for tokens. i haven't seen it in there

(though there's plenty of that literature i don't know).

@fjsj
Copy link
Contributor

fjsj commented May 24, 2022

AFAIK, the problem with multinomial classification is how to have flexibility between training and evaluation on the number N of classes. So instead of training with N fixed classes, deep learning researchers usually try to make models that learn to embed records in a vector space and then query that vector space (with kNN, LSH, clustering, etc., which gives flexibility on the number of classes).

You can learn to embed using pairs of records, that's what ditto and entity-embed do. Do these models also "know" the internals of the records or do they blindly look at the distance between the record attributes? I argue for the former. In deep learning models, the feature extraction layers are trainable and the distance is learned in a separate layer. So I don't think a DL model "throws away a lot of information about what was actually in the records" and I also think a DL model "will learn to down-weight feature that are not informative of a class". Whether DL models are really robust to class number growth during evaluation, I don't know.

See:

@NickCrews
Copy link
Contributor

Re treating as multinomial classification, in section 4.5 of this survey they mention that Entity Resolution Using Convolutional Neural Net treats linking records as a classification problem. I didn't dive any deeper than that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants