-
Notifications
You must be signed in to change notification settings - Fork 550
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor labeler.py #1065
Refactor labeler.py #1065
Conversation
dedupe/labeler.py
Outdated
return self.data_model.distances(pairs) | ||
|
||
def fit(self, X: numpy.typing.NDArray[numpy.float_], y: LabelsLike) -> None: | ||
|
||
def _fit(self, X: numpy.typing.NDArray[numpy.float_], y: LabelsLike) -> None: | ||
self.y: numpy.typing.NDArray[numpy.int_] = numpy.array(y) | ||
self.X = X |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i know this was nor your design, but it's a bit weird to have the fit method be responsible for setting these instance attributes. mark
would be a more natural place, imo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. still trying to figure out how to deal with the fact that the learners are stateful with their data. I almost want to split them up, so that one component is stateless like sklearn models and has a fit()
, and that is owned by a learner that is stateful and remembers the old training data, and that has the mark()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK in this new version all the sublearners are stateless (in regards to saving training data) and don't have a mark
, just a stateless fit(). Only DisagreementLEarner has a mark
that stores traingin data.
dedupe/labeler.py
Outdated
self.y: numpy.typing.NDArray[numpy.int_] = numpy.array(y) | ||
self.X = X | ||
sklearn.linear_model.LogisticRegression.fit(self, self.X, self.y) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think it's good to have an explicit call here, but if we do that, we shouldn't also have the class inherit from sklearn.linear_model.LogisticRegression
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I turn this to super().fit(self.X, self.y)
then mypy complains because it needs both parents in the inheritance to have the same signature for fit
: dedupe/labeler.py:84: error: Argument 1 to "fit" of "Learner" has incompatible type "ndarray[Any, dtype[floating[_64Bit]]]"; expected "List[Tuple[Mapping[str, Any], Mapping[str, Any]]]" [arg-type]
I think we should rename RLRLearner
to MatchLearner
(it is comparing records using distances, and this is sorta a nice opposition to BlockLEarner) and make it wrap LogisticRegression
(or RandomForest, or whatever) using composition, not inheritance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See if you like what I did.
dedupe/labeler.py
Outdated
for learner in self.learners: | ||
learner.fit_transform(pairs, y) | ||
learner.fit(pairs, y) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fit
is a bad verb for something that both sets the data on a learner and retrains the learner. I think maybe we want to get rid of fit from the public API and have mark
be responsible for all of this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See new version and comment above.
The meaning of these methods is not consistent with how sklearn uses them. For them, fit_transform(X, y) means fit(X, y).transform(X), but we were using it as fit(transform(X), y). So just rename fit_transform() to fit(), and now it has the correct meaning. What's more, the transform method isn't used external to the class, and I don't think it should be. The public API should only deal with TrainingPairs, it shouldn't deal with the calculated distances. Rename it to _distances() The old fit() method now is just an internal helper method that deals with already calculated distance data.
It is already done in self._fit()
Now it prints something like `[arg-type]` when it errors, so you can add `# type: ignore[arg-type]` and be specific about the error you are silencing
It's not actually used anywhere besides from creating the initial set of candidate predicates
it's only used internally, so make that obvious
Already is part of the Learner base class
DisagreementLearner is really reaching far down into the contained objects. The lower classes themselves should be responsible for this sort of thing. Sure, it makes the code more complicated, but the way it is is fooling ourselves that it is simple, and making it prone to breaking int he future.
It's not actually used anywhere. See dedupeio#1065 (comment)
We were only testing the very small component of labeler. Now we actually go through most lines of code. Check out coverage.
- Remove many unused methods of MatchLearner like mark() and pop(). I think these used to get used when this was the only learner class, but now they aren;t used anywhere. - Just set MatchLeaner.candidates in constructor. It makes it way easier to reason about. - Adjust the inheritance. Now DisagreementLearner is out of the heirarchy that MatchLEarner and BlockLearner are in. This is good, because DisagreementLEarner OWNS these other two it is not a "is a" relationship - Remove the mark() function from the sub-learners. They just have the fit() method now, but they don't actually persist this training data, which is in line with the naming of fit(). - Make `candidates` a RO attribute, makes it easier to reason about, we don't have to worry about someone outside of the calss coming in and changing it. - Fix a bug in BlockLearner where `remove` never actually removed entry from `candidates`, so if you broke the cahce of _cached_scores and ended up calling `self._predict(self.candidates)`, you would get the result from all of the original candidates. - In the test, actually check for the values of the candidates, not just the number of them. - Rename `_remove` to `remove` in the sublearners, since they are publicly used in DisagreementLEarner - Remove the unused candidate_scores from the DisagreementLearner public API - Always make a copy in sample_records() to avoid footguns
3e7d4bd
to
2605b29
Compare
@fgregg this has gotten way bigger, and this was very much done in an experiemental/haphazard way, so that some commits undo/modify what previous commits did. I can rebase this in a more sensible way if the large scale idea looks right. Maybe though take a look at the final result and see if this is generally heading in a good direction before I do all that cleanup work. Or I can split this into a PR with the less contentious changes, get that merged, and then take a look at the more structural changes afterwards. Thanks you for looking! |
Generally, i really like this direction. Sorry for the slow response, i've been out sick. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really liking where this is going. A few comments in line.
@@ -138,9 +139,15 @@ def score(predicate: Predicate) -> float: | |||
|
|||
return candidates | |||
|
|||
def cover(self, pairs: TrainingExamples) -> Cover: | |||
def cover(self, pairs: TrainingExamples, index_predicates: bool = True) -> Cover: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't love this change. is it really ugly to avoid this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you seen the other changes from this commit? and the corresponding commit message? I think the current way that we deal with this, where in DisagreementLearner.learn_predicates() we do self.blocker.block_learner.blocker.predicates.copy()
is a terrible code smell, reaching 3 levels deep. In addition, blocker
block_learner
, and blocker
are all the same thing. So I wanted to actually create public APIs at each level.
The other problem there is the pattern of "temporarily delete some predicates, do some stuff, and then restore them" which I didn't like.
I agree that this commit does add more lines of code, which I don't like, but what I like less was the unofficial and haphazard methods we used.
When you say "this change", do you mean "adding index_predicates as an arg to this function"? We could probably adjust this. Or do you mean the larger-scale changes I mention above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually, i'm good with this!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm really struggling with this. i think this is better than the code that it is replacing, but it seems like this filtering should not be the responsibility of the BlockLearner class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm... what about this design.
class BlockLearner(ABC):
def learn(self, matches, recall, candidate_types='simple', selected_predicates=None):
comparison_cover = self.comparison_cover
if selected_predicates:
candidate_preds = comparison_cover & selected_predicates
else:
candidate_preds = comparison_cover.keys()
match_cover = self.cover(matches, candidate_preds)
...
def cover(self, pairs, candidate_preds) -> Cover:
predicate_cover = {}
for predicate in candidate_preds: # type: ignore
...
then this labeler code could just be
def learn_predicates(
self, recall: float, selected_predicates
) -> tuple[Predicate, ...]:
learned_preds: tuple[Predicate, ...]
dupes = [pair for label, pair in zip(self.y, self.pairs) if label]
learned_preds = self.blocker.block_learner.learn(
dupes, recall=recall, candidate_types="random forest",
selected_predicates
)
and so then it would ultimately be the responsible for the train
method in the api.py
to filter the predicates based and pass them on appropriately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm trying to wrap my head around the general pattern of what we're doing, correct me if I'm wrong. Right now there are three sets of predicates:
- all the available predicates from the data model
- A subset of 1, all the predicates that the BlockLearners use as candidates.
- A subset of 2, the final predicates chosen by the BlockLearners that we use to actually do the blocking in Dedupe.pairs().
Right now BlockLearners are handed 1, and it filters that down to 2 using the index_predicates
flag. Then it does the step of filtering from 2 to 3.
I agree that this is bad, these concerns should be separated. The filtering from 1 to 2 should happen external to the BlockLearners. The block learners should just get handed set 2, and they select set 3 from that. That is the direction that #1052 is trying to go in.
I'm not sure the 1->2 filtering should happen in train() though. Can't we do this at the very beginning, when the data model is first given to Dedupe? I don't like the comparison_cover & selected_predicates
bit, this filtering is happening later than it should. I think we can make it so that training.BlockLearner is only ever handed the selected_predicates
at the beginning. Because in training.DedupeBlockLearner.init we call
self.blocker = blocking.Fingerprinter(predicates)
self.blocker.index_all(data)
which makes us index ALL the predicates, even if then in subsequent calls to learn() or cover() we then filter out the index predicates and don't use them.
If you think this is at least an improvement, how about you merge it and then we can actually write PRs that go in this even better direction, and we have something concrete to talk about? I'm finding it hard without actually testing things out, but I don't want to have merge conflicts later so I want main to be stable.
Or, I rebase and remove this commit from this PR and you merge everything else, and we work on this separately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But yes I think your suggestion is better than what I have currently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This conversation is getting too big, should be it's own issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fgregg in case you didn't see this
With your refactor, it's pretty apparent that the weighted sampling methods on these classes are a bit odd. We can save that for a future PR though. Probably, the right thing to do is to create a new object, or honestly, maybe just some new functions that take a in a blocker coverage object and return a weighted sample of keys. |
The config in setup.cfg is ignored.
Makes it consistent between the MatchLEarner and BlockLearner. Also fixes a bug where self._fitted was never set to True
@fgregg I fixed two of your requests, take a look at the final one that I didn't change because I wanted clarification. |
hi @NickCrews. sorry for the slow response! I responded two to you on the |
All benchmarks (diff):
|
i'm going to go ahead and bring this in, and we can keep on working on improving how we interact with predicates in future PRs. |
Sweet, thanks for keeping momentum going @fgregg ! |
See each commit individually, nothing functional changes
EDIT: still shouldn't be any funcitonal changes, though this has gotten much more substantial