-
Notifications
You must be signed in to change notification settings - Fork 550
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consider removing distinction between Static
and non-Static
APIs
#1049
Comments
there's a few issues related to pickling either kind of object, and then some issues related to the differences between the objects. Common issues
pickle.dump('obj.pickle', obj)
new_obj = pickle.load('obj.pickle')
new_obj.index(data) Issues related to the distinct classes
thinking through this a bit more, i wonder if an even more radical change might be interesting. we make the Active Learner class a first class object in the API. we move all the methods related to learning to that class, ActiveLearner.train() -> Dedupe This new Dedupe class could be directly pickleable. There's a nice separation of concerns here, and it would also make it very natural to write a different Learner class that took in static training data, which is harder than it ought to be right now to fit into our current API. Another way of slicing this up could be to split all the methods related to active learning into an ActiveLearner class, and than have a Dedupe class that had a fit_transform method but didn't keep a hold of any of the data.
that could be nice too. unfortunately the data that would need to be passed to the fit_transform method is complex. it's the labeled example data, but also a set of predicate weights. |
I like the idea of splitting up the ActiveLearner portion and the Dedupe/RecordLinkage/Gazeteer model portion. Not very familiar, but it looks like this is what https://github.com/scikit-activeml/scikit-activeml does. Instead of ActiveLearner.train() -> Dedupe, it could be did you mean Does it make sense to think of this as three different layers?
sklearn combines parts 1 and 2, and doesn't even touch 3. |
From #1045 (comment)
Why do we have the distinction between Static and non-Static classes? Is it to prevent re-training an already trained model? I don't think this needs to be enforced in the API. For example, in scikitlearn you can call fit() over and over again, and the previous learned settings are thrown out each time. Seems like we could enforce a similar contract.
@fgregg do you remember the history of the design choice was made to keep them separate? It must have been an explicit choice for some reason, and I'm thinking I'm just missing it.
Benefits of merging that I see would be
The text was updated successfully, but these errors were encountered: