-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consolidate loaders #305
base: main
Are you sure you want to change the base?
Consolidate loaders #305
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #305 +/- ##
==========================================
- Coverage 62.70% 62.59% -0.12%
==========================================
Files 63 64 +1
Lines 8580 8637 +57
Branches 2444 2455 +11
==========================================
+ Hits 5380 5406 +26
- Misses 2583 2612 +29
- Partials 617 619 +2 ☔ View full report in Codecov by Sentry. |
Hi @sneakers-the-rat - it looks like this makes the loaders stateful. Currently they are stateless (you may ask why they are objects at all, given that they are used just as if they were functions, ...) This isn't necessarily wrong, but was wondering if there was a specific justification, vs for example making an iterable class... |
I was just trying to preserve existing behavior as much as possible! Literally copy and pasting methods from main linkml repo to here :) So they are stateful for the things that the main linkml repo was using them for, and not stateful for the ways linkml-runtime was using them. No strong preference from me - having state would be useful in some contexts where one would like to be able to refer to the source schema in addition to the target class, etc. Where the class doesnt have all the info of the schema, but that can also be accomplished in other ways. Happy to modify to make it be one or the other. |
Ah, I'm the source of the confusion here! Quoting from your other PR:
It's actually the other way around. The ones in the An attempt at a brief historical recap: The This led to the introduction of the I guess I had always intended that the I think what gives me the most pause about these changes is that the resulting I guess I'd suggest two possible paths forward:
Hopefully that all makes sense. @sneakers-the-rat I've really appreciated your enthusiasm for contributing lately so please let me know if that all makes sense and if you have other thoughts on paths forward. |
OK catching up on these open PRs - thanks for the history @pkalita-lbl !!! Trying to figure out what we want to do here, bc i think it would be really good to clean up and unify the loaders/dumpers
This is totally fair, but i think it sort of points towards a need for a refactor! If the loaders can't handle loading and iterating to the point where it's easier to fork them, then let's clean them up! I think I'm interested in these bc they seem to be the main place where the 'rubber hits the road' of working with data, and it would be very cool to be able to just point it at the data and press play, particularly as they relate to the transformers/mappers sitting in between the loaders and dumpers. This'll be especially critical for translating existing data formats like NWB, which will require some special loading logic to pack them into python classes (like how HDMF does), so might as well do some tidying now before it's time fro that :)
Also completely agree, i was mostly just consolidating them into one place so that they could be called identically to how they're being used now so that we can improve on them iteratively. So far we have... Stateful/stateless
I actually like treating classes as collections of functions, makes perfect sense to me. I think we can do both here. It makes sense to want to treat them as just functions, pass arguments, receive models/data. It also makes sense to have a bit of state in them - there are a decent amount of parameters to be passed around, and the anonymity of would a combination of using we could also (and probably should) have pure functional forms that are just Logic Unification
ya that sounds super bad lol. It seems like basing them around generators/iterators makes sense for dealing with data where one might not want to have to load the entire thing into memory esp for things like
and then:
an Broader Unificationthere's no getting away from it, linkml is a yaml-driven package! so having a single means of loading yaml i think would potentially bring clarity to some other parts of the package. I'm not saying "let's put all yaml things into the dumpers/loaders," but i am saying we could put all yaml I/O things into them. I already mentioned the
This is particularly important given the perhaps unexpected extended universe of yaml forms for ppl who aren't all that familiar with the format - getting all these weird directives and tags and whatnot might be bewildering, and offering a single so just sketching some ideas and paths forward, but yes i think @pkalita-lbl I would pick option (1) - maybe not all in one PR, but this gets us started down that direction, and then we can roadmap out what else we want here. I'd also be down to do a bit of docs work since this was the part where i got a little lost my first time through (ok i have a schema, now what?) lmk what ya think, sorry for long comment <3 |
briefly: incremental PRs that don't change existing client-facing signatures or introduce new potentially unstable public methods are most welcome! Also it seems maybe we need a bit more depth in the core developers guide explaining some things. schemaloader and rawloader are indeed part of the older-style generators and don't concern us here (but I totally appreciate how their presence and naming confuses things). There are a few bigger things I'd like to coalesce on before diving into a full refactor:
These are both touched on in other issues, but for 1, pydantics built in field introspection is sufficient for json/yaml loading and dumping, but for say rdf loading/dumping, there is insufficient metadata in the python itself. Now we are making great progress towards the "great convergence" where pydanticgen is on a par with the "good bits" of pythongen but I'd be a bit more comfortable finishing some of that first - e.g. making the curies/uris of classes and slots introspectable. It may be the case that we can make future versions of rdf loading and dumping standalone, no need for the client to orchestrate creating a schemaview object, which would be nice. For 2, a design question is whether we want to have some kind of unified interface for loading/dumping from sqldbs, duckdb, mongo, etc, or whether to keep this as a separate concern (if we do this then we'd want to obviously keep linkml-runtime light and have the implementations for some of these backends be plugins... we are already addressing this to some extent with linkml-arrays and hdf5...). Either way, |
you know about a billion times more of the lay of the land than i do, obviously, so thx for the perspective. re: 1) that shouldn't be too hard actually - basically it will be
Field(json_schema_extra={"linkml_meta": {
{% for key, val in meta.items() %}
{{key}}: {{val}}
{% endfor %}
}})
class {{ name }}({{bases}}):
linkml_meta: ClassVar[LinkMLMeta] = LinkMLMeta(
{% for key, val in meta.items() %}
{{ key }} = {{ val }}
{% endfor %}
) and then since we have domain models that tell us which fields are consumed by the template and which don't have direct representation, we would just do something like this: class TemplateModel(BaseModel):
meta: Optional[Dict[str,Any]] = None
def fill_meta(self, def: Definition):
items = remove_empty_items(def)
items = {k:v for k,v in items.items() if k not in self.model_fields}
self.meta = items and that's pretty much it. we don't even need to worry about formatting in the template now that we just format it all with black. dump whatever we want in there and we can make introspection methods in a parent class for making field-level metadata easier to access. re: 2 I don't really know! I think we should clean up these classes so that we can make them hookable - if we make a clean interface by which someone might be able to write an interface to their favorite DB, then i feel like that'll be more powerful than us trying to implement them all in main repo (ultimately that is what i am trying to do with the pydantic generators). I feel like there are a few different different concerns that converge towards a consolidation of loading and dumping behavior in its different forms:
and so yes it seems like we just need a plan :) i feel like getting started with refactoring the behavior we already have is a low hanging fruit en route to grander visions |
def _construct_target_class(self, | ||
@abstractmethod | ||
def iter_instances(self) -> Iterator[Any]: | ||
"""Lazily load data instances from the source |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[minor: can be iterated on in new PR]
Can we clarify what a data instance would be?
It seems that this is canonically a dict, never an instance of a class (whether dataclass or pydantic)? or would it be (e.g. pkl serialization)? Would rdflib_loader eventually implement this with a Triple/Quad object, or a 3-or-4-tuple?
I'm tending towards a more predictable signature (iterates over dicts) with some guarantees
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
totally - so the type annotation here would be refined in the child loaders, keeping it "any" here is just to say "there will be some iterator (it could actually just be Iterator
and then we would do Iterator[dict[str, JsonObj | dict | list]]
or whatever in the child objects. We would make this type a union of all the child types but it wouldn't really give us much bc the child impl should override it
Merging in functionality from main linkml loaders (see: linkml/linkml#1967 )