-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Xarray dumpers / loaders #16
base: main
Are you sure you want to change the base?
Conversation
# If values are length 1 we are dealing with coords like date | ||
v = v.values | ||
try: | ||
node_dict["coords"][k] = v |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
node_dict["coords"][k] = v | |
node_dict["coords"][k] = {"data": v, "dims": k} |
This would be consistent with the lines below. Not sure if dims is necessary, but I think so?
|
||
|
||
def _create_node(model, schemaview): | ||
"""Create datatree from temperature dataset""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add comment with pointer to the data model of a Dataset's attrs/coords/data_vars/dims.
node_dict["coords"][k].update({"attrs": {key: value}}) | ||
else: | ||
# conversion factor | ||
node_dict["data_vars"] = {k: {"attrs": {key: value}}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if there is more than one? Here the attrs dict is set to a dictionary with just this kv pair. I think this should call _create_node
recursively.
if isinstance(v, BaseModel): | ||
# create a subgroup and recurse | ||
if "values" in v.__dir__(): | ||
dims = ["y","x"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: see if we can get this from the schemaview.
return output_file_path | ||
|
||
|
||
class YamlXarrayZarrDumper(YamlArrayFileDumper): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am sorta having a tough time telling whats going on in this PR (sorry for missing the meeting, I never know when they are), but im doing this rn and u will definitely want to have a model for this, otherwise all the ser/des logic gets packed into a huge nesty dumper/loader and u lose the declarative relationship between in memory and dumped.
As just a general comment, when working with linkml models which can contain a ton of references between objects, and we cant/shouldnt assume hierarchical structure, I have found that recursive patterns become sort of a liability. Its one reason I restructured nwb-linkml stuff to be a sort of bidirectional passing of results objects up and down system for model generation, bc the decision about what to do at any given node is dependent on at least the parent/child ( I see some switches in here for deciding whether or not we're a dataset or an attr, for example, and also some comments about "what if there is more than one of this"). For loading/dumping, I think u will probably want to structure it like a graph that models dependencies between the nodes. My still-pretty-crappy-and-repetitive version of this for nwb is over here: https://github.com/p2p-ld/nwb-linkml/blob/main/nwb_linkml/src/nwb_linkml/io/hdf5.py |
also maybe relevant in general im about 2 do some stuff like dis class MyModel(BaseModel):
array: NDArray model = MyModel(array="data/test.avi")
print_json(model.model_dump_json(round_trip=True)) {"array": {"type": "video", "file": "data/test.avi"}} model.model_dump_json(round_trip=True) {
"array": {
"type": "dask",
"name": "array-2a39187fc9fcee3f4cdbc1f2911b4b92",
"chunks": [[2], [2]],
"dtype": "int64",
"array": [[1, 2], [3, 4]]
}
} model.model_dump_json(
round_trip=True,
context={"mark_interface": True}
)) {
"array": {
"interface": {
"module": "numpydantic.interface.hdf5",
"cls": "H5Interface",
"version": "1.5.3"
},
"value": {
"type": "hdf5",
"file": "data/test.h5",
"path": "/data",
"field": null
}
}
} and so on. and da structure of that is all controllable by the interface class so ya can make your own serialization no sweat eg: https://github.com/p2p-ld/numpydantic/blob/135b74aa2e4d6e5a755c27d3bb7fe170650058d3/src/numpydantic/interface/hdf5.py#L80-L91 |
Sorry for the late comment but you are absolutely right @sneakers-the-rat, when implementing it this way I indeed ran into the problem that the recursiveness makes it difficult to to traverse up and down the graph if required, where a single level within the schema view might not give the required information on its own how to exactly deal with the incoming information. |
@rly I think this would be good to discuss this evening |
Adding xarray dumpers and loaders