Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Xarray dumpers / loaders #16

Open
wants to merge 39 commits into
base: main
Choose a base branch
from
Open

Xarray dumpers / loaders #16

wants to merge 39 commits into from

Conversation

melonora
Copy link

Adding xarray dumpers and loaders

# If values are length 1 we are dealing with coords like date
v = v.values
try:
node_dict["coords"][k] = v
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
node_dict["coords"][k] = v
node_dict["coords"][k] = {"data": v, "dims": k}

This would be consistent with the lines below. Not sure if dims is necessary, but I think so?



def _create_node(model, schemaview):
"""Create datatree from temperature dataset"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add comment with pointer to the data model of a Dataset's attrs/coords/data_vars/dims.

node_dict["coords"][k].update({"attrs": {key: value}})
else:
# conversion factor
node_dict["data_vars"] = {k: {"attrs": {key: value}}}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if there is more than one? Here the attrs dict is set to a dictionary with just this kv pair. I think this should call _create_node recursively.

if isinstance(v, BaseModel):
# create a subgroup and recurse
if "values" in v.__dir__():
dims = ["y","x"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: see if we can get this from the schemaview.

return output_file_path


class YamlXarrayZarrDumper(YamlArrayFileDumper):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am sorta having a tough time telling whats going on in this PR (sorry for missing the meeting, I never know when they are), but im doing this rn and u will definitely want to have a model for this, otherwise all the ser/des logic gets packed into a huge nesty dumper/loader and u lose the declarative relationship between in memory and dumped.

@sneakers-the-rat
Copy link

As just a general comment, when working with linkml models which can contain a ton of references between objects, and we cant/shouldnt assume hierarchical structure, I have found that recursive patterns become sort of a liability. Its one reason I restructured nwb-linkml stuff to be a sort of bidirectional passing of results objects up and down system for model generation, bc the decision about what to do at any given node is dependent on at least the parent/child ( I see some switches in here for deciding whether or not we're a dataset or an attr, for example, and also some comments about "what if there is more than one of this").

For loading/dumping, I think u will probably want to structure it like a graph that models dependencies between the nodes. My still-pretty-crappy-and-repetitive version of this for nwb is over here: https://github.com/p2p-ld/nwb-linkml/blob/main/nwb_linkml/src/nwb_linkml/io/hdf5.py

@sneakers-the-rat
Copy link

sneakers-the-rat commented Sep 21, 2024

also maybe relevant in general im about 2 do some stuff like dis
https://numpydantic.readthedocs.io/en/dump_json/serialization.html

p2p-ld/numpydantic#20

class MyModel(BaseModel):
    array: NDArray
model = MyModel(array="data/test.avi")
print_json(model.model_dump_json(round_trip=True))
{"array": {"type": "video", "file": "data/test.avi"}}
model.model_dump_json(round_trip=True)
{
  "array": {
    "type": "dask",
    "name": "array-2a39187fc9fcee3f4cdbc1f2911b4b92",
    "chunks": [[2], [2]],
    "dtype": "int64",
    "array": [[1, 2], [3, 4]]
  }
}
  model.model_dump_json(
    round_trip=True, 
    context={"mark_interface": True}
    ))
{
  "array": {
    "interface": {
      "module": "numpydantic.interface.hdf5",
      "cls": "H5Interface",
      "version": "1.5.3"
    },
    "value": {
      "type": "hdf5",
      "file": "data/test.h5",
      "path": "/data",
      "field": null
    }
  }
}

and so on.

and da structure of that is all controllable by the interface class so ya can make your own serialization no sweat

eg: https://github.com/p2p-ld/numpydantic/blob/135b74aa2e4d6e5a755c27d3bb7fe170650058d3/src/numpydantic/interface/hdf5.py#L80-L91
and: https://github.com/p2p-ld/numpydantic/blob/135b74aa2e4d6e5a755c27d3bb7fe170650058d3/src/numpydantic/interface/hdf5.py#L386-L412

@melonora
Copy link
Author

Sorry for the late comment but you are absolutely right @sneakers-the-rat, when implementing it this way I indeed ran into the problem that the recursiveness makes it difficult to to traverse up and down the graph if required, where a single level within the schema view might not give the required information on its own how to exactly deal with the incoming information.

@melonora
Copy link
Author

@rly I think this would be good to discuss this evening

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants