Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion: what should happen when self-contained STACs are copied? #61

Open
jisantuc opened this issue Dec 23, 2019 · 4 comments
Open
Labels
discussion An issue to capture a discussion

Comments

@jisantuc
Copy link
Contributor

I have a STAC that includes a bunch of chips of tifs. When I clone that STAC, I keep all of the items, and the assets point to the old tifs. I don't think this is necessarily wrong. I'm curious whether it's a deliberate choice to leave the references to the old tifs and not copy the tifs into the new stac or whether that's something that happened incidentally. I can see arguments for both ways --

In favor of not copying the data:

  • since the tifs are part of a different catalog, there's no relative path from the new catalog to the old data, so path construction requires some assumptions on PySTAC's part
  • presumably if I'm building stacs from other stacs i have access to the data in both places, so why copy?

In favor of being able within PySTAC to copy the data (obviously I can do whatever I want outside of PySTAC):

  • self-contained catalogs are nice, and there's currently no way to tell PySTAC to make a new self-contained catalog from an existing one as far as I can tell (it won't infer the copy behavior)
  • in multi-step pipelines for STAC production, I might want to delete everything but the output of the last step (i.e. only keep the "complete" catalog, where "complete" means "has had everything I want to do to it done), which means at the end my references to assets from previous stages will be invalid
@jisantuc
Copy link
Contributor Author

I think maybe this could be a broader question. pystac doesn't seem to understand self-contained STACs at all, or at least seems uninterested in writing or copying assets for items. I think it would be better if save on catalogs included assets with relative links when the catalog type is SELF_CONTAINED

@lossyrob
Copy link
Member

PySTAC tries to maintain the correct path to the asset file if a self contained STAC is copied. So if an item is moved to a new HREF, it will setup a relative path to the original asset HREF according to the new item HREF. It does not do any asset copying. If the asset is copied, then that new asset should replace the old asset HREF in the copied item.

Another option would be to leave relative paths as-is in copied items. However, since PySTAC specifically doesn't do the asset copying (which is intentional, as it's not always clear you want a copy of the heavy asset data when you copy the metadata). The downside to that is that now an Item points to an asset HREF that doesn't exist.

What do you think should happen?

Currently, save should set all links to relative when the catalog type is self contained - is that not happening? This is the code that does this: https://github.com/azavea/pystac/blob/develop/pystac/catalog.py#L424

@matthewhanson
Copy link
Member

Reviving this interesting discussion. I was actually going to propose that SELF_CONTAINED catalog could be removed. The only difference between it and RELATIVE_PUBLISHED is the inclusion of an absolute self link at the root in RELATIVE_PUBLISHED.

However, I could see a SELF_CONTAINED catalog as being strictly defined as one that is all relative links and has assets that are located alongside the Items (i.e., the same directory). In this case I think it would be useful to be able to delete Items, and corresponding local assets. And maybe you want to move branches of catalogs around. But if you were going to just copy the catalog elsewhere you could do that just saving it as a RELATIVE_PUBLISHED catalog and copying the entire tree somewhere else.

Are these use cases worth supporting?

@gadomski
Copy link
Member

gadomski commented Nov 7, 2022

Moving and copying assets is currently handled by stactools, which can use third-party packages (e.g. fsspec) to do I/O on non-local filesystems. However, I could see the argument for adding local-only asset management to PySTAC, which would still be within scope for this package. This nothing to do with the SELF_CONTAINED strategy. I've opened #911 to capture.

I was actually going to propose that SELF_CONTAINED catalog could be removed.

I find the SELF_CONTAINED strategy very useful when creating STAC catalogs that can live in multiple places, such as examples for stactools packages (e.g. https://github.com/stactools-packages/noaa-climate-normals/blob/2701520439ef87862113004ba80924b92ed953e4/examples/catalog.json). So I'm 👎🏽 on removing support for SELF_CONTAINED.

If #911 is rejected, then I think this issue can be closed, because PySTAC wouldn't be supporting data file move/copies anytime soon. If #911 is accepted, we should then answer the question of whether move/copy should be extended to Catalog.save and friends as well, which would directly address OP's questions (I think).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion An issue to capture a discussion
Projects
None yet
Development

No branches or pull requests

4 participants