Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to read catalog as SELF_CONTAINED when written as RELATIVE_PUBLISHED #137

Closed
CloudNiner opened this issue Aug 3, 2020 · 2 comments

Comments

@CloudNiner
Copy link
Contributor

CloudNiner commented Aug 3, 2020

I published a catalog to a remote private S3 bucket after writing it with RELATIVE_PUBLISHED, so that the catalog root contained an absolute self link.

I then wanted to read and walk the catalog locally in order to write a derived downstream catalog using some of the items in it, so I downloaded it to my machine and read it in with: catalog = pystac.Catalog.from_file('path/to/catalog.json') where this catalog.json is the root containing the absolute self link.

Since all of the catalog links remain relative (aside from the root absolute ref) I would expect to be able to walk and read items locally from the catalog as entirely relative links, as if it were SELF_CONTAINED. However I was unable to do so:

import pystac
catalog = pystac.Catalog.from_file('./data/catalog/catalog.json') 
catalog.get_child("ChildCollectionId")
FileNotFoundError: [Errno 2] No such file or directory: 'https://mybucket.s3.amazonaws.com/catalog/catalog.json'

It appears there's a make_all_links_relative() function that might do what I want, but that throws the same error:

catalog.make_all_links_relative()
FileNotFoundError: [Errno 2] No such file or directory: 'https://mybucket.s3.amazonaws.com/catalog/catalog.json'

Perhaps a few potential solutions here:

  1. IIUC, fix make_all_links_relative to not first attempt a read of an absolute url that may not be available in the user's environment
  2. Provide an additional catalog method that swaps the catalog "mode" between compatible options, such as RELATIVE_PUBLISHED and SELF_CONTAINED
  3. Allow the user to specify CatalogType in Catalog.from_file

Edit: As a manual workaround, I just deleted the absolute self link from the root catalog.json after I downloaded it.

@lossyrob
Copy link
Member

lossyrob commented Aug 4, 2020

This is an interesting case, and raises a question for me: is a RELATIVE_PUBLISHED catalog that is copied to a location that does not match its root link HREF a valid STAC? According to the spec I don't believe so. So in this case PySTAC should either handle this error case better or there should be a clear way to accomplish what your trying to do through some other means.

Currently PySTAC uses the root link of the catalog to resolve HREFs for relative links - which seems appropriate for a RELATIVE_PUBLISHED catalog in the original location. I can see why there'd be an expectation that you'd be able to traverse relative links based on the relative paths of the actual file locations. One option for handling this would be for PySTAC to always override the root link with the file location it read the catalog from in the case that you're reading a catalog from a file directly. However, I'm not sure there would be logic that could ensure the root HREF, when differing from the catalog read path, represents the same or a different catalog.

The make_all_links_relative() method mentioned above wouldn't work in this case, because it would cause PySTAC to traverse the STAC based on the current known absolute versions of the HREF of the links, which are based on the root link. Likewise, modifying the catalog type on read time or in memory will still cause problems because it will start from the point of view that those catalog objects live at their remote locations, and so will try to read from those locations before making any modifications to the type.

I think this speaks to a need for a more consistent way to copy STACs around. This is a particular case where there is only one link's difference (the 'root' link of the root catalog) between a RELATIVE_PUBLISHED and SELF_CONTAINED, the latter having the ability to be copied around to different locations without a problem. However, if a user wanted to copy an ABSOLUTE_PUBLISHED catalog, this becomes more complicated - all of the absolute link HREFs would need to be modified in order to accomplish this.

A catalog-to-catalog transfer could happen consistently via this code snippet:

cat = pystac.read_file('/original/catalog/location')
cat.normalize_and_save('/new/catalog/location', pystac.CatalogType.SELF_CONTAINED)

With this method the catalog will always be written in a valid SELF_CONTAINED catalog at the desired location. I will note that this may change the layout of the STAC to be canonical, which may not be desired.

In building a CLI of utility methods as part of #119, I think we should consider this as a subcommand to make this easier, something like:

> stac cp --type self-contained /original/catalog/location /new/catalog/location

This way there'd be a consistent way to copy STACs around, that could even have options like copying the assets as well (addressing some points raised in #61).

@lossyrob
Copy link
Member

Copying STACs was implemented in stactools. See https://stactools.readthedocs.io/en/latest/cli.html#stac-copy for the command line version, https://stactools.readthedocs.io/en/latest/api.html#copying-and-moving for the library functionality.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants