Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for private remote object storage #441

Open
berombau opened this issue Jan 26, 2024 · 7 comments · May be fixed by #842
Open

Support for private remote object storage #441

berombau opened this issue Jan 26, 2024 · 7 comments · May be fixed by #842
Labels

Comments

@berombau
Copy link
Contributor

berombau commented Jan 26, 2024

Right now, there is support for local .zarr stores and remote stores publically accessible via HTTP or S3.
Private remote stores are more difficult, as they need certain options or credentials that are not representable by simply a string or Path. One option is to use a zarr.storage.FSStore, which can have storage_options or any fsspec.spec.AbstractFileSystem.

Two pull requests enable this:

Testing is difficult, but this is what I used:

import spatialdata as sd
import zarr
# works now, requires credentials in ~/.aws/credentials
root = zarr.open('s3://BUCKET/spatial-sandbox/visium_associated_xenium_io.zarr', storage_options = {'client_kwargs': {'endpoint_url': MINIO_URL}})
sd.read_zarr(root)
# still works, I think depends on zmetadata?
sd.read_zarr('https://s3.embl.de/spatialdata/spatialdata-sandbox/visium_associated_xenium_io.zarr/')
# still works
sd.read_zarr('~/visium_associated_xenium_io.zarr')
@berombau
Copy link
Contributor Author

I refactored to use UPath, which solves many issues I had with remote support. So I would recommend UPath over Path, str, ZarrLocation...

It works with my own object storage:

from upath import UPath
from spatialdata import SpatialData

p = UPath(
    "s3://BUCKET/spatial-sandbox/visium_associated_xenium_io_tables.zarr",
    endpoint_url="https://objectstor.vib.be",
)
full_sdata.write(p)
sdata = SpatialData.read(p)

I also added tests for the remote datasets and mocked remote tests. There are still some remaining issues:

  • reading from private remote storage over S3 works
  • writing to private remote storage over S3 works
  • test_remote_mock.py mock reading test using ome-zarr fails, so images and labels fail. I need to test this some more as I'm also using a patched ome_zarr.
  • test_remote.py reading the SpatialData remote datasets over HTTP fails for the points parquet files. I also can't reproduce the working implementation (maybe because of a package update?).

I will likely be a while until I can work on this some more.

@LucaMarconato @ArneDefauw

@berombau
Copy link
Contributor Author

berombau commented Nov 13, 2024

This is still an open issue. There is an old fork at https://github.com/berombau/spatialdata main with closed PR #442 that does support remote reading using UPath, but merging it would be difficult, especially because the spatialdata code base underwent a lot of changes.

The current solutions without this support:

Future work plan:

  • after the PR's from the hackathon now, start from the latest version of SpatialData again
  • add the remote mock test code from the old PR
  • change str and Path to UPath until the remote tests work
  • do a clean PR with basic read support
  • work on write support
  • add more tests to keep the remote support, try to dissuade the use of str or Path for filesystem locations

It's still very doable, but currently not a priority and easier after the merging of a lot of open PR's

@BioinfoTongLI

@edoumazane
Copy link

Here is a usecase where the full res image is remote and read-only, while the rest is local.
#831
There is a workaround, adding the pyramid as a separate image layer to Interactive._viewer

@berombau berombau linked a pull request Jan 24, 2025 that will close this issue
@berombau
Copy link
Contributor Author

I've readded the dependencies and moto tests already in the PR. Changing every Path to UPath in the _io module should be straightforward, as Path is a subtype of UPath.

Using the right interfaces and API design is a bit more complicated, as it would be nice to also think on the upcoming Zarr v3 API's. The design should be something like:

filepath in SpatialData.read / .write = str | UPath
Spatialdata._path = UPath

Right now the element interface is not really consistent or easily allows reading / writing each element separately. All elements focus on just zarr.Group for writing, but for reading:

  • _read_multiscale uses str | Path and uses ome_ngff ZarrLocation (which expects Path | str | FSStore).
    • the open_ome_zarr from open_ome_zarr uses zarr.Group, so that's the interface we probably want in the future.
    • we can go to FSStore or ZarrLocation with the group.store attribute.
  • _read_points uses str | Path | MutableMapping | zarr.Group.
  • _read_shapes uses str | Path | MutableMapping | zarr.Group.
  • _read_table uses zarr_store_path: str, group: zarr.Group, subgroup: zarr.Group (group is unused?).

The user should convert str | UPath | zarr.BaseStore | ZarrLocation | ... to zarr.Group outside of the functions or use the SpatialData interface. Since these are all private functions and the public functions only change Path -> UPath, the impact is hopefully minimal.

@LucaMarconato
Copy link
Member

LucaMarconato commented Jan 24, 2025

Thanks @berombau! One of the improvements in the IO APIs that we want to make soon is to expose public APIs for reading single components. I tracked this here: #843.

@berombau
Copy link
Contributor Author

Ah ok. The public interface should be added on top then maybe. So try to make all private read/write functions have a zarr.Group-only API and make all the public read/write functions use the various possibilities of getting to a group:

  • str | Path -> UPath
  • UPath -> FSStore
  • BaseStore | FSStore -> zarr.Group (which is the argument for the private function with the zarr.Group-only API)

There are various convenience methods like zarr v3 with StorePath and make_store_path, ZarrLocation... but it's probably easier to just have our own utility function.

The remote mock tests currently only tests read/writing a SpatialData object with one element of each type. I'll try to extend them so each element is tested also on it's own.

@LucaMarconato
Copy link
Member

Thanks, gonna isolate the action point in the item below for myself:

  • So try to make all private read/write functions have a zarr.Group-only API and make all the public read/write functions use the various possibilities of getting to a group:

    str | Path -> UPath
    UPath -> FSStore
    BaseStore | FSStore -> zarr.Group (which is the argument for the private function with the zarr.Group-only API)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
4 participants