✨ XpySTACAssetReader for reading COG, NetCDF & Zarr STAC assets (#87)

An iterable-style DataPipe for STAC asset data like Cloud-Optimized GeoTIFFs, NetCDF and Zarr! Uses xarray.open_dataset(..., engine="stac") enabled by xpystac for reading STAC assets into an xarray.Dataset object. * ➕ Add xpystac Extend xarray.open_dataset to accept pystac objects! * ➕ Add zarr An implementation of chunked, compressed, N-dimensional arrays for Python! * ✨ XpySTACAssetReader for reading COG and Zarr STAC assets An iterable-style DataPipe for STAC asset data like Cloud-Optimized GeoTIFFs and Zarr! Uses xpystac for the I/O. Included a doctest and unit test, added a new section in the API docs and some more intersphinx mappings. * 💚 Install xpystac and zarr in docs extras Include xpystac and zarr in the 'docs' extras dependencies to fix ReadtheDocs build. * ➕ Add aiohttp to dev dependencies Async http client/server framework (asyncio)! Needed to fix `ModuleNotFoundError: No module named 'aiohttp'` when accessing NetCDF files from https://nasagddp.blob.core.windows.net/nex-gddp-cmip6-references/ACCESS-CM2_historical.json * 📝 Add xpystac and zarr to pip install zen3geo[raster] docs Mention that xpystac and zarr is installed with the 'raster' extras on the main index.md page. Also mentioned that NetCDF files can be read, and added some blank lines in the XpySTACAssetReader doctest example. * ⬆️ Bump aiohttp from 3.8.1 to 3.8.4 Bumps [aiohttp](https://github.com/aio-libs/aiohttp) from 3.8.1 to 3.8.4. - [Release notes](https://github.com/aio-libs/aiohttp/releases) - [Changelog](https://github.com/aio-libs/aiohttp/blob/master/CHANGES.rst) - [Commits](aio-libs/aiohttp@v3.8.1...v3.8.4) * ✅ Unit test for a Zarr-backed STAC asset Ensure that a STAC asset pointing to a Zarr file can be loaded using XpySTACAssetReader. Using the Daymet Annual Hawaii STAC Collection at https://planetarycomputer.microsoft.com/dataset/daymet-annual-hi for this unit test. Also edited previous unit test to specify that it is for a Cloud-Optimized GeoTIFF STAC Asset. * 🔧 Put xpystac under the stac extras Decided that xpystac fits better under the 'stac' extras, because it depends on just pystac and xarray, and has a somewhat optional dependency on stackstac. This enables a more streamlined I/O option for reading STAC Assets into an xarray.Dataset. Note that Zarr is kept under the 'raster' extras. * ✏️ Fix NetCDF link and type hints Need to use trailing underscores for RST-style URLs, and remove the `pystac.Asset` type hint so that zen3geo works without `pystac` being installed.
weiji14 · Apr 2, 2023 · f5e1704 · f5e1704
1 parent 9153c89
commit f5e1704
Show file tree

Hide file tree

Showing 8 changed files with 433 additions and 89 deletions.
diff --git a/docs/_config.yml b/docs/_config.yml
@@ -96,6 +96,9 @@ sphinx:
       xbatcher:
         - 'https://xbatcher.readthedocs.io/en/latest/'
         - null
+      zarr:
+        - 'https://zarr.readthedocs.io/en/latest/'
+        - null
   extra_extensions:
     - 'sphinx.ext.autodoc'
     - 'sphinx.ext.intersphinx'

diff --git a/docs/api.md b/docs/api.md
@@ -82,3 +82,12 @@
 .. autoclass:: zen3geo.datapipes.xbatcher.XbatcherSlicerIterDataPipe
     :show-inheritance:
 ```
+
+### XpySTAC
+
+```{eval-rst}
+.. automodule:: zen3geo.datapipes.xpystac
+.. autoclass:: zen3geo.datapipes.XpySTACAssetReader
+.. autoclass:: zen3geo.datapipes.xpystac.XpySTACAssetReaderIterDataPipe
+    :show-inheritance:
+```
diff --git a/docs/index.md b/docs/index.md
@@ -7,9 +7,9 @@ Get what you need, not more, not less:
 | Command                        |  Dependencies |
 |:-------------------------------|---------------|
 | `pip install zen3geo`          | rioxarray, torchdata |
-| `pip install zen3geo[raster]`  | rioxarray, torchdata, xbatcher |
+| `pip install zen3geo[raster]`  | rioxarray, torchdata, xbatcher, zarr |
 | `pip install zen3geo[spatial]` | rioxarray, torchdata, datashader, spatialpandas |
-| `pip install zen3geo[stac]`    | rioxarray, torchdata, pystac, pystac-client, stackstac |
+| `pip install zen3geo[stac]`    | rioxarray, torchdata, pystac, pystac-client, stackstac, xpystac |
 | `pip install zen3geo[vector]`  | rioxarray, torchdata, pyogrio[geopandas] |
 
 Retrieve more ['extras'](https://github.com/weiji14/zen3geo/blob/main/pyproject.toml) using

diff --git a/poetry.lock b/poetry.lock
diff --git a/pyproject.toml b/pyproject.toml
@@ -41,6 +41,8 @@ pystac-client = {version = ">=0.4.0", optional = true}
 spatialpandas = {version = ">=0.4.0", optional = true}
 stackstac = {version = ">=0.4.0", optional = true}
 xbatcher = {version = ">=0.2.0", optional = true}
+xpystac = {version = ">=0.0.1", optional = true}
+zarr = {version = ">=2.13.0", optional = true}
 # Docs
 adlfs = {version = "*", optional = true}
 contextily = {version = "*", optional = true}
@@ -50,6 +52,7 @@ matplotlib = {version = "*", optional = true}
 planetary-computer = {version="*", optional=true}
 
 [tool.poetry.group.dev.dependencies]
+aiohttp = "*"
 black = "*"
 pytest = "*"
 
@@ -67,17 +70,23 @@ docs = [
     "pystac_client",
     "stackstac",
     "spatialpandas",
-    "xbatcher"
+    "xbatcher",
+    "xpystac",
+    "zarr"
+]
+raster = [
+    "xbatcher",
+    "zarr"
 ]
-raster = ["xbatcher"]
 spatial = [
     "datashader",
     "spatialpandas"
 ]
 stac = [
     "pystac",
     "pystac_client",
-    "stackstac"
+    "stackstac",
+    "xpystac"
 ]
 vector = ["pyogrio"]
 

diff --git a/zen3geo/datapipes/__init__.py b/zen3geo/datapipes/__init__.py
@@ -20,3 +20,6 @@
     StackSTACStackerIterDataPipe as StackSTACStacker,
 )
 from zen3geo.datapipes.xbatcher import XbatcherSlicerIterDataPipe as XbatcherSlicer
+from zen3geo.datapipes.xpystac import (
+    XpySTACAssetReaderIterDataPipe as XpySTACAssetReader,
+)
diff --git a/zen3geo/datapipes/xpystac.py b/zen3geo/datapipes/xpystac.py
@@ -0,0 +1,142 @@
+"""
+DataPipes for `xpystac <https://github.com/jsignell/xpystac>`__.
+"""
+from typing import Any, Dict, Iterator, Optional
+
+import xarray as xr
+
+try:
+    import pystac
+    import xpystac
+except ImportError:
+    pystac = None
+    xpystac = None
+from torchdata.datapipes import functional_datapipe
+from torchdata.datapipes.iter import IterDataPipe
+from torchdata.datapipes.utils import StreamWrapper
+
+
+@functional_datapipe("read_from_xpystac")
+class XpySTACAssetReaderIterDataPipe(IterDataPipe[StreamWrapper]):
+    """
+    Takes a :py:class:`pystac.Asset` object containing raster data (e.g.
+    :doc:`Zarr <zarr:index>`,
+    `NetCDF <https://www.unidata.ucar.edu/software/netcdf>`__,
+    `Cloud-Optimized GeoTIFF <https://www.cogeo.org>`__, etc) from local disk
+    or URLs (as long as they can be read by xpystac) and yields
+    :py:class:`xarray.Dataset` objects (functional name:
+    ``read_from_xpystac``).
+
+    Based on
+    https://github.com/pytorch/data/blob/v0.5.1/torchdata/datapipes/iter/load/iopath.py#L42-L97
+
+    Parameters
+    ----------
+    source_datapipe : IterDataPipe[pystac.Asset]
+        A DataPipe that contains :py:class:`pystac.Asset` objects to raster
+        files such as :doc:`Zarr <zarr:index>`,
+        `NetCDF <https://www.unidata.ucar.edu/software/netcdf>`__,
+        `Cloud-Optimized GeoTIFF <https://www.cogeo.org>`__, etc.
+
+    engine : str or xarray.backends.BackendEntrypoint
+        Engine to use when reading files. If not provided, the default engine
+        will be the "stac" backend from ``xpystac``. Alternatively, set
+        ``engine=None`` to let ``xarray`` choose the default engine based on
+        available dependencies, with a preference for "netcdf4". See also
+        :py:func:`xarray.open_dataset` for details about other engine options.
+
+    kwargs : Optional
+        Extra keyword arguments to pass to :py:func:`xarray.open_dataset`.
+
+    Yields
+    ------
+    stream_obj : xarray.Dataset
+        A :py:class:`xarray.Dataset` object containing the raster data.
+
+    Raises
+    ------
+    ModuleNotFoundError
+        If ``xpystac`` is not installed. See
+        `install instructions for xpystac
+        <https://github.com/jsignell/xpystac#install>`__,
+        (e.g. via ``pip install xpystac``) before using this class.
+
+    Example
+    -------
+    >>> import pytest
+    >>> pystac = pytest.importorskip("pystac")
+    >>> xpystac = pytest.importorskip("xpystac")
+    >>> zarr = pytest.importorskip("zarr")
+    ...
+    >>> from torchdata.datapipes.iter import IterableWrapper
+    >>> from zen3geo.datapipes import XpySTACAssetReader
+    ...
+    >>> # Read in STAC Asset using DataPipe
+    >>> collection_url: str = "https://planetarycomputer.microsoft.com/api/stac/v1/collections/nasa-nex-gddp-cmip6"
+    >>> asset: pystac.Asset = pystac.Collection.from_file(href=collection_url).assets[
+    ...     "ACCESS-CM2.historical"
+    ... ]
+    >>> dp = IterableWrapper(iterable=[asset])
+    >>> dp_xpystac = dp.read_from_xpystac()
+    ...
+    >>> # Loop or iterate over the DataPipe stream
+    >>> it = iter(dp_xpystac)
+    >>> dataset = next(it)
+    >>> dataset.sizes
+    Frozen({'time': 23741, 'lat': 600, 'lon': 1440})
+    >>> print(dataset.data_vars)
+    Data variables:
+        hurs     (time, lat, lon) float32 ...
+        huss     (time, lat, lon) float32 ...
+        pr       (time, lat, lon) float32 ...
+        rlds     (time, lat, lon) float32 ...
+        rsds     (time, lat, lon) float32 ...
+        sfcWind  (time, lat, lon) float32 ...
+        tas      (time, lat, lon) float32 ...
+        tasmax   (time, lat, lon) float32 ...
+        tasmin   (time, lat, lon) float32 ...
+    >>> dataset.attrs  # doctest: +NORMALIZE_WHITESPACE
+    {'Conventions': 'CF-1.7',
+     'activity': 'NEX-GDDP-CMIP6',
+     'cmip6_institution_id': 'CSIRO-ARCCSS',
+     'cmip6_license': 'CC-BY-SA 4.0',
+     'cmip6_source_id': 'ACCESS-CM2',
+     ...
+     'history': '2021-10-04T13:59:21.654137+00:00: install global attributes',
+     'institution': 'NASA Earth Exchange, NASA Ames Research Center, ...
+     'product': 'output',
+     'realm': 'atmos',
+     'references': 'BCSD method: Thrasher et al., 2012, ...
+     'resolution_id': '0.25 degree',
+     'scenario': 'historical',
+     'source': 'BCSD',
+     'title': 'ACCESS-CM2, r1i1p1f1, historical, global downscaled CMIP6 ...
+     'tracking_id': '16d27564-470f-41ea-8077-f4cc3efa5bfe',
+     'variant_label': 'r1i1p1f1',
+     'version': '1.0'}
+    """
+
+    def __init__(
+        self,
+        source_datapipe: IterDataPipe,
+        engine: str = "stac",
+        **kwargs: Optional[Dict[str, Any]]
+    ) -> None:
+        if xpystac is None:
+            raise ModuleNotFoundError(
+                "Package `xpystac` is required to be installed to use this datapipe. "
+                "Please use `pip install xpystac` "
+                "to install the package"
+            )
+        self.source_datapipe: IterDataPipe = source_datapipe
+        self.engine: str = engine
+        self.kwargs = kwargs
+
+    def __iter__(self) -> Iterator[StreamWrapper]:
+        for asset in self.source_datapipe:
+            yield StreamWrapper(
+                xr.open_dataset(asset, engine=self.engine, **self.kwargs)
+            )
+
+    def __len__(self) -> int:
+        return len(self.source_datapipe)
diff --git a/zen3geo/tests/test_datapipes_xpystac.py b/zen3geo/tests/test_datapipes_xpystac.py
@@ -0,0 +1,67 @@
+"""
+Tests for pystac datapipes.
+"""
+import pytest
+from torchdata.datapipes.iter import IterableWrapper
+
+from zen3geo.datapipes import XpySTACAssetReader
+
+pystac = pytest.importorskip("pystac")
+xpystac = pytest.importorskip("xpystac")
+
+
+# %%
+def test_xpystac_asset_reader_cog():
+    """
+    Ensure that XpySTACAssetReader works to read in a pystac.Asset object
+    stored as a Cloud-Optimized GeoTIFF and output to an xarray.Dataset object.
+    """
+    item_url: str = "https://github.com/stac-utils/pystac/raw/v1.7.1/tests/data-files/raster/raster-sentinel2-example.json"
+    asset: pystac.Asset = pystac.Item.from_file(href=item_url).assets["overview"]
+    assert asset.media_type == pystac.MediaType.COG
+
+    dp = IterableWrapper(iterable=[asset])
+
+    # Using class constructors
+    dp_xpystac = XpySTACAssetReader(source_datapipe=dp)
+    # Using functional form (recommended)
+    dp_xpystac = dp.read_from_xpystac()
+
+    assert len(dp_xpystac) == 1
+    it = iter(dp_xpystac)
+    dataset = next(it)
+
+    assert dataset.sizes == {"band": 3, "x": 343, "y": 343}
+    assert dataset.band_data.dtype == "float32"
+    assert dataset.rio.bounds() == (399960.0, 4090240.0, 509720.0, 4200000.0)
+    assert dataset.rio.resolution() == (320.0, -320.0)
+    assert dataset.rio.crs == "EPSG:32633"
+
+
+def test_xpystac_asset_reader_zarr():
+    """
+    Ensure that XpySTACAssetReader works to read in a pystac.Asset object
+    stored as a Zarr file and output to an xarray.Dataset object.
+    """
+    collection_url: str = "https://planetarycomputer.microsoft.com/api/stac/v1/collections/daymet-daily-hi"
+    asset: pystac.Asset = pystac.Collection.from_file(href=collection_url).assets[
+        "zarr-https"
+    ]
+    assert asset.media_type == "application/vnd+zarr"
+
+    dp = IterableWrapper(iterable=[asset])
+
+    # Using class constructors
+    dp_xpystac = XpySTACAssetReader(source_datapipe=dp)
+    # Using functional form (recommended)
+    dp_xpystac = dp.read_from_xpystac()
+
+    assert len(dp_xpystac) == 1
+    it = iter(dp_xpystac)
+    dataset = next(it)
+
+    assert dataset.sizes == {"time": 14965, "y": 584, "x": 284, "nv": 2}
+    assert dataset.prcp.dtype == "float32"
+    assert dataset.rio.bounds() == (-5802750.0, -622500.0, -5518750.0, -38500.0)
+    assert dataset.rio.resolution() == (1000.0, -1000.0)
+    assert dataset.rio.grid_mapping == "lambert_conformal_conic"