Skip to content

Commit

Permalink
✨ XpySTACAssetReader for reading COG, NetCDF & Zarr STAC assets (#87)
Browse files Browse the repository at this point in the history
An iterable-style DataPipe for STAC asset data like Cloud-Optimized GeoTIFFs, NetCDF and Zarr! Uses xarray.open_dataset(..., engine="stac") enabled by xpystac for reading STAC assets into an xarray.Dataset object.

* ➕ Add xpystac

Extend xarray.open_dataset to accept pystac objects!

* ➕ Add zarr

An implementation of chunked, compressed, N-dimensional arrays for Python!

* ✨ XpySTACAssetReader for reading COG and Zarr STAC assets

An iterable-style DataPipe for STAC asset data like Cloud-Optimized GeoTIFFs and Zarr! Uses xpystac for the I/O. Included a doctest and unit test, added a new section in the API docs and some more intersphinx mappings.

* 💚 Install xpystac and zarr in docs extras

Include xpystac and zarr in the 'docs' extras dependencies to fix ReadtheDocs build.

* ➕ Add aiohttp to dev dependencies

Async http client/server framework (asyncio)!

Needed to fix `ModuleNotFoundError: No module named 'aiohttp'` when accessing NetCDF files from https://nasagddp.blob.core.windows.net/nex-gddp-cmip6-references/ACCESS-CM2_historical.json

* 📝 Add xpystac and zarr to pip install zen3geo[raster] docs

Mention that xpystac and zarr is installed with the 'raster' extras on the main index.md page. Also mentioned that NetCDF files can be read, and added some blank lines in the XpySTACAssetReader doctest example.

* ⬆️ Bump aiohttp from 3.8.1 to 3.8.4

Bumps [aiohttp](https://github.com/aio-libs/aiohttp) from 3.8.1 to 3.8.4.
- [Release notes](https://github.com/aio-libs/aiohttp/releases)
- [Changelog](https://github.com/aio-libs/aiohttp/blob/master/CHANGES.rst)
- [Commits](aio-libs/aiohttp@v3.8.1...v3.8.4)

* ✅ Unit test for a Zarr-backed STAC asset

Ensure that a STAC asset pointing to a Zarr file can be loaded using XpySTACAssetReader. Using the Daymet Annual Hawaii STAC Collection at https://planetarycomputer.microsoft.com/dataset/daymet-annual-hi for this unit test. Also edited previous unit test to specify that it is for a Cloud-Optimized GeoTIFF STAC Asset.

* 🔧 Put xpystac under the stac extras

Decided that xpystac fits better under the 'stac' extras, because it depends on just pystac and xarray, and has a somewhat optional dependency on stackstac. This enables a more streamlined I/O option for reading STAC Assets into an xarray.Dataset. Note that Zarr is kept under the 'raster' extras.

* ✏️ Fix NetCDF link and type hints

Need to use trailing underscores for RST-style URLs, and remove the `pystac.Asset` type hint so that zen3geo works without `pystac` being installed.
  • Loading branch information
weiji14 authored Apr 2, 2023
1 parent 9153c89 commit f5e1704
Show file tree
Hide file tree
Showing 8 changed files with 433 additions and 89 deletions.
3 changes: 3 additions & 0 deletions docs/_config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -96,6 +96,9 @@ sphinx:
xbatcher:
- 'https://xbatcher.readthedocs.io/en/latest/'
- null
zarr:
- 'https://zarr.readthedocs.io/en/latest/'
- null
extra_extensions:
- 'sphinx.ext.autodoc'
- 'sphinx.ext.intersphinx'
Expand Down
9 changes: 9 additions & 0 deletions docs/api.md
Original file line number Diff line number Diff line change
Expand Up @@ -82,3 +82,12 @@
.. autoclass:: zen3geo.datapipes.xbatcher.XbatcherSlicerIterDataPipe
:show-inheritance:
```

### XpySTAC

```{eval-rst}
.. automodule:: zen3geo.datapipes.xpystac
.. autoclass:: zen3geo.datapipes.XpySTACAssetReader
.. autoclass:: zen3geo.datapipes.xpystac.XpySTACAssetReaderIterDataPipe
:show-inheritance:
```
4 changes: 2 additions & 2 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,9 +7,9 @@ Get what you need, not more, not less:
| Command | Dependencies |
|:-------------------------------|---------------|
| `pip install zen3geo` | rioxarray, torchdata |
| `pip install zen3geo[raster]` | rioxarray, torchdata, xbatcher |
| `pip install zen3geo[raster]` | rioxarray, torchdata, xbatcher, zarr |
| `pip install zen3geo[spatial]` | rioxarray, torchdata, datashader, spatialpandas |
| `pip install zen3geo[stac]` | rioxarray, torchdata, pystac, pystac-client, stackstac |
| `pip install zen3geo[stac]` | rioxarray, torchdata, pystac, pystac-client, stackstac, xpystac |
| `pip install zen3geo[vector]` | rioxarray, torchdata, pyogrio[geopandas] |

Retrieve more ['extras'](https://github.com/weiji14/zen3geo/blob/main/pyproject.toml) using
Expand Down
279 changes: 195 additions & 84 deletions poetry.lock

Large diffs are not rendered by default.

15 changes: 12 additions & 3 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,8 @@ pystac-client = {version = ">=0.4.0", optional = true}
spatialpandas = {version = ">=0.4.0", optional = true}
stackstac = {version = ">=0.4.0", optional = true}
xbatcher = {version = ">=0.2.0", optional = true}
xpystac = {version = ">=0.0.1", optional = true}
zarr = {version = ">=2.13.0", optional = true}
# Docs
adlfs = {version = "*", optional = true}
contextily = {version = "*", optional = true}
Expand All @@ -50,6 +52,7 @@ matplotlib = {version = "*", optional = true}
planetary-computer = {version="*", optional=true}

[tool.poetry.group.dev.dependencies]
aiohttp = "*"
black = "*"
pytest = "*"

Expand All @@ -67,17 +70,23 @@ docs = [
"pystac_client",
"stackstac",
"spatialpandas",
"xbatcher"
"xbatcher",
"xpystac",
"zarr"
]
raster = [
"xbatcher",
"zarr"
]
raster = ["xbatcher"]
spatial = [
"datashader",
"spatialpandas"
]
stac = [
"pystac",
"pystac_client",
"stackstac"
"stackstac",
"xpystac"
]
vector = ["pyogrio"]

Expand Down
3 changes: 3 additions & 0 deletions zen3geo/datapipes/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,3 +20,6 @@
StackSTACStackerIterDataPipe as StackSTACStacker,
)
from zen3geo.datapipes.xbatcher import XbatcherSlicerIterDataPipe as XbatcherSlicer
from zen3geo.datapipes.xpystac import (
XpySTACAssetReaderIterDataPipe as XpySTACAssetReader,
)
142 changes: 142 additions & 0 deletions zen3geo/datapipes/xpystac.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,142 @@
"""
DataPipes for `xpystac <https://github.com/jsignell/xpystac>`__.
"""
from typing import Any, Dict, Iterator, Optional

import xarray as xr

try:
import pystac
import xpystac
except ImportError:
pystac = None
xpystac = None
from torchdata.datapipes import functional_datapipe
from torchdata.datapipes.iter import IterDataPipe
from torchdata.datapipes.utils import StreamWrapper


@functional_datapipe("read_from_xpystac")
class XpySTACAssetReaderIterDataPipe(IterDataPipe[StreamWrapper]):
"""
Takes a :py:class:`pystac.Asset` object containing raster data (e.g.
:doc:`Zarr <zarr:index>`,
`NetCDF <https://www.unidata.ucar.edu/software/netcdf>`__,
`Cloud-Optimized GeoTIFF <https://www.cogeo.org>`__, etc) from local disk
or URLs (as long as they can be read by xpystac) and yields
:py:class:`xarray.Dataset` objects (functional name:
``read_from_xpystac``).
Based on
https://github.com/pytorch/data/blob/v0.5.1/torchdata/datapipes/iter/load/iopath.py#L42-L97
Parameters
----------
source_datapipe : IterDataPipe[pystac.Asset]
A DataPipe that contains :py:class:`pystac.Asset` objects to raster
files such as :doc:`Zarr <zarr:index>`,
`NetCDF <https://www.unidata.ucar.edu/software/netcdf>`__,
`Cloud-Optimized GeoTIFF <https://www.cogeo.org>`__, etc.
engine : str or xarray.backends.BackendEntrypoint
Engine to use when reading files. If not provided, the default engine
will be the "stac" backend from ``xpystac``. Alternatively, set
``engine=None`` to let ``xarray`` choose the default engine based on
available dependencies, with a preference for "netcdf4". See also
:py:func:`xarray.open_dataset` for details about other engine options.
kwargs : Optional
Extra keyword arguments to pass to :py:func:`xarray.open_dataset`.
Yields
------
stream_obj : xarray.Dataset
A :py:class:`xarray.Dataset` object containing the raster data.
Raises
------
ModuleNotFoundError
If ``xpystac`` is not installed. See
`install instructions for xpystac
<https://github.com/jsignell/xpystac#install>`__,
(e.g. via ``pip install xpystac``) before using this class.
Example
-------
>>> import pytest
>>> pystac = pytest.importorskip("pystac")
>>> xpystac = pytest.importorskip("xpystac")
>>> zarr = pytest.importorskip("zarr")
...
>>> from torchdata.datapipes.iter import IterableWrapper
>>> from zen3geo.datapipes import XpySTACAssetReader
...
>>> # Read in STAC Asset using DataPipe
>>> collection_url: str = "https://planetarycomputer.microsoft.com/api/stac/v1/collections/nasa-nex-gddp-cmip6"
>>> asset: pystac.Asset = pystac.Collection.from_file(href=collection_url).assets[
... "ACCESS-CM2.historical"
... ]
>>> dp = IterableWrapper(iterable=[asset])
>>> dp_xpystac = dp.read_from_xpystac()
...
>>> # Loop or iterate over the DataPipe stream
>>> it = iter(dp_xpystac)
>>> dataset = next(it)
>>> dataset.sizes
Frozen({'time': 23741, 'lat': 600, 'lon': 1440})
>>> print(dataset.data_vars)
Data variables:
hurs (time, lat, lon) float32 ...
huss (time, lat, lon) float32 ...
pr (time, lat, lon) float32 ...
rlds (time, lat, lon) float32 ...
rsds (time, lat, lon) float32 ...
sfcWind (time, lat, lon) float32 ...
tas (time, lat, lon) float32 ...
tasmax (time, lat, lon) float32 ...
tasmin (time, lat, lon) float32 ...
>>> dataset.attrs # doctest: +NORMALIZE_WHITESPACE
{'Conventions': 'CF-1.7',
'activity': 'NEX-GDDP-CMIP6',
'cmip6_institution_id': 'CSIRO-ARCCSS',
'cmip6_license': 'CC-BY-SA 4.0',
'cmip6_source_id': 'ACCESS-CM2',
...
'history': '2021-10-04T13:59:21.654137+00:00: install global attributes',
'institution': 'NASA Earth Exchange, NASA Ames Research Center, ...
'product': 'output',
'realm': 'atmos',
'references': 'BCSD method: Thrasher et al., 2012, ...
'resolution_id': '0.25 degree',
'scenario': 'historical',
'source': 'BCSD',
'title': 'ACCESS-CM2, r1i1p1f1, historical, global downscaled CMIP6 ...
'tracking_id': '16d27564-470f-41ea-8077-f4cc3efa5bfe',
'variant_label': 'r1i1p1f1',
'version': '1.0'}
"""

def __init__(
self,
source_datapipe: IterDataPipe,
engine: str = "stac",
**kwargs: Optional[Dict[str, Any]]
) -> None:
if xpystac is None:
raise ModuleNotFoundError(
"Package `xpystac` is required to be installed to use this datapipe. "
"Please use `pip install xpystac` "
"to install the package"
)
self.source_datapipe: IterDataPipe = source_datapipe
self.engine: str = engine
self.kwargs = kwargs

def __iter__(self) -> Iterator[StreamWrapper]:
for asset in self.source_datapipe:
yield StreamWrapper(
xr.open_dataset(asset, engine=self.engine, **self.kwargs)
)

def __len__(self) -> int:
return len(self.source_datapipe)
67 changes: 67 additions & 0 deletions zen3geo/tests/test_datapipes_xpystac.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
"""
Tests for pystac datapipes.
"""
import pytest
from torchdata.datapipes.iter import IterableWrapper

from zen3geo.datapipes import XpySTACAssetReader

pystac = pytest.importorskip("pystac")
xpystac = pytest.importorskip("xpystac")


# %%
def test_xpystac_asset_reader_cog():
"""
Ensure that XpySTACAssetReader works to read in a pystac.Asset object
stored as a Cloud-Optimized GeoTIFF and output to an xarray.Dataset object.
"""
item_url: str = "https://github.com/stac-utils/pystac/raw/v1.7.1/tests/data-files/raster/raster-sentinel2-example.json"
asset: pystac.Asset = pystac.Item.from_file(href=item_url).assets["overview"]
assert asset.media_type == pystac.MediaType.COG

dp = IterableWrapper(iterable=[asset])

# Using class constructors
dp_xpystac = XpySTACAssetReader(source_datapipe=dp)
# Using functional form (recommended)
dp_xpystac = dp.read_from_xpystac()

assert len(dp_xpystac) == 1
it = iter(dp_xpystac)
dataset = next(it)

assert dataset.sizes == {"band": 3, "x": 343, "y": 343}
assert dataset.band_data.dtype == "float32"
assert dataset.rio.bounds() == (399960.0, 4090240.0, 509720.0, 4200000.0)
assert dataset.rio.resolution() == (320.0, -320.0)
assert dataset.rio.crs == "EPSG:32633"


def test_xpystac_asset_reader_zarr():
"""
Ensure that XpySTACAssetReader works to read in a pystac.Asset object
stored as a Zarr file and output to an xarray.Dataset object.
"""
collection_url: str = "https://planetarycomputer.microsoft.com/api/stac/v1/collections/daymet-daily-hi"
asset: pystac.Asset = pystac.Collection.from_file(href=collection_url).assets[
"zarr-https"
]
assert asset.media_type == "application/vnd+zarr"

dp = IterableWrapper(iterable=[asset])

# Using class constructors
dp_xpystac = XpySTACAssetReader(source_datapipe=dp)
# Using functional form (recommended)
dp_xpystac = dp.read_from_xpystac()

assert len(dp_xpystac) == 1
it = iter(dp_xpystac)
dataset = next(it)

assert dataset.sizes == {"time": 14965, "y": 584, "x": 284, "nv": 2}
assert dataset.prcp.dtype == "float32"
assert dataset.rio.bounds() == (-5802750.0, -622500.0, -5518750.0, -38500.0)
assert dataset.rio.resolution() == (1000.0, -1000.0)
assert dataset.rio.grid_mapping == "lambert_conformal_conic"

0 comments on commit f5e1704

Please sign in to comment.