recommended way to get a whole dataset into xarray #258

rabernat · 2018-11-16T04:17:32Z

Sorry for raising an open-ended / more-of-a-question issue. I figured this would be the best place to discuss it

I have been playing with siphon recently and am really enjoying it! I would now like to use it the following way.

Let's say we have a thredds server which stores many files from the same dataset, e.g NARR daily data:
https://www.esrl.noaa.gov/psd/thredds/catalog/Datasets/NARR/Dailies/pressure/catalog.html
This is a very common scenario. Right now siphon lets me open an individual file into xarray via opendap. Instead, I would like to "open" the whole thing (all the files) lazily, and then do my subsetting from the xarray side. (Or maybe gobble the whole thing up in parallel using dask, but that's a later step.)

Here's what I came up with:

import xarray as xr
from siphon.catalog import TDSCatalog
cat_url = 'https://www.esrl.noaa.gov/psd/thredds/catalog/Datasets/NARR/Dailies/pressure/catalog.xml'
cat = TDSCatalog(cat_url)
dsets = [cds.remote_access(use_xarray=True).reset_coords(drop=True).chunk({'time': 1, 'level': 1})
         for cds in cat.datasets[:30]] # eventually want to use the whole catalog here
ds = xr.auto_combine(dsets)
ds

which gives

<xarray.Dataset>
Dimensions:            (level: 29, nbnds: 2, time: 912, x: 349, y: 277)
Coordinates:
  * level              (level) float32 1000.0 975.0 950.0 925.0 900.0 875.0 ...
  * y                  (y) float32 0.0 32463.0 64926.0 97389.0 129852.0 ...
  * x                  (x) float32 0.0 32463.0 64926.0 97389.0 129852.0 ...
  * time               (time) datetime64[ns] 1979-01-01 1979-01-02 ...
Dimensions without coordinates: nbnds
Data variables:
    Lambert_Conformal  (time) int32 -2147483647 -2147483647 -2147483647 ...
    time_bnds          (time, nbnds) float64 dask.array<shape=(912, 2), chunksize=(1, 2)>
    air                (time, level, y, x) float32 dask.array<shape=(912, 29, 277, 349), chunksize=(1, 1, 277, 349)>

This seems to basically work. I chose the chunks to match the _ChunkSizes attribute of the data variables.

My questions are:

Does this make sense? Am I doing anything obviously stupid here?
Will I break the thredds server?
If the answer to the above is no, does it make sense to document this sort of workflow and / or provide a way to do it automatically from siphon?

What I have in mind is the ability to virtually open entire datasets from thredds in the same way we do with zarr. I have gotten used to this way of working and really like it. It is nice to just see the whole dataset--all timesteps and variables--from xarray.

The text was updated successfully, but these errors were encountered:

dopplershift · 2018-11-16T20:42:44Z

What you've done here looks perfectly sensible to me.

As far as ramifications to the TDS, I'd defer to @lesserwhirls: how much load would we induce on the server by getting metadata from every one of the files at:

https://www.esrl.noaa.gov/psd/thredds/catalog/Datasets/NARR/Dailies/pressure/catalog.xml

? I know getting data across the many files could be intensive, but I'm curious how much burden would be on the server by the client creating the collection.

rabernat · 2018-11-16T21:34:44Z

If anyone wants to play around with this, you can check out the narr_noaa_thredds.ipynb example notebook in this binder: https://github.com/rabernat/pangeo_esgf_demo. In that I not only crawl the metadata but actually use dask to pull data in parallel. It works great from my side!

lesserwhirls · 2018-11-16T22:42:58Z

Cracking the metadata is fairly light weight - it's something you do each time you open an OPeNDAP url or click the OPeNDAP link in a catalog, so no biggie, really. Being stored as netCDF4 adds a little overhead to getting the metadata that you would not see in netCDF-3 or GRIB, but it should not be much. The server should be just fine here. Ideally the server would aggregate these using an FMRC, but it looks like that's not an option for them.

In terms of connections, it all depends on how they have their server setup (which version of tomcat they are using, are they load balancing, etc.) and how nice you want to be (at the risk of being blocked). The good news is that the connections are not persistent - when an OPeNDAP url is opened by netCDF-C, a connection to the server is opened, the metadata are read (via the .das), and the connection is closed. You generally don't have much to worry about in terms of bogging down the server just by requesting metadata.

Once a slice is made, a new connection is opened and, once the data come back, that connection is closed. Depending on the size and stride of the slice, each one of these could take some time to fetch from disk. Doing the actual data requests in parallel could be ok, but could open many not-so-short-lived connections when making the slices. Unless these data are stored on a high performance disk subsystem, you might cause the disks to thrash like crazy, and the performance would be super bad. If the parallel requests max out the number of connections the server can handle, then it's bad times for all. It looks like you are making requests based on the chunk size, and I'd think that would certainly in terms of keeping the lifespan of each connection down, but it all depends on the total number of simultaneous requests (not sure what Dask looks like here).

I'd say if you're happy with how it's working, and NOAA/ESRL/PSD doesn't block your IP address, then game on!

deeplycloudy · 2022-01-27T04:37:24Z

+1 to implementing the next step proposed above: consume the whole catalog without the list comprehension.

A further step would be to crawl the full depth starting from a top level catalog. Things are often organized by nested month and day folders, and I’d like to do:

ds = TDSCatalog(‘top_catalog.xml’).remote_access(use_xarray=True)

From there the dream is for the pretty Jupyter representation of the dataset to become a dataset structure and scope discovery engine, with subsetting happening with the familiar xarray syntax. Compare to clicking through the web interface and/or programmatically querying for what variables and time/space spans are present; it’s an especially large activation energy when a dataset is unfamiliar.

Probably also related to #292.

dopplershift · 2022-01-29T01:52:12Z

I wonder if intake-thredds can do this already? intake.open_thredds_merged() seems particularly relevant.

brianmapes · 2022-02-23T18:07:58Z

I'm puzzled by .to_dask() being the key line in intake-thredds, the result being an xarray dataset. But looking now at dask.org, it is vast -- I thought it was just a parallelization thing, unnecessary for "small" users like me, but no: it seems bigger (higher-level) than that, a wrapper for so many things.

Will dask obviate xarray and siphon layers, with their many cells worth of steps, just as xarray has superceded calls to netcdf4 explicitly? That could be nice.

Siphon involves more cells (and calls) than I wish I needed to make. For instance (and specifically for today's workflow), the auto-generated Jupyter notebook from here
is rather long, and by the way eventually fails at plotting (at the end) because it lacks a time selector:
TypeError: Invalid shape (1, 721, 1440) for image data.

Could a one-cell notebook using xarray do the same job?

dopplershift · 2022-02-23T18:27:23Z

@brianmapes Dask is about making virtualized numpy-compatible arrays that wrap chunks of data. This allows you to have arrays larger than fit in memory. In order to work with that efficiently, operations on those arrays are deferred until the computation of those results are needed, at which point the entire graph of computations is executed (to manage data being loaded into and out of memory). It also just so happens that one you have all that, distributed computing falls out of it: hence dask.distributed.

None of that obviates siphon or xarray. Siphon is about interacting with a thredds data server: parsing catalogs and speaking the needed protocols. Xarray provides a nice pandas-like interface on top of the netCDF data model, helping give us some of the niceties of netCDF-java's CDM.

Regarding the link, as far as the error is concerned that's NCAR RDA's server, so you'll have to take it up with them, we don't have any input there. In general, writing generic code to work with all possible data dimensionality is non-trivial, though. I agree Siphon ends up being a bit more verbose, but in this case, that's partially caused by the RDA not being set up to use Grib Collections, instead only exposing the individual files. Compare that with using the "Best", "Full Collection", or "Latest" links on Unidata's demonstration TDS. It's possible using intake-thredds this is easier, but I haven't experimented with it. For siphon, you can open that entire collection with xarray using:

cat = TDSCatalog('https://thredds.ucar.edu/thredds/catalog/grib/NCEP/GFS/Global_0p25deg/catalog.xml')
xarray_dataset = cat.datasets['Best GFS Quarter Degree Forecast Time Series'].remote_access(use_xarray=True)

We've done our best to simplify that as much as possible. Concrete suggestions on how to make that even easier (and works generally) are always welcome.

The core of the problem is at some point somebody has to make and maintain a catalog of the data. TDS provides this directly in the catalog.xml, but you have to know where the server is and the hierarchy of datasets. intake-thredds does this using a yaml file, but it's essentially the same thing.

brianmapes · 2022-02-23T19:42:12Z

Lucid, thanks! I'll try this catalog-level access.
I want to make a lead-time ensemble for a given verification time, which involving building strings for requests, but also having to use datetime since lead time to verification differences cross day boundaries. A bit of a puzzle (unless you know an example somewhere...), but a fun puzzle if one has the time

dopplershift · 2022-02-23T19:56:35Z

@brianmapes The only help I can offer is that if you're willing to live with the time span of data available on thredds.ucar.edu, the "Full Collection (Reference / Forecast Time) Dataset" from the Grib Collection should let you query for data from a give valid/verification time. We're getting off topic on this issue so if we need more discussion, let's take it to a new discussion on this repo.

lesserwhirls added the Type: Question label Nov 16, 2018

rabernat mentioned this issue Mar 11, 2020

OPeNDAP loading error for datasets with many zeros and NaNs Unidata/netcdf-c#1667

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

recommended way to get a whole dataset into xarray #258

recommended way to get a whole dataset into xarray #258

rabernat commented Nov 16, 2018

dopplershift commented Nov 16, 2018 •

edited

Loading

rabernat commented Nov 16, 2018

lesserwhirls commented Nov 16, 2018

deeplycloudy commented Jan 27, 2022

dopplershift commented Jan 29, 2022

brianmapes commented Feb 23, 2022

dopplershift commented Feb 23, 2022

brianmapes commented Feb 23, 2022

dopplershift commented Feb 23, 2022

recommended way to get a whole dataset into xarray #258

recommended way to get a whole dataset into xarray #258

Comments

rabernat commented Nov 16, 2018

dopplershift commented Nov 16, 2018 • edited Loading

rabernat commented Nov 16, 2018

lesserwhirls commented Nov 16, 2018

deeplycloudy commented Jan 27, 2022

dopplershift commented Jan 29, 2022

brianmapes commented Feb 23, 2022

dopplershift commented Feb 23, 2022

brianmapes commented Feb 23, 2022

dopplershift commented Feb 23, 2022

dopplershift commented Nov 16, 2018 •

edited

Loading