Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

recommended way to get a whole dataset into xarray #258

Open
rabernat opened this issue Nov 16, 2018 · 9 comments
Open

recommended way to get a whole dataset into xarray #258

rabernat opened this issue Nov 16, 2018 · 9 comments

Comments

@rabernat
Copy link

Sorry for raising an open-ended / more-of-a-question issue. I figured this would be the best place to discuss it

I have been playing with siphon recently and am really enjoying it! I would now like to use it the following way.

Let's say we have a thredds server which stores many files from the same dataset, e.g NARR daily data:
https://www.esrl.noaa.gov/psd/thredds/catalog/Datasets/NARR/Dailies/pressure/catalog.html
This is a very common scenario. Right now siphon lets me open an individual file into xarray via opendap. Instead, I would like to "open" the whole thing (all the files) lazily, and then do my subsetting from the xarray side. (Or maybe gobble the whole thing up in parallel using dask, but that's a later step.)

Here's what I came up with:

import xarray as xr
from siphon.catalog import TDSCatalog
cat_url = 'https://www.esrl.noaa.gov/psd/thredds/catalog/Datasets/NARR/Dailies/pressure/catalog.xml'
cat = TDSCatalog(cat_url)
dsets = [cds.remote_access(use_xarray=True).reset_coords(drop=True).chunk({'time': 1, 'level': 1})
         for cds in cat.datasets[:30]] # eventually want to use the whole catalog here
ds = xr.auto_combine(dsets)
ds

which gives

<xarray.Dataset>
Dimensions:            (level: 29, nbnds: 2, time: 912, x: 349, y: 277)
Coordinates:
  * level              (level) float32 1000.0 975.0 950.0 925.0 900.0 875.0 ...
  * y                  (y) float32 0.0 32463.0 64926.0 97389.0 129852.0 ...
  * x                  (x) float32 0.0 32463.0 64926.0 97389.0 129852.0 ...
  * time               (time) datetime64[ns] 1979-01-01 1979-01-02 ...
Dimensions without coordinates: nbnds
Data variables:
    Lambert_Conformal  (time) int32 -2147483647 -2147483647 -2147483647 ...
    time_bnds          (time, nbnds) float64 dask.array<shape=(912, 2), chunksize=(1, 2)>
    air                (time, level, y, x) float32 dask.array<shape=(912, 29, 277, 349), chunksize=(1, 1, 277, 349)>

This seems to basically work. I chose the chunks to match the _ChunkSizes attribute of the data variables.

My questions are:

  • Does this make sense? Am I doing anything obviously stupid here?
  • Will I break the thredds server?
  • If the answer to the above is no, does it make sense to document this sort of workflow and / or provide a way to do it automatically from siphon?

What I have in mind is the ability to virtually open entire datasets from thredds in the same way we do with zarr. I have gotten used to this way of working and really like it. It is nice to just see the whole dataset--all timesteps and variables--from xarray.

@dopplershift
Copy link
Member

dopplershift commented Nov 16, 2018

What you've done here looks perfectly sensible to me.

As far as ramifications to the TDS, I'd defer to @lesserwhirls: how much load would we induce on the server by getting metadata from every one of the files at:

https://www.esrl.noaa.gov/psd/thredds/catalog/Datasets/NARR/Dailies/pressure/catalog.xml

? I know getting data across the many files could be intensive, but I'm curious how much burden would be on the server by the client creating the collection.

@rabernat
Copy link
Author

If anyone wants to play around with this, you can check out the narr_noaa_thredds.ipynb example notebook in this binder: https://github.com/rabernat/pangeo_esgf_demo. In that I not only crawl the metadata but actually use dask to pull data in parallel. It works great from my side!

@lesserwhirls
Copy link
Collaborator

Cracking the metadata is fairly light weight - it's something you do each time you open an OPeNDAP url or click the OPeNDAP link in a catalog, so no biggie, really. Being stored as netCDF4 adds a little overhead to getting the metadata that you would not see in netCDF-3 or GRIB, but it should not be much. The server should be just fine here. Ideally the server would aggregate these using an FMRC, but it looks like that's not an option for them.

In terms of connections, it all depends on how they have their server setup (which version of tomcat they are using, are they load balancing, etc.) and how nice you want to be (at the risk of being blocked). The good news is that the connections are not persistent - when an OPeNDAP url is opened by netCDF-C, a connection to the server is opened, the metadata are read (via the .das), and the connection is closed. You generally don't have much to worry about in terms of bogging down the server just by requesting metadata.

Once a slice is made, a new connection is opened and, once the data come back, that connection is closed. Depending on the size and stride of the slice, each one of these could take some time to fetch from disk. Doing the actual data requests in parallel could be ok, but could open many not-so-short-lived connections when making the slices. Unless these data are stored on a high performance disk subsystem, you might cause the disks to thrash like crazy, and the performance would be super bad. If the parallel requests max out the number of connections the server can handle, then it's bad times for all. It looks like you are making requests based on the chunk size, and I'd think that would certainly in terms of keeping the lifespan of each connection down, but it all depends on the total number of simultaneous requests (not sure what Dask looks like here).

I'd say if you're happy with how it's working, and NOAA/ESRL/PSD doesn't block your IP address, then game on!

@deeplycloudy
Copy link
Contributor

+1 to implementing the next step proposed above: consume the whole catalog without the list comprehension.

A further step would be to crawl the full depth starting from a top level catalog. Things are often organized by nested month and day folders, and I’d like to do:

ds = TDSCatalog(‘top_catalog.xml’).remote_access(use_xarray=True)

From there the dream is for the pretty Jupyter representation of the dataset to become a dataset structure and scope discovery engine, with subsetting happening with the familiar xarray syntax. Compare to clicking through the web interface and/or programmatically querying for what variables and time/space spans are present; it’s an especially large activation energy when a dataset is unfamiliar.

Probably also related to #292.

@dopplershift
Copy link
Member

I wonder if intake-thredds can do this already? intake.open_thredds_merged() seems particularly relevant.

@brianmapes
Copy link

I'm puzzled by .to_dask() being the key line in intake-thredds, the result being an xarray dataset. But looking now at dask.org, it is vast -- I thought it was just a parallelization thing, unnecessary for "small" users like me, but no: it seems bigger (higher-level) than that, a wrapper for so many things.

Will dask obviate xarray and siphon layers, with their many cells worth of steps, just as xarray has superceded calls to netcdf4 explicitly? That could be nice.

Siphon involves more cells (and calls) than I wish I needed to make. For instance (and specifically for today's workflow), the auto-generated Jupyter notebook from here
is rather long, and by the way eventually fails at plotting (at the end) because it lacks a time selector:
TypeError: Invalid shape (1, 721, 1440) for image data.

Could a one-cell notebook using xarray do the same job?

@dopplershift
Copy link
Member

@brianmapes Dask is about making virtualized numpy-compatible arrays that wrap chunks of data. This allows you to have arrays larger than fit in memory. In order to work with that efficiently, operations on those arrays are deferred until the computation of those results are needed, at which point the entire graph of computations is executed (to manage data being loaded into and out of memory). It also just so happens that one you have all that, distributed computing falls out of it: hence dask.distributed.

None of that obviates siphon or xarray. Siphon is about interacting with a thredds data server: parsing catalogs and speaking the needed protocols. Xarray provides a nice pandas-like interface on top of the netCDF data model, helping give us some of the niceties of netCDF-java's CDM.

Regarding the link, as far as the error is concerned that's NCAR RDA's server, so you'll have to take it up with them, we don't have any input there. In general, writing generic code to work with all possible data dimensionality is non-trivial, though. I agree Siphon ends up being a bit more verbose, but in this case, that's partially caused by the RDA not being set up to use Grib Collections, instead only exposing the individual files. Compare that with using the "Best", "Full Collection", or "Latest" links on Unidata's demonstration TDS. It's possible using intake-thredds this is easier, but I haven't experimented with it. For siphon, you can open that entire collection with xarray using:

cat = TDSCatalog('https://thredds.ucar.edu/thredds/catalog/grib/NCEP/GFS/Global_0p25deg/catalog.xml')
xarray_dataset = cat.datasets['Best GFS Quarter Degree Forecast Time Series'].remote_access(use_xarray=True)

We've done our best to simplify that as much as possible. Concrete suggestions on how to make that even easier (and works generally) are always welcome.

The core of the problem is at some point somebody has to make and maintain a catalog of the data. TDS provides this directly in the catalog.xml, but you have to know where the server is and the hierarchy of datasets. intake-thredds does this using a yaml file, but it's essentially the same thing.

@brianmapes
Copy link

Lucid, thanks! I'll try this catalog-level access.
I want to make a lead-time ensemble for a given verification time, which involving building strings for requests, but also having to use datetime since lead time to verification differences cross day boundaries. A bit of a puzzle (unless you know an example somewhere...), but a fun puzzle if one has the time

@dopplershift
Copy link
Member

@brianmapes The only help I can offer is that if you're willing to live with the time span of data available on thredds.ucar.edu, the "Full Collection (Reference / Forecast Time) Dataset" from the Grib Collection should let you query for data from a give valid/verification time. We're getting off topic on this issue so if we need more discussion, let's take it to a new discussion on this repo.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants