-
Notifications
You must be signed in to change notification settings - Fork 75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
recommended way to get a whole dataset into xarray #258
Comments
What you've done here looks perfectly sensible to me. As far as ramifications to the TDS, I'd defer to @lesserwhirls: how much load would we induce on the server by getting metadata from every one of the files at: https://www.esrl.noaa.gov/psd/thredds/catalog/Datasets/NARR/Dailies/pressure/catalog.xml ? I know getting data across the many files could be intensive, but I'm curious how much burden would be on the server by the client creating the collection. |
If anyone wants to play around with this, you can check out the |
Cracking the metadata is fairly light weight - it's something you do each time you open an OPeNDAP url or click the OPeNDAP link in a catalog, so no biggie, really. Being stored as netCDF4 adds a little overhead to getting the metadata that you would not see in netCDF-3 or GRIB, but it should not be much. The server should be just fine here. Ideally the server would aggregate these using an FMRC, but it looks like that's not an option for them. In terms of connections, it all depends on how they have their server setup (which version of tomcat they are using, are they load balancing, etc.) and how nice you want to be (at the risk of being blocked). The good news is that the connections are not persistent - when an OPeNDAP url is opened by netCDF-C, a connection to the server is opened, the metadata are read (via the .das), and the connection is closed. You generally don't have much to worry about in terms of bogging down the server just by requesting metadata. Once a slice is made, a new connection is opened and, once the data come back, that connection is closed. Depending on the size and stride of the slice, each one of these could take some time to fetch from disk. Doing the actual data requests in parallel could be ok, but could open many not-so-short-lived connections when making the slices. Unless these data are stored on a high performance disk subsystem, you might cause the disks to thrash like crazy, and the performance would be super bad. If the parallel requests max out the number of connections the server can handle, then it's bad times for all. It looks like you are making requests based on the chunk size, and I'd think that would certainly in terms of keeping the lifespan of each connection down, but it all depends on the total number of simultaneous requests (not sure what Dask looks like here). I'd say if you're happy with how it's working, and NOAA/ESRL/PSD doesn't block your IP address, then game on! |
+1 to implementing the next step proposed above: consume the whole catalog without the list comprehension. A further step would be to crawl the full depth starting from a top level catalog. Things are often organized by nested month and day folders, and I’d like to do:
From there the dream is for the pretty Jupyter representation of the dataset to become a dataset structure and scope discovery engine, with subsetting happening with the familiar xarray syntax. Compare to clicking through the web interface and/or programmatically querying for what variables and time/space spans are present; it’s an especially large activation energy when a dataset is unfamiliar. Probably also related to #292. |
I wonder if intake-thredds can do this already? |
I'm puzzled by .to_dask() being the key line in intake-thredds, the result being an xarray dataset. But looking now at dask.org, it is vast -- I thought it was just a parallelization thing, unnecessary for "small" users like me, but no: it seems bigger (higher-level) than that, a wrapper for so many things. Will dask obviate xarray and siphon layers, with their many cells worth of steps, just as xarray has superceded calls to netcdf4 explicitly? That could be nice. Siphon involves more cells (and calls) than I wish I needed to make. For instance (and specifically for today's workflow), the auto-generated Jupyter notebook from here Could a one-cell notebook using xarray do the same job? |
@brianmapes Dask is about making virtualized numpy-compatible arrays that wrap chunks of data. This allows you to have arrays larger than fit in memory. In order to work with that efficiently, operations on those arrays are deferred until the computation of those results are needed, at which point the entire graph of computations is executed (to manage data being loaded into and out of memory). It also just so happens that one you have all that, distributed computing falls out of it: hence None of that obviates siphon or xarray. Siphon is about interacting with a thredds data server: parsing catalogs and speaking the needed protocols. Xarray provides a nice pandas-like interface on top of the netCDF data model, helping give us some of the niceties of netCDF-java's CDM. Regarding the link, as far as the error is concerned that's NCAR RDA's server, so you'll have to take it up with them, we don't have any input there. In general, writing generic code to work with all possible data dimensionality is non-trivial, though. I agree Siphon ends up being a bit more verbose, but in this case, that's partially caused by the RDA not being set up to use Grib Collections, instead only exposing the individual files. Compare that with using the "Best", "Full Collection", or "Latest" links on Unidata's demonstration TDS. It's possible using intake-thredds this is easier, but I haven't experimented with it. For siphon, you can open that entire collection with xarray using: cat = TDSCatalog('https://thredds.ucar.edu/thredds/catalog/grib/NCEP/GFS/Global_0p25deg/catalog.xml')
xarray_dataset = cat.datasets['Best GFS Quarter Degree Forecast Time Series'].remote_access(use_xarray=True) We've done our best to simplify that as much as possible. Concrete suggestions on how to make that even easier (and works generally) are always welcome. The core of the problem is at some point somebody has to make and maintain a catalog of the data. TDS provides this directly in the |
Lucid, thanks! I'll try this catalog-level access. |
@brianmapes The only help I can offer is that if you're willing to live with the time span of data available on thredds.ucar.edu, the "Full Collection (Reference / Forecast Time) Dataset" from the Grib Collection should let you query for data from a give valid/verification time. We're getting off topic on this issue so if we need more discussion, let's take it to a new discussion on this repo. |
Sorry for raising an open-ended / more-of-a-question issue. I figured this would be the best place to discuss it
I have been playing with siphon recently and am really enjoying it! I would now like to use it the following way.
Let's say we have a thredds server which stores many files from the same dataset, e.g NARR daily data:
https://www.esrl.noaa.gov/psd/thredds/catalog/Datasets/NARR/Dailies/pressure/catalog.html
This is a very common scenario. Right now siphon lets me open an individual file into xarray via opendap. Instead, I would like to "open" the whole thing (all the files) lazily, and then do my subsetting from the xarray side. (Or maybe gobble the whole thing up in parallel using dask, but that's a later step.)
Here's what I came up with:
which gives
This seems to basically work. I chose the chunks to match the
_ChunkSizes
attribute of the data variables.My questions are:
What I have in mind is the ability to virtually open entire datasets from thredds in the same way we do with zarr. I have gotten used to this way of working and really like it. It is nice to just see the whole dataset--all timesteps and variables--from xarray.
The text was updated successfully, but these errors were encountered: