⭐️ IBTRACS dataset adapter #493

kevinsantana11 · 2024-07-19T14:18:39Z

No description provided.

Associated to #493 and fixes the deployed dataset in s3 repo. The code changes update the dataset URL in the `gdp1h` function to version 2.01.1. This ensures that the latest version of the dataset is being used. The previous URL was "https://noaa-oar-hourly-gdp-pds.s3.amazonaws.com/latest/gdp-v2.01.zarr" and it has been updated to "https://noaa-oar-hourly-gdp-pds.s3.amazonaws.com/latest/gdp-v2.01.1.zarr".

philippemiron · 2024-08-04T04:03:43Z

Is it normal that the tests take 40 minutes?

kevinsantana11 · 2024-08-06T02:12:17Z

Is it normal that the tests take 40 minutes?

Not really but sometimes the AOML servers can get overloaded and the exponential backoff will kick in and sometimes cause tests to take a really long time if the server can't recover in time.

kevinsantana11 · 2024-10-21T08:48:54Z

@selipot this PR is ready for review. I've also got started on the example notebook repo for the dataset:

Cloud-Drift/ibtracs-get-started#1

selipot · 2024-10-23T14:35:37Z

clouddrift/datasets.py

+    xarray.Dataset
+        IBTRACS dataset as a ragged array.
+
+    Standard usage of the dataset.


Why this line? Should we add a simple example ds = ibtracs()?

selipot · 2024-10-23T14:44:43Z

I am not able to generate version v03r09. I get the following error

from clouddrift.datasets import ibtracs
ds03 = ibtracs(version='v03r09')


https://www.ncei.noaa.gov/data/international-best-track-archive-for-climate-stewardship-ibtracs/v03r09/access
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File /opt/homebrew/Caskroom/mambaforge/base/envs/clouddrift/lib/python3.12/site-packages/xarray/backends/file_manager.py:211, in CachingFileManager._acquire_with_cache_info(self, needs_lock)
    210 try:
--> 211     file = self._cache[self._key]
    212 except KeyError:

File /opt/homebrew/Caskroom/mambaforge/base/envs/clouddrift/lib/python3.12/site-packages/xarray/backends/lru_cache.py:56, in LRUCache.__getitem__(self, key)
     55 with self._lock:
---> 56     value = self._cache[key]
     57     self._cache.move_to_end(key)

KeyError: [<class 'netCDF4._netCDF4.Dataset'>, ('/var/folders/fx/qsnv05_94vs9qzp4p0qww8c00000gn/T/clouddrift/ibtracs/IBTrACS.last3years.v03r09.nc',), 'r', (('clobber', True), ('diskless', False), ('format', 'NETCDF4'), ('persist', False)), 'a22fcbb6-1058-4905-bd5c-fdc4c1e2e8ef']

During handling of the above exception, another exception occurred:

OSError                                   Traceback (most recent call last)
Cell In[10], line 1
----> 1 ds03 = ibtracs(version='v03r09')

File ~/projects.git/clouddrift/clouddrift/datasets.py:404, in ibtracs(version, kind, tmp_path, decode_times)
    370 def ibtracs(
    371     version: _Version = "v04r01",
    372     kind: _Kind = "LAST_3_YEARS",
    373     tmp_path: str = adapters.ibtracs._DEFAULT_FILE_PATH,
    374     decode_times: bool = True,
    375 ) -> xr.Dataset:
    376     """Returns International Best Track Archive for Climate Stewardship (IBTrACS) as a ragged array xarray dataset.
    377 
    378     The function will first look for the ragged array dataset on the local
   (...)
    402     Standard usage of the dataset.
    403     """
--> 404     return _dataset_filecache(
    405         f"ibtracs_ra_{version}_{kind}.nc",
    406         decode_times,
    407         lambda: adapters.ibtracs.to_raggedarray(version, kind, tmp_path),
    408     )

File ~/projects.git/clouddrift/clouddrift/datasets.py:752, in _dataset_filecache(filename, decode_times, get_ds)
    749 os.makedirs(os.path.dirname(fp), exist_ok=True)
    751 if not os.path.exists(fp):
--> 752     ds = get_ds()
    753     if ext == ".nc":
    754         ds.to_netcdf(fp)

File ~/projects.git/clouddrift/clouddrift/datasets.py:407, in ibtracs.<locals>.<lambda>()
    370 def ibtracs(
    371     version: _Version = "v04r01",
    372     kind: _Kind = "LAST_3_YEARS",
    373     tmp_path: str = adapters.ibtracs._DEFAULT_FILE_PATH,
    374     decode_times: bool = True,
    375 ) -> xr.Dataset:
    376     """Returns International Best Track Archive for Climate Stewardship (IBTrACS) as a ragged array xarray dataset.
    377 
    378     The function will first look for the ragged array dataset on the local
   (...)
    402     Standard usage of the dataset.
    403     """
    404     return _dataset_filecache(
    405         f"ibtracs_ra_{version}_{kind}.nc",
    406         decode_times,
--> 407         lambda: adapters.ibtracs.to_raggedarray(version, kind, tmp_path),
    408     )

File ~/projects.git/clouddrift/clouddrift/adapters/ibtracs.py:90, in to_raggedarray(version, kind, tmp_path)
     87 dst_path = os.path.join(tmp_path, filename)
     88 download_with_progress([(src_url, dst_path)])
---> 90 ds = xr.open_dataset(dst_path, engine="netcdf4")
     91 ds = ds.rename_dims({"date_time": "obs"})
     93 vars = list[Hashable]()

File /opt/homebrew/Caskroom/mambaforge/base/envs/clouddrift/lib/python3.12/site-packages/xarray/backends/api.py:611, in open_dataset(filename_or_obj, engine, chunks, cache, decode_cf, mask_and_scale, decode_times, decode_timedelta, use_cftime, concat_characters, decode_coords, drop_variables, inline_array, chunked_array_type, from_array_kwargs, backend_kwargs, **kwargs)
    599 decoders = _resolve_decoders_kwargs(
    600     decode_cf,
    601     open_backend_dataset_parameters=backend.open_dataset_parameters,
   (...)
    607     decode_coords=decode_coords,
    608 )
    610 overwrite_encoded_chunks = kwargs.pop("overwrite_encoded_chunks", None)
--> 611 backend_ds = backend.open_dataset(
    612     filename_or_obj,
    613     drop_variables=drop_variables,
    614     **decoders,
    615     **kwargs,
    616 )
    617 ds = _dataset_from_backend_dataset(
    618     backend_ds,
    619     filename_or_obj,
   (...)
    629     **kwargs,
    630 )
    631 return ds

File /opt/homebrew/Caskroom/mambaforge/base/envs/clouddrift/lib/python3.12/site-packages/xarray/backends/netCDF4_.py:649, in NetCDF4BackendEntrypoint.open_dataset(self, filename_or_obj, mask_and_scale, decode_times, concat_characters, decode_coords, drop_variables, use_cftime, decode_timedelta, group, mode, format, clobber, diskless, persist, lock, autoclose)
    628 def open_dataset(  # type: ignore[override]  # allow LSP violation, not supporting **kwargs
    629     self,
    630     filename_or_obj: str | os.PathLike[Any] | BufferedIOBase | AbstractDataStore,
   (...)
    646     autoclose=False,
    647 ) -> Dataset:
    648     filename_or_obj = _normalize_path(filename_or_obj)
--> 649     store = NetCDF4DataStore.open(
    650         filename_or_obj,
    651         mode=mode,
    652         format=format,
    653         group=group,
    654         clobber=clobber,
    655         diskless=diskless,
    656         persist=persist,
    657         lock=lock,
    658         autoclose=autoclose,
    659     )
    661     store_entrypoint = StoreBackendEntrypoint()
    662     with close_on_error(store):

File /opt/homebrew/Caskroom/mambaforge/base/envs/clouddrift/lib/python3.12/site-packages/xarray/backends/netCDF4_.py:410, in NetCDF4DataStore.open(cls, filename, mode, format, group, clobber, diskless, persist, lock, lock_maker, autoclose)
    404 kwargs = dict(
    405     clobber=clobber, diskless=diskless, persist=persist, format=format
    406 )
    407 manager = CachingFileManager(
    408     netCDF4.Dataset, filename, mode=mode, kwargs=kwargs
    409 )
--> 410 return cls(manager, group=group, mode=mode, lock=lock, autoclose=autoclose)

File /opt/homebrew/Caskroom/mambaforge/base/envs/clouddrift/lib/python3.12/site-packages/xarray/backends/netCDF4_.py:357, in NetCDF4DataStore.__init__(self, manager, group, mode, lock, autoclose)
    355 self._group = group
    356 self._mode = mode
--> 357 self.format = self.ds.data_model
    358 self._filename = self.ds.filepath()
    359 self.is_remote = is_remote_uri(self._filename)

File /opt/homebrew/Caskroom/mambaforge/base/envs/clouddrift/lib/python3.12/site-packages/xarray/backends/netCDF4_.py:419, in NetCDF4DataStore.ds(self)
    417 @property
    418 def ds(self):
--> 419     return self._acquire()

File /opt/homebrew/Caskroom/mambaforge/base/envs/clouddrift/lib/python3.12/site-packages/xarray/backends/netCDF4_.py:413, in NetCDF4DataStore._acquire(self, needs_lock)
    412 def _acquire(self, needs_lock=True):
--> 413     with self._manager.acquire_context(needs_lock) as root:
    414         ds = _nc4_require_group(root, self._group, self._mode)
    415     return ds

File /opt/homebrew/Caskroom/mambaforge/base/envs/clouddrift/lib/python3.12/contextlib.py:137, in _GeneratorContextManager.__enter__(self)
    135 del self.args, self.kwds, self.func
    136 try:
--> 137     return next(self.gen)
    138 except StopIteration:
    139     raise RuntimeError("generator didn't yield") from None

File /opt/homebrew/Caskroom/mambaforge/base/envs/clouddrift/lib/python3.12/site-packages/xarray/backends/file_manager.py:199, in CachingFileManager.acquire_context(self, needs_lock)
    196 @contextlib.contextmanager
    197 def acquire_context(self, needs_lock=True):
    198     """Context manager for acquiring a file."""
--> 199     file, cached = self._acquire_with_cache_info(needs_lock)
    200     try:
    201         yield file

File /opt/homebrew/Caskroom/mambaforge/base/envs/clouddrift/lib/python3.12/site-packages/xarray/backends/file_manager.py:217, in CachingFileManager._acquire_with_cache_info(self, needs_lock)
    215     kwargs = kwargs.copy()
    216     kwargs["mode"] = self._mode
--> 217 file = self._opener(*self._args, **kwargs)
    218 if self._mode == "w":
    219     # ensure file doesn't get overridden when opened again
    220     self._mode = "a"

File src/netCDF4/_netCDF4.pyx:2470, in netCDF4._netCDF4.Dataset.__init__()

File src/netCDF4/_netCDF4.pyx:2107, in netCDF4._netCDF4._ensure_nc_success()

OSError: [Errno -51] NetCDF: Unknown file format: '/var/folders/fx/qsnv05_94vs9qzp4p0qww8c00000gn/T/clouddrift/ibtracs/IBTrACS.last3years.v03r09.nc'

selipot · 2024-10-23T15:14:38Z

clouddrift/datasets.py

+
+    Parameters
+    ----------
+    version : "v03r09", "v04r00", "v04r01" (default)


drop support of version 3

selipot · 2024-10-23T15:20:06Z

clouddrift/datasets.py

+    ----------
+    version : "v03r09", "v04r00", "v04r01" (default)
+        Specify the dataset version to retrieve. Default to the latest version.
+    kind: "ACTIVE", "ALL", "EP", "NA", "NI", "SA", "SI", "SP", "WP", "SINCE_1980", "LAST_3_YEARS" (default)


Does using "ACTIVE" or "LAST_3_YEARS" re-generate the ragged array or not? I think it should. So maybe disable caching for this dataset?

selipot · 2024-11-15T14:05:17Z

@kevinsantana11 what else do we need to do to merge?

kevinsantana11 · 2025-01-08T15:15:12Z

Add docstring to the to_raggedarray function, move function to top

kevinsantana11 · 2025-01-16T04:03:15Z

Found some issues when comparing the ragged and un-ragged dataset. Still need to do further investigation but I've added a test that should pass once the issue is fixed.

kevinsantana11 · 2025-01-16T22:22:37Z

Found some issues when comparing the ragged and un-ragged dataset. Still need to do further investigation but I've added a test that should pass once the issue is fixed.

So upon further investigation it turns out there isn't an issue between the ragged and original dataset. Last night I had noticed discrepancies between the length of the data variables but later realized the original datasets data variables span the whole datasets observation datetime span. After comparing the variables using the trimmed length of the original array the test passes since both the ragged and original data array contain the same data.

I added this validation to the tests and also added a check which validates the rest of the variable only contains nan values.

kevinsantana11 changed the title ~~⭐️ IBTRACS Dataset~~ ⭐️ IBTRACS dataset adapter Jul 19, 2024

kevinsantana11 force-pushed the ibtracs branch from 3166863 to 6c92b9b Compare August 1, 2024 04:05

kevinsantana11 force-pushed the ibtracs branch from 3d68f63 to 470b5e1 Compare August 9, 2024 04:10

kevinsantana11 force-pushed the ibtracs branch from 25ce09a to bc000cc Compare August 24, 2024 04:37

kevinsantana11 marked this pull request as ready for review October 21, 2024 08:48

philippemiron requested a review from selipot October 21, 2024 14:10

selipot reviewed Oct 23, 2024

View reviewed changes

clouddrift/datasets.py Outdated

Parameters

----------

version : "v03r09", "v04r00", "v04r01" (default)

Copy link

Member

selipot Oct 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

drop support of version 3

selipot reviewed Oct 23, 2024

View reviewed changes

kevinsantana11 added 2 commits January 15, 2025 19:36

ibtracs adapter, fix merge conflicts

acc5139

fix imports

b814b06

kevinsantana11 force-pushed the ibtracs branch from 8f1bfb1 to b814b06 Compare January 16, 2025 03:48

kevinsantana11 added 2 commits January 15, 2025 19:49

fmt

829a27d

test entire data array

f4cc541

kevinsantana11 added 2 commits January 16, 2025 14:12

Check up to trimmed length

aff7823

fmt

2417ce1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

⭐️ IBTRACS dataset adapter #493

⭐️ IBTRACS dataset adapter #493

kevinsantana11 commented Jul 19, 2024

philippemiron commented Aug 4, 2024

kevinsantana11 commented Aug 6, 2024

kevinsantana11 commented Oct 21, 2024

selipot Oct 23, 2024

selipot commented Oct 23, 2024

selipot Oct 23, 2024

selipot Oct 23, 2024

selipot commented Nov 15, 2024

kevinsantana11 commented Jan 8, 2025 •

edited

Loading

kevinsantana11 commented Jan 16, 2025

kevinsantana11 commented Jan 16, 2025

⭐️ IBTRACS dataset adapter #493

Are you sure you want to change the base?

⭐️ IBTRACS dataset adapter #493

Conversation

kevinsantana11 commented Jul 19, 2024

philippemiron commented Aug 4, 2024

kevinsantana11 commented Aug 6, 2024

kevinsantana11 commented Oct 21, 2024

selipot Oct 23, 2024

Choose a reason for hiding this comment

selipot commented Oct 23, 2024

selipot Oct 23, 2024

Choose a reason for hiding this comment

selipot Oct 23, 2024

Choose a reason for hiding this comment

selipot commented Nov 15, 2024

kevinsantana11 commented Jan 8, 2025 • edited Loading

kevinsantana11 commented Jan 16, 2025

kevinsantana11 commented Jan 16, 2025

kevinsantana11 commented Jan 8, 2025 •

edited

Loading