generated from NOAA-GFDL/template-repository
-
Notifications
You must be signed in to change notification settings - Fork 5
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
0 parents
commit 629bb4a
Showing
52 changed files
with
5,763 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
# Sphinx build info version 1 | ||
# This file records the configuration used when building these files. When it is not found, a full rebuild will be done. | ||
config: de08cdb371102642c7904f804427fe18 | ||
tags: 645f666f9bcd5a90fca523b33c5a78b7 |
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Empty file.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,38 @@ | ||
Background | ||
========== | ||
|
||
The catalog builder project is a “python community package ecosystem” that allows you to generate data catalogs compatible with intake-esm. Available as a Conda package. | ||
|
||
See our `Github repository here <https://github.com/aradhakrishnanGFDL/CatalogBuilder>`_. | ||
We have contributing guidelines and code of conduct documented in our GitHub repo. We welcome your contributions. | ||
|
||
Brief overview on data catalogs | ||
------------------------------- | ||
|
||
Data catalogs enable "data discoverability" regardless of the data format (zarr, netcdf). We acknowledge the different community collaborations (`Pangeo/ESGF Cloud Data working group <https://pangeo-data.github.io/pangeo-cmip6-cloud/>`_) that led us to explore this further. | ||
|
||
Data catalogs in this project have 3 components. One of those is the "intake-esm" API that makes use of the specifications and catalogs, generated by the catalog builder API. Read more about `Intake-ESM here <https://intake-esm.readthedocs.io/en/stable/>`_. | ||
|
||
Catalog Specification | ||
|
||
- What we expect to find inside and how to open the “datasets”/objects? | ||
- Provides metadata about the catalog | ||
- Identifies how multiple files can be aggregated into a single “dataset” | ||
- Support for extensible metadata | ||
- Single JSON file | ||
|
||
Catalogs | ||
|
||
- Tells us more about the data collection | ||
- Path to the files (objects), and associated metadata. | ||
- CSV file | ||
- User-defined granularity | ||
|
||
Intake-ESM API | ||
|
||
- Opens possibilities to QUERY and ANALYZE | ||
- Provides a pythonic way to “query” for information in the catalogs | ||
- Loads the results in an xarray dataset object | ||
|
||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,187 @@ | ||
Generating data catalogs | ||
======================== | ||
|
||
There are a few ways to use the catalog builder. | ||
|
||
Installation | ||
------------ | ||
|
||
Recommended approach: Install as a `conda package <https://anaconda.org/NOAA-GFDL/catalogbuilder>`_ | ||
|
||
.. code-block:: console | ||
conda install catalogbuilder -c noaa-gfdl | ||
Alternatively, you may clone the `git repository <https://github.com/NOAA-GFDL/CatalogBuilder>`_ | ||
and create your conda environment using the `environment.yml <https://github.com/NOAA-GFDL/CatalogBuilder/blob/main/environment.yml>`_ in the git repository. | ||
|
||
.. code-block:: console | ||
git clone https://github.com/NOAA-GFDL/CatalogBuilder | ||
conda env create -f environment_intake.yml | ||
Expected output | ||
--------------- | ||
|
||
A JSON catalog specification file and a CSV catalog in the specfied output directory with the specified name. | ||
|
||
Using conda package | ||
------------------- | ||
|
||
**1. Install the package using conda:** | ||
|
||
.. code-block:: console | ||
conda install catalogbuilder -c noaa-gfdl | ||
If you're trying these steps from GFDL, likely that you may need to do additional things to get it to work. See below | ||
|
||
Add these to your ~/.condarc file | ||
|
||
whitelist_channels: | ||
- noaa-gfdl | ||
- conda-forge | ||
- anaconda | ||
channels: | ||
- noaa-gfdl | ||
- conda-forge | ||
- anaconda | ||
|
||
(and try: conda config --add channels noaa-gfdl conda config --append channels conda-forge) | ||
|
||
If you encounter issues "ChecksumMismatchError: Conda detected a mismatch between the expected.." , do the following: | ||
|
||
conda config --add pkgs_dirs /local2/home/conda/pkgs | ||
conda config --add envs_dirs /local2/home/conda/envs | ||
|
||
**2. Add conda environment's site packages to PATH** | ||
|
||
See example below. | ||
|
||
.. code-block:: console | ||
setenv PATH ${PATH}:${CONDA_PREFIX}/lib/python3.1/site-packages/scripts/ | ||
**3. Call the builder** | ||
|
||
Catalogs are generated by the following command: *gen_intake_gfdl.py <INPUT_PATH> <OUTPUT_PATH>* | ||
|
||
Output path argumment should end with the desired output filename WITHOUT a file ending. See example below. | ||
|
||
.. code-block:: console | ||
gen_intake_gfdl.py /archive/am5/am5/am5f3b1r0/c96L65_am5f3b1r0_pdclim1850F/gfdl.ncrc5-deploy-prod-openmp/pp $HOME/catalog | ||
This would create a catalog.csv and catalog.json in the user's home directory. | ||
|
||
.. image:: _static/ezgif-4-786144c287.gif | ||
:width: 1000px | ||
:alt: Catalog generation demonstration | ||
|
||
See `Flags`_ here. | ||
|
||
Using a configuration file | ||
-------------------------- | ||
|
||
We recommend the use of a configuration file to provide input to the catalog builder. This is necessary and useful if you want to work with datasets and directories that are *not quite* GFDL post-processed directory oriented. | ||
|
||
`Here <https://github.com/NOAA-GFDL/CatalogBuilder/blob/main/catalogbuilder/tests/config-cfname.yaml>`_ is an example configuration file. | ||
|
||
Catalog headers (column names) are set with the *HEADER LIST* variable. The *OUTPUT PATH TEMPLATE* variable controls the expected directory structure of input data. | ||
|
||
.. code-block:: yaml | ||
#Catalog Headers | ||
headerlist: ["activity_id", "institution_id", "source_id", "experiment_id", | ||
"frequency", "realm", "table_id", | ||
"member_id", "grid_label", "variable_id", | ||
"time_range", "chunk_freq","platform","dimensions","cell_methods","standard_name","path"] | ||
The headerlist is expected column names in your catalog/csv file. This is usually determined by the users in conjuction | ||
with the ESM collection specification standards and the appropriate workflows. | ||
|
||
.. code-block:: yaml | ||
#Directory structure information | ||
output_path_template = ['NA','NA','source_id','NA','experiment_id','platform','custom_pp','realm','cell_methods','frequency','chunk_freq'] | ||
For a directory structure like /archive/am5/am5/am5f3b1r0/c96L65_am5f3b1r0_pdclim1850F/gfdl.ncrc5-deploy-prod-openmp/pp | ||
the output_path_template is set as above. We have NA in those values that do not match up with any of the expected headerlist (CSV columns), otherwise we | ||
simply specify the associated header name in the appropriate place. E.g. The third directory in the PP path example above is the model (source_id), so the third list value in output_path_template is set to 'source_id'. We make sure this is a valid value in headerlist as well. The fourth directory is am5f3b1r0 which does not map to an existing header value. So we simply NA in output_path_template for the fourth value. We have NA in values that do not match up with any of the expected headerlist (CSV columns), otherwise we simply specify the associated header name in the appropriate place. E.g. The third directory in the PP path example above is the model (source_id), so the third list value in output_path_template is set to 'source_id'. We make sure this is a valid value in headerlist as well. #The fourth directory is am5f3b1r0 which does not map to an existing header value. So we simply set NA in output_path_template for the fourth value. | ||
|
||
.. code-block:: yaml | ||
#Filename information | ||
output_file_template = ['realm','temporal_subset','variable_id'] | ||
.. code-block:: yaml | ||
#Input directory and output info | ||
input_path: "/archive/am5/am5/am5f7b10r0/c96L65_am5f7b10r0_amip/gfdl.ncrc5-deploy-prod-openmp/pp/" | ||
output_path: "/home/a1r/github/noaa-gfdl/catalogs/c96L65_am5f7b10r0_amip" # ENTER NAME OF THE CSV AND JSON, THE SUFFIX ALONE. This can be an absolute or a relative path | ||
From a Python script | ||
--------------------- | ||
Do you have a python script or a notebook where you could also include steps to generate a data catalog? | ||
See example `here <https://github.com/NOAA-GFDL/CatalogBuilder/blob/main/catalogbuilder/scripts/gen_intake_gfdl_runner_config.py>`_ | ||
|
||
Here is another example | ||
|
||
.. code-block:: console | ||
#!/usr/bin/env python | ||
#TODO test after conda pkg is published and make changes as needed | ||
from catalogbuilder.scripts import gen_intake_gfdl | ||
import sys | ||
input_path = "archive/am5/am5/am5f3b1r0/c96L65_am5f3b1r0_pdclim1850F/gfdl.ncrc5-deploy-prod-openmp/pp" | ||
output_path = "test" | ||
try: | ||
gen_intake_gfdl.create_catalog(input_path,output_path) | ||
except: | ||
sys.exit("Exception occured calling gen_intake_gfdl.create_catalog") | ||
From Jupyter Notebook | ||
--------------------- | ||
|
||
Refer to this `notebook <https://github.com/NOAA-GFDL/CatalogBuilder/blob/main/catalogbuilder/scripts/gen_intake_gfdl_notebook.ipynb>`_ to see how you can generate catalogs from a Jupyter Notebook | ||
|
||
|
||
.. image:: _static/catalog_generation.png | ||
:alt: Screenshot of a notebook showing catalog generation | ||
|
||
|
||
Using FRE-CLI (GFDL only) | ||
------------------------- | ||
|
||
**1. Activate conda environment** | ||
|
||
.. code-block:: console | ||
conda activate /nbhome/fms/conda/envs/fre-cli | ||
**2. Call the builder** | ||
|
||
Catalogs are generated by the following command: *fre catalog buildcatalog <INPUT_PATH> <OUTPUT_PATH>* | ||
|
||
(OUTPUT_PATH should end with the desired output filename WITHOUT a file ending) See example below. | ||
|
||
.. code-block:: console | ||
fre catalog buildcatalog --overwrite /archive/path_to_data_dir ~/output | ||
See `Flags`_ here. | ||
|
||
See `Fre-CLI Documentation here <https://noaa-gfdl.github.io/fre-cli/>`_ | ||
|
||
|
||
Flags | ||
_____ | ||
|
||
.. Reference `Flags`_. | ||
- overwrite - Overwrite an existing catalog at the given output path | ||
- append - Append (without headerlist) to an existing catalog at the given output path |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,30 @@ | ||
.. Catalog Builder documentation master file, created by | ||
sphinx-quickstart on Wed Feb 14 00:31:23 2024. | ||
You can adapt this file completely to your liking, but it should at least | ||
contain the root `toctree` directive. | ||
Welcome to Catalog Builder's documentation! | ||
=========================================== | ||
|
||
|
||
The Catalog Builder API will collect building blocks necessary to build a data catalog which can then be ingested in climate analysis scripts/workflow, leveraging the use of intake-esm and xarray. | ||
|
||
Tested on posix file system, S3 and GFDL post-processed (select simulations, components) at this time. This repository has unit tests (pytest) and incorporated the same in GitHub Actions, when a PR is open or a push is initiated. | ||
|
||
See our `Github repository <https://github.com/aradhakrishnanGFDL/CatalogBuilder>`_ here. | ||
|
||
.. toctree:: | ||
:maxdepth: 2 | ||
:caption: Contents: | ||
|
||
background | ||
generation | ||
usage | ||
presentation | ||
|
||
Indices and tables | ||
================== | ||
|
||
* :ref:`genindex` | ||
* :ref:`modindex` | ||
* :ref:`search` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,8 @@ | ||
Presentation links | ||
================== | ||
`March 12th GFDL webinar <https://github.com/aradhakrishnanGFDL/CatalogBuilder/blob/main/doc/_static/data%20catalog%20webinar%20slides.pdf>`_ | ||
|
||
`Introduction to community-developed data exploration for earth system and beyond <https://github.com/NOAA-GFDL/CatalogBuilder/blob/47-intro-docs/doc/_static/Introduction%20to%20catalogs%20and%20intake-esm%20.pdf>`_ | ||
|
||
`How to use the Catalog Builder <https://github.com/NOAA-GFDL/CatalogBuilder/blob/AddingSlides/doc/_static/How-To-Use-The-Catalog-Builder.pdf>`_ | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,77 @@ | ||
Using data catalogs | ||
=================== | ||
|
||
Catalogs provide a level of indexing that can greatly speed up data discovery. Therefore, usability is a priority. All catalogs generated by Catalog Builder are accompanied by a Intake-ESM compatable JSON file. | ||
|
||
Example notebooks | ||
------------------ | ||
|
||
We are collecting examples that use the Intake-ESM API with the catalogs generated by our catalog builder `here <https://github.com/aradhakrishnanGFDL/canopy-cats>`_. Please open an issue and contribute! | ||
|
||
Community examples | ||
|
||
- `An example from the MDTF <https://nbviewer.org/github/wrongkindofdoctor/MDTF-diagnostics/blob/refactor_pp/diagnostics/example_multicase/example_multirun_demo.ipynb>`_ | ||
`GitHub reference <https://github.com/wrongkindofdoctor/MDTF-diagnostics/blob/refactor_pp/diagnostics/example_multicase/example_multirun_demo.ipynb>`_ | ||
|
||
- `An example using a GFDL experiment (dev only) <https://nbviewer.org/github/aradhakrishnanGFDL/canopy-cats/blob/main/notebooks/om_example.ipynb>`_ | ||
`GitHub reference <https://github.com/aradhakrishnanGFDL/canopy-cats/blob/main/notebooks/om_example.ipynb>`_ | ||
|
||
- `Generate Intake-ESM example that uses our CMIP6 collection in the AWS cloud <https://github.com/aradhakrishnanGFDL/gfdl-aws-analysis>`_ | ||
- `Examples from DKRZ <https://easy.gems.dkrz.de/Processing/Intake/index.html>`_ | ||
- `Pangeo CMIP6 gallery <https://gallery.pangeo.io/repos/pangeo-gallery/cmip6/intake_ESM_example.html>`_ | ||
- `CIMES Intern projects <https://github.com/MackenzieBlanusa/OHC_CMIP6>`_ | ||
- `Student work <https://github.com/aradhakrishnanGFDL/AGU-rmonge/>`_ | ||
|
||
|
||
How to ingest using Intake-ESM | ||
------------------------------ | ||
|
||
**Import needed packages based on what your python analysis needs. Only intake and intake-esm are necessary for data exploration with intake-esm package** | ||
|
||
.. code-block:: python | ||
import xarray as xr | ||
import intake | ||
import intake_esm | ||
import matplotlib | ||
from matplotlib import pyplot as plt | ||
%matplotlib inline | ||
**Set collection file variable (col_url) to JSON path** | ||
|
||
We must provide Intake-ESM with a path to an ESM compatible collection file (JSON). This JSON establishes a link to the generated catalog. | ||
|
||
.. code-block:: python | ||
col_url = "<path-to-JSON>" | ||
#E.g: col_url = "cats/gfdl_test1.json" # The template we use for current testing and for MDTF is here https://github.com/aradhakrishnanGFDL/CatalogBuilder/blob/main/cats/gfdl_template.json | ||
col = intake.open_esm_datastore(col_url) | ||
**Set search parameters** | ||
|
||
Search parameters can be set to find specific files. Here, we search for a file using keys such as the experiment name and modeling realm. | ||
|
||
.. code-block:: python | ||
expname_filter = ['ESM4_1pctCO2_D1'] | ||
modeling_realm = 'atmos' | ||
model_filter = 'ESM4' | ||
variable_id_filter = "evap" | ||
ens_filter = "r1i1p1f1" | ||
frequency = "monthly" | ||
chunk_freq = "5yr" | ||
**Search the catalog** | ||
|
||
Now, we execute our query: | ||
|
||
.. code-block:: python | ||
cat = col.search(experiment_id=expname_filter,frequency=frequency,modeling_realm=modeling_realm, | ||
source_id=model_filter,variable_id=variable_id_filter) | ||
cat.df["path"][0] | ||
Intake will return the path to the file(s) that match these search parameters. |
Binary file not shown.
Binary file not shown.
Oops, something went wrong.