Merge pull request #18 from aodn/YmlLinter

(Feat) add YAML and JSON linter on github actions
aodn · Jun 4, 2024 · df41e58 · df41e58
2 parents 79e3a13 + 16842cf
commit df41e58
Show file tree

Hide file tree

Showing 51 changed files with 6,773 additions and 754 deletions.
diff --git a/.github/workflows/build.yml b/.github/workflows/build.yml
@@ -81,4 +81,3 @@ jobs:
       - name: Verify build
         run: |
           pip install dist/*.whl
-
diff --git a/.github/workflows/pre-commit.yml b/.github/workflows/pre-commit.yml
@@ -0,0 +1,23 @@
+name: Pre-commit
+
+on: [push, pull_request]
+
+jobs:
+  pre-commit:
+    runs-on: ubuntu-latest
+
+    steps:
+    - name: Checkout code
+      uses: actions/checkout@v4
+
+    - name: Set up Python
+      uses: actions/setup-python@v5
+      with:
+        python-version: '3.10.14'
+
+    - name: Install pre-commit
+      run: pip install pre-commit
+
+    - name: Run pre-commit
+      run: |
+        pre-commit run --all-files
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -0,0 +1,13 @@
+repos:
+-   repo: https://github.com/pre-commit/pre-commit-hooks
+    rev: v2.3.0
+    hooks:
+    -   id: check-yaml
+    -   id: check-json
+        exclude : aodn_cloud_optimised/config/dataset/dataset_template.json
+    -   id: end-of-file-fixer
+    -   id: trailing-whitespace
+-   repo: https://github.com/psf/black
+    rev: 22.10.0
+    hooks:
+    -   id: black
diff --git a/README.md b/README.md
@@ -5,12 +5,23 @@ A tool to convert IMOS NetCDF files and CSV into Cloud Optimised format (Zarr/Pa
 
 
 # Installation
+## Users
+Requirements:
+* python >= 3.10.14
+
+```bash
+curl -s https://raw.githubusercontent.com/aodn/aodn_cloud_optimised/main/install.sh | bash
+```
+
+## Development
+Requirements:
+* Mamba from miniforge3: https://github.com/conda-forge/miniforge
+
 ```bash
 mamba env create --file=environment.yml
 mamba activate CloudOptimisedParquet
 
-pip install -e .  # development (edit mode)
-pip install . # prod
+poetry install
 ```
 # Requirements
 AWS SSO to push files to S3
@@ -36,7 +47,7 @@ AWS SSO to push files to S3
 | Create AWS OpenData Registry Yaml                                                              | Done   |
 | Config file JSON validation against schema                                                     | Done   |
 | Create polygon variable to facilite geometry queries | Done   |
-   
+
 ## Zarr Features
 | Feature                                                                | Status | Comment                                                                            |
 |------------------------------------------------------------------------|--------|------------------------------------------------------------------------------------|
@@ -96,11 +107,11 @@ See [documentation](README_add_new_dataset.md) to learn how to add a new dataset
 
 # Notebooks
 
-Notebooks exist under 
-https://github.com/aodn/architecturereview/blob/main/cloud-optimised/cloud-optimised-team/parquet/notebooks/
+Notebooks exist under
+https://github.com/aodn/aodn_cloud_optimised/blob/main/notebooks/
 
-For each new dataset, it is a good practice to use the provided template ```cloud-optimised/cloud-optimised-team/parquet/notebooks/template.ipynb```
+For each new dataset, it is a good practice to use the provided template ```notebooks/template.ipynb```
 and create a new notebook.
 
 These notebooks use a common library of python functions to help with creating the geo-spatial filters:
-```cloud-optimised/cloud-optimised-team/parquet/notebooks/parquet_queries.py```
+```notebooks/parquet_queries.py```
diff --git a/README_add_new_dataset.md b/README_add_new_dataset.md
@@ -1,5 +1,5 @@
-This module aims to be generic enough so that adding a new IMOS dataset is driven through a json config file. 
-For more complicated dataset, such as Argo for example, it's also possible to create a specific handler which would 
+This module aims to be generic enough so that adding a new IMOS dataset is driven through a json config file.
+For more complicated dataset, such as Argo for example, it's also possible to create a specific handler which would
 inherit with ```Super()``` all of the methods for the ```aodn_cloud_optimised.lib.GenericParquetHandler.GenericHandler``` class.
 
 The main choice left to create a cloud optimised dataset with this module is to decide to either use the **Apache Parquet**
@@ -34,7 +34,7 @@ While developing the aodn_cloud_optimised, it became clear that for both zarr an
 In this section, we're demonstrating how to create the full schema from a NetCDF file as an example, so that each variable
 is defined, with its variable attributes and the type.
 
-The following snippet creates the required schema from a random NetCDF. ```generate_json_schema_from_s3_netcdf``` will output the schema into a json file in a temporary file. 
+The following snippet creates the required schema from a random NetCDF. ```generate_json_schema_from_s3_netcdf``` will output the schema into a json file in a temporary file.
 
 ```python
 import os
@@ -145,14 +145,14 @@ the parquet dataset, the logs will output the json info to be added into the con
 
 ### Global attributes as variables
 Some NetCDF global attributes may have to be converted into variables so that users/API can filter the data based on these
-values. 
+values.
 
 In the following example, ```deployment_code``` is a global attribute that we want to have as a variable. It is then added
 in the ```gattrs_to_variables```. **However**, this needs to also be present in the schema definition
 so that:
 
 ```json
-...  
+...
   "gattrs_to_variables": [
     "deployment_code"
   ],
@@ -165,7 +165,7 @@ so that:
 ```
 
 ### Filename as variable
-The IMOS/AODN data (re)processing is very file oriented. In order to reprocess data and delete the old matching data, 
+The IMOS/AODN data (re)processing is very file oriented. In order to reprocess data and delete the old matching data,
 the original filename is stored as a variable. It is required to add it in the schema definition:
 
 ```json
@@ -188,14 +188,14 @@ The following information needs to be added in the relevant sections:
 ```json
   "partition_keys": [
     "timestamp",
-    ...    
+    ...
   ],
   "time_extent": {
     "time": "TIME",
     "partition_timestamp_period": "Q"
   },
   "schema":
-    ...    
+    ...
     "timestamp": {
       "type": "int64"
     },
@@ -243,7 +243,7 @@ Force search for existing parquet files to delete when creating new ones. This c
   "force_old_pq_del": true
 ```
 
-### AWS OpenData registry 
+### AWS OpenData registry
 In order to publicise the dataset on the OpenData Registry, add the following to the config. A ```yaml``` file will be
 created/updated alongside the parquet dataset.
 
@@ -405,8 +405,8 @@ The name of a variable which will be used as a template to create missing variab
 
 ```
 
-### Variables to drop 
-when setting `region` explicitly in to_zarr() method, all variables in the dataset to write must have at least one 
+### Variables to drop
+when setting `region` explicitly in to_zarr() method, all variables in the dataset to write must have at least one
 dimension in common with the region's dimensions ['TIME'].
 We need to remove the variables from the dataset which fall into this condition:
 ```json
@@ -417,7 +417,5 @@ We need to remove the variables from the dataset which fall into this condition:
 
 See same section above. As for parquet
 
-### AWS OpenData registry 
+### AWS OpenData registry
 See same section above. As for parquet
-
-
diff --git a/aodn_cloud_optimised/bin/aatams_acoustic_tagging.py b/aodn_cloud_optimised/bin/aatams_acoustic_tagging.py
@@ -2,18 +2,29 @@
 import importlib.resources
 
 from aodn_cloud_optimised.lib.CommonHandler import cloud_optimised_creation_loop
-from aodn_cloud_optimised.lib.config import load_variable_from_config, load_dataset_config
+from aodn_cloud_optimised.lib.config import (
+    load_variable_from_config,
+    load_dataset_config,
+)
 from aodn_cloud_optimised.lib.s3Tools import s3_ls
 
+
 def main():
-    BUCKET_RAW_DEFAULT = load_variable_from_config('BUCKET_RAW_DEFAULT')
-    obj_ls = s3_ls(BUCKET_RAW_DEFAULT, 'IMOS/AATAMS/acoustic_tagging/', suffix='.csv')
+    BUCKET_RAW_DEFAULT = load_variable_from_config("BUCKET_RAW_DEFAULT")
+    obj_ls = s3_ls(BUCKET_RAW_DEFAULT, "IMOS/AATAMS/acoustic_tagging/", suffix=".csv")
 
-    dataset_config = load_dataset_config(str(importlib.resources.path("aodn_cloud_optimised.config.dataset", "aatams_acoustic_tagging.json")))
+    dataset_config = load_dataset_config(
+        str(
+            importlib.resources.path(
+                "aodn_cloud_optimised.config.dataset", "aatams_acoustic_tagging.json"
+            )
+        )
+    )
 
-    cloud_optimised_creation_loop(obj_ls,
-                                  dataset_config=dataset_config,
-                                  )
+    cloud_optimised_creation_loop(
+        obj_ls,
+        dataset_config=dataset_config,
+    )
 
 
 if __name__ == "__main__":

diff --git a/aodn_cloud_optimised/bin/acorn_gridded_qc_turq.py b/aodn_cloud_optimised/bin/acorn_gridded_qc_turq.py
@@ -4,30 +4,39 @@
 from aodn_cloud_optimised.lib.GenericZarrHandler import GenericHandler
 from aodn_cloud_optimised.lib.CommonHandler import cloud_optimised_creation_loop
 
-from aodn_cloud_optimised.lib.config import load_variable_from_config, load_dataset_config
+from aodn_cloud_optimised.lib.config import (
+    load_variable_from_config,
+    load_dataset_config,
+)
 from aodn_cloud_optimised.lib.s3Tools import s3_ls
 
 
 def main():
-    BUCKET_RAW_DEFAULT = load_variable_from_config('BUCKET_RAW_DEFAULT')
-    nc_obj_ls = s3_ls(BUCKET_RAW_DEFAULT, 'IMOS/ACORN/gridded_1h-avg-current-map_QC/TURQ/2023')
-
-    dataset_config = load_dataset_config(str(importlib.resources.path("aodn_cloud_optimised.config.dataset", "acorn_gridded_qc_turq.json")))
+    BUCKET_RAW_DEFAULT = load_variable_from_config("BUCKET_RAW_DEFAULT")
+    nc_obj_ls = s3_ls(
+        BUCKET_RAW_DEFAULT, "IMOS/ACORN/gridded_1h-avg-current-map_QC/TURQ/2023"
+    )
+
+    dataset_config = load_dataset_config(
+        str(
+            importlib.resources.path(
+                "aodn_cloud_optimised.config.dataset", "acorn_gridded_qc_turq.json"
+            )
+        )
+    )
 
     # First zarr creation
-    cloud_optimised_creation_loop([nc_obj_ls[0]],
-                                  dataset_config=dataset_config,
-                                  reprocess=True
-                                  )
+    cloud_optimised_creation_loop(
+        [nc_obj_ls[0]], dataset_config=dataset_config, reprocess=True
+    )
 
     # append to zarr
-    cloud_optimised_creation_loop(nc_obj_ls[1:],
-                                  dataset_config=dataset_config
-                                  )
+    cloud_optimised_creation_loop(nc_obj_ls[1:], dataset_config=dataset_config)
     # rechunking
-    GenericHandler(input_object_key=nc_obj_ls[0],
-                   dataset_config=dataset_config,
-                   ).rechunk()
+    GenericHandler(
+        input_object_key=nc_obj_ls[0],
+        dataset_config=dataset_config,
+    ).rechunk()
 
 
 if __name__ == "__main__":

diff --git a/aodn_cloud_optimised/bin/anfog_to_parquet.py b/aodn_cloud_optimised/bin/anfog_to_parquet.py
@@ -2,19 +2,26 @@
 import importlib.resources
 
 from aodn_cloud_optimised.lib.CommonHandler import cloud_optimised_creation_loop
-from aodn_cloud_optimised.lib.config import load_variable_from_config, load_dataset_config
+from aodn_cloud_optimised.lib.config import (
+    load_variable_from_config,
+    load_dataset_config,
+)
 from aodn_cloud_optimised.lib.s3Tools import s3_ls
 
 
 def main():
-    BUCKET_RAW_DEFAULT = load_variable_from_config('BUCKET_RAW_DEFAULT')
-    nc_obj_ls = s3_ls(BUCKET_RAW_DEFAULT, 'IMOS/ANFOG/slocum_glider')
+    BUCKET_RAW_DEFAULT = load_variable_from_config("BUCKET_RAW_DEFAULT")
+    nc_obj_ls = s3_ls(BUCKET_RAW_DEFAULT, "IMOS/ANFOG/slocum_glider")
 
-    dataset_config = load_dataset_config(str(importlib.resources.path("aodn_cloud_optimised.config.dataset", "anfog_slocum_glider.json")))
+    dataset_config = load_dataset_config(
+        str(
+            importlib.resources.path(
+                "aodn_cloud_optimised.config.dataset", "anfog_slocum_glider.json"
+            )
+        )
+    )
 
-    cloud_optimised_creation_loop(nc_obj_ls,
-                                  dataset_config=dataset_config
-                                  )
+    cloud_optimised_creation_loop(nc_obj_ls, dataset_config=dataset_config)
 
 
 if __name__ == "__main__":

diff --git a/aodn_cloud_optimised/bin/anmn_aqualogger_to_parquet.py b/aodn_cloud_optimised/bin/anmn_aqualogger_to_parquet.py
@@ -2,26 +2,41 @@
 import importlib.resources
 
 from aodn_cloud_optimised.lib.CommonHandler import cloud_optimised_creation_loop
-from aodn_cloud_optimised.lib.config import load_variable_from_config, load_dataset_config
+from aodn_cloud_optimised.lib.config import (
+    load_variable_from_config,
+    load_dataset_config,
+)
 from aodn_cloud_optimised.lib.s3Tools import s3_ls
 
 
 def main():
 
-    BUCKET_RAW_DEFAULT = load_variable_from_config('BUCKET_RAW_DEFAULT')
+    BUCKET_RAW_DEFAULT = load_variable_from_config("BUCKET_RAW_DEFAULT")
 
-    nc_obj_ls = s3_ls(BUCKET_RAW_DEFAULT, 'IMOS/ANMN/NSW') + \
-                s3_ls(BUCKET_RAW_DEFAULT, 'IMOS/ANMN/PA') + \
-                s3_ls(BUCKET_RAW_DEFAULT, 'IMOS/ANMN/QLD') + \
-                s3_ls(BUCKET_RAW_DEFAULT, 'IMOS/ANMN/SA') + \
-                s3_ls(BUCKET_RAW_DEFAULT, 'IMOS/ANMN/WA')
+    nc_obj_ls = (
+        s3_ls(BUCKET_RAW_DEFAULT, "IMOS/ANMN/NSW")
+        + s3_ls(BUCKET_RAW_DEFAULT, "IMOS/ANMN/PA")
+        + s3_ls(BUCKET_RAW_DEFAULT, "IMOS/ANMN/QLD")
+        + s3_ls(BUCKET_RAW_DEFAULT, "IMOS/ANMN/SA")
+        + s3_ls(BUCKET_RAW_DEFAULT, "IMOS/ANMN/WA")
+    )
 
     # Aqualogger
-    temperature_logger_ts_fv01_ls = [s for s in nc_obj_ls if ('/Temperature/' in s) and ('FV01' in s)]
-    dataset_config = load_dataset_config(str(importlib.resources.path("aodn_cloud_optimised.config.dataset", "anmn_temperature_logger_ts_fv01.json")))
-
-    cloud_optimised_creation_loop(temperature_logger_ts_fv01_ls,
-                                  dataset_config=dataset_config)
+    temperature_logger_ts_fv01_ls = [
+        s for s in nc_obj_ls if ("/Temperature/" in s) and ("FV01" in s)
+    ]
+    dataset_config = load_dataset_config(
+        str(
+            importlib.resources.path(
+                "aodn_cloud_optimised.config.dataset",
+                "anmn_temperature_logger_ts_fv01.json",
+            )
+        )
+    )
+
+    cloud_optimised_creation_loop(
+        temperature_logger_ts_fv01_ls, dataset_config=dataset_config
+    )
 
 
 if __name__ == "__main__":
Original file line number	Diff line number	Diff line change
Expand Up		@@ -81,4 +81,3 @@ jobs:
		- name: Verify build
		run: \|
		pip install dist/*.whl