Skip to content

Commit

Permalink
Merge pull request #18 from aodn/YmlLinter
Browse files Browse the repository at this point in the history
(Feat) add YAML and JSON linter on github actions
  • Loading branch information
lbesnard authored Jun 4, 2024
2 parents 79e3a13 + 16842cf commit df41e58
Show file tree
Hide file tree
Showing 51 changed files with 6,773 additions and 754 deletions.
1 change: 0 additions & 1 deletion .github/workflows/build.yml
Original file line number Diff line number Diff line change
Expand Up @@ -81,4 +81,3 @@ jobs:
- name: Verify build
run: |
pip install dist/*.whl
23 changes: 23 additions & 0 deletions .github/workflows/pre-commit.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
name: Pre-commit

on: [push, pull_request]

jobs:
pre-commit:
runs-on: ubuntu-latest

steps:
- name: Checkout code
uses: actions/checkout@v4

- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.10.14'

- name: Install pre-commit
run: pip install pre-commit

- name: Run pre-commit
run: |
pre-commit run --all-files
13 changes: 13 additions & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
repos:
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v2.3.0
hooks:
- id: check-yaml
- id: check-json
exclude : aodn_cloud_optimised/config/dataset/dataset_template.json
- id: end-of-file-fixer
- id: trailing-whitespace
- repo: https://github.com/psf/black
rev: 22.10.0
hooks:
- id: black
25 changes: 18 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,12 +5,23 @@ A tool to convert IMOS NetCDF files and CSV into Cloud Optimised format (Zarr/Pa


# Installation
## Users
Requirements:
* python >= 3.10.14

```bash
curl -s https://raw.githubusercontent.com/aodn/aodn_cloud_optimised/main/install.sh | bash
```

## Development
Requirements:
* Mamba from miniforge3: https://github.com/conda-forge/miniforge

```bash
mamba env create --file=environment.yml
mamba activate CloudOptimisedParquet

pip install -e . # development (edit mode)
pip install . # prod
poetry install
```
# Requirements
AWS SSO to push files to S3
Expand All @@ -36,7 +47,7 @@ AWS SSO to push files to S3
| Create AWS OpenData Registry Yaml | Done |
| Config file JSON validation against schema | Done |
| Create polygon variable to facilite geometry queries | Done |

## Zarr Features
| Feature | Status | Comment |
|------------------------------------------------------------------------|--------|------------------------------------------------------------------------------------|
Expand Down Expand Up @@ -96,11 +107,11 @@ See [documentation](README_add_new_dataset.md) to learn how to add a new dataset

# Notebooks

Notebooks exist under
https://github.com/aodn/architecturereview/blob/main/cloud-optimised/cloud-optimised-team/parquet/notebooks/
Notebooks exist under
https://github.com/aodn/aodn_cloud_optimised/blob/main/notebooks/

For each new dataset, it is a good practice to use the provided template ```cloud-optimised/cloud-optimised-team/parquet/notebooks/template.ipynb```
For each new dataset, it is a good practice to use the provided template ```notebooks/template.ipynb```
and create a new notebook.

These notebooks use a common library of python functions to help with creating the geo-spatial filters:
```cloud-optimised/cloud-optimised-team/parquet/notebooks/parquet_queries.py```
```notebooks/parquet_queries.py```
26 changes: 12 additions & 14 deletions README_add_new_dataset.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
This module aims to be generic enough so that adding a new IMOS dataset is driven through a json config file.
For more complicated dataset, such as Argo for example, it's also possible to create a specific handler which would
This module aims to be generic enough so that adding a new IMOS dataset is driven through a json config file.
For more complicated dataset, such as Argo for example, it's also possible to create a specific handler which would
inherit with ```Super()``` all of the methods for the ```aodn_cloud_optimised.lib.GenericParquetHandler.GenericHandler``` class.

The main choice left to create a cloud optimised dataset with this module is to decide to either use the **Apache Parquet**
Expand Down Expand Up @@ -34,7 +34,7 @@ While developing the aodn_cloud_optimised, it became clear that for both zarr an
In this section, we're demonstrating how to create the full schema from a NetCDF file as an example, so that each variable
is defined, with its variable attributes and the type.

The following snippet creates the required schema from a random NetCDF. ```generate_json_schema_from_s3_netcdf``` will output the schema into a json file in a temporary file.
The following snippet creates the required schema from a random NetCDF. ```generate_json_schema_from_s3_netcdf``` will output the schema into a json file in a temporary file.

```python
import os
Expand Down Expand Up @@ -145,14 +145,14 @@ the parquet dataset, the logs will output the json info to be added into the con

### Global attributes as variables
Some NetCDF global attributes may have to be converted into variables so that users/API can filter the data based on these
values.
values.

In the following example, ```deployment_code``` is a global attribute that we want to have as a variable. It is then added
in the ```gattrs_to_variables```. **However**, this needs to also be present in the schema definition
so that:

```json
...
...
"gattrs_to_variables": [
"deployment_code"
],
Expand All @@ -165,7 +165,7 @@ so that:
```

### Filename as variable
The IMOS/AODN data (re)processing is very file oriented. In order to reprocess data and delete the old matching data,
The IMOS/AODN data (re)processing is very file oriented. In order to reprocess data and delete the old matching data,
the original filename is stored as a variable. It is required to add it in the schema definition:

```json
Expand All @@ -188,14 +188,14 @@ The following information needs to be added in the relevant sections:
```json
"partition_keys": [
"timestamp",
...
...
],
"time_extent": {
"time": "TIME",
"partition_timestamp_period": "Q"
},
"schema":
...
...
"timestamp": {
"type": "int64"
},
Expand Down Expand Up @@ -243,7 +243,7 @@ Force search for existing parquet files to delete when creating new ones. This c
"force_old_pq_del": true
```

### AWS OpenData registry
### AWS OpenData registry
In order to publicise the dataset on the OpenData Registry, add the following to the config. A ```yaml``` file will be
created/updated alongside the parquet dataset.

Expand Down Expand Up @@ -405,8 +405,8 @@ The name of a variable which will be used as a template to create missing variab

```

### Variables to drop
when setting `region` explicitly in to_zarr() method, all variables in the dataset to write must have at least one
### Variables to drop
when setting `region` explicitly in to_zarr() method, all variables in the dataset to write must have at least one
dimension in common with the region's dimensions ['TIME'].
We need to remove the variables from the dataset which fall into this condition:
```json
Expand All @@ -417,7 +417,5 @@ We need to remove the variables from the dataset which fall into this condition:

See same section above. As for parquet

### AWS OpenData registry
### AWS OpenData registry
See same section above. As for parquet


25 changes: 18 additions & 7 deletions aodn_cloud_optimised/bin/aatams_acoustic_tagging.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,18 +2,29 @@
import importlib.resources

from aodn_cloud_optimised.lib.CommonHandler import cloud_optimised_creation_loop
from aodn_cloud_optimised.lib.config import load_variable_from_config, load_dataset_config
from aodn_cloud_optimised.lib.config import (
load_variable_from_config,
load_dataset_config,
)
from aodn_cloud_optimised.lib.s3Tools import s3_ls


def main():
BUCKET_RAW_DEFAULT = load_variable_from_config('BUCKET_RAW_DEFAULT')
obj_ls = s3_ls(BUCKET_RAW_DEFAULT, 'IMOS/AATAMS/acoustic_tagging/', suffix='.csv')
BUCKET_RAW_DEFAULT = load_variable_from_config("BUCKET_RAW_DEFAULT")
obj_ls = s3_ls(BUCKET_RAW_DEFAULT, "IMOS/AATAMS/acoustic_tagging/", suffix=".csv")

dataset_config = load_dataset_config(str(importlib.resources.path("aodn_cloud_optimised.config.dataset", "aatams_acoustic_tagging.json")))
dataset_config = load_dataset_config(
str(
importlib.resources.path(
"aodn_cloud_optimised.config.dataset", "aatams_acoustic_tagging.json"
)
)
)

cloud_optimised_creation_loop(obj_ls,
dataset_config=dataset_config,
)
cloud_optimised_creation_loop(
obj_ls,
dataset_config=dataset_config,
)


if __name__ == "__main__":
Expand Down
39 changes: 24 additions & 15 deletions aodn_cloud_optimised/bin/acorn_gridded_qc_turq.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,30 +4,39 @@
from aodn_cloud_optimised.lib.GenericZarrHandler import GenericHandler
from aodn_cloud_optimised.lib.CommonHandler import cloud_optimised_creation_loop

from aodn_cloud_optimised.lib.config import load_variable_from_config, load_dataset_config
from aodn_cloud_optimised.lib.config import (
load_variable_from_config,
load_dataset_config,
)
from aodn_cloud_optimised.lib.s3Tools import s3_ls


def main():
BUCKET_RAW_DEFAULT = load_variable_from_config('BUCKET_RAW_DEFAULT')
nc_obj_ls = s3_ls(BUCKET_RAW_DEFAULT, 'IMOS/ACORN/gridded_1h-avg-current-map_QC/TURQ/2023')

dataset_config = load_dataset_config(str(importlib.resources.path("aodn_cloud_optimised.config.dataset", "acorn_gridded_qc_turq.json")))
BUCKET_RAW_DEFAULT = load_variable_from_config("BUCKET_RAW_DEFAULT")
nc_obj_ls = s3_ls(
BUCKET_RAW_DEFAULT, "IMOS/ACORN/gridded_1h-avg-current-map_QC/TURQ/2023"
)

dataset_config = load_dataset_config(
str(
importlib.resources.path(
"aodn_cloud_optimised.config.dataset", "acorn_gridded_qc_turq.json"
)
)
)

# First zarr creation
cloud_optimised_creation_loop([nc_obj_ls[0]],
dataset_config=dataset_config,
reprocess=True
)
cloud_optimised_creation_loop(
[nc_obj_ls[0]], dataset_config=dataset_config, reprocess=True
)

# append to zarr
cloud_optimised_creation_loop(nc_obj_ls[1:],
dataset_config=dataset_config
)
cloud_optimised_creation_loop(nc_obj_ls[1:], dataset_config=dataset_config)
# rechunking
GenericHandler(input_object_key=nc_obj_ls[0],
dataset_config=dataset_config,
).rechunk()
GenericHandler(
input_object_key=nc_obj_ls[0],
dataset_config=dataset_config,
).rechunk()


if __name__ == "__main__":
Expand Down
21 changes: 14 additions & 7 deletions aodn_cloud_optimised/bin/anfog_to_parquet.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,19 +2,26 @@
import importlib.resources

from aodn_cloud_optimised.lib.CommonHandler import cloud_optimised_creation_loop
from aodn_cloud_optimised.lib.config import load_variable_from_config, load_dataset_config
from aodn_cloud_optimised.lib.config import (
load_variable_from_config,
load_dataset_config,
)
from aodn_cloud_optimised.lib.s3Tools import s3_ls


def main():
BUCKET_RAW_DEFAULT = load_variable_from_config('BUCKET_RAW_DEFAULT')
nc_obj_ls = s3_ls(BUCKET_RAW_DEFAULT, 'IMOS/ANFOG/slocum_glider')
BUCKET_RAW_DEFAULT = load_variable_from_config("BUCKET_RAW_DEFAULT")
nc_obj_ls = s3_ls(BUCKET_RAW_DEFAULT, "IMOS/ANFOG/slocum_glider")

dataset_config = load_dataset_config(str(importlib.resources.path("aodn_cloud_optimised.config.dataset", "anfog_slocum_glider.json")))
dataset_config = load_dataset_config(
str(
importlib.resources.path(
"aodn_cloud_optimised.config.dataset", "anfog_slocum_glider.json"
)
)
)

cloud_optimised_creation_loop(nc_obj_ls,
dataset_config=dataset_config
)
cloud_optimised_creation_loop(nc_obj_ls, dataset_config=dataset_config)


if __name__ == "__main__":
Expand Down
39 changes: 27 additions & 12 deletions aodn_cloud_optimised/bin/anmn_aqualogger_to_parquet.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,26 +2,41 @@
import importlib.resources

from aodn_cloud_optimised.lib.CommonHandler import cloud_optimised_creation_loop
from aodn_cloud_optimised.lib.config import load_variable_from_config, load_dataset_config
from aodn_cloud_optimised.lib.config import (
load_variable_from_config,
load_dataset_config,
)
from aodn_cloud_optimised.lib.s3Tools import s3_ls


def main():

BUCKET_RAW_DEFAULT = load_variable_from_config('BUCKET_RAW_DEFAULT')
BUCKET_RAW_DEFAULT = load_variable_from_config("BUCKET_RAW_DEFAULT")

nc_obj_ls = s3_ls(BUCKET_RAW_DEFAULT, 'IMOS/ANMN/NSW') + \
s3_ls(BUCKET_RAW_DEFAULT, 'IMOS/ANMN/PA') + \
s3_ls(BUCKET_RAW_DEFAULT, 'IMOS/ANMN/QLD') + \
s3_ls(BUCKET_RAW_DEFAULT, 'IMOS/ANMN/SA') + \
s3_ls(BUCKET_RAW_DEFAULT, 'IMOS/ANMN/WA')
nc_obj_ls = (
s3_ls(BUCKET_RAW_DEFAULT, "IMOS/ANMN/NSW")
+ s3_ls(BUCKET_RAW_DEFAULT, "IMOS/ANMN/PA")
+ s3_ls(BUCKET_RAW_DEFAULT, "IMOS/ANMN/QLD")
+ s3_ls(BUCKET_RAW_DEFAULT, "IMOS/ANMN/SA")
+ s3_ls(BUCKET_RAW_DEFAULT, "IMOS/ANMN/WA")
)

# Aqualogger
temperature_logger_ts_fv01_ls = [s for s in nc_obj_ls if ('/Temperature/' in s) and ('FV01' in s)]
dataset_config = load_dataset_config(str(importlib.resources.path("aodn_cloud_optimised.config.dataset", "anmn_temperature_logger_ts_fv01.json")))

cloud_optimised_creation_loop(temperature_logger_ts_fv01_ls,
dataset_config=dataset_config)
temperature_logger_ts_fv01_ls = [
s for s in nc_obj_ls if ("/Temperature/" in s) and ("FV01" in s)
]
dataset_config = load_dataset_config(
str(
importlib.resources.path(
"aodn_cloud_optimised.config.dataset",
"anmn_temperature_logger_ts_fv01.json",
)
)
)

cloud_optimised_creation_loop(
temperature_logger_ts_fv01_ls, dataset_config=dataset_config
)


if __name__ == "__main__":
Expand Down
Loading

0 comments on commit df41e58

Please sign in to comment.