This project, built using PyTorch and PyTorch-lightning, is designed to train a variety of Neural Network architectures (GNNs, CNNs, Vision Transformers, ...) on various weather forecasting datasets. This is a Work in Progress, intended to share ideas and design concepts with partners.
Developped at Météo-France by DSM/AI Lab and CNRM/GMAP/PREV.
Contributions are welcome (Issues, Pull Requests, ...).
This project is licensed under the APACHE 2.0 license.
This project started as a fork of neural-lam, a project by Joel Oskarsson, see here. Many thanks to Joel for his work!
- Overview
- Features
- Installation
- Usage
- Contributing new features
- Design choices
- Unit tests
- Continuous Integration
- Use any neural network architectures available in mfai
- 1 dataset with samples available on Huggingface : Titan
- 2 training strategies : Scaled Auto-regressive steps, Differential Auto-regressive steps
- 4 losses: Scaled RMSE, Scaled L1, Weighted MSE, Weighted L1
- neural networks as simple torch.nn.Module
- training with pytorchlightning
- simple interfaces to easily add a new dataset, neural network, training strategy or loss
- simple command line to lauch a training
- config files to change the parameters of your dataset or neural network during training
- experiment tracking with tensorboard and plots of forecasts with matplotlib
- implementation of NamedTensors to tracks features and dimensions of tensors at each step of the training
See here for details on the available datasets, neural networks, training strategies, losses, and explanation of our NamedTensor.
Start by cloning the repository:
git clone https://github.com/meteofrance/py4cast.git
cd py4cast
In order to be able to run the code on different machines, some environment variables can be set.
You may add them in your .bashrc
or modify them just before launching an experiment.
PY4CAST_ROOTDIR
: Specify the ROOT DIR for your experiment. It also modifies the CACHE_DIR. This is where the files created during the experiment will be stored.PY4CAST_SMEAGOL_PATH
: Specify where the smeagol dataset is stored. Only needed if you want to use the smeagol dataset.PY4CAST_TITAN_PATH
: Specify where the titan dataset is stored. Only needed if you want to use the titan dataset.
This should be done by
export PY4CAST_ROOTDIR="/my/dir/"
You MUST export PY4CAST_ROOTDIR to make py4cast work, you can use for instance the existing SCRATCH env var:
export PY4CAST_ROOTDIR=$SCRATCH/py4cast
If PY4CAST_ROOTDIR is not exported py4cast will default to use /scratch/shared/py4cast as its root directory, leading to Exceptions if this directory does not exist or if it is not writable.
When working at Météo-France, you can use either runai + Docker or Conda/Micromamba to setup a working environment. On the AI Lab cluster we recommend using runai, Conda on our HPC.
See the runai repository for installation instructions.
For HPC, see the related doc (doc/install/install_MF.md) to get the right installation settings.
You can install a conda environment, including py4cast
in editable mode, using
conda env create --file env.yaml
From an exixting conda environment, you can now install manually py4cast
in development mode using
conda install conda-build -n py4cast
conda develop .
or
pip install --editable .
In case the install fail because some dependencies are not found or are in conflict, please look at the installation known issues.
Please install the environment using :
micromamba create -f env.yaml
From an exixting micromamba environment, you can now install manually py4cast
in editable mode using
pip install --editable .
To build the docker image please use the oci-image-build.sh
script.
For Meteo-France user, you should export the variable INJECT_MF_CERT
to use the Meteo-France certificate
export INJECT_MF_CERT=1
Then, build with the following command
bash ./oci-image-build.sh --runtime docker
By default, the CUDA
and pytorch
version are extracted from the env.yaml
reference file. Nevertheless, for test purpose, you can set the PY4CAST_CUDA_VERSION and PY4CAST_TORCH_VERSION to override the default versions.
As an alternative to docker, you can use podman to build the image.
Click to expand
To build the podman image please use the oci-image-build.sh
script.
bash ./oci-image-build.sh --runtime podman
By default, the CUDA
and pytorch
version are extracted from the env.yaml
reference file. Nevertheless, for test purpose, you can set the PY4CAST_CUDA_VERSION and PY4CAST_TORCH_VERSION to override the default versions.
From a previously built docker or podman image, you can convert it to the singularity format.
Click to expand
To convert the previously built image to a Singularity container, you have to first save the image as a tar
file:
docker save py4cast:your_tag -o py4cast-your_tag.tar
or with podman:
podman save --format oci-archive py4cast:your_tag -o py4cast-your_tag.tar
Then, build the singularity image with:
singularity build py4cast-your_tag.sif docker-archive://py4cast-your_tag.tar
Please, be sure to get enough free disk space to store the .tar and .sif files.
From your py4cast
source directory, to run an experiment using the docker image you need to mount in the container :
- The dataset path
- The py4cast sources
- The PY4CAST_ROOTDIR path
Here is an example of command to run a "dev_mode" training of the HiLam model with the TITAN dataset, using all the GPUs:
docker run \
--name hilam-titan \
--rm \
--gpus all \
-v ./${HOME} \
-v <path-to-datasets>/TITAN:/dataset/TITAN \
-v <your_py4cast_root_dir>:<your_py4cast_root_dir> \
-e PY4CAST_ROOTDIR=<your_py4cast_root_dir> \
-e PY4CAST_TITAN_PATH=/dataset/TITAN \
py4cast:<your_tag> \
bash -c " \
pip install -e . && \
python bin/train.py \
--dataset titan \
--model HiLAM \
--dataset_conf config/datasets/titan_full.json \
--dev_mode \
--no_log \
--num_pred_steps_val_test 1 \
--num_input_steps 1 \
"
Click to expand
From your py4cast
source directory, to run an experiment using the podman image you need to mount in the container :
- The dataset path
- The py4cast sources
- The PY4CAST_ROOTDIR path
Here is an example of command to run a "dev_mode" training of the HiLam model with the TITAN dataset, using all the GPUs:
podman run \
--name hilam-titan \
--rm \
--device nvidia.com/gpu=all \
--ipc=host \
--network=host \
-v ./${HOME} \
-v <path-to-datasets>/TITAN:/dataset/TITAN \
-v <your_py4cast_root_dir>:<your_py4cast_root_dir> \
-e PY4CAST_ROOTDIR=<your_py4cast_root_dir> \
-e PY4CAST_TITAN_PATH=/dataset/TITAN \
py4cast:<your_tag> \
bash -c " \
pip install -e . && \
python bin/train.py \
--dataset titan \
--model HiLAM \
--dataset_conf config/datasets/titan_full.json \
--dev_mode \
--no_log \
--num_pred_steps_val_test 1 \
--num_input_steps 1 \
"
Click to expand
From your py4cast
source directory, to run an experiment using a singularity container you need to mount in the container :
- The dataset path
- The PY4CAST_ROOTDIR path
Here is an example of command to run a "dev_mode" training of the HiLam model with the TITAN dataset:
PY4CAST_TITAN_PATH=/dataset/TITAN \
PY4CAST_ROOTDIR=<your_py4cast_root_dir> \
singularity exec \
--nv \
--bind <path-to-datasets>/TITAN:/dataset/TITAN \
--bind <your_py4cast_root_dir>:<your_py4cast_root_dir> \
py4cast-<your_tag>.sif \
bash -c " \
pip install -e . && \
python bin/train.py \
--dataset titan \
--model HiLAM \
--dataset_conf config/datasets/titan_full.json \
--dev_mode \
--no_log \
--num_pred_steps_val_test 1 \
--num_input_steps 1 \
"
For now this works only for internal Météo-France users.
Click to expand
runai
commands must be issued at the root directory of the py4cast
project:
- Run an interactive training session
runai gpu_play 4
runai build
runai exec_gpu python bin/train.py --dataset titan --model HiLAM
- Train using sbatch single node multi-GPUs
export RUNAI_GRES="gpu:v100:4"
runai sbatch python bin/train.py --dataset titan --model HiLAM
- Train using sbatch multi nodes multi GPUs
Here we use 2 nodes with 4 GPUs each.
export RUNAI_SLURM_NNODES=2
export RUNAI_GRES="gpu:v100:4"
runai sbatch_multi_node python bin/train.py --dataset titan --model HiLAM
For the rest of the documentation, you must preprend each python command with runai exec_gpu
.
Once your micromamba environment is setup, you should :
- activate your environment
conda activate py4cast
ormicromamba activate nlam
- launch a training
A very simple training can be launch (on your current node)
python bin/train.py --dataset dummy --model HalfUNet --epochs 2
To do so, you will need to create a small sh
script.
#!/usr/bin/bash
#SBATCH --partition=ndl
#SBATCH --nodes=1 # Specify the number of GPU node you required
#SBATCH --gres=gpu:1 # Specify the number of GPU required per Node
#SBATCH --time=05:00:00 # Specify your experiment Time limit
#SBATCH --ntasks-per-node=1 # Specify the number of task per node. This should match the number of GPU Required per Node
# Note that other variable could be set (according to your machine). For example you may need to set the number of CPU or the memory used by your experiment.
# On MF hpc, this is proportional to the number of GPU required per node. This is not the case on other machine (e.g MétéoFrance AILab machine).
source ~/.bashrc # Be sure that all your environment variables are set
conda activate py4cast # Activate your environment (installed by micromamba or conda)
cd $PY4CAST_PATH # Go to Py4CAST (you can either add an environment variable or hard code it here).
# Launch your favorite command.
srun python bin/train.py --model HalfUNet --dataset dummy --epochs 2
Then just launch this script using
sbatch my_tiny_script.sh
NB Note that you may have some trouble with SSL certificates (for cartopy). You may need to explicitely export the certificate as :
export SSL_CERT_FILE="/opt/softs/certificats/proxy1.pem"
with the proxy path depending on your machine.
As in neural-lam, before training you must first compute the mean and std of each feature.
To compute the stats of the Titan dataset:
python py4cast/datasets/titan/__init__.py
To train on a dataset with its default settings just pass the name of the dataset (all lowercase) :
python bin/train.py --dataset titan --model HalfUNet
You can override the dataset default configuration file:
python bin/train.py --dataset smeagol --model HalfUNet --dataset_conf config/smeagoldev.json
Details on available datasets.
- Configuring the neural network
To train on a dataset using a network with its default settings just pass the name of the architecture (all lowercase) as shown below:
python bin/train.py --dataset smeagol --model HiLAM
python bin/train.py --dataset smeagol --model HalfUNet
You can override some settings of the model using a json config file (here we increase the number of filter to 128 and use ghost modules):
python bin/train.py --dataset smeagol --model HalfUNet --model_conf config/halfunet128_ghost.json
Details on available neural networks.
- Changing the training strategy
You can choose a training strategy using the --strategy STRATEGY_NAME cli argument:
python bin/train.py --dataset smeagol --model HalfUNet --strategy diff_ar
Details on available training strategies.
- Other training options:
--seed SEED
random seed (default: 42)--loss LOSS
Loss function to use (default: mse)--lr LR
learning rate (default: 0.001)--val_interval VAL_INTERVAL
Number of epochs training between each validation run (default: 1)--epochs EPOCHS
upper epoch limit (default: 200)--profiler PROFILER
Profiler required. Possibilities are ['simple', 'pytorch', 'None']--batch_size BATCH_SIZE
batch size--precision PRECISION
Numerical precision to use for model (32/16/bf16) (default: 32)--limit_train_batches LIMIT_TRAIN_BATCHES
Number of batches to use for training--num_pred_steps_train NUM_PRED_STEPS_TRAIN
Number of auto-regressive steps/prediction steps during training forward pass--num_pred_steps_val_test NUM_PRED_STEPS_VAL_TEST
Number of auto-regressive steps/prediction steps during validation and tests--num_input_steps NUM_INPUT_STEPS
Number of previous timesteps supplied as inputs to the model--num_inter_steps NUM_INTER_STEPS
Number of model steps between two samples--no_log
When activated, logs are not stored and models are not saved. Use in dev mode. (default: False)--mlflow_log
When activated, the MLFlowLogger is used and the model is saved in the MLFlow style (default: False)--dev_mode
When activated, reduce number of epoch and steps. (default: False)--load_model_ckpt LOAD_MODEL_CKPT
Path to load model parameters from (default: None)
You can find more details about all the num_X_steps
options here.
We use Tensorboad to track the experiments. You can launch a tensorboard server using the following command:
At Météo-France:
runai will handle port forwarding for you.
runai tensorboard --logdir PATH_TO_YOUR_ROOT_PATH
Elsewhere
tensorboard --logdir PATH_TO_YOUR_ROOT_PATH
Then you can access the tensorboard server at the following address: http://YOUR_SERVER_IP:YOUR_PORT/
Optionally, you can use MLFlow, in addition to Tensorboard, to track your experiment and log your model. To activate the MLFlow logger simply add the --mlflow_log
option on the bin/train.py
command line.
Local usage
Without a MLFlow server, the logs are stored in your root path, at PY4CAST_ROOTDIR/logs/mlflow
.
With a MLFlow server
If you have a MLFow server you can configure your training environment to push the logs on the remote server. A set of environment variables are available to do that.
For exemple, you can export the following variable in your training environment:
export MLFLOW_TRACKING_URI=https://my.mlflow.server.com/
export MLFLOW_TRACKING_USERNAME=<your-mlflow-user>
export MLFLOW_TRACKING_PASSWORD=<your-mlflow-pwd>
export MLFLOW_EXPERIMENT_NAME=py4cast/unetrpp
Inference is done by running the bin/inference.py
script. This script will load a model and run it on a dataset using the training parameters (dataset config, timestep options, ...).
usage: python bin/inference.py [-h] [--model_path MODEL_PATH] [--dataset DATASET] [--infer_steps INFER_STEPS] [--date DATE]
options:
-h, --help show this help message and exit
--model_path MODEL_PATH
Path to the model checkpoint
--date DATE
Date of the sample to infer on. Format:YYYYMMDDHH
--dataset DATASET
Name of the dataset to use (typically the same as has been used for training)
--dataset_conf DATASET_CONF
Name of the dataset config file (json, to change e.g dates, leadtimes, etc)
--infer_steps INFER_STEPS
Number of auto-regressive steps/prediction steps during the inference
--precision PRECISION
floating point precision for the inference (default: 32)
--grib BOOL
Whether the outputs should be saved as grib, needs saving conf.
--saving_conf SAVING_CONF
Name of the config file for write settings (json)
A simple example of inference is shown below:
runai exec_gpu python bin/inference.py --model_path /scratch/shared/py4cast/logs/camp0/poesy/halfunet/sezn_run_dev_12 --date 2021061621 --dataset poesy_infer --infer_steps 2
You can compare multiple trained models on specific case studies and visualize the forecasts on animated plots with the bin/gif_comparison.py
. See example of GIF at the beginning of the README.
Warnings:
- For now this script only works with models trained with Titan dataset.
- If you want to use AROME as a model, you have to manually download the forecast before.
Usage: gif_comparison.py [-h] --ckpt CKPT --date DATE [--num_pred_steps NUM_PRED_STEPS]
options:
-h, --help show this help message and exit
--ckpt CKPT Paths to the model checkpoint or AROME
--date DATE Date for inference. Format YYYYMMDDHH.
--num_pred_steps NUM_PRED_STEPS
Number of auto-regressive steps/prediction steps.
example: python bin/gif_comparison.py --ckpt AROME --ckpt /.../logs/my_run/epoch=247.ckpt
--date 2023061812 --num_pred_steps 10
The bin/test.py
script will compute and save metrics on the validation set, on as many auto-regressive prediction steps as you want.
python bin/test.py PATH_TO_CHECKPOINT --num_pred_steps 24
Once you have executed the test.py
script on all the models you want, you can compare them with bin/scores_comparison.py
:
python bin/scores_comparison.py --ckpt PATH_TO_CKPT_0 --ckpt PATH_TO_CKPT_1
Warning: For now bin/scores_comparison.py
only works with models trained with Titan dataset
This page explains how to:
- add a new neural network
- add a new dataset
- contribute to this project following our guidelines
The figure below illustrates the principal components of the Py4cast architecture.
-
We define interface contracts between the components of the system using Python ABCs. As long as the Python classes respect the interface contract, they can be used interchangeably in the system and the underlying implementation can be very different. For instance datasets with any underlying storage (grib2, netcdf, mmap+numpy, ...) and real-time or ahead of time concat and pre-processing could be used with the same neural network architectures and training strategies.
-
Adding a model, a dataset, a loss, a plot, a training strategy, ... should be as simple as creating a new Python class that complies with the interface contract.
-
Dataset produce Item, collated into ItemBatch, both having NamedTensor attributes.
-
Dataset produce tensors with the following dimensions: (batch, timestep, lat, lon, features). Models can flatten or reshape spatial dimension in the prepare_batch but the rest of the system expects features to be always the last dimension of the tensors.
-
Neural network architectures are Python classes that inherit from both ModelABC and PyTorch's nn.Module. The later means it is quick to insert a third-party pure PyTorch model in the system (see for instance the code for Lucidrains' Segformer or a U-Net).
-
We use dataclasses and dataclass_json to define the settings whenever possible. This allows us to easily serialize and deserialize the settings to/from json files with Schema validation.
-
The NamedTensor allows us to keep track of the physical/weather parameters along the features dimension and to pass a single consistent object in the system. It is also a way to factorize common operations on tensors (concat along features dimension, flatten in place, ...) while keeping the dimension and feature names metadata in sync.
-
We use PyTorch-lightning to train the models. This allows us to easily scale the training to multiple GPUs and to use the same training loop for all the models. We also use the PyTorch-lightning logging system to log the training metrics and the hyperparameters.
-
Ideally, we could end up with a simple based class system for the training strategies to allow for easy addition of new strategies.
-
The ItemBatch class attributes could be generalized to have multiple inputs, outputs and forcing tensors referenced by name, this would allow for more flexibility in the models and plug metnet-3 and Pangu.
-
The distinction between prognostic and diagnostic variables should be made explicit in the system.
-
We should probably reshape back the GNN outputs to (lat, lon) gridded shape as early as possible to have this as a common/standard output format for all the models. This would simplify the post-processing, plotting, ... We still have if statements in the code to handle the different output shapes of the models.