Add WarmupCosineAnnealingScheduler to NeuralLAM and add `--steps` keyword to training script #7

matschreiner · 2025-03-05T14:00:49Z

This PR implements 2 features.

A --steps argument has been added to the training script, allowing control over the number of training steps before termination. This is implemented using Trainer(..., max_num_steps=args.steps, ...) in pytorch_lightning. If both max_num_epochs and max_num_steps are specified in the Trainer, training will stop when either has been reached. By default "--steps" is -1 which means that without specifying this keyword, training will never terminate due to max_num_steps.
A learning scheduler to NeuralLAM following the specifications given for learning rates in GraphCast at the developer meeting at DMI 4th of March. The defaults are implemented such that when using the default learning rate for Adam (1e-3) the linear warmup will range from epsilon to 1e-3 and then cosine annealing from 1e-3 to 1e-6. Otherwise, min and max learning rate will range from min_factor * initial_lr to max_factor * initial_lr)
The signature and defaults of the scheduler is

neural_lam.lr_scheduler.WarmupCosineAnnealingLR(
        optimizer,
        warmup_steps=100000,
        annealing_steps=1000,
        max_factor=1.0,
        min_factor=0.001,
)

The scheduler has 3 phases:

Warmup phase

In this phase the learning rate is warmed up by linearly increasing the learning rate from min_factor * initial_learning_rate to max_factor * initial_learning_rate (The learning rate is defined in the optimizer).

Annealing phase

In this phase the learning rate is annealed by using a CosineAnnealing schedule that anneals the learning rate back to min_factor * initial_learning_rate by following a half a cosing cycle.

Fine tuning phase

In this phase the annealed learning rate is used until the training terminates.

Learning schedule

Using the default learning rate for the adam optimizer (1e-3), 20 warmup steps, 100 annealing steps and 150 steps in total will produce a schedule that looks as folllows:

WarmupCosineAnnealingLR(
        optimizer,
        warmup_steps=20,
        annealing_steps=100,
)

Type of change

🐛 Bug fix (non-breaking change that fixes an issue)
✨ New feature (non-breaking change that adds functionality)
💥 Breaking change (fix or feature that would cause existing functionality to not work as expected)
📖 Documentation (Addition or improvements to documentation)

Checklist before requesting a review

My branch is up-to-date with the target branch - if not update your fork with the changes from the target branch (use pull with --rebase option if possible).
I have performed a self-review of my code
For any new/modified functions/classes I have added docstrings that clearly describe its purpose, expected inputs and returned values
I have placed in-line comments to clarify the intent of any hard-to-understand passages of my code
I have updated the README to cover introduced code changes
I have added tests that prove my fix is effective or that my feature works
I have given the PR a name that clearly describes the change, written in imperative form (context).
I have requested a reviewer and an assignee (assignee is responsible for merging). This applies only if you have write access to the repo, otherwise feel free to tag a maintainer to add a reviewer and assignee.

Checklist for reviewers

Each PR comes with its own improvements and flaws. The reviewer should check the following:

the code is readable
the code is well tested
the code is documented (including return types and parameters)
the code is easy to maintain

Author checklist after completed review

I have added a line to the CHANGELOG describing this change, in a section
reflecting type of change (add section where missing):
- added: when you have added new functionality
- changed: when default behaviour of the code has been changed
- fixes: when your contribution fixes a bug

Checklist for assignee

PR is up to date with the base branch
the tests pass
author has added an entry to the changelog (and designated the change as added, changed or fixed)
Once the PR is ready to be merged, squash commits and merge the PR.

Type of change

🐛 Bug fix (non-breaking change that fixes an issue)
✨ New feature (non-breaking change that adds functionality)
💥 Breaking change (fix or feature that would cause existing functionality to not work as expected)
📖 Documentation (Addition or improvements to documentation)

Checklist before requesting a review

My branch is up-to-date with the target branch - if not update your fork with the changes from the target branch (use pull with --rebase option if possible).
I have performed a self-review of my code
For any new/modified functions/classes I have added docstrings that clearly describe its purpose, expected inputs and returned values
I have placed in-line comments to clarify the intent of any hard-to-understand passages of my code
I have updated the README to cover introduced code changes
I have added tests that prove my fix is effective or that my feature works
I have given the PR a name that clearly describes the change, written in imperative form (context).
I have requested a reviewer and an assignee (assignee is responsible for merging). This applies only if you have write access to the repo, otherwise feel free to tag a maintainer to add a reviewer and assignee.

Checklist for reviewers

Each PR comes with its own improvements and flaws. The reviewer should check the following:

the code is readable
the code is well tested
the code is documented (including return types and parameters)
the code is easy to maintain

Author checklist after completed review

I have added a line to the CHANGELOG describing this change, in a section
reflecting type of change (add section where missing):
- added: when you have added new functionality
- changed: when default behaviour of the code has been changed
- fixes: when your contribution fixes a bug

Checklist for assignee

PR is up to date with the base branch
the tests pass
author has added an entry to the changelog (and designated the change as added, changed or fixed)
Once the PR is ready to be merged, squash commits and merge the PR.

Co-authored-by: SimonKamuk <[email protected]>

Jacob Mathias Schreiner and others added 17 commits March 4, 2025 13:36

add steps to args

f81f79e

first tests

9358f66

add WarmupCosineAnnealingScheduler

3b0f24e

add test

2480d8b

rename

e0cb8fb

add plot of lrs with default values

32dded5

rename

0312f10

remove unused import

5d96cce

use _step

dba2532

add comment

df4fba8

empty

ddbf4e6

Trigger

5550f83

Update tests/test_lr_scheduler.py

0f6c26a

Co-authored-by: SimonKamuk <[email protected]>

remove visualization code

5263cff

get_lr should return lr not take step

e6b1746

only support single group

27552cd

make sure only one parameter group is being used

56a33bd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add WarmupCosineAnnealingScheduler to NeuralLAM and add `--steps` keyword to training script #7

Add WarmupCosineAnnealingScheduler to NeuralLAM and add `--steps` keyword to training script #7

matschreiner commented Mar 5, 2025 •

edited

Loading

Add WarmupCosineAnnealingScheduler to NeuralLAM and add --steps keyword to training script #7

Are you sure you want to change the base?

Add WarmupCosineAnnealingScheduler to NeuralLAM and add --steps keyword to training script #7

Conversation

matschreiner commented Mar 5, 2025 • edited Loading

Warmup phase

Annealing phase

Fine tuning phase

Learning schedule

Type of change

Checklist before requesting a review

Checklist for reviewers

Author checklist after completed review

Checklist for assignee

Type of change

Checklist before requesting a review

Checklist for reviewers

Author checklist after completed review

Checklist for assignee

Add WarmupCosineAnnealingScheduler to NeuralLAM and add `--steps` keyword to training script #7

Add WarmupCosineAnnealingScheduler to NeuralLAM and add `--steps` keyword to training script #7

matschreiner commented Mar 5, 2025 •

edited

Loading