Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Moved calculation of mse to common_step and added logging of mean rmse #5

Open
wants to merge 5 commits into
base: dev/gefion
Choose a base branch
from

Conversation

mafdmi
Copy link

@mafdmi mafdmi commented Mar 5, 2025

Describe your changes

Logging rmse during training, validation and testing.

Issue Link

< Link to the relevant issue or task. > (e.g. closes #00 or solves #00)

Type of change

  • 🐛 Bug fix (non-breaking change that fixes an issue)
  • ✨ New feature (non-breaking change that adds functionality)
  • 💥 Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • 📖 Documentation (Addition or improvements to documentation)

Checklist before requesting a review

  • My branch is up-to-date with the target branch - if not update your fork with the changes from the target branch (use pull with --rebase option if possible).
  • I have performed a self-review of my code
  • For any new/modified functions/classes I have added docstrings that clearly describe its purpose, expected inputs and returned values
  • I have placed in-line comments to clarify the intent of any hard-to-understand passages of my code
  • I have updated the README to cover introduced code changes
  • I have added tests that prove my fix is effective or that my feature works
  • I have given the PR a name that clearly describes the change, written in imperative form (context).
  • I have requested a reviewer and an assignee (assignee is responsible for merging). This applies only if you have write access to the repo, otherwise feel free to tag a maintainer to add a reviewer and assignee.

Checklist for reviewers

Each PR comes with its own improvements and flaws. The reviewer should check the following:

  • the code is readable
  • the code is well tested
  • the code is documented (including return types and parameters)
  • the code is easy to maintain

Author checklist after completed review

  • I have added a line to the CHANGELOG describing this change, in a section
    reflecting type of change (add section where missing):
    • added: when you have added new functionality
    • changed: when default behaviour of the code has been changed
    • fixes: when your contribution fixes a bug

Checklist for assignee

  • PR is up to date with the base branch
  • the tests pass
  • author has added an entry to the changelog (and designated the change as added, changed or fixed)
  • Once the PR is ready to be merged, squash commits and merge the PR.

@matschreiner
Copy link

LGTM.
Just added comments about comments.
Maybe its a question about personal style, but still, here's a nice recap of the chapter about comments from the famous book "Clean Code" by uncle bob :)
https://medium.com/codex/clean-code-comments-833e11a706dc

@mafdmi
Copy link
Author

mafdmi commented Mar 5, 2025

Nice, thanks for the link. Have you added comments to the comments in this PR? I don't see the comments then:(

@matschreiner
Copy link

matschreiner commented Mar 5, 2025

@mafdmi Could it be possible to also log the learning rate during training?
I think you can find it on

model.trainer.optimizer.param_groups[0]["lr"]

@matschreiner
Copy link

Nice, thanks for the link. Have you added comments to the comments in this PR? I don't see the comments then:(

i did, but I don't know why you don't see them. Anyways, I just wrote that the comments were maybe redundant. :)

@mafdmi
Copy link
Author

mafdmi commented Mar 6, 2025

I've added logging of learning rate, so I think this is ready for final review. You can see the results of my test run at https://localhost:4433/#/experiments/7/runs/6f7eacff595a4515b1892029c368052b/model-metrics

@matschreiner
Copy link

LGTM! :)

@mafdmi
Copy link
Author

mafdmi commented Mar 7, 2025

@matschreiner I've fixed the tests, but for some reason two of the workflow jobs haven't been run - they have been waiting for a runner to pick them up since yesterday. Tried to re-run them today: https://github.com/mafdmi/neural-lam/actions/runs/13700752255

@matschreiner
Copy link

matschreiner commented Mar 10, 2025

Sorry @mafdmi I added a lot of comments.

Nice that we are logging all the metrics now. I had hoped that we could find a common metric to compare models across runs with different variables.

@mafdmi
Copy link
Author

mafdmi commented Mar 10, 2025

Sorry @mafdmi I added a lot of comments.

Nice that we are logging all the metrics now. I had hoped that we could find a common metric to compare models across runs with different variables.

I don't get it, but I still don't see your comments!

sum_vars=False,
) # (B, pred_steps, d_f)

# Log mean RMSE for first prediction step

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Redundant comment?

@@ -283,13 +283,31 @@ def common_step(self, batch):
# prediction: (B, pred_steps, num_grid_nodes, d_f) pred_std: (B,
# pred_steps, num_grid_nodes, d_f) or (d_f,)

return prediction, target_states, pred_std, batch_times
# Calculate MSEs

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Redundant comment?

@@ -405,10 +416,12 @@ def test_step(self, batch, batch_idx):
batch_size=batch[0].shape[0],
)

# Store already computed MSEs

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:)

@@ -269,7 +269,7 @@ def unroll_prediction(self, init_states, forcing_features, true_states):

def common_step(self, batch):
"""
Predict on single batch batch consists of: init_states: (B, 2,
Predict on single batch consists of: init_states: (B, 2,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this docstring is clear.


def training_step(self, batch):
"""
Train on single batch
"""
prediction, target, pred_std, _ = self.common_step(batch)
prediction, target, pred_std, _, entry_mses = self.common_step(batch)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the common step should have a more descriptive name - it should reflect it's function rather than the fact that it is being shared.
Also it has two responsibilities which is prediction with the model and processsing of the prediction - could maybe be factored into separate steps?

# Logging
train_log_dict = {"train_loss": batch_loss}
state_var_names = self._datastore.get_vars_names(category="state")
train_log_dict |= {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is hard to read. Maybe wrap it in a function with a descriptive name? I would imagine that the train_log_dict should be defined in one line, something like

train_log_dict = {"train_loss": batch_loss, "lr": ..., **rmse_dict}

f"train_rmse_{v}": mean_rmse_ar_step_1[i]
for (i, v) in enumerate(state_var_names)
}
train_log_dict["train_lr"] = self.trainer.optimizers[0].param_groups[0][

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't learning rate only related to training since it's called "train_lr"?

@@ -343,6 +366,15 @@ def validation_step(self, batch, batch_idx):
if step <= len(time_step_loss)
}
val_log_dict["val_mean_loss"] = mean_loss

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

@@ -352,13 +384,6 @@ def validation_step(self, batch, batch_idx):
)

# Store MSEs

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Redundant comment

@mafdmi
Copy link
Author

mafdmi commented Mar 10, 2025

@matschreiner I've fixed the tests, but for some reason two of the workflow jobs haven't been run - they have been waiting for a runner to pick them up since yesterday. Tried to re-run them today: https://github.com/mafdmi/neural-lam/actions/runs/13700752255

@leifdenby Do you know why the gpu tests fail in above action? Is it okay to merge without those tests passing?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants