Moved calculation of mse to common_step and added logging of mean rmse #5

mafdmi · 2025-03-05T11:04:18Z

Describe your changes

Logging rmse during training, validation and testing.

Issue Link

< Link to the relevant issue or task. > (e.g. closes #00 or solves #00)

Type of change

🐛 Bug fix (non-breaking change that fixes an issue)
✨ New feature (non-breaking change that adds functionality)
💥 Breaking change (fix or feature that would cause existing functionality to not work as expected)
📖 Documentation (Addition or improvements to documentation)

Checklist before requesting a review

My branch is up-to-date with the target branch - if not update your fork with the changes from the target branch (use pull with --rebase option if possible).
I have performed a self-review of my code
For any new/modified functions/classes I have added docstrings that clearly describe its purpose, expected inputs and returned values
I have placed in-line comments to clarify the intent of any hard-to-understand passages of my code
I have updated the README to cover introduced code changes
I have added tests that prove my fix is effective or that my feature works
I have given the PR a name that clearly describes the change, written in imperative form (context).
I have requested a reviewer and an assignee (assignee is responsible for merging). This applies only if you have write access to the repo, otherwise feel free to tag a maintainer to add a reviewer and assignee.

Checklist for reviewers

Each PR comes with its own improvements and flaws. The reviewer should check the following:

the code is readable
the code is well tested
the code is documented (including return types and parameters)
the code is easy to maintain

Author checklist after completed review

I have added a line to the CHANGELOG describing this change, in a section
reflecting type of change (add section where missing):
- added: when you have added new functionality
- changed: when default behaviour of the code has been changed
- fixes: when your contribution fixes a bug

Checklist for assignee

PR is up to date with the base branch
the tests pass
author has added an entry to the changelog (and designated the change as added, changed or fixed)
Once the PR is ready to be merged, squash commits and merge the PR.

matschreiner · 2025-03-05T12:09:44Z

LGTM.
Just added comments about comments.
Maybe its a question about personal style, but still, here's a nice recap of the chapter about comments from the famous book "Clean Code" by uncle bob :)
https://medium.com/codex/clean-code-comments-833e11a706dc

mafdmi · 2025-03-05T12:28:20Z

Nice, thanks for the link. Have you added comments to the comments in this PR? I don't see the comments then:(

matschreiner · 2025-03-05T13:06:19Z

@mafdmi Could it be possible to also log the learning rate during training?
I think you can find it on

model.trainer.optimizer.param_groups[0]["lr"]

matschreiner · 2025-03-05T13:14:39Z

Nice, thanks for the link. Have you added comments to the comments in this PR? I don't see the comments then:(

i did, but I don't know why you don't see them. Anyways, I just wrote that the comments were maybe redundant. :)

mafdmi · 2025-03-06T10:37:42Z

I've added logging of learning rate, so I think this is ready for final review. You can see the results of my test run at https://localhost:4433/#/experiments/7/runs/6f7eacff595a4515b1892029c368052b/model-metrics

matschreiner · 2025-03-06T10:51:19Z

LGTM! :)

mafdmi · 2025-03-07T13:36:04Z

@matschreiner I've fixed the tests, but for some reason two of the workflow jobs haven't been run - they have been waiting for a runner to pick them up since yesterday. Tried to re-run them today: https://github.com/mafdmi/neural-lam/actions/runs/13700752255

matschreiner · 2025-03-10T09:14:47Z

Sorry @mafdmi I added a lot of comments.

Nice that we are logging all the metrics now. I had hoped that we could find a common metric to compare models across runs with different variables.

mafdmi · 2025-03-10T09:48:14Z

Sorry @mafdmi I added a lot of comments.

Nice that we are logging all the metrics now. I had hoped that we could find a common metric to compare models across runs with different variables.

I don't get it, but I still don't see your comments!

matschreiner · 2025-03-05T12:03:28Z

neural_lam/models/ar_model.py

+            sum_vars=False,
+        )  # (B, pred_steps, d_f)
+
+        # Log mean RMSE for first prediction step


Redundant comment?

matschreiner · 2025-03-05T12:03:40Z

neural_lam/models/ar_model.py

@@ -283,13 +283,31 @@ def common_step(self, batch):
        # prediction: (B, pred_steps, num_grid_nodes, d_f) pred_std: (B,
        # pred_steps, num_grid_nodes, d_f) or (d_f,)

-        return prediction, target_states, pred_std, batch_times
+        # Calculate MSEs


Redundant comment?

matschreiner · 2025-03-05T12:05:01Z

neural_lam/models/ar_model.py

@@ -405,10 +416,12 @@ def test_step(self, batch, batch_idx):
            batch_size=batch[0].shape[0],
        )

+        # Store already computed MSEs


matschreiner · 2025-03-10T08:51:31Z

neural_lam/models/ar_model.py

@@ -269,7 +269,7 @@ def unroll_prediction(self, init_states, forcing_features, true_states):

    def common_step(self, batch):
        """
-        Predict on single batch batch consists of: init_states: (B, 2,
+        Predict on single batch consists of: init_states: (B, 2,


I don't think this docstring is clear.

matschreiner · 2025-03-10T09:00:29Z

neural_lam/models/ar_model.py


    def training_step(self, batch):
        """
        Train on single batch
        """
-        prediction, target, pred_std, _ = self.common_step(batch)
+        prediction, target, pred_std, _, entry_mses = self.common_step(batch)


I think the common step should have a more descriptive name - it should reflect it's function rather than the fact that it is being shared.
Also it has two responsibilities which is prediction with the model and processsing of the prediction - could maybe be factored into separate steps?

matschreiner · 2025-03-10T09:01:34Z

neural_lam/models/ar_model.py

+        # Logging
+        train_log_dict = {"train_loss": batch_loss}
+        state_var_names = self._datastore.get_vars_names(category="state")
+        train_log_dict |= {


This code is hard to read. Maybe wrap it in a function with a descriptive name? I would imagine that the train_log_dict should be defined in one line, something like

train_log_dict = {"train_loss": batch_loss, "lr": ..., **rmse_dict}

matschreiner · 2025-03-10T09:04:37Z

neural_lam/models/ar_model.py

+            f"train_rmse_{v}": mean_rmse_ar_step_1[i]
+            for (i, v) in enumerate(state_var_names)
+        }
+        train_log_dict["train_lr"] = self.trainer.optimizers[0].param_groups[0][


isn't learning rate only related to training since it's called "train_lr"?

matschreiner · 2025-03-10T09:05:53Z

neural_lam/models/ar_model.py

@@ -343,6 +366,15 @@ def validation_step(self, batch, batch_idx):
            if step <= len(time_step_loss)
        }
        val_log_dict["val_mean_loss"] = mean_loss


same as above

matschreiner · 2025-03-10T09:06:17Z

neural_lam/models/ar_model.py

@@ -352,13 +384,6 @@ def validation_step(self, batch, batch_idx):
        )

        # Store MSEs


Redundant comment

mafdmi · 2025-03-10T10:04:50Z

@matschreiner I've fixed the tests, but for some reason two of the workflow jobs haven't been run - they have been waiting for a runner to pick them up since yesterday. Tried to re-run them today: https://github.com/mafdmi/neural-lam/actions/runs/13700752255

@leifdenby Do you know why the gpu tests fail in above action? Is it okay to merge without those tests passing?

Moved calculation of mse to common_step and added logging of mean rmse

ed4b729

mafdmi added 2 commits March 6, 2025 11:12

Added logging of learning rate

7e50f49

Linting

49b02cb

Updated changelog

09ba710

Mock out trainer

0b20c02

matschreiner reviewed Mar 10, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Moved calculation of mse to common_step and added logging of mean rmse #5

Moved calculation of mse to common_step and added logging of mean rmse #5

mafdmi commented Mar 5, 2025

matschreiner commented Mar 5, 2025

mafdmi commented Mar 5, 2025

matschreiner commented Mar 5, 2025 •

edited

Loading

matschreiner commented Mar 5, 2025

mafdmi commented Mar 6, 2025

matschreiner commented Mar 6, 2025

mafdmi commented Mar 7, 2025

matschreiner commented Mar 10, 2025 •

edited

Loading

mafdmi commented Mar 10, 2025

matschreiner Mar 5, 2025

matschreiner Mar 5, 2025

matschreiner Mar 5, 2025

matschreiner Mar 10, 2025

matschreiner Mar 10, 2025

matschreiner Mar 10, 2025

matschreiner Mar 10, 2025

matschreiner Mar 10, 2025

matschreiner Mar 10, 2025

mafdmi commented Mar 10, 2025

		@@ -352,13 +384,6 @@ def validation_step(self, batch, batch_idx):
		)

		# Store MSEs

Moved calculation of mse to common_step and added logging of mean rmse #5

Are you sure you want to change the base?

Moved calculation of mse to common_step and added logging of mean rmse #5

Conversation

mafdmi commented Mar 5, 2025

Describe your changes

Issue Link

Type of change

Checklist before requesting a review

Checklist for reviewers

Author checklist after completed review

Checklist for assignee

matschreiner commented Mar 5, 2025

mafdmi commented Mar 5, 2025

matschreiner commented Mar 5, 2025 • edited Loading

matschreiner commented Mar 5, 2025

mafdmi commented Mar 6, 2025

matschreiner commented Mar 6, 2025

mafdmi commented Mar 7, 2025

matschreiner commented Mar 10, 2025 • edited Loading

mafdmi commented Mar 10, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mafdmi commented Mar 10, 2025

matschreiner commented Mar 5, 2025 •

edited

Loading

matschreiner commented Mar 10, 2025 •

edited

Loading