-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Moved calculation of mse to common_step and added logging of mean rmse #5
base: dev/gefion
Are you sure you want to change the base?
Moved calculation of mse to common_step and added logging of mean rmse #5
Conversation
LGTM. |
Nice, thanks for the link. Have you added comments to the comments in this PR? I don't see the comments then:( |
@mafdmi Could it be possible to also log the learning rate during training?
|
i did, but I don't know why you don't see them. Anyways, I just wrote that the comments were maybe redundant. :) |
I've added logging of learning rate, so I think this is ready for final review. You can see the results of my test run at https://localhost:4433/#/experiments/7/runs/6f7eacff595a4515b1892029c368052b/model-metrics |
LGTM! :) |
@matschreiner I've fixed the tests, but for some reason two of the workflow jobs haven't been run - they have been waiting for a runner to pick them up since yesterday. Tried to re-run them today: https://github.com/mafdmi/neural-lam/actions/runs/13700752255 |
Sorry @mafdmi I added a lot of comments. Nice that we are logging all the metrics now. I had hoped that we could find a common metric to compare models across runs with different variables. |
I don't get it, but I still don't see your comments! |
neural_lam/models/ar_model.py
Outdated
sum_vars=False, | ||
) # (B, pred_steps, d_f) | ||
|
||
# Log mean RMSE for first prediction step |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Redundant comment?
neural_lam/models/ar_model.py
Outdated
@@ -283,13 +283,31 @@ def common_step(self, batch): | |||
# prediction: (B, pred_steps, num_grid_nodes, d_f) pred_std: (B, | |||
# pred_steps, num_grid_nodes, d_f) or (d_f,) | |||
|
|||
return prediction, target_states, pred_std, batch_times | |||
# Calculate MSEs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Redundant comment?
@@ -405,10 +416,12 @@ def test_step(self, batch, batch_idx): | |||
batch_size=batch[0].shape[0], | |||
) | |||
|
|||
# Store already computed MSEs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:)
@@ -269,7 +269,7 @@ def unroll_prediction(self, init_states, forcing_features, true_states): | |||
|
|||
def common_step(self, batch): | |||
""" | |||
Predict on single batch batch consists of: init_states: (B, 2, | |||
Predict on single batch consists of: init_states: (B, 2, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this docstring is clear.
|
||
def training_step(self, batch): | ||
""" | ||
Train on single batch | ||
""" | ||
prediction, target, pred_std, _ = self.common_step(batch) | ||
prediction, target, pred_std, _, entry_mses = self.common_step(batch) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the common step should have a more descriptive name - it should reflect it's function rather than the fact that it is being shared.
Also it has two responsibilities which is prediction with the model and processsing of the prediction - could maybe be factored into separate steps?
# Logging | ||
train_log_dict = {"train_loss": batch_loss} | ||
state_var_names = self._datastore.get_vars_names(category="state") | ||
train_log_dict |= { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code is hard to read. Maybe wrap it in a function with a descriptive name? I would imagine that the train_log_dict should be defined in one line, something like
train_log_dict = {"train_loss": batch_loss, "lr": ..., **rmse_dict}
f"train_rmse_{v}": mean_rmse_ar_step_1[i] | ||
for (i, v) in enumerate(state_var_names) | ||
} | ||
train_log_dict["train_lr"] = self.trainer.optimizers[0].param_groups[0][ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't learning rate only related to training since it's called "train_lr"?
@@ -343,6 +366,15 @@ def validation_step(self, batch, batch_idx): | |||
if step <= len(time_step_loss) | |||
} | |||
val_log_dict["val_mean_loss"] = mean_loss |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above
@@ -352,13 +384,6 @@ def validation_step(self, batch, batch_idx): | |||
) | |||
|
|||
# Store MSEs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Redundant comment
@leifdenby Do you know why the gpu tests fail in above action? Is it okay to merge without those tests passing? |
Describe your changes
Logging rmse during training, validation and testing.
Issue Link
< Link to the relevant issue or task. > (e.g.
closes #00
orsolves #00
)Type of change
Checklist before requesting a review
pull
with--rebase
option if possible).Checklist for reviewers
Each PR comes with its own improvements and flaws. The reviewer should check the following:
Author checklist after completed review
reflecting type of change (add section where missing):
Checklist for assignee