Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Understanding the terminal output during training #28

Closed
saskra opened this issue Nov 2, 2022 · 4 comments
Closed

Understanding the terminal output during training #28

saskra opened this issue Nov 2, 2022 · 4 comments

Comments

@saskra
Copy link
Contributor

saskra commented Nov 2, 2022

Can anyone help me understand the terminal output during training? I followed these steps: https://github.com/WillBrennan/SemanticSegmentation#training-new-projects I only adjusted minor things.

My output looks like this (with some truncation, line breaks and indentation on my part for clarity):

python train.py --train datasets/my_cat_images_train --val datasets/my_cat_images_val --model-tag segmentation_cat --model-type FCNResNet101
INFO:root:running training on cuda
INFO:root:creating dataset and data loaders
INFO:root:loaded 4114 annotations from datasets/my_cat_images_train
INFO:root:use augmentation: True
INFO:root:categories: ['cat']
INFO:root:loaded 184 annotations from datasets/my_cat_images_val
INFO:root:use augmentation: False
INFO:root:categories: ['cat']
INFO:root:creating dataloaders with 19 workers and a batch-size of 6
INFO:root:creating FCNResNet101 and optimizer with initial lr of 0.0001
INFO:root:creating model with categories: ['cat']
INFO:root:creating trainer and evaluator engines
INFO:root:creating summary writer with tag segmentation_cat
INFO:root:attaching lr scheduler
INFO:root:attaching event driven calls
INFO:root:training...
	INFO:ignite.engine.engine.Engine:Engine run starting with max_epochs=2.
	
INFO:root:epoch[1] - iteration[500/685] loss: 0.145
INFO:root:epoch: 1 - current lr: 0.0001
	INFO:ignite.engine.engine.Engine:Engine run starting with max_epochs=1.
	INFO:ignite.engine.engine.Engine:Epoch[1] Complete. Time taken: 00:00:09
	INFO:ignite.engine.engine.Engine:Engine run complete. Time taken: 00:00:09
INFO:root:loss: 0.147 precision: 0.833 recall: 0.880 [email protected]: 0.724 [email protected]: 0.748 Epoch: 1 
	INFO:ignite.engine.engine.Engine:Engine run starting with max_epochs=1.
	INFO:ignite.engine.engine.Engine:Epoch[1] Complete. Time taken: 00:00:09
	INFO:ignite.engine.engine.Engine:Engine run complete. Time taken: 00:00:09
INFO:root:loss: 0.123 precision: 0.873 recall: 0.911 [email protected]: 0.781 [email protected]: 0.804 Epoch: 1 
	INFO:ignite.engine.engine.Engine:Epoch[1] Complete. Time taken: 00:10:52
	
INFO:root:epoch[2] - iteration[500/685] loss: 0.416
INFO:root:epoch: 2 - current lr: 9e-05
	INFO:ignite.engine.engine.Engine:Engine run starting with max_epochs=1.
	INFO:ignite.engine.engine.Engine:Epoch[1] Complete. Time taken: 00:00:09
	INFO:ignite.engine.engine.Engine:Engine run complete. Time taken: 00:00:09
INFO:root:loss: 0.161 precision: 0.854 recall: 0.880 [email protected]: 0.745 [email protected]: 0.765 Epoch: 2 
	INFO:ignite.engine.engine.Engine:Engine run starting with max_epochs=1.
	INFO:ignite.engine.engine.Engine:Epoch[1] Complete. Time taken: 00:00:09
	INFO:ignite.engine.engine.Engine:Engine run complete. Time taken: 00:00:09
INFO:root:loss: 0.116 precision: 0.872 recall: 0.922 [email protected]: 0.787 [email protected]: 0.812 Epoch: 2 
	INFO:ignite.engine.engine.Engine:Epoch[2] Complete. Time taken: 00:10:52
	INFO:ignite.engine.engine.Engine:Engine run complete. Time taken: 00:21:45

Here's what I'd like to know about it:

  • The outputs in the training starting with "INFO:root:" are apparently defined in semantic_segmentation/engines.py, but where do the others ("INFO:ignite.engine.engine:") come from?
  • Why are two epochs run through as desired, but after the initially correct specification (namely "max_epochs=2"), "max_epochs=1" is then output several times?
  • Why is "Epoch[1]" output several times during the second epoch?
  • Why is it stated several times in the middle of an epoch that the epoch or the run is "complete" and a new run would be started?
  • What does the time information "Time taken: 00:00:09" refer to, because the real runtime is significantly longer?
  • Why are there intermediate results for the metrics during the epochs, although the function should only be executed when the epoch is finished? And why are there only the results of the second output per epoch in the tensorboard?
    @trainer.on(engine.Events.EPOCH_COMPLETED)
Wall time,Step,Value
1667398091.2036688,1,0.8042078614234924
1667398743.5209541,2,0.8115814328193665

Possibly part of my questions come from the fact that I had only used Tensorflow and not Torch before. I thank you in advance if someone can make me understand at least part of these oddities.

@WillBrennan
Copy link
Owner

WillBrennan commented Nov 3, 2022

Hey;

Why are two epochs run through as desired, but after the initially correct specification (namely "max_epochs=2"), "max_epochs=1" is then output several times?

So this happens as we do "one epoch" for each validation dataset at the end of each training epoch. This code also computes the validation metrics on the training set as well to make it easier to see if its overfitting.

What does the time information "Time taken: 00:00:09" refer to, because the real runtime is significantly longer?

This is why you're seeing different times for completion as well, when its referring to the shorter times its how long it takes for an "epoch" on the validation sets.

The outputs in the training starting with "INFO:root:" are apparently defined in semantic_segmentation/engines.py, but where do the others ("INFO:ignite.engine.engine:") come from?

These other logging statements come from within ignite

@saskra
Copy link
Contributor Author

saskra commented Nov 3, 2022

So this happens as we do "one epoch" for each validation dataset at the end of each training epoch. This code also computes the validation metrics on the training set as well to make it easier to see if its overfitting.
This is why you're seeing different times for completion as well, when its referring to the shorter times its how long it takes for an "epoch" on the validation sets.

Thank you for the explanation! So in each epoch the first row of metrics refers to the training dataset and the second to the validation dataset? You could add that in the output to make it more understandable. Isn't it unusual that the result on the validation dataset is better than on the training dataset?

These other logging statements come from within ignite

Can I suppress these outputs somehow, during longer trainings they just clutter up the terminal window?

@WillBrennan
Copy link
Owner

You could add that in the output to make it more understandable.

It definitely should be; it'll be a fairly simple change here if you want to make a PR?
https://github.com/WillBrennan/SemanticSegmentation/blob/master/semantic_segmentation/engines.py#L60

Can I suppress these outputs somehow, during longer trainings they just clutter up the terminal window?

Yes; they can be suppressed. Currently this library is using the default logger from the python standard logging library. We should specify a separate logger for this library.

Isn't it unusual that the result on the validation dataset is better than on the training dataset?
Ah to clarify; when the metrics are calculated on the training dataset it is still performing augmentation. This is why the validation dataset performs better.

@saskra
Copy link
Contributor Author

saskra commented Nov 3, 2022

It definitely should be; it'll be a fairly simple change here if you want to make a PR? https://github.com/WillBrennan/SemanticSegmentation/blob/master/semantic_segmentation/engines.py#L60

Here it is: #29

Ah to clarify; when the metrics are calculated on the training dataset it is still performing augmentation. This is why the validation dataset performs better.

Ah, now it makes sense! :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants