auto3dseg resume training #5754

scarpma · 2022-12-15T10:09:38Z

scarpma
Dec 15, 2022

i started auto3dseg in the default configuration using

python -m monai.apps.auto3dseg AutoRunner run --input='./input.yaml'

After the first training completed, something happened and the second one got interrupted. I would like to resume from where it left, but it seems that it starts the training from scratch (of all models). How can I "enamble the cache" of the training process so that it only trains the non-already-trained models ?

Thanks

Answered by myron

Dec 15, 2022

thank you for the question. It's currently not possible automatically, and requires manual start of only non-finished runs. but it should be fixable, the issue was created #5756

View full answer

myron · 2022-12-15T17:54:56Z

myron
Dec 15, 2022
Maintainer

thank you for the question. It's currently not possible automatically, and requires manual start of only non-finished runs. but it should be fixable, the issue was created #5756

2 replies

scarpma Dec 15, 2022
Author

Thank you very much ! Could you tell me how to start each one individually in the mean time ?

pwrightkcl Jan 31, 2024

You can restart individual algorithm bundles using the bash command given in the doc/README.md file inside each algorithm bundle's folder as described here.

There is also a python method described in issue #5756 but I haven't tried it.

(I know it's risky to pester a dead thread, but it's for posterity.)

SeracFloe · 2022-12-28T23:31:06Z

SeracFloe
Dec 28, 2022

这是来自QQ邮箱的假期自动回复邮件。您好，我最近正在休假中，无法亲自回复您的邮件。我将在假期结束后，尽快给您回复。

0 replies

pwrightkcl · 2024-01-31T11:00:30Z

pwrightkcl
Jan 31, 2024

@myron I'd like to revisit this. I'm also in a situation where my auto3dseg tasks can be interrupted by the job management system on our compute cluster. When this happens, the docker container will rerun from the entrypoint. Auto3dseg will, by default, skip any models and folds that have partially trained and train only the untouched ones. That means the interrupted fold will be left with fewer epochs run than the others. My current option is to delete the 'model' directory (or equivalent) in the partially-trained algo bundle directory and restart training manually using the bash command. I was wondering if it is possible to get training to resume from the last saved checkpoint instead. I have a feeling this would require a good deal of experience with MONAI and possibly modifying the generated training code for each model, which may involve a different solution for each of the different architectures. If I'm lucky and I'm wrong, maybe there's a standardised way to finding the last checkpoint, working out how many more epochs to run, and resuming from there. I would be grateful for your advice.

0 replies

pwrightkcl · 2024-02-29T15:44:03Z

pwrightkcl
Feb 29, 2024

@scarpma and anyone else landing on this discussion, I have posted some code to resume partially-trained models from their saved checkpoint in #7506.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

auto3dseg resume training #5754

{{title}}

Replies: 4 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

auto3dseg resume training #5754

scarpma Dec 15, 2022

Replies: 4 comments · 2 replies

myron Dec 15, 2022 Maintainer

scarpma Dec 15, 2022 Author

pwrightkcl Jan 31, 2024

SeracFloe Dec 28, 2022

pwrightkcl Jan 31, 2024

pwrightkcl Feb 29, 2024

scarpma
Dec 15, 2022

Replies: 4 comments 2 replies

myron
Dec 15, 2022
Maintainer

scarpma Dec 15, 2022
Author

SeracFloe
Dec 28, 2022

pwrightkcl
Jan 31, 2024

pwrightkcl
Feb 29, 2024