-
i started auto3dseg in the default configuration using python -m monai.apps.auto3dseg AutoRunner run --input='./input.yaml' After the first training completed, something happened and the second one got interrupted. I would like to resume from where it left, but it seems that it starts the training from scratch (of all models). How can I "enamble the cache" of the training process so that it only trains the non-already-trained models ? Thanks |
Beta Was this translation helpful? Give feedback.
Replies: 4 comments 2 replies
-
thank you for the question. It's currently not possible automatically, and requires manual start of only non-finished runs. but it should be fixable, the issue was created #5756 |
Beta Was this translation helpful? Give feedback.
-
这是来自QQ邮箱的假期自动回复邮件。
您好,我最近正在休假中,无法亲自回复您的邮件。我将在假期结束后,尽快给您回复。
|
Beta Was this translation helpful? Give feedback.
-
@myron I'd like to revisit this. I'm also in a situation where my auto3dseg tasks can be interrupted by the job management system on our compute cluster. When this happens, the docker container will rerun from the entrypoint. Auto3dseg will, by default, skip any models and folds that have partially trained and train only the untouched ones. That means the interrupted fold will be left with fewer epochs run than the others. My current option is to delete the 'model' directory (or equivalent) in the partially-trained algo bundle directory and restart training manually using the bash command. I was wondering if it is possible to get training to resume from the last saved checkpoint instead. I have a feeling this would require a good deal of experience with MONAI and possibly modifying the generated training code for each model, which may involve a different solution for each of the different architectures. If I'm lucky and I'm wrong, maybe there's a standardised way to finding the last checkpoint, working out how many more epochs to run, and resuming from there. I would be grateful for your advice. |
Beta Was this translation helpful? Give feedback.
-
@scarpma and anyone else landing on this discussion, I have posted some code to resume partially-trained models from their saved checkpoint in #7506. |
Beta Was this translation helpful? Give feedback.
thank you for the question. It's currently not possible automatically, and requires manual start of only non-finished runs. but it should be fixable, the issue was created #5756