How to start training again from last checkpoint #4488
Replies: 7 comments 10 replies
-
If you're using exp manager, it has two flags for resuming training v Without it, you can try loading your checkpoint using load_from_checkpoinr() and then calling trainer.train() again but we haven't tested that |
Beta Was this translation helpful? Give feedback.
-
Are you talking about this one? create_checkpoint_callback. It is set to True @titu1994 |
Beta Was this translation helpful? Give feedback.
-
@FatimaArshad-DS see NeMo/nemo/utils/exp_manager.py Lines 197 to 200 in 8a172df and NeMo/nemo/utils/exp_manager.py Lines 204 to 206 in 8a172df |
Beta Was this translation helpful? Give feedback.
-
Note that you will need to fix eithr NeMo/nemo/utils/exp_manager.py Lines 188 to 189 in 8a172df or set NeMo/nemo/utils/exp_manager.py Lines 194 to 195 in 8a172df to point to your specific run, that you wish to resume. |
Beta Was this translation helpful? Give feedback.
-
We also some documentation here: https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/core/exp_manager.html |
Beta Was this translation helpful? Give feedback.
-
You could continue training from the latest checkpoint following the steps:
|
Beta Was this translation helpful? Give feedback.
-
@FatimaArshad-DS Do you have any other questions about this? |
Beta Was this translation helpful? Give feedback.
-
Hi,
My training got interrupted and Im trying to restart training from last checkpoint. However, training starts from the beginning. How do I make it start from last checkpoint?
Beta Was this translation helpful? Give feedback.
All reactions