Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Add a new parameter
NUM_TRAINERS
to specify the number of gpu instances. Only works with DDP branch of DeepEM.gpu_ids
is important for specifying the number of processes in each instance, but does not actually restrict the gpus used. If its length is larger than the actual number of gpus the training will fail.batch_size
andnum_workers
are now batch_size and num_worker per process/gpu. So you need to scale down the numbers from previous training script to match the change.Not sure the current version is very compatible with spot instance training. Elastic training cannot be all spot instances and we are not using it now, any instance loss will cause the entire training to fail, to resume seems to require a new
rdvz_id
which only happens with aupdate parameter
command for now.