Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DDP support for training loop #110

Open
wants to merge 7 commits into
base: main
Choose a base branch
from
Open

DDP support for training loop #110

wants to merge 7 commits into from

Conversation

ranlu
Copy link

@ranlu ranlu commented Jan 16, 2025

Add a new parameter NUM_TRAINERS to specify the number of gpu instances. Only works with DDP branch of DeepEM. gpu_ids is important for specifying the number of processes in each instance, but does not actually restrict the gpus used. If its length is larger than the actual number of gpus the training will fail.
batch_size and num_workers are now batch_size and num_worker per process/gpu. So you need to scale down the numbers from previous training script to match the change.
Not sure the current version is very compatible with spot instance training. Elastic training cannot be all spot instances and we are not using it now, any instance loss will cause the entire training to fail, to resume seems to require a new rdvz_id which only happens with a update parameter command for now.

ranlu added 4 commits January 16, 2025 16:09
Easier for the trainers to communicate with each other
This is for DDP, for some reason DDP prefer to use ipv6 to communicate,
disable ipv6 is simpler than extending the deployment to support ipv6.
Use torchrun to setup the environment and launch training script
@ranlu ranlu requested a review from torms3 January 16, 2025 23:11
ranlu added 2 commits January 16, 2025 21:28
It is does not necessarily run the rank 0 processes, but should still be
fine
The task can be idle while waiting for other workers to start
It does not make sense to auto scale the cluster since the size is fixed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant