DDP support for training loop #110

ranlu · 2025-01-16T23:11:43Z

Add a new parameter NUM_TRAINERS to specify the number of gpu instances. Only works with DDP branch of DeepEM. gpu_ids is important for specifying the number of processes in each instance, but does not actually restrict the gpus used. If its length is larger than the actual number of gpus the training will fail.
batch_size and num_workers are now batch_size and num_worker per process/gpu. So you need to scale down the numbers from previous training script to match the change.
Not sure the current version is very compatible with spot instance training. Elastic training cannot be all spot instances and we are not using it now, any instance loss will cause the entire training to fail, to resume seems to require a new rdvz_id which only happens with a update parameter command for now.

Easier for the trainers to communicate with each other

This is for DDP, for some reason DDP prefer to use ipv6 to communicate, disable ipv6 is simpler than extending the deployment to support ipv6.

Use torchrun to setup the environment and launch training script

It is does not necessarily run the rank 0 processes, but should still be fine

The task can be idle while waiting for other workers to start

It does not make sense to auto scale the cluster since the size is fixed

ranlu added 4 commits January 16, 2025 16:09

Add etcd to the deployment

1032a9b

Use host network for training container

d664c53

Easier for the trainers to communicate with each other

Disable ipv6 on the workers

9e3c16e

This is for DDP, for some reason DDP prefer to use ipv6 to communicate, disable ipv6 is simpler than extending the deployment to support ipv6.

DDP support for deepem training

08a5686

Use torchrun to setup the environment and launch training script

ranlu requested a review from torms3 January 16, 2025 23:11

ranlu added 2 commits January 16, 2025 21:28

Reset rdzv_id when rank 0 node failed

73f1944

It is does not necessarily run the rank 0 processes, but should still be fine

Disable qos for training tasks

2c59d7e

The task can be idle while waiting for other workers to start

ranlu force-pushed the DDP branch from 665c956 to bc5cd72 Compare January 17, 2025 05:39

Do not autoscale deepem-gpu cluster

b480a2d

It does not make sense to auto scale the cluster since the size is fixed

ranlu force-pushed the DDP branch from bc5cd72 to b480a2d Compare January 17, 2025 06:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DDP support for training loop #110

DDP support for training loop #110

ranlu commented Jan 16, 2025

DDP support for training loop #110

Are you sure you want to change the base?

DDP support for training loop #110

Conversation

ranlu commented Jan 16, 2025