This guide walks through the steps to submit a elastic training job with horovod.
-
Build image for training environment You can use the registry.cn-hangzhou.aliyuncs.com/ai-samples/horovod:0.20.0-tf2.3.0-torch1.6.0-mxnet1.6.0.post0-py3.7-cuda10.1 image directly. In addition, you can also build your own image with the help of this document elastic-training-sample-image.
-
Submit a elastic training job. Example code from tensorflow2_mnist_elastic.py
arena submit etjob \ --name=elastic-training \ --gpus=1 \ --workers=3 \ --max-workers=9 \ --min-workers=1 \ --image=registry.cn-hangzhou.aliyuncs.com/ai-samples/horovod:0.20.0-tf2.3.0-torch1.6.0-mxnet1.6.0.post0-py3.7-cuda10.1 \ --working-dir=/examples \ "horovodrun -np \$((\${workers}*\${gpus})) --min-np \$((\${minWorkers}*\${gpus})) --max-np \$((\${maxWorkers}*\${gpus})) --host-discovery-script /usr/local/bin/discover_hosts.sh python /examples/elastic/tensorflow2_mnist_elastic.py "
Output:
configmap/elastic-training-etjob created configmap/elastic-training-etjob labeled trainingjob.kai.alibabacloud.com/elastic-training created INFO[0000] The Job elastic-training has been submitted successfully INFO[0000] You can run `arena get elastic-training --type etjob` to check the job status
-
List your job.
arena list
Output:
NAME STATUS TRAINER AGE NODE elastic-training RUNNING ETJOB 52s 192.168.0.116
-
Get your job details.
arena get elastic-training
Output:
STATUS: RUNNING NAMESPACE: default PRIORITY: N/A TRAINING DURATION: 1m NAME STATUS TRAINER AGE INSTANCE NODE elastic-training RUNNING ETJOB 1m elastic-training-launcher 192.168.0.116 elastic-training RUNNING ETJOB 1m elastic-training-worker-0 192.168.0.114 elastic-training RUNNING ETJOB 1m elastic-training-worker-1 192.168.0.116 elastic-training RUNNING ETJOB 1m elastic-training-worker-2 192.168.0.116
-
Check logs
arena logs elastic-training --tail 10
Output:
Tue Sep 8 08:32:50 2020[1]<stdout>:Step #2170 Loss: 0.021992 Tue Sep 8 08:32:50 2020[0]<stdout>:Step #2180 Loss: 0.000902 Tue Sep 8 08:32:50 2020[1]<stdout>:Step #2180 Loss: 0.023190 Tue Sep 8 08:32:50 2020[2]<stdout>:Step #2180 Loss: 0.013149 Tue Sep 8 08:32:51 2020[0]<stdout>:Step #2190 Loss: 0.029536 Tue Sep 8 08:32:51 2020[2]<stdout>:Step #2190 Loss: 0.017537 Tue Sep 8 08:32:51 2020[1]<stdout>:Step #2190 Loss: 0.018273 Tue Sep 8 08:32:51 2020[2]<stdout>:Step #2200 Loss: 0.038399 Tue Sep 8 08:32:51 2020[0]<stdout>:Step #2200 Loss: 0.007017 Tue Sep 8 08:32:51 2020[1]<stdout>:Step #2200 Loss: 0.017495
-
Scaleout your job. Will add one worker into jobs.
arena scaleout etjob --name="elastic-training" --count=1 --timeout=1m
Output:
configmap/elastic-training-1599548177-scaleout created configmap/elastic-training-1599548177-scaleout labeled scaleout.kai.alibabacloud.com/elastic-training-1599548177 created INFO[0000] The scaleout job elastic-training-1599548177 has been submitted successfully
-
Get your job details. We can see new worker(elastic-training-worker-3) has been "RUNNING".
arena get elastic-training
Output:
STATUS: RUNNING NAMESPACE: default PRIORITY: N/A TRAINING DURATION: 2m NAME STATUS TRAINER AGE INSTANCE NODE elastic-training RUNNING ETJOB 2m elastic-training-launcher 192.168.0.116 elastic-training RUNNING ETJOB 2m elastic-training-worker-0 192.168.0.114 elastic-training RUNNING ETJOB 2m elastic-training-worker-1 192.168.0.116 elastic-training RUNNING ETJOB 2m elastic-training-worker-2 192.168.0.116 elastic-training RUNNING ETJOB 2m elastic-training-worker-3 192.168.0.117
-
Check logs.
arena logs elastic-training --tail 10
Output:
Tue Sep 8 08:33:33 2020[1]<stdout>:Step #3140 Loss: 0.014412 Tue Sep 8 08:33:33 2020[0]<stdout>:Step #3140 Loss: 0.004425 Tue Sep 8 08:33:33 2020[3]<stdout>:Step #3150 Loss: 0.000513 Tue Sep 8 08:33:33 2020[2]<stdout>:Step #3150 Loss: 0.062282 Tue Sep 8 08:33:33 2020[1]<stdout>:Step #3150 Loss: 0.020650 Tue Sep 8 08:33:33 2020[0]<stdout>:Step #3150 Loss: 0.008056 Tue Sep 8 08:33:34 2020[3]<stdout>:Step #3160 Loss: 0.002170 Tue Sep 8 08:33:34 2020[2]<stdout>:Step #3160 Loss: 0.009676 Tue Sep 8 08:33:34 2020[1]<stdout>:Step #3160 Loss: 0.051425 Tue Sep 8 08:33:34 2020[0]<stdout>:Step #3160 Loss: 0.023769
-
Scalein your job. Will remove one worker from current jobs.
arena scalein etjob --name="elastic-training" --count=1 --timeout=1m
Output:
configmap/elastic-training-1599554041-scalein created configmap/elastic-training-1599554041-scalein labeled scalein.kai.alibabacloud.com/elastic-training-1599554041 created INFO[0000] The scalein job elastic-training-1599554041 has been submitted successfully
-
Get your job details. We can see that
elastic-training-worker-3
has been removed.arena get elastic-training
Output:
STATUS: RUNNING NAMESPACE: default PRIORITY: N/A TRAINING DURATION: 3m NAME STATUS TRAINER AGE INSTANCE NODE elastic-training RUNNING ETJOB 3m elastic-training-launcher 192.168.0.116 elastic-training RUNNING ETJOB 3m elastic-training-worker-0 192.168.0.114 elastic-training RUNNING ETJOB 3m elastic-training-worker-1 192.168.0.116 elastic-training RUNNING ETJOB 3m elastic-training-worker-2 192.168.0.116
-
Check logs.
arena logs elastic-training --tail 10
Output:
Tue Sep 8 08:34:43 2020[0]<stdout>:Step #5210 Loss: 0.005627 Tue Sep 8 08:34:43 2020[2]<stdout>:Step #5220 Loss: 0.002142 Tue Sep 8 08:34:43 2020[1]<stdout>:Step #5220 Loss: 0.002978 Tue Sep 8 08:34:43 2020[0]<stdout>:Step #5220 Loss: 0.011404 Tue Sep 8 08:34:44 2020[2]<stdout>:Step #5230 Loss: 0.000689 Tue Sep 8 08:34:44 2020[1]<stdout>:Step #5230 Loss: 0.024597 Tue Sep 8 08:34:44 2020[0]<stdout>:Step #5230 Loss: 0.040936 Tue Sep 8 08:34:44 2020[0]<stdout>:Step #5240 Loss: 0.000125 Tue Sep 8 08:34:44 2020[2]<stdout>:Step #5240 Loss: 0.026498 Tue Sep 8 08:34:44 2020[1]<stdout>:Step #5240 Loss: 0.000308