Skip to content

Latest commit

 

History

History
175 lines (161 loc) · 6.97 KB

28-elastictraining-tensorflow2-mnist.md

File metadata and controls

175 lines (161 loc) · 6.97 KB

This guide walks through the steps to submit a elastic training job with horovod.

  1. Build image for training environment You can use the registry.cn-hangzhou.aliyuncs.com/ai-samples/horovod:0.20.0-tf2.3.0-torch1.6.0-mxnet1.6.0.post0-py3.7-cuda10.1 image directly. In addition, you can also build your own image with the help of this document elastic-training-sample-image.

  2. Submit a elastic training job. Example code from tensorflow2_mnist_elastic.py

    arena submit etjob \
        --name=elastic-training \
        --gpus=1 \
        --workers=3 \
        --max-workers=9 \
        --min-workers=1 \
        --image=registry.cn-hangzhou.aliyuncs.com/ai-samples/horovod:0.20.0-tf2.3.0-torch1.6.0-mxnet1.6.0.post0-py3.7-cuda10.1 \
        --working-dir=/examples \
        "horovodrun
        -np \$((\${workers}*\${gpus}))
        --min-np \$((\${minWorkers}*\${gpus}))
        --max-np \$((\${maxWorkers}*\${gpus}))
        --host-discovery-script /usr/local/bin/discover_hosts.sh
        python /examples/elastic/tensorflow2_mnist_elastic.py
        "

    Output:

    configmap/elastic-training-etjob created
    configmap/elastic-training-etjob labeled
    trainingjob.kai.alibabacloud.com/elastic-training created
    INFO[0000] The Job elastic-training has been submitted successfully
    INFO[0000] You can run `arena get elastic-training --type etjob` to check the job status
    
  3. List your job.

    arena list

    Output:

    NAME              STATUS   TRAINER  AGE  NODE
    elastic-training  RUNNING  ETJOB    52s  192.168.0.116
    
  4. Get your job details.

    arena get elastic-training

    Output:

    STATUS: RUNNING
    NAMESPACE: default
    PRIORITY: N/A
    TRAINING DURATION: 1m
    
    NAME              STATUS   TRAINER  AGE  INSTANCE                   NODE
    elastic-training  RUNNING  ETJOB    1m   elastic-training-launcher  192.168.0.116
    elastic-training  RUNNING  ETJOB    1m   elastic-training-worker-0  192.168.0.114
    elastic-training  RUNNING  ETJOB    1m   elastic-training-worker-1  192.168.0.116
    elastic-training  RUNNING  ETJOB    1m   elastic-training-worker-2  192.168.0.116
    
  5. Check logs

    arena logs elastic-training --tail 10

    Output:

    Tue Sep  8 08:32:50 2020[1]<stdout>:Step #2170	Loss: 0.021992
    Tue Sep  8 08:32:50 2020[0]<stdout>:Step #2180	Loss: 0.000902
    Tue Sep  8 08:32:50 2020[1]<stdout>:Step #2180	Loss: 0.023190
    Tue Sep  8 08:32:50 2020[2]<stdout>:Step #2180	Loss: 0.013149
    Tue Sep  8 08:32:51 2020[0]<stdout>:Step #2190	Loss: 0.029536
    Tue Sep  8 08:32:51 2020[2]<stdout>:Step #2190	Loss: 0.017537
    Tue Sep  8 08:32:51 2020[1]<stdout>:Step #2190	Loss: 0.018273
    Tue Sep  8 08:32:51 2020[2]<stdout>:Step #2200	Loss: 0.038399
    Tue Sep  8 08:32:51 2020[0]<stdout>:Step #2200	Loss: 0.007017
    Tue Sep  8 08:32:51 2020[1]<stdout>:Step #2200	Loss: 0.017495
    
  6. Scaleout your job. Will add one worker into jobs.

    arena scaleout etjob --name="elastic-training" --count=1 --timeout=1m

    Output:

    configmap/elastic-training-1599548177-scaleout created
    configmap/elastic-training-1599548177-scaleout labeled
    scaleout.kai.alibabacloud.com/elastic-training-1599548177 created
    INFO[0000] The scaleout job elastic-training-1599548177 has been submitted successfully
    
  7. Get your job details. We can see new worker(elastic-training-worker-3) has been "RUNNING".

    arena get elastic-training

    Output:

    STATUS: RUNNING
    NAMESPACE: default
    PRIORITY: N/A
    TRAINING DURATION: 2m
    
    NAME              STATUS   TRAINER  AGE  INSTANCE                   NODE
    elastic-training  RUNNING  ETJOB    2m   elastic-training-launcher  192.168.0.116
    elastic-training  RUNNING  ETJOB    2m   elastic-training-worker-0  192.168.0.114
    elastic-training  RUNNING  ETJOB    2m   elastic-training-worker-1  192.168.0.116
    elastic-training  RUNNING  ETJOB    2m   elastic-training-worker-2  192.168.0.116
    elastic-training  RUNNING  ETJOB    2m   elastic-training-worker-3  192.168.0.117
    
  8. Check logs.

    arena logs elastic-training --tail 10

    Output:

    Tue Sep  8 08:33:33 2020[1]<stdout>:Step #3140	Loss: 0.014412
    Tue Sep  8 08:33:33 2020[0]<stdout>:Step #3140	Loss: 0.004425
    Tue Sep  8 08:33:33 2020[3]<stdout>:Step #3150	Loss: 0.000513
    Tue Sep  8 08:33:33 2020[2]<stdout>:Step #3150	Loss: 0.062282
    Tue Sep  8 08:33:33 2020[1]<stdout>:Step #3150	Loss: 0.020650
    Tue Sep  8 08:33:33 2020[0]<stdout>:Step #3150	Loss: 0.008056
    Tue Sep  8 08:33:34 2020[3]<stdout>:Step #3160	Loss: 0.002170
    Tue Sep  8 08:33:34 2020[2]<stdout>:Step #3160	Loss: 0.009676
    Tue Sep  8 08:33:34 2020[1]<stdout>:Step #3160	Loss: 0.051425
    Tue Sep  8 08:33:34 2020[0]<stdout>:Step #3160	Loss: 0.023769
    
  9. Scalein your job. Will remove one worker from current jobs.

    arena scalein etjob --name="elastic-training" --count=1 --timeout=1m

    Output:

    configmap/elastic-training-1599554041-scalein created
    configmap/elastic-training-1599554041-scalein labeled
    scalein.kai.alibabacloud.com/elastic-training-1599554041 created
    INFO[0000] The scalein job elastic-training-1599554041 has been submitted successfully
    
  10. Get your job details. We can see that elastic-training-worker-3 has been removed.

    arena get elastic-training

    Output:

    STATUS: RUNNING
    NAMESPACE: default
    PRIORITY: N/A
    TRAINING DURATION: 3m
    
    NAME              STATUS   TRAINER  AGE  INSTANCE                   NODE
    elastic-training  RUNNING  ETJOB    3m   elastic-training-launcher  192.168.0.116
    elastic-training  RUNNING  ETJOB    3m   elastic-training-worker-0  192.168.0.114
    elastic-training  RUNNING  ETJOB    3m   elastic-training-worker-1  192.168.0.116
    elastic-training  RUNNING  ETJOB    3m   elastic-training-worker-2  192.168.0.116
    
  11. Check logs.

    arena logs elastic-training --tail 10

    Output:

    Tue Sep  8 08:34:43 2020[0]<stdout>:Step #5210	Loss: 0.005627
    Tue Sep  8 08:34:43 2020[2]<stdout>:Step #5220	Loss: 0.002142
    Tue Sep  8 08:34:43 2020[1]<stdout>:Step #5220	Loss: 0.002978
    Tue Sep  8 08:34:43 2020[0]<stdout>:Step #5220	Loss: 0.011404
    Tue Sep  8 08:34:44 2020[2]<stdout>:Step #5230	Loss: 0.000689
    Tue Sep  8 08:34:44 2020[1]<stdout>:Step #5230	Loss: 0.024597
    Tue Sep  8 08:34:44 2020[0]<stdout>:Step #5230	Loss: 0.040936
    Tue Sep  8 08:34:44 2020[0]<stdout>:Step #5240	Loss: 0.000125
    Tue Sep  8 08:34:44 2020[2]<stdout>:Step #5240	Loss: 0.026498
    Tue Sep  8 08:34:44 2020[1]<stdout>:Step #5240	Loss: 0.000308