NOTE: This tutorial will require to have TPU v4 quota. You can request this using TPUv4.
This page demonstrates how to run TPU job like maxtext performance benchmark test in Slurm Cluster with TPU partition using hpc-slurm6-tpu-maxtext.yaml blueprint.
In order to deploy and run this blueprint, you need to download dataset in your cloud storage bucket. You need follow steps mentioned in download dataset to download the dataset in your GCS bucket. After that you can update the blueprint to use the dataset from GCS bucket in training script.
./gcluster create community/examples/hpc-slurm6-tpu-maxtext.yaml --vars project_id=<project-id>;
./gcluster deploy slurm6-tpu-v4 --auto-approve
This would deploy slurm cluster with TPU partition, dynamic compute partition. Maxtext benchmark test script
will be stored in /opt/apps/scripts/tpu-test
directory.
Once the startup script has completed, connect to the login node.
Use the following command to ssh into the login node from cloud shell:
gcloud compute ssh slurm6tpuv-login-v6tpu-001 --zone us-central2-b --project <project-id>
You may be prompted to set up SSH. If so follow the prompts and if asked for a
password, just hit [enter]
leaving the input blank.
Create maxtext directory in home directory and run maxtext script.
mkdir maxtext && cd maxtext
sbatch /opt/apps/scripts/tpu-test/run_maxtext.sh
The sbatch command submits a batch script to the tpu partition, which makes Slurm auto-scale up nodes to run the job
You can refresh the TPU instances page and see that TPU is being/has been created.
These will be named something like slurm6tpuv-tpunodeset-0
.
When running squeue
, observe the job status start as CF
(configuring), change to
R
(running) once the compute VMs have been created, and finally CG
(completing)
when job has finished and nodes are spooling down.
The ${HOME}/maxtext/output
directory will have several files and directories generated.
slurm-<job-id>.out
file contains standard output for the TPU job.
cat slurm-1.out
This should have something like
completed step: 23, seconds: 4.911, TFLOP/s/device: 34.760, loss: 12.192
completed step: 24, seconds: 4.908, TFLOP/s/device: 34.781, loss: 12.173
This would run for the number of steps that have been provided.
To avoid incurring ongoing charges we will want to destroy our cluster.
For this we need to return to our cloud shell terminal. Run exit in the terminal to close the SSH connection to the login node.
Run the following command in the cloud shell terminal to destroy the cluster:
./gcluster destroy slurm6-tpu-v4 --auto-approve
When complete you should see something like:
Destroy complete! Resources: xx destroyed.
NOTE: If destroy is run before Slurm shut down the auto-scale nodes then they will be left behind and destroy may fail. In this case you can delete the VMs manually and rerun the destroy command above.