diff --git a/examples/hypercompute_clusters/a3u-slurm-ubuntu-gcs/README.md b/examples/hypercompute_clusters/a3u-slurm-ubuntu-gcs/README.md new file mode 100644 index 0000000000..7f0c062080 --- /dev/null +++ b/examples/hypercompute_clusters/a3u-slurm-ubuntu-gcs/README.md @@ -0,0 +1,153 @@ +# A3-Ultra Slurm + Ubuntu + GCS + +This reference design creates a Slurm cluster with the following design: + +1. Ubuntu 22 Operating System +1. A static a3-ultragpu-8g partition that uses a reservation. +1. 3 VPCs (2x CPU, 1x for GPU RDMA networks), with a total of 9 subnetworks +1. A GCS bucket that is configured with Hierarchical Namespace enabled +1. Cloud Storage Fuse, configured to utilize Local-SSD storage + +## Deployment Instructions + +### Build the Cluster Toolkit gcluster binary + +Follow instructions +[here](https://cloud.google.com/cluster-toolkit/docs/setup/configure-environment) + +### (Optional, but recommended) Create a GCS Bucket for storing terraform state + +```bash +#!/bin/bash + +TF_STATE_BUCKET_NAME= +PROJECT_ID= +REGION= + +gcloud storage buckets create gs://${TF_STATE_BUCKET_NAME} \ + --project=${PROJECT_ID} \ + --default-storage-class=STANDARD --location=${REGION} \ + --uniform-bucket-level-access +gcloud storage buckets update gs://${TF_STATE_BUCKET_NAME} --versioning +``` + +### Create and configure a GCS Bucket + +This will be used for input data and checkpoint/restart data. This bucket should +be created with Hierarchical Namespace enabled. See +[here](https://cloud.google.com/storage/docs/hns-overview) for more details. + +```bash +#!/bin/bash +PROJECT_ID= +REGION= +HNS_BUCKET_NAME= +PROJECT_NUMER= + +gcloud storage buckets create gs://${HNS_BUCKET_NAME} \ + --location=${REGION} --uniform-bucket-level-access + --enable-hierarchical-namespace + +``` + +### Create/modify the deployment.yaml file with your preferred configuration + +For example, set the such as size, reservation to be used, etc, as well as the +name of the bucket that you just created. Below is an example + +```yaml +--- +terraform_backend_defaults: + type: gcs + configuration: + bucket: TF_STATE_BUCKET_NAME + +vars: + deployment_name: a3u-gcs + project_id: + region: + zone: + a3u_reservation_name: + a3u_cluster_size: + hns_gcs_bucket: # This bucket must have been previously created + +``` + +### Deploy the cluster + +```bash +#!/bin/bash +gcluster deploy -d deployment.yaml a3u-slurm-ubuntu-gcs.yaml +``` + +## Storage Design Components + +On the login and controller nodes, the gcs bucket is mounted at /gcs, using +fairly standard [Cloud Storage Fuse configuration](https://cloud.google.com/storage/docs/cloud-storage-fuse/config-file). On the compute nodes, there are two +mounts of the same bucket. First, `/gcs` is mounted with with the following +configuration: + +```yaml +file-cache: + max-size-mb: -1 + enable-parallel-downloads: true + download-chunk-size-mb: 50 + parallel-downloads-per-file: 16 +cache-dir: /mnt/localssd +file-system: + dir-mode: "777" + file-mode: "777" + rename-dir-limit: 20000 # Set to 20000 for hierarchical buckets + temp-dir: /mnt/localssd + fuse-options: allow_other +foreground: true +``` + +This uses /mnt/localssd as a cache dir (for reads) and temp-dir (for writes). +It also enables parallel downloads, which is particularly useful for +checkpoint restarts. + +Next, `/gcs-ro` is mounted in a "read-only" mode, and optimized to for +input (training) data reading. + +```yaml +file-cache: + max-size-mb: -1 +metadata-cache: + ttl-secs: 3600 # Decrease if your data changes quickly. +cache-dir: /mnt/localssd +file-system: + dir-mode: "755" # need 5 on dir to enable ls + file-mode: "644" + temp-dir: /mnt/localssd + fuse-options: allow_other + kernel-list-cache-ttl-secs: 60 +foreground: true +``` + +The local ssds will be used for a file cache, and the metadata-cache +for the data is set to 1 hour, with kernel-list-cache ttl set to 60 seconds. +This reduces the amount of requests that will be sent to GCS, and improves +data loading performance. + +We suggest using /gcs for checkpoint saving/loading. and use /gcs-ro for +data input loading. + +## Running Benchmarks with Ramble + +To run a series of NCCL test benchmarks on your cluster, you can use +the use the following script: `run-nccl-tests-via-ramble.sh`, +which will use [ramble](https://github.com/GoogleCloudPlatform/ramble) to +automate the building and running of nccl tests from 2 nodes up to 32 node +scales. + +Copy the contents of `run-nccl-tests-via-ramble.sh` to your slurm +login or controller node, for example: + +```bash +#!/bin/bash +wget -np -nd https://raw.githubusercontent.com/GoogleCloudPlatform/cluster-toolkit/refs/heads/develop/examples/hypercompute_clusters/a3u-slurm-ubuntu-gcs/run-nccl-tests-via-ramble.sh +``` + +and then launch with `bash run-nccl-tests-via-ramble.sh`. The entire process +will take ~30 minutes. diff --git a/examples/hypercompute_clusters/a3u-slurm-ubuntu-gcs/a3u-slurm-ubuntu-gcs.yaml b/examples/hypercompute_clusters/a3u-slurm-ubuntu-gcs/a3u-slurm-ubuntu-gcs.yaml new file mode 100644 index 0000000000..7be9f89a00 --- /dev/null +++ b/examples/hypercompute_clusters/a3u-slurm-ubuntu-gcs/a3u-slurm-ubuntu-gcs.yaml @@ -0,0 +1,615 @@ +# Copyright 2024 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +--- + +blueprint_name: a3u-slurm-ubuntu-gcs + +vars: + # The following are supplied through the deployment.yaml file. + deployment_name: # supply deployment name + project_id: # supply project ID + region: # supply region + zone: # supply zone + a3u_cluster_size: # supply cluster size + a3u_reservation_name: # supply reservation name + hns_gcs_bucket: # Name of HNS enabled GCS bucket + # End of variables defined by deployment.yaml. The remainder + # of this blueprint need not be modified. + + # Image settings + base_image: + project: ubuntu-os-accelerator-images + family: ubuntu-accelerator-2204-amd64-with-nvidia-550 + image_build_machine_type: n2-standard-16 + build_slurm_from_git_ref: 6.8.6 + + # Cluster env settings + # net0 and filestore ranges must not overlap + net0_range: 192.168.0.0/19 + filestore_ip_range: 192.168.32.0/24 + net1_range: 192.168.64.0/18 + rdma_net_range: 192.168.128.0/18 + + # Cluster Settings + local_ssd_mountpoint: /mnt/localssd + instance_image: + project: $(vars.project_id) + family: $(vars.deployment_name)-u22 + disk_size_gb: 200 + nccl_plugin_version: v1.0.2 + + # Here we define a set of startup script runners that are used to configure + # the controller node + controller_runners: + - type: shell + destination: stage_scripts.sh + content: | + #!/bin/bash + SLURM_ROOT=/opt/apps/adm/slurm + PARTITION_NAME=a3ultra + mkdir -m 0755 -p "${SLURM_ROOT}/scripts" + mkdir -p "${SLURM_ROOT}/partition-${PARTITION_NAME}-epilog_slurmd.d" + ln -s "/slurm/scripts/tools/gpu-test" "${SLURM_ROOT}/partition-${PARTITION_NAME}-epilog_slurmd.d/gpu-test.epilog_slurmd" + + # Shared runners between login and controller: + # Configure an enroot config path + shared_runners: + - type: data + destination: /etc/enroot/enroot.conf + content: | + ENROOT_CONFIG_PATH ${HOME}/.enroot + + # Here we define a set of startup script runners that are used to configure + # the A3-Ultra nodes + # Set up enroot, using the local ssds for runtime/cache/data/temp storage. + a3u_runners: + - type: data + destination: /etc/enroot/enroot.conf + content: | + ENROOT_CONFIG_PATH ${HOME}/.enroot + ENROOT_RUNTIME_PATH $(vars.local_ssd_mountpoint)/${UID}/enroot/runtime + ENROOT_CACHE_PATH $(vars.local_ssd_mountpoint)/${UID}/enroot/cache + ENROOT_DATA_PATH $(vars.local_ssd_mountpoint)/${UID}/enroot/data + ENROOT_TEMP_PATH $(vars.local_ssd_mountpoint)/${UID}/enroot + + # Install NCCL Network Plugin + - type: ansible-local + destination: nccl_plugin.yml + content: | + --- + - name: Install NCCL plugin for A3 Ultra series + hosts: all + become: true + tasks: + - name: Add SystemD unit for NCCL plugin installation + ansible.builtin.copy: + dest: /etc/systemd/system/nccl-plugin@.service + mode: 0o0644 + content: | + [Unit] + After=network-online.target + Before=slurmd.service + + [Service] + Type=oneshot + ExecStartPre=/usr/bin/rm -rf /usr/local/gib + ExecStartPre=/usr/bin/mkdir -p /usr/local/gib + ExecStartPre=/snap/bin/gcloud auth configure-docker --quiet us-docker.pkg.dev + ExecStart=/usr/bin/docker run --rm --name nccl-gib-installer --volume /usr/local/gib:/var/lib/gib \ + us-docker.pkg.dev/gce-ai-infra/gpudirect-gib/nccl-plugin-gib:%i install --install-nccl + + [Install] + WantedBy=slurmd.service + notify: + - Reload SystemD + handlers: + - name: Reload SystemD + ansible.builtin.systemd: + daemon_reload: true + post_tasks: + - name: Enable NCCL plugin SystemD unit + ansible.builtin.service: + name: nccl-plugin@$(vars.nccl_plugin_version).service + state: started + enabled: true + + # Configure Cloud Storage FUSE + - type: ansible-local + destination: gcsfuse.yml + content: | + --- + - name: Create LSSD optimized gcsfuse mount + hosts: all + become: true + tasks: + - name: Create gcsfuse rwx configuration + ansible.builtin.copy: + dest: /etc/gcsfuse-lssd.yml + owner: root + group: root + mode: 0o644 + content: | + file-cache: + max-size-mb: -1 + enable-parallel-downloads: true + download-chunk-size-mb: 50 + parallel-downloads-per-file: 16 + cache-dir: /mnt/localssd + file-system: + dir-mode: "777" + file-mode: "777" + rename-dir-limit: 20000 # Set to 20000 for hierarchical buckets + temp-dir: /mnt/localssd + fuse-options: allow_other + foreground: true + + - name: Create gcsfuse read-only configuration for input data + ansible.builtin.copy: + dest: /etc/gcsfuse-ro.yml + owner: root + group: root + mode: 0o644 + content: | + file-cache: + max-size-mb: -1 + metadata-cache: + ttl-secs: 3600 # Decrease if your data changes quickly. + cache-dir: /mnt/localssd + file-system: + dir-mode: "755" # need 5 on dir to enable ls + file-mode: "644" + temp-dir: /mnt/localssd + fuse-options: allow_other + kernel-list-cache-ttl-secs: 60 + foreground: true + + - name: Create gcsfuse systemd service + ansible.builtin.copy: + dest: /etc/systemd/system/gcsfuse-lssd.service + owner: root + group: root + mode: 0o644 + content: | + [Unit] + Description=gcsfuse mount of all buckets + After=local-fs.target + + [Service] + Type=simple + User=root + ExecStartPre=/bin/mkdir -p /gcs + ExecStart=gcsfuse --config-file /etc/gcsfuse-lssd.yml $(vars.hns_gcs_bucket) /gcs + ExecStop=fusermount3 -u /gcs + + [Install] + WantedBy=slurmd.service multi-user.target + + - name: Create read-only gcsfuse systemd service + ansible.builtin.copy: + dest: /etc/systemd/system/gcsfuse-ro.service + owner: root + group: root + mode: 0o644 + content: | + [Unit] + Description=gcsfuse-ro mount + After=local-fs.target + + [Service] + Type=simple + User=root + ExecStartPre=/bin/mkdir -p /gcs-ro + ExecStart=gcsfuse --config-file /etc/gcsfuse-ro.yml $(vars.hns_gcs_bucket) /gcs-ro + ExecStop=fusermount3 -u /gcs-ro + + [Install] + WantedBy=slurmd.service multi-user.target + + post_tasks: + - name: Enable and restart gcsfuse + ansible.builtin.service: + name: gcsfuse-lssd.service + state: restarted + enabled: true + + - name: Enable and restart gcsfuse-ro + ansible.builtin.service: + name: gcsfuse-ro.service + state: restarted + enabled: true + + # Configure Cloud Storage FUSE for login/controller nodes + gcsfuse_runners: + - type: ansible-local + destination: gcsfuse.yml + content: | + --- + - name: Create Standard RWX gcsfuse mount + hosts: localhost + become: true + tasks: + - name: Create gcsfuse configuration + ansible.builtin.copy: + dest: /etc/gcsfuse.yml + owner: root + group: root + mode: 0o644 + content: | + file-system: + dir-mode: "777" + file-mode: "777" + rename-dir-limit: 20000 + fuse-options: allow_other + foreground: true + + - name: Create gcsfuse systemd service + ansible.builtin.copy: + dest: /etc/systemd/system/gcsfuse.service + owner: root + group: root + mode: 0o644 + content: | + [Unit] + Description=gcsfuse mount of all buckets + After=local-fs.target + + [Service] + Type=simple + User=root + ExecStartPre=/bin/mkdir -p /gcs + ExecStart=gcsfuse --config-file /etc/gcsfuse.yml $(vars.hns_gcs_bucket) /gcs + ExecStop=fusermount3 -u /gcs + + [Install] + WantedBy=slurmd.service multi-user.target + + post_tasks: + - name: Enable and restart gcsfuse + ansible.builtin.service: + name: gcsfuse.service + state: restarted + enabled: true + +deployment_groups: +- group: image-env + modules: + - id: slurm-image-network + source: modules/network/vpc + + - id: slurm-build-script + source: modules/scripts/startup-script + settings: + install_ansible: true + docker: + enabled: true + runners: + - type: data + destination: /etc/cluster_toolkit/a3ultra-prod-slurm-image.yaml + source: ../.ghpc/artifacts/expanded_blueprint.yaml + - type: data + destination: /var/tmp/slurm_vars.json + content: | + { + "reboot": false, + "install_cuda": false, + "install_gcsfuse": true, + "install_lustre": false, + "install_ompi": true, + "update_kernel": false, + "monitoring_agent": "cloud-ops", + } + - type: shell + destination: install_slurm.sh + content: | + #!/bin/bash + set -e -o pipefail + ansible-pull \ + -U https://github.com/GoogleCloudPlatform/slurm-gcp -C $(vars.build_slurm_from_git_ref) \ + -i localhost, --limit localhost --connection=local \ + -e @/var/tmp/slurm_vars.json \ + ansible/playbook.yml + # this duplicates the ulimits configuration of the HPC VM Image + - type: data + destination: /etc/security/limits.d/99-unlimited.conf + content: | + * - memlock unlimited + * - nproc unlimited + * - stack unlimited + * - nofile 1048576 + * - cpu unlimited + * - rtprio unlimited + - type: data + destination: /etc/systemd/system/slurmd.service.d/file_ulimit.conf + content: | + [Service] + LimitNOFILE=infinity + - type: data + destination: /etc/netplan/60-cloud-mrdma-init.yaml + content: | + network: + ethernets: + primary: + match: + name: enp0s* + driver: gve + dhcp4: true + dhcp4-overrides: + use-domains: true + dhcp6: true + dhcp6-overrides: + use-domains: true + optional: true + secondary: + match: + driver: gve + dhcp4: true + dhcp4-overrides: + use-domains: false + use-dns: false + use-ntp: false + dhcp6: true + dhcp6-overrides: + use-domains: false + use-dns: false + use-ntp: false + optional: true + mrdma_devices: + match: + driver: mlx5_core + dhcp-identifier: mac + dhcp4: true + dhcp4-overrides: + use-domains: true + use-dns: false + use-ntp: false + optional: true + version: 2 + - type: ansible-local + destination: configure_gpu.yml + content: | + --- + - name: Install NVIDIA packages + hosts: all + become: true + vars: + distribution: "{{ ansible_distribution | lower }}{{ ansible_distribution_version | replace('.','') }}" + cuda_repo_url: https://developer.download.nvidia.com/compute/cuda/repos/{{ distribution }}/x86_64/cuda-keyring_1.1-1_all.deb + cuda_repo_filename: /tmp/{{ cuda_repo_url | basename }} + enable_nvidia_dcgm: false + nvidia_packages: + - cuda-toolkit-12-4 + - datacenter-gpu-manager + - libnvidia-nscq-550 + tasks: + - name: Download NVIDIA repository package + ansible.builtin.get_url: + url: "{{ cuda_repo_url }}" + dest: "{{ cuda_repo_filename }}" + - name: Install NVIDIA repository package + ansible.builtin.apt: + deb: "{{ cuda_repo_filename }}" + state: present + - name: Reduce NVIDIA repository priority + ansible.builtin.copy: + dest: /etc/apt/preferences.d/cuda-repository-pin-600 + mode: 0o0644 + owner: root + group: root + content: | + Package: nsight-compute + Pin: origin *ubuntu.com* + Pin-Priority: -1 + + Package: nsight-systems + Pin: origin *ubuntu.com* + Pin-Priority: -1 + + Package: * + Pin: release l=NVIDIA CUDA + Pin-Priority: 400 + - name: Install NVIDIA fabric and CUDA + ansible.builtin.apt: + name: "{{ item }}" + update_cache: true + loop: "{{ nvidia_packages }}" + - name: Freeze NVIDIA fabric and CUDA + ansible.builtin.dpkg_selections: + name: "{{ item }}" + selection: hold + loop: "{{ nvidia_packages }}" + post_tasks: + - name: Disable NVIDIA DCGM by default (enable during boot on GPU nodes) + ansible.builtin.service: + name: nvidia-dcgm.service + state: stopped + enabled: false + - type: ansible-local + destination: install_mellanox_drivers.yml + content: | + --- + - name: Update Netplan and Install Network Utils + hosts: all + become: true + tasks: + - name: Install Linux Modules Extra + ansible.builtin.package: + name: + - ibverbs-utils + state: present + - name: Apply netplan + ansible.builtin.command: netplan apply + +- group: image + modules: + - id: slurm-a3ultra-image + source: modules/packer/custom-image + kind: packer + settings: + disk_size: $(vars.disk_size_gb) + machine_type: $(vars.image_build_machine_type) + source_image_family: $(vars.base_image.family) + source_image_project_id: [$(vars.base_image.project)] + image_family: $(vars.instance_image.family) + omit_external_ip: false + use: + - slurm-image-network + - slurm-build-script + +- group: cluster-env + modules: + - id: a3ultra-slurm-net-0 + source: modules/network/vpc + settings: + network_name: $(vars.deployment_name)-net-0 + mtu: 8896 + subnetworks: + - subnet_name: $(vars.deployment_name)-sub-0 + subnet_region: $(vars.region) + subnet_ip: $(vars.net0_range) + + - id: a3ultra-slurm-net-1 + source: modules/network/vpc + settings: + network_name: $(vars.deployment_name)-net-1 + mtu: 8896 + subnetworks: + - subnet_name: $(vars.deployment_name)-sub-1 + subnet_region: $(vars.region) + subnet_ip: $(vars.net1_range) + + - id: a3ultra-slurm-rdma-net + source: modules/network/gpu-rdma-vpc + settings: + network_name: $(vars.deployment_name)-rdma-net + network_profile: https://www.googleapis.com/compute/beta/projects/$(vars.project_id)/global/networkProfiles/$(vars.zone)-vpc-roce + network_routing_mode: REGIONAL + nic_type: MRDMA + subnetworks_template: + name_prefix: $(vars.deployment_name)-mrdma-sub + count: 8 + ip_range: $(vars.rdma_net_range) + region: $(vars.region) + + - id: homefs + source: modules/file-system/filestore + use: + - a3ultra-slurm-net-0 + settings: + filestore_tier: HIGH_SCALE_SSD + size_gb: 10240 + local_mount: /home + reserved_ip_range: $(vars.filestore_ip_range) + deletion_protection: + enabled: true + reason: Avoid data loss + outputs: + - network_storage + +- group: cluster + modules: + - id: a3ultra_startup + source: modules/scripts/startup-script + settings: + local_ssd_filesystem: + mountpoint: $(vars.local_ssd_mountpoint) + permissions: "1777" # must quote numeric filesystem permissions! + docker: + enabled: true + world_writable: true + daemon_config: | + { + "data-root": "$(vars.local_ssd_mountpoint)/docker" + } + runners: $(flatten([vars.a3u_runners])) + + - id: a3_ultra_nodeset + source: community/modules/compute/schedmd-slurm-gcp-v6-nodeset + use: [a3ultra-slurm-net-0, a3ultra_startup] + settings: + bandwidth_tier: gvnic_enabled + machine_type: a3-ultragpu-8g + instance_image_custom: true + enable_public_ips: true + node_count_static: $(vars.a3u_cluster_size) + node_count_dynamic_max: 0 + enable_placement: false + disk_type: hyperdisk-balanced + on_host_maintenance: TERMINATE + reservation_name: $(vars.a3u_reservation_name) + additional_networks: + $(concat( + [{ + network=null, + subnetwork=a3ultra-slurm-net-1.subnetwork_self_link, + subnetwork_project=vars.project_id, + nic_type="GVNIC", + queue_count=null, + network_ip="", + stack_type=null, + access_config=[], + ipv6_access_config=[], + alias_ip_range=[] + }], + a3ultra-slurm-rdma-net.subnetwork_interfaces + )) + + - id: a3_ultra_partition + source: community/modules/compute/schedmd-slurm-gcp-v6-partition + use: + - a3_ultra_nodeset + settings: + exclusive: false + partition_name: a3ultra + is_default: true + partition_conf: + ResumeTimeout: 900 + SuspendTimeout: 600 + OverSubscribe: EXCLUSIVE + + - id: controller_startup + source: modules/scripts/startup-script + settings: + runners: $(flatten([vars.shared_runners, vars.controller_runners, vars.gcsfuse_runners])) + + - id: login_startup + source: modules/scripts/startup-script + settings: + runners: $(flatten([vars.shared_runners, vars.gcsfuse_runners])) + + - id: slurm_login + source: community/modules/scheduler/schedmd-slurm-gcp-v6-login + use: [a3ultra-slurm-net-0] + settings: + instance_image_custom: true + disk_size_gb: 300 + enable_login_public_ips: true + machine_type: n2-standard-8 + + - id: slurm_controller + source: community/modules/scheduler/schedmd-slurm-gcp-v6-controller + use: + - a3ultra-slurm-net-0 + - a3_ultra_partition + - slurm_login + - homefs + settings: + enable_controller_public_ips: true + instance_image_custom: true + disk_type: pd-extreme + disk_size_gb: 300 + machine_type: n2-standard-80 + controller_startup_script: $(controller_startup.startup_script) + login_startup_script: $(login_startup.startup_script) + enable_external_prolog_epilog: true diff --git a/examples/hypercompute_clusters/a3u-slurm-ubuntu-gcs/deployment.yaml b/examples/hypercompute_clusters/a3u-slurm-ubuntu-gcs/deployment.yaml new file mode 100644 index 0000000000..d955eda1f4 --- /dev/null +++ b/examples/hypercompute_clusters/a3u-slurm-ubuntu-gcs/deployment.yaml @@ -0,0 +1,31 @@ +# Copyright 2024 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +--- +# If using GCS as a terraform backend (suggested), add the following. If not, +# comment out or remove. +terraform_backend_defaults: + type: gcs + configuration: + bucket: # Name of terraform state bucket. +# End of optional section + +vars: + deployment_name: # Unique name of this Cluster Toolkit Deployment, e.g. a3u-gcs + project_id: # Your GCP project name + region: # e.g. europe-west1 + zone: # e.g. europe-west1-b + a3u_reservation_name: # reservation name, e.g. a3u-reservation-00 + a3u_cluster_size: # Number of A3-Ultra nodes in the cluster + hns_gcs_bucket: # This bucket must have been previously created diff --git a/examples/hypercompute_clusters/a3u-slurm-ubuntu-gcs/run-nccl-tests-via-ramble.sh b/examples/hypercompute_clusters/a3u-slurm-ubuntu-gcs/run-nccl-tests-via-ramble.sh new file mode 100644 index 0000000000..62061533f3 --- /dev/null +++ b/examples/hypercompute_clusters/a3u-slurm-ubuntu-gcs/run-nccl-tests-via-ramble.sh @@ -0,0 +1,224 @@ +#!/bin/bash +# Copyright 2024 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +set -eu + +trap "printf '\nCaught Ctrl+c. Exiting...\n'; exit" INT + +# Use current unix timestamp as a unique tag +# for jobs submitted +TAG=$(date +%s) +TEST_DIR=nccl-tests-"${TAG}" +SOFTWARE_INSTALL=/opt/apps + +cat <"${TEST_DIR}"/configs/ramble.yaml +# Ramble Configuration for NCCL Tests +ramble: + env_vars: + set: + OMPI_MCA_pml: "^ucx" + OMPI_MCA_btl: "^openib" + OMPI_MCA_btl_tcp_if_include: enp0s19 + + CUDA_VISIBLE_DEVICES: 0,1,2,3,4,5,6,7 + NCCL_NET: gIB + NCCL_SOCKET_IFNAME: enp0s19,enp192s20 + NCCL_CROSS_NIC: 0 + NCCL_NET_GDR_LEVEL: PIX + NCCL_P2P_NET_CHUNKSIZE: 131072 + NCCL_P2P_PCI_CHUNKSIZE: 131072 + NCCL_P2P_NVL_CHUNKSIZE: 524288 + NCCL_NVLS_CHUNKSIZE: 524288 + NCCL_IB_GID_INDEX: 3 + NCCL_IB_ADAPTIVE_ROUTING: 1 + NCCL_IB_QPS_PER_CONNECTION: 4 + NCCL_IB_TC: 52 + NCCL_IB_FIFO_TC: 84 + NCCL_SHIMNET_GUEST_CONFIG_CHECKER_CONFIG_FILE: /usr/local/gib/configs/guest_config.txtpb + NCCL_TUNER_CONFIG_PATH: /usr/local/gib/configs/tuner_config.txtpb + prepend: + - paths: + LD_LIBRARY_PATH: /usr/local/gib/lib64 + + variables: + mpi_command: srun --mpi=pmix + batch_submit: 'sbatch {execute_experiment}' + processes_per_node: '{gpus_per_node}' + gpus_per_node: '8' + applications: + nccl-tests: + workloads: + '{workload}': + experiments: + '{workload}-{n_nodes}': + variants: + package_manager: spack + variables: + workload: [all-gather, all-reduce, reduce-scatter] + n_nodes: [2, 4, 8, 16, 32] + matrix: + - n_nodes + - workload + + software: + packages: + pmix: + pkg_spec: pmix + mpi: + pkg_spec: openmpi +cuda cuda_arch=90 + cuda: + pkg_spec: cuda@12.4.0 + nccl: + pkg_spec: nccl@2.23.4-1 cuda_arch=90 + nccl-tests: + pkg_spec: nccl-tests cuda_arch=90 + environments: + nccl-tests: + packages: [cuda, mpi, nccl, nccl-tests, pmix] + +EOF + +# Populate slurm sbatch script +cat <"${TEST_DIR}"/configs/execute_experiment.tpl +#!/bin/bash +#SBATCH -J {experiment_name}-"${TAG}" +#SBATCH --output={experiment_run_dir}/slurm-%j.out +#SBATCH -N {n_nodes} +#SBATCH --gpus-per-node=8 +#SBATCH --exclusive +#SBATCH --ntasks-per-node={processes_per_node} + +cd "{experiment_run_dir}" +{command} +EOF + +# Get number of nodes available +N_NODES=$(sinfo -h -o %D) + +# Print available benchmarks +printf "\n--------- Setting up Benchmarks ----------\n" +ramble workspace info --where '{n_nodes} <= '"$N_NODES" + +printf "\n------- About to run the following: ------\n\n" +printf "source %s/ramble/env/bin/activate\n" "${SOFTWARE_INSTALL}" +printf ". %s/ramble/share/ramble/setup-env.sh\n" "${SOFTWARE_INSTALL}" +printf ". %s/spack/share/spack/setup-env.sh\n" "${SOFTWARE_INSTALL}" +printf "ramble workspace activate %s\n" "${TEST_DIR}" +printf "ramble workspace setup --where '{n_nodes} <= %s'\n" "${N_NODES}" +printf "ramble on --where '{n_nodes} <= %s' \n" "${N_NODES}" + +# Set up experiments +printf "\n--------- Setting up Benchmarks -------\n" +printf " This may take 20-30 minutes \n" +ramble workspace setup --where '{n_nodes} <= '"${N_NODES}" + +# Submit Experiments to Slurm +printf "\n----------- Running Benchmarks --------\n" +ramble on --where '{n_nodes} <= '"${N_NODES}" + +# Wait for all to be done +# Use the TAG in the slurm jobs +until [[ $(squeue -h -o %j | grep -c "${TAG}") -eq 0 ]]; do + clear + echo "waiting for $(squeue -h -o %j | grep -c "${TAG}") jobs to finish" + squeue + sleep 5 +done + +# Analyze +ramble workspace analyze -f json --where '{n_nodes} <= '"${N_NODES}" + +# Summarize all results in summary.tsv +cd "${TEST_DIR}" +jq -r '["workload","n_nodes","msg_size","busbw"], (.experiments[] as $exp | $exp.CONTEXTS[] as $context | +{ + experiment_name: $exp.name, + workload: $exp.workload_name, + n_nodes: $exp.n_nodes, + Context: $context.name +} + +($context.foms | from_entries ) +| [.workload, .n_nodes, .Size, ."Out of Place Bus Bandwidth"]) +| @tsv' results.latest.json >summary.tsv + +# Print just the 8GB message sizes +printf "\n--- SUMMARY for 8GB Message Sizes --\n" +jq -r '["workload","n_nodes","msg_size","busbw"], (.experiments[] as $exp | $exp.CONTEXTS[] as $context | +{ + experiment_name: $exp.name, + workload: $exp.workload_name, + n_nodes: $exp.n_nodes, + Context: $context.name +} + +($context.foms | from_entries ) +| select(.Size | tonumber > 8000000000) +| [.workload, .n_nodes, .Size, ."Out of Place Bus Bandwidth"]) +| @tsv' results.latest.json +printf "\nFor full results, see \"summary.tsv\"\n" + +printf "\n- To reactivate this ramble workspace, run -\n\n" +printf "source %s/ramble/env/bin/activate\n" "${SOFTWARE_INSTALL}" +printf ". %s/ramble/share/ramble/setup-env.sh\n" "${SOFTWARE_INSTALL}" +printf ". %s/spack/share/spack/setup-env.sh\n" "${SOFTWARE_INSTALL}" +printf "ramble workspace activate %s\n" "${TEST_DIR}" diff --git a/examples/machine-learning/a3-ultragpu-8g/README.md b/examples/machine-learning/a3-ultragpu-8g/README.md new file mode 100644 index 0000000000..dfa3bb17c5 --- /dev/null +++ b/examples/machine-learning/a3-ultragpu-8g/README.md @@ -0,0 +1,16 @@ +# A3 Ultra Blueprints + +For further information on deploying an A3 Ultra cluster with Slurm, please +see: + +[Create A3 Ultra Slurm Cluster](https://cloud.google.com/ai-hypercomputer/docs/create/create-slurm-cluster) + +If you are unable to access these documents, please contact your +[Technical Account Manager (TAM)](https://cloud.google.com/tam). + +## Deploy A3 Ultra compute VM with custom startup-scripts + +Customers can deploy [a3ultra-vm.yaml] blueprint to deploy 2 A3 Ultra VMs. You +can also specify custom startup-scripts to run in the blueprint. + +[a3ultra-vm.yaml]: ./a3ultra-vm.yaml diff --git a/examples/machine-learning/a3-ultragpu-8g/a3ultra-slurm-blueprint.yaml b/examples/machine-learning/a3-ultragpu-8g/a3ultra-slurm-blueprint.yaml new file mode 100644 index 0000000000..29b08add88 --- /dev/null +++ b/examples/machine-learning/a3-ultragpu-8g/a3ultra-slurm-blueprint.yaml @@ -0,0 +1,451 @@ +# Copyright 2024 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +--- +# This blueprint uses private preview functionality in limited availability, +# see README.md for further information + +# This blueprint requires a Cluster Toolkit binary built from a +# release >= 1.44.0 + +blueprint_name: a3ultra-slurm + +vars: + deployment_name: # supply deployment name + project_id: # supply project ID + region: # supply region + zone: # supply zone + a3u_cluster_size: # supply cluster size + a3u_reservation_name: # supply reservation name + # Image settings + base_image: + project: ubuntu-os-accelerator-images + family: ubuntu-accelerator-2204-amd64-with-nvidia-550 + image_build_machine_type: n2-standard-16 + build_slurm_from_git_ref: 6.8.7 + # Cluster env settings + # net0 and filestore ranges must not overlap + net0_range: 192.168.0.0/19 + filestore_ip_range: 192.168.32.0/24 + net1_range: 192.168.64.0/18 + rdma_net_range: 192.168.128.0/18 + # Cluster Settings + local_ssd_mountpoint: /mnt/localssd + instance_image: + project: $(vars.project_id) + family: $(vars.deployment_name)-u22 + disk_size_gb: 200 + nccl_plugin_version: v1.0.2 + +deployment_groups: +- group: image-env + modules: + - id: slurm-image-network + source: modules/network/vpc + + - id: slurm-build-script + source: modules/scripts/startup-script + settings: + install_ansible: true + docker: + enabled: true + runners: + - type: data + destination: /etc/cluster_toolkit/a3ultra-prod-slurm-image.yaml + source: ../.ghpc/artifacts/expanded_blueprint.yaml + - type: data + destination: /var/tmp/slurm_vars.json + content: | + { + "reboot": false, + "install_cuda": false, + "install_gcsfuse": true, + "install_lustre": false, + "install_ompi": true, + "update_kernel": false, + "monitoring_agent": "cloud-ops", + } + - type: shell + destination: install_slurm.sh + content: | + #!/bin/bash + set -e -o pipefail + ansible-pull \ + -U https://github.com/GoogleCloudPlatform/slurm-gcp -C $(vars.build_slurm_from_git_ref) \ + -i localhost, --limit localhost --connection=local \ + -e @/var/tmp/slurm_vars.json \ + ansible/playbook.yml + # this duplicates the ulimits configuration of the HPC VM Image + - type: data + destination: /etc/security/limits.d/99-unlimited.conf + content: | + * - memlock unlimited + * - nproc unlimited + * - stack unlimited + * - nofile 1048576 + * - cpu unlimited + * - rtprio unlimited + - type: data + destination: /etc/systemd/system/slurmd.service.d/file_ulimit.conf + content: | + [Service] + LimitNOFILE=infinity + - type: data + destination: /etc/netplan/60-cloud-mrdma-init.yaml + content: | + network: + ethernets: + primary: + match: + name: enp0s* + driver: gve + dhcp4: true + dhcp4-overrides: + use-domains: true + dhcp6: true + dhcp6-overrides: + use-domains: true + optional: true + secondary: + match: + driver: gve + dhcp4: true + dhcp4-overrides: + use-domains: false + use-dns: false + use-ntp: false + dhcp6: true + dhcp6-overrides: + use-domains: false + use-dns: false + use-ntp: false + optional: true + mrdma_devices: + match: + driver: mlx5_core + dhcp-identifier: mac + dhcp4: true + dhcp4-overrides: + use-domains: true + use-dns: false + use-ntp: false + optional: true + version: 2 + - type: ansible-local + destination: configure_gpu.yml + content: | + --- + - name: Install NVIDIA packages + hosts: all + become: true + vars: + distribution: "{{ ansible_distribution | lower }}{{ ansible_distribution_version | replace('.','') }}" + cuda_repo_url: https://developer.download.nvidia.com/compute/cuda/repos/{{ distribution }}/x86_64/cuda-keyring_1.1-1_all.deb + cuda_repo_filename: /tmp/{{ cuda_repo_url | basename }} + enable_nvidia_dcgm: false + nvidia_packages: + - cuda-toolkit-12-4 + - datacenter-gpu-manager + - libnvidia-nscq-550 + tasks: + - name: Download NVIDIA repository package + ansible.builtin.get_url: + url: "{{ cuda_repo_url }}" + dest: "{{ cuda_repo_filename }}" + - name: Install NVIDIA repository package + ansible.builtin.apt: + deb: "{{ cuda_repo_filename }}" + state: present + - name: Reduce NVIDIA repository priority + ansible.builtin.copy: + dest: /etc/apt/preferences.d/cuda-repository-pin-600 + mode: 0o0644 + owner: root + group: root + content: | + Package: nsight-compute + Pin: origin *ubuntu.com* + Pin-Priority: -1 + + Package: nsight-systems + Pin: origin *ubuntu.com* + Pin-Priority: -1 + + Package: * + Pin: release l=NVIDIA CUDA + Pin-Priority: 400 + - name: Install NVIDIA fabric and CUDA + ansible.builtin.apt: + name: "{{ item }}" + update_cache: true + loop: "{{ nvidia_packages }}" + - name: Freeze NVIDIA fabric and CUDA + ansible.builtin.dpkg_selections: + name: "{{ item }}" + selection: hold + loop: "{{ nvidia_packages }}" + post_tasks: + - name: Disable NVIDIA DCGM by default (enable during boot on GPU nodes) + ansible.builtin.service: + name: nvidia-dcgm.service + state: stopped + enabled: false + - type: ansible-local + destination: install_mellanox_drivers.yml + content: | + --- + - name: Update Netplan and Install Network Utils + hosts: all + become: true + tasks: + - name: Install Linux Modules Extra + ansible.builtin.package: + name: + - ibverbs-utils + state: present + - name: Apply netplan + ansible.builtin.command: netplan apply + +- group: image + modules: + - id: slurm-a3ultra-image + source: modules/packer/custom-image + kind: packer + settings: + disk_size: $(vars.disk_size_gb) + machine_type: $(vars.image_build_machine_type) + source_image_family: $(vars.base_image.family) + source_image_project_id: [$(vars.base_image.project)] + image_family: $(vars.instance_image.family) + omit_external_ip: false + use: + - slurm-image-network + - slurm-build-script + +- group: cluster-env + modules: + - id: a3ultra-slurm-net-0 + source: modules/network/vpc + settings: + network_name: $(vars.deployment_name)-net-0 + mtu: 8896 + enable_internal_traffic: false # Setting firewall below instead + subnetworks: + - subnet_name: $(vars.deployment_name)-sub-0 + subnet_region: $(vars.region) + subnet_ip: $(vars.net0_range) + firewall_rules: + - name: $(vars.deployment_name)-internal-0 + ranges: [$(vars.net0_range)] + allow: + - protocol: tcp + - protocol: udp + - protocol: icmp + + - id: a3ultra-slurm-net-1 + source: modules/network/vpc + settings: + network_name: $(vars.deployment_name)-net-1 + mtu: 8896 + enable_internal_traffic: false # Setting firewall below instead + subnetworks: + - subnet_name: $(vars.deployment_name)-sub-1 + subnet_region: $(vars.region) + subnet_ip: $(vars.net1_range) + firewall_rules: + - name: $(vars.deployment_name)-internal-1 + ranges: [$(vars.net1_range)] + allow: + - protocol: tcp + - protocol: udp + - protocol: icmp + + - id: a3ultra-slurm-rdma-net + source: modules/network/gpu-rdma-vpc + settings: + network_name: $(vars.deployment_name)-rdma-net + network_profile: https://www.googleapis.com/compute/beta/projects/$(vars.project_id)/global/networkProfiles/$(vars.zone)-vpc-roce + network_routing_mode: REGIONAL + subnetworks_template: + name_prefix: $(vars.deployment_name)-mrdma-sub + count: 8 + ip_range: $(vars.rdma_net_range) + region: $(vars.region) + firewall_rules: + - name: $(vars.deployment_name)-internal-rdma + ranges: [$(vars.rdma_net_range)] + allow: + - protocol: tcp + - protocol: udp + - protocol: icmp + + - id: homefs + source: modules/file-system/filestore + use: + - a3ultra-slurm-net-0 + settings: + filestore_tier: HIGH_SCALE_SSD + size_gb: 10240 + local_mount: /home + reserved_ip_range: $(vars.filestore_ip_range) + deletion_protection: + enabled: true + reason: Avoid data loss + outputs: + - network_storage + +- group: cluster + modules: + - id: a3ultra_startup + source: modules/scripts/startup-script + settings: + local_ssd_filesystem: + mountpoint: $(vars.local_ssd_mountpoint) + permissions: "1777" # must quote numeric filesystem permissions! + docker: + enabled: true + world_writable: true + daemon_config: | + { + "data-root": "$(vars.local_ssd_mountpoint)/docker" + } + runners: + - type: data + destination: /etc/enroot/enroot.conf + content: | + ENROOT_RUNTIME_PATH $(vars.local_ssd_mountpoint)/${UID}/enroot/runtime + ENROOT_CACHE_PATH $(vars.local_ssd_mountpoint)/${UID}/enroot/cache + ENROOT_DATA_PATH $(vars.local_ssd_mountpoint)/${UID}/enroot/data + ENROOT_TEMP_PATH $(vars.local_ssd_mountpoint)/${UID}/enroot + - type: ansible-local + destination: nccl_plugin.yml + content: | + --- + - name: Install NCCL plugin for A3 Ultra series + hosts: all + become: true + tasks: + - name: Add SystemD unit for NCCL plugin installation + ansible.builtin.copy: + dest: /etc/systemd/system/nccl-plugin@.service + mode: 0o0644 + content: | + [Unit] + After=network-online.target + Before=slurmd.service + + [Service] + Type=oneshot + ExecStartPre=/usr/bin/rm -rf /usr/local/gib + ExecStartPre=/usr/bin/mkdir -p /usr/local/gib + ExecStartPre=/snap/bin/gcloud auth configure-docker --quiet us-docker.pkg.dev + ExecStart=/usr/bin/docker run --rm --name nccl-gib-installer --volume /usr/local/gib:/var/lib/gib \ + us-docker.pkg.dev/gce-ai-infra/gpudirect-gib/nccl-plugin-gib:%i install --install-nccl + + [Install] + WantedBy=slurmd.service + notify: + - Reload SystemD + handlers: + - name: Reload SystemD + ansible.builtin.systemd: + daemon_reload: true + post_tasks: + - name: Enable NCCL plugin SystemD unit + ansible.builtin.service: + name: nccl-plugin@$(vars.nccl_plugin_version).service + state: started + enabled: true + + - id: a3_ultra_nodeset + source: community/modules/compute/schedmd-slurm-gcp-v6-nodeset + use: [a3ultra-slurm-net-0, a3ultra_startup] + settings: + bandwidth_tier: gvnic_enabled + machine_type: a3-ultragpu-8g + instance_image_custom: true + enable_public_ips: true + node_count_static: $(vars.a3u_cluster_size) + node_count_dynamic_max: 0 + enable_placement: false + disk_type: hyperdisk-balanced + on_host_maintenance: TERMINATE + reservation_name: $(vars.a3u_reservation_name) + additional_networks: + $(concat( + [{ + network=null, + subnetwork=a3ultra-slurm-net-1.subnetwork_self_link, + subnetwork_project=vars.project_id, + nic_type="GVNIC", + queue_count=null, + network_ip="", + stack_type=null, + access_config=[], + ipv6_access_config=[], + alias_ip_range=[] + }], + a3ultra-slurm-rdma-net.subnetwork_interfaces + )) + + - id: a3_ultra_partition + source: community/modules/compute/schedmd-slurm-gcp-v6-partition + use: + - a3_ultra_nodeset + settings: + exclusive: false + partition_name: a3ultra + is_default: true + partition_conf: + ResumeTimeout: 900 + SuspendTimeout: 600 + + - id: slurm_login + source: community/modules/scheduler/schedmd-slurm-gcp-v6-login + use: [a3ultra-slurm-net-0] + settings: + instance_image_custom: true + disk_size_gb: 300 + enable_login_public_ips: true + machine_type: n2-standard-8 + + - id: controller_startup + source: modules/scripts/startup-script + settings: + runners: + - type: shell + destination: stage_scripts.sh + content: | + #!/bin/bash + SLURM_ROOT=/opt/apps/adm/slurm + PARTITION_NAME=$(a3_ultra_partition.partitions[0].partition_name) + mkdir -m 0755 -p "${SLURM_ROOT}/scripts" + mkdir -p "${SLURM_ROOT}/partition-${PARTITION_NAME}-epilog_slurmd.d" + ln -s "/slurm/scripts/tools/gpu-test" "${SLURM_ROOT}/partition-${PARTITION_NAME}-epilog_slurmd.d/gpu-test.epilog_slurmd" + + - id: slurm_controller + source: community/modules/scheduler/schedmd-slurm-gcp-v6-controller + use: + - a3ultra-slurm-net-0 + - a3_ultra_partition + - slurm_login + - homefs + settings: + enable_controller_public_ips: true + instance_image_custom: true + disk_type: pd-extreme + disk_size_gb: 300 + machine_type: n2-standard-80 + controller_startup_script: $(controller_startup.startup_script) + enable_external_prolog_epilog: true diff --git a/examples/machine-learning/a3-ultragpu-8g/a3ultra-slurm-deployment.yaml b/examples/machine-learning/a3-ultragpu-8g/a3ultra-slurm-deployment.yaml new file mode 100644 index 0000000000..6fa29af09e --- /dev/null +++ b/examples/machine-learning/a3-ultragpu-8g/a3ultra-slurm-deployment.yaml @@ -0,0 +1,26 @@ +# Copyright 2024 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +--- +terraform_backend_defaults: + type: gcs + configuration: + bucket: # supply existing bucket to store Terraform state + +vars: + deployment_name: # supply unique deployment name + project_id: # supply existing project id + region: # supply region with a3-ultragpu-8g capacity in reservation + zone: # supply zone with a3-ultragpu-8g capacity in reservation + a3u_reservation_name: # supply a3-ultragpu-8g reservation name + a3u_cluster_size: # supply a3-ultragpu-8g reservation size diff --git a/examples/machine-learning/a3-ultragpu-8g/a3ultra-vm.yaml b/examples/machine-learning/a3-ultragpu-8g/a3ultra-vm.yaml new file mode 100644 index 0000000000..25d7fd83bf --- /dev/null +++ b/examples/machine-learning/a3-ultragpu-8g/a3ultra-vm.yaml @@ -0,0 +1,151 @@ +# Copyright 2024 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +--- + +blueprint_name: a3ultra-vm-instance + +vars: + project_id: # supply project ID + deployment_name: a3ultra-vm-instance + region: europe-west1 + zone: europe-west1-b + instance_image: + project: ubuntu-os-accelerator-images + family: ubuntu-accelerator-2204-amd64-with-nvidia-550 + net0_range: 192.168.0.0/19 + net1_range: 192.168.64.0/18 + filestore_ip_range: 192.168.32.0/24 + rdma_net_range: 192.168.128.0/18 + hostname_prefix: $(vars.deployment_name)-beowulf + +deployment_groups: +- group: primary + modules: + + - id: a3ultra-net-0 + source: modules/network/vpc + settings: + network_name: $(vars.deployment_name)-net-0 + mtu: 8896 + subnetworks: + - subnet_name: $(vars.deployment_name)-sub-0 + subnet_region: $(vars.region) + subnet_ip: $(vars.net0_range) + firewall_rules: + - name: $(vars.deployment_name)-internal-0 + ranges: [$(vars.net0_range)] + allow: + - protocol: tcp + - protocol: udp + - protocol: icmp + + - id: a3ultra-net-1 + source: modules/network/vpc + settings: + network_name: $(vars.deployment_name)-net-1 + mtu: 8896 + subnetworks: + - subnet_name: $(vars.deployment_name)-sub-1 + subnet_region: $(vars.region) + subnet_ip: $(vars.net1_range) + firewall_rules: + - name: $(vars.deployment_name)-internal-1 + ranges: [$(vars.net1_range)] + allow: + - protocol: tcp + - protocol: udp + - protocol: icmp + + - id: a3ultra-rdma-net + source: modules/network/gpu-rdma-vpc + settings: + network_name: $(vars.deployment_name)-rdma-net + network_profile: https://www.googleapis.com/compute/beta/projects/$(vars.project_id)/global/networkProfiles/$(vars.zone)-vpc-roce + network_routing_mode: REGIONAL + subnetworks_template: + name_prefix: $(vars.deployment_name)-mrdma-sub + count: 8 + ip_range: $(vars.rdma_net_range) + region: $(vars.region) + firewall_rules: + - name: $(vars.deployment_name)-internal-rdma + ranges: [$(vars.rdma_net_range)] + allow: + - protocol: tcp + - protocol: udp + - protocol: icmp + + - id: homefs + source: modules/file-system/filestore + use: [a3ultra-net-0] + settings: + filestore_tier: HIGH_SCALE_SSD + size_gb: 10240 + local_mount: /home + reserved_ip_range: $(vars.filestore_ip_range) + outputs: + - network_storage + + - id: startup-script + source: modules/scripts/startup-script + settings: + configure_ssh_host_patterns: + - $(vars.hostname_prefix)-* + + - id: a3ultra-vms + source: modules/compute/vm-instance + use: [startup-script, homefs] + settings: + machine_type: a3-ultragpu-8g + instance_count: 2 + name_prefix: $(vars.hostname_prefix) + disk_type: hyperdisk-balanced + automatic_restart: true + on_host_maintenance: TERMINATE + reservation_name: # supply reservation name + network_interfaces: + $(concat( + [{ + network=null, + subnetwork=a3ultra-net-0.subnetwork_self_link, + subnetwork_project=vars.project_id, + nic_type="GVNIC", + queue_count=null, + network_ip=null, + stack_type=null, + access_config=[{nat_ip=null, public_ptr_domain_name=null, network_tier=null}], + ipv6_access_config=[], + alias_ip_range=[] + }, + { + network=null, + subnetwork=a3ultra-net-1.subnetwork_self_link, + subnetwork_project=vars.project_id, + nic_type="GVNIC", + queue_count=null, + network_ip=null, + stack_type=null, + access_config=[{nat_ip=null, public_ptr_domain_name=null, network_tier=null}], + ipv6_access_config=[], + alias_ip_range=[] + }], + a3ultra-rdma-net.subnetwork_interfaces, + )) + + - id: wait-for-vms + source: community/modules/scripts/wait-for-startup + settings: + instance_names: $(a3ultra-vms.name) + timeout: 7200 diff --git a/examples/machine-learning/a3-ultragpu-8g/nccl-tests/README.md b/examples/machine-learning/a3-ultragpu-8g/nccl-tests/README.md new file mode 100644 index 0000000000..3f6dfab5c9 --- /dev/null +++ b/examples/machine-learning/a3-ultragpu-8g/nccl-tests/README.md @@ -0,0 +1,89 @@ +The examples in this directory are used to show how enroot + pyxis can be used +to launch containerized workloads via Slurm. + +Contents: + +* `build-nccl-tests.sh`: A Slurm batch script for building the nccl-tests. +* `run-nccl-tests.sh`: A Slurm batch script for running the nccl-tests + `all_reduce_perf` benchmark. +* `import_container.sh`: Uses enroot to create a squashfs container image. Added + for reference only. enroot import happens within the `build-nccl-tests.sh`. + +# Running NCCL-Tests via Enroot/Pyxis + +In general the workflow to deploy GPUDirect-RDMA-enabled workloads via enroot-pyxis is +the following: + +1. Convert your container into a squashfs based container image +2. Set required environment variables +3. Run your application workload + +## TLDR + +For an end-to-end example, copy the `build-nccl-tests.sh` and +`run-nccl-tests.sh` to your login node. + +And run the following: + +```text +BUILD_JOB=$(sbatch --parsable build-nccl-tests.sh) # takes ~4 minutes +sbatch -d afterok:${BUILD_JOB} run-nccl-tests.sh # takes ~3 minutes +``` + +The latter should result in a slurm-XX.out file that contains the result of the nccl +`all_gather_perf` benchmark: + +```text +# +# out-of-place in-place +# size count type redop root time algbw busbw #wrong time algbw busbw #wrong +# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s) + 268435456 4194304 float none -1 XXXXX XXX.XX XXX.XX N/A XXXXXX XXX.XX XXX.XX 0 + 536870912 8388608 float none -1 XXXXX XXX.XX XXX.XX N/A XXXXXX XXX.XX XXX.XX 0 + 1073741824 16777216 float none -1 XXXXX XXX.XX XXX.XX N/A XXXXXX XXX.XX XXX.XX 0 + 2147483648 33554432 float none -1 XXXXX XXX.XX XXX.XX N/A XXXXXX XXX.XX XXX.XX 0 + 4294967296 67108864 float none -1 XXXXX XXX.XX XXX.XX N/A XXXXXX XXX.XX XXX.XX 0 + 8589934592 134217728 float none -1 XXXXX XXX.XX XXX.XX N/A XXXXXX XXX.XX XXX.XX 0 +# Out of bounds values : 0 OK +# Avg bus bandwidth : XXX.XX +# +``` + +For more details, follow the remainder of this README. + +## Detailed Instructions + +All of the following should be done on the login node of your slurm cluster, +and while somewhere on the shared Filestore filesystem (typically the user's +home directory). + +### Building NCCL-tests + +See build-nccl-tests.sh for an example. Within it, you will see that first we'll +create a squashfs version of the container using we want to launch using `enroot +import`. We do this because otherwise we'd be pulling the (typically more than +10GB) image multiple times from the source on each node, converting to sqsh each +time, etc, which would make the job launch longer. + +For building the nccl-tests binaries, we use `pyxis` to run the enroot container +and build the nccl-tests within that container to ensure the resulting binarier +are compatible with the container environment. + +Both of the above (importing and building) are accomplished by running: + +```text +sbatch build-nccl-tests.sh +``` + +### Running your application on a3-ultra instances + +For a complete example, run: + +```text +sbatch run-nccl-tests.sh +``` + +The output will appear in in a `slurm-.log` file. If the name of your a3-ultragpu +partition is different than "a3ultra", you will need to modify the `build-nccl-tests.sh` +and `run-nccl-tests.sh` scripts's `#SBATCH --partition` setting. Alternatively, you +can run `sbatch -p