Skip to content

Commit

Permalink
rename /gcs-rwx to /gcs
Browse files Browse the repository at this point in the history
  • Loading branch information
koallison committed Jan 11, 2025
1 parent 52706cd commit 0d394b5
Show file tree
Hide file tree
Showing 2 changed files with 33 additions and 12 deletions.
2 changes: 1 addition & 1 deletion examples/machine-learning/a3-megagpu-8g/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,4 +7,4 @@ these [instructions].

# GCSFuse with Local SSD cache

`slurm-a3mega-gcsfuse-lssd-cluster.yaml` reflects best practices for using GCSFuse for ML workloads. It is configured to mount all available GCS buckets on two mountpoints on a3-mega nodes. The `/gcs-rwx` mountpoint enables parallel downloads, intended for reading/writing checkpoints, logs, application outputs, model serving, or loading large files (e.g. squashfs files). The read-only `/gcs-ro` mountpoint disables parallel downloads and enables the list cache, intended for reading training data. Parallel downloads are not recommended for training workloads; see [GCSFuse documentation](https://cloud.google.com/storage/docs/cloud-storage-fuse/file-caching#parallel-downloads) for details.
`slurm-a3mega-gcsfuse-lssd-cluster.yaml` reflects best practices for using GCSFuse for ML workloads. It is configured to mount all available GCS buckets on two mountpoints on a3-mega nodes. The `/gcs` mountpoint enables parallel downloads, intended for reading/writing checkpoints, logs, application outputs, model serving, or loading large files (e.g. squashfs files). The read-only `/gcs-ro` mountpoint disables parallel downloads and enables the list cache, intended for reading training data. Parallel downloads are not recommended for training workloads; see [GCSFuse documentation](https://cloud.google.com/storage/docs/cloud-storage-fuse/file-caching#parallel-downloads) for details.
Original file line number Diff line number Diff line change
Expand Up @@ -90,7 +90,7 @@ vars:
tasks:
- name: Create gcsfuse rwx configuration
ansible.builtin.copy:
dest: /etc/gcsfuse-rwx.yml
dest: /etc/gcsfuse.yml
owner: root
group: root
mode: 0o644
Expand Down Expand Up @@ -130,37 +130,58 @@ vars:
kernel-list-cache-ttl-secs: 60
foreground: true
- name: Create gcsfuse@ systemd service
- name: Create gcsfuse systemd service
ansible.builtin.copy:
dest: /etc/systemd/system/gcsfuse.service
owner: root
group: root
mode: 0o644
content: |
[Unit]
Description=gcsfuse mount of all buckets
After=local-fs.target
[Service]
Type=simple
User=root
ExecStartPre=/bin/mkdir -p /gcs
ExecStart=gcsfuse --config-file /etc/gcsfuse.yml /gcs
ExecStop=fusermount3 -u /gcs
[Install]
WantedBy=slurmd.service multi-user.target
- name: Create gcsfuse-ro systemd service
ansible.builtin.copy:
dest: /etc/systemd/system/gcsfuse@.service
dest: /etc/systemd/system/gcsfuse-ro.service
owner: root
group: root
mode: 0o644
content: |
[Unit]
Description=gcsfuse %i mount of all buckets
Description=gcsfuse ro mount of all buckets
After=local-fs.target
[Service]
Type=simple
User=root
ExecStartPre=/bin/mkdir -p /gcs-%i
ExecStart=gcsfuse --config-file /etc/gcsfuse-%i.yml /gcs-%i
ExecStop=fusermount3 -u /gcs-%i
ExecStartPre=/bin/mkdir -p /gcs-ro
ExecStart=gcsfuse --config-file /etc/gcsfuse-ro.yml /gcs-ro
ExecStop=fusermount3 -u /gcs-ro
[Install]
WantedBy=slurmd.service multi-user.target
post_tasks:
- name: Enable and restart gcsfuse@rwx
- name: Enable and restart gcsfuse
ansible.builtin.service:
name: gcsfuse@rwx.service
name: gcsfuse.service
state: restarted
enabled: true
- name: Enable and restart gcsfuse@ro
- name: Enable and restart gcsfuse-ro
ansible.builtin.service:
name: gcsfuse@ro.service
name: gcsfuse-ro.service
state: restarted
enabled: true
Expand Down

0 comments on commit 0d394b5

Please sign in to comment.