Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add multi-mount parallelstore support #3256

Merged
merged 2 commits into from
Dec 5, 2024

Conversation

harshthakkar01
Copy link
Contributor

@harshthakkar01 harshthakkar01 commented Nov 13, 2024

This PR,

  • excludes GPU interfaces in daos config file
  • Add support for multiple mount for single parallelstore instance (creates systemd service for each mount)

Submission Checklist

NOTE: Community submissions can take up to 2 weeks to be reviewed.

Please take the following actions before submitting this pull request.

  • Fork your PR branch from the Toolkit "develop" branch (not main)
  • Test all changes with pre-commit in a local branch #
  • Confirm that "make tests" passes all tests
  • Add or modify unit tests to cover code changes
  • Ensure that unit test coverage remains above 80%
  • Update all applicable documentation
  • Follow Cluster Toolkit Contribution guidelines #

@harshthakkar01 harshthakkar01 added the release-improvements Added to release notes under the "Improvements" heading. label Nov 13, 2024
@harshthakkar01 harshthakkar01 force-pushed the ps-fix-2 branch 4 times, most recently from 727d580 to f607f80 Compare November 15, 2024 06:50
@harshthakkar01 harshthakkar01 changed the title Update mount parallelstore script to support multiple parallelstore Add multi-mount parallelstore support Nov 15, 2024
Copy link
Member

@tpdownes tpdownes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to know a bit more about what attempting to support multiple Parallelstore instances. In the meantime, the changes I suggest will improve reliability.

@tpdownes tpdownes assigned harshthakkar01 and unassigned tpdownes Nov 18, 2024
@mr0re1 mr0re1 assigned mr0re1 and unassigned harshthakkar01 Nov 19, 2024
@tpdownes tpdownes assigned tpdownes and unassigned mr0re1 Nov 20, 2024
tpdownes added a commit to harshthakkar01/hpc-toolkit that referenced this pull request Dec 3, 2024
TESTED:
- simple Debian and Ubuntu VMs with one NIC

TODO:
- rewrite find command to address 2 gVNIC?
- fix quoting of ignored interfaces
TESTED:
- simple Debian and Ubuntu VMs with one NIC
- a3-megagpu-8g Ubuntu and HPC Rocky 8
@tpdownes
Copy link
Member

tpdownes commented Dec 5, 2024

In addition to the standard tests I tested against this blueprint:

---
blueprint_name: test-ps

vars:
  deployment_name: test-ps
  project_id: hpc-toolkit-gsc
  region: us-central1
  zone: us-central1-c
  parallelstore_ips: "[10.80.175.133,10.80.175.132,10.80.175.130]"

deployment_groups:
- group: primary
  modules:

  - id: network
    source: modules/network/pre-existing-vpc
    settings:
      network_name: a3mega-sys-net
      subnetwork_name: a3mega-sys-subnet

  - id: gpunet
    source: modules/network/pre-existing-vpc
    settings:
      network_name: a3mega-cluster-dev-gpunet-0
      subnetwork_name: a3mega-cluster-dev-gpunet-0-subnet

  - id: parallelstore-rwx
    source: modules/file-system/pre-existing-network-storage
    settings:
      fs_type: daos
      remote_mount: $(vars.parallelstore_ips)
      local_mount: /parallelstore/rwx
      mount_options: disable-caching,thread-count=26,eq-count=13,multi-user

  - id: parallelstore-rwo
    source: modules/file-system/pre-existing-network-storage
    settings:
      fs_type: daos
      remote_mount: $(vars.parallelstore_ips)
      local_mount: /parallelstore/rwo
      mount_options: disable-wb-cache,thread-count=26,eq-count=13,multi-user

  - id: vm
    source: modules/compute/vm-instance
    use:
    - parallelstore-rwo
    - parallelstore-rwx
    settings:
      machine_type: n2-standard-8
      name_prefix: id
      disk_type: pd-ssd
      network_interfaces:
      - network: null
        subnetwork: $(network.subnetwork_self_link)
        subnetwork_project: null
        network_ip: null
        stack_type: null
        access_config: []
        ipv6_access_config: []
        alias_ip_range: []
        queue_count: null
        nic_type: GVNIC
      - network: null
        subnetwork: $(gpunet.subnetwork_self_link)
        subnetwork_project: null
        network_ip: null
        stack_type: null
        access_config: []
        ipv6_access_config: []
        alias_ip_range: []
        queue_count: null
        nic_type: GVNIC

and observed the expected outcome:

exclude_fabric_ifaces: ["lo","eth1"]

@tpdownes tpdownes self-requested a review December 5, 2024 05:44
@tpdownes tpdownes merged commit c416381 into GoogleCloudPlatform:develop Dec 5, 2024
10 of 57 checks passed
tpdownes added a commit to tpdownes/hpc-toolkit that referenced this pull request Dec 5, 2024
tpdownes added a commit to tpdownes/hpc-toolkit that referenced this pull request Dec 6, 2024
tpdownes added a commit to tpdownes/hpc-toolkit that referenced this pull request Dec 6, 2024
tpdownes added a commit to tpdownes/hpc-toolkit that referenced this pull request Dec 6, 2024
cdunbar13 pushed a commit to cdunbar13/cluster-toolkit that referenced this pull request Dec 18, 2024
TESTED:
- simple Debian and Ubuntu VMs with one NIC
- a3-megagpu-8g Ubuntu and HPC Rocky 8
cdunbar13 pushed a commit to cdunbar13/cluster-toolkit that referenced this pull request Dec 18, 2024
@nick-stroud nick-stroud mentioned this pull request Dec 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-improvements Added to release notes under the "Improvements" heading.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants