Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow customization of Parallelstore mounts #3144

Merged

Conversation

wiktorn
Copy link
Contributor

@wiktorn wiktorn commented Oct 17, 2024

Add two customization options:

  • daos_agent_config - to allow providing additional configuration in /etc/daos/daos_agent.yml
  • dfuse_environment - to allow providing environment variables for dfuse process

Topics for discussion:

  • whether using templates is proper approach here (it gives a lot of flexibility)
  • support for whitespaces in mount points (currently not supported, not sure if it was supported earlier)
  • handling of /etc/daos/daos_agent.yml - currently anyway everything is commented, and we just uncomment specific lines with sed. The other approach is either to create a new file (as implemnted), or just make sure that everything is commented, and just add configuration at the end. I think this aproach is cleaner in terms of what is expected result / file content.
  • usage of systemctl to start dfuse on first call. I did not implemented as such for now, as I didn't find a good way to fail systemctl start command if dfuse failed to start and this is current behavior
  • Adding Restart=always to systemd unit. With my testing, one needs to run first umount $mount_point, but if this indeed is necessary, this could be wrapped in script together with $mount_command, and this would make sure, that even if dfuse crashes, it will restart

TODO

  • align those changes with modules/file-system/pre-existing-network-storage

This PR also addresses bug in startup-script module,

startup-script: Mon Dec 16 08:44:43 -0500 2024 Info [4589]: === start executing runner: early_run_hotfixes.sh-b383 ===
startup-script: /slurm/custom_scripts/controller.d/ghpc_daos_mount.sh: line 373: .//tmp/tmp.g1yoDyA6T5/early_run_hotfixes.sh: No such file or directory
startup-script: Mon Dec 16 08:44:43 -0500 2024 Info [4589]: === early_run_hotfixes.sh-b383 finished with exit_code=127 ===
startup-script: Mon Dec 16 08:44:43 -0500 2024 Error [4589]: === execution of early_run_hotfixes.sh-b383 failed, exiting ===


If re-running the script, using for example: DEBUG=1 google_metadata_script_runner startup

@harshthakkar01 harshthakkar01 self-assigned this Oct 21, 2024
@harshthakkar01
Copy link
Contributor

Adding Restart=always to systemd unit. With my testing, one needs to run first umount $mount_point, but if this indeed is necessary, this could be wrapped in script together with $mount_command, and this would make sure, that even if dfuse crashes, it will restart

In this case, if you enable systemd unit, it will run for every restart (this is the current behavior, wdyt ?). For the case, you mentioned, I also observed you need to umount it first.

@harshthakkar01
Copy link
Contributor

usage of systemctl to start dfuse on first call. I did not implemented as such for now, as I didn't find a good way to fail systemctl start command if dfuse failed to start and this is current behavior

Current behavior we dont use systemd unit to mount the instance. We manually mount it using bash and enable the service to be run for every restart.

@harshthakkar01 harshthakkar01 added the release-improvements Added to release notes under the "Improvements" heading. label Nov 12, 2024
@harshthakkar01 harshthakkar01 marked this pull request as ready for review November 12, 2024 19:46
@tpdownes
Copy link
Member

Adding Restart=always to systemd unit. With my testing, one needs to run first umount $mount_point, but if this indeed is necessary, this could be wrapped in script together with $mount_command, and this would make sure, that even if dfuse crashes, it will restart

In this case, if you enable systemd unit, it will run for every restart (this is the current behavior, wdyt ?). For the case, you mentioned, I also observed you need to umount it first.

I strongly recommend Restart=on-failure for almost any SystemD unit unless always provides some benefit you can clearly articulate. I believe always isn't allowed for oneshots.

@wiktorn
Copy link
Contributor Author

wiktorn commented Nov 13, 2024

Adding Restart=always to systemd unit. With my testing, one needs to run first umount $mount_point, but if this indeed is necessary, this could be wrapped in script together with $mount_command, and this would make sure, that even if dfuse crashes, it will restart

In this case, if you enable systemd unit, it will run for every restart (this is the current behavior, wdyt ?). For the case, you mentioned, I also observed you need to umount it first.

I strongly recommend Restart=on-failure for almost any SystemD unit unless always provides some benefit you can clearly articulate. I believe always isn't allowed for oneshots.

My idea was to run dfuse in foreground and do not make this a oneshoot, so systemd can monitor the process and if it crashes (what may happen), and whatever the reason it will be for it to exit, it systemd will restart it.

The only thing to do is to instead call directly the dfuse binary, call a script, which will call unmount if needed.

@wiktorn
Copy link
Contributor Author

wiktorn commented Nov 13, 2024

usage of systemctl to start dfuse on first call. I did not implemented as such for now, as I didn't find a good way to fail systemctl start command if dfuse failed to start and this is current behavior

Current behavior we dont use systemd unit to mount the instance. We manually mount it using bash and enable the service to be run for every restart.

Using systemd always would simplify the code, as there would only one way to mount the filesystem, whether this is the first startup of the instance, or the next one. The only drawback is that I did not find a way, to fail systemctl enable --now (or similar command) when service did not start successfully.

I can add a check after the call, with some sleeps to check if either service started successfully or the mountpoint is mounted and fail the startup script if those checks fail. WDYT?

@mr0re1
Copy link
Collaborator

mr0re1 commented Nov 15, 2024

/gcbrun

@wiktorn wiktorn marked this pull request as draft November 15, 2024 21:35
@harshthakkar01
Copy link
Contributor

harshthakkar01 commented Nov 15, 2024

You may need to rebase this after #3256 (this one was high priorty). I have discussed with Ivan and he will follow up with on this as I am ooo for 3 weeks from today.

Btw, we agreed on moving forward with adding customization for mounts. Thanks for this PR.

@wiktorn wiktorn force-pushed the template_pstore_mount branch from 0d3506a to 87f6475 Compare November 16, 2024 17:44
@wiktorn wiktorn marked this pull request as ready for review November 16, 2024 17:44
@wiktorn
Copy link
Contributor Author

wiktorn commented Nov 16, 2024

You may need to rebase this after #3256 (this one was high priorty). I have discussed with Ivan and he will follow up with on this as I am ooo for 3 weeks from today.

Btw, we agreed on moving forward with adding customization for mounts. Thanks for this PR.

Thanks for letting me know. For now, I rebased this PR on top of #3256, once it is in, I can skip a few commits from here.

I aligned now pre-existing-network-storage with parallelstore in terms of scripts and APIs and fixed some smaller issues I faced (like installation failing on RHEL 8, but started to work well after dnf clean all 🤷 )

@harshthakkar01
Copy link
Contributor

Please resolve conflicts

@wiktorn wiktorn force-pushed the template_pstore_mount branch from 87f6475 to bcb9a91 Compare December 20, 2024 14:55
@wiktorn
Copy link
Contributor Author

wiktorn commented Dec 20, 2024

Conflicts resolved.

@wiktorn wiktorn force-pushed the template_pstore_mount branch from bcb9a91 to 850c370 Compare December 20, 2024 16:08
@harshthakkar01
Copy link
Contributor

/gcbrun

1 similar comment
@harshthakkar01
Copy link
Contributor

/gcbrun

@wiktorn wiktorn force-pushed the template_pstore_mount branch 2 times, most recently from 1c1a045 to f9a92b2 Compare December 23, 2024 17:27
@wiktorn wiktorn force-pushed the template_pstore_mount branch from f9a92b2 to d76cfe2 Compare December 23, 2024 20:17
@harshthakkar01
Copy link
Contributor

/gcbrun

@harshthakkar01 harshthakkar01 merged commit 2933f6f into GoogleCloudPlatform:develop Dec 24, 2024
11 of 64 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-improvements Added to release notes under the "Improvements" heading.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants