Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to use io-watchdog with Torque? #2

Open
beginZero opened this issue Jun 17, 2016 · 9 comments
Open

How to use io-watchdog with Torque? #2

beginZero opened this issue Jun 17, 2016 · 9 comments

Comments

@beginZero
Copy link

Is there a way to use io-watchdog with Torque other than Slurm.

For another, if there is only one process without writing for a long time period, whether io-watchdog will report this as a hang or not?

Thanks in advance:)

@grondo
Copy link
Owner

grondo commented Jun 18, 2016

It has been awhile since I looked at this project, but io-watchdog runs as a standalone server and an LD_PRELOAD library set in the environment of the process you want to "watch", so it is not limited to SLURM at all. The SLURM support is just in the form of a convenience plugin that allows io-watchdog to be used with a simple --io-watchdog option to Slurm. (oh, and the --rank option only supports Slurm)

A way to use it under Torque might be to run the process to monitor under the io-watchog command, it will re-exec its argument after setting up the io-watchdog server and LD_PRELOAD environment variable. However, you might have to make a wrapper script if you only want to target a single rank.

io-watchdog monitors a single process at a time IIRC, so if you want to monitor multiple processes you would run io-watchdog once for each process.

@beginZero
Copy link
Author

Good to know it. Is there a way to attach io-watchdog to an already running process by process id then?

@beginZero
Copy link
Author

Does io-watchdog depend on LD_PRELOAD? or either can work without the other?

@grondo
Copy link
Owner

grondo commented Jun 18, 2016

Good to know it. Is there a way to attach io-watchdog to an already running process by process id then?

Not in the current design. io-watchdog does depend on LD_PRELOAD in order to intercept library calls that cause writes with its "interposer" library, so LD_PRELOAD needs to be set before the process is started.

@beginZero
Copy link
Author

Clear on this point now. Another question. Normally, I can run an mpi program with command mpirun -n NUM_PROCS PROGRAM. If I wanna use io-watchdog to monitor one process, I have to launch the process being monitored with io-watchdog with others being started with mpirun normally.

Is it possible to achieve so then?

@grondo
Copy link
Owner

grondo commented Jun 20, 2016

Currently if you want to only monitor one task in a parallel job, you might have to write a wrapper script, something like this (untested):

#!/bin/sh
if test "$MPIRUN_RANK" = "0"; then
    exec io-watchdog $IO_WATCHDOG_OPTIONS "$@"
fi
exec "$@"

If you save this in io-watchdog-wrapper.sh, then you'd run this as mpirun -n NUM_PROCS io-watchdog-wrapper.sh PROGRAM

Or, we could probably easily extend monitor_this_rank() in io-watchdog.c using a Torque environment variable, or $MPIRUN_RANK as above, and then you would not need to use the script, instead you could use io-watchdog directly like:

mpirun -n NUM_PROCS io-watchdog --rank=0 PROGRAM

I could create a branch with an experimental patch if you are willing to test it. I don't have a Torque system on which to test.

@grondo
Copy link
Owner

grondo commented Jun 20, 2016

If you try io-watchdog, you may want to build from my rtld_next branch. I found that on recent systems io-watchdog interposer library can't find libc.so through its old method of globbing. The rtld_next branch falls back to RTLD_NEXT if the glob fails, so it should be more robust.

@grondo
Copy link
Owner

grondo commented Jun 20, 2016

FYI, I also added an experimental patch to support MPIRUN_RANK directly with the io-watchdog --rank option on this branch

https://github.com/grondo/io-watchdog/commits/rtld_next

@beginZero
Copy link
Author

Thanks a lot for so many suggestions. I do appreciate it a lot. Recently, I have been busy with other things. I will let you know if I have other problems.

Thanks again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants