-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
wrap_rrdesi not running on coadds from a list? #321
Comments
for info:
|
@sbailey @araichoor I took a look last night and it looks like it was just an incorrect syntax -- using 1 node also works fine with |
@craigwarner-ufastro please confirm: is |
@sbailey yes that is correct
Although A quick look shows that several env vars are created when its called by
I suggest checking one of these and if not there, we can exit with an informative error. |
Thanks for clarifying. It looks like
|
@sbailey Oh - actually if I run
to get an interactive node, and then run
Here is a list of
Versus running
What about |
To save you time, these are the additional env vars when running with
|
Yeah, the mess comes that someone might get a node with
in which case $SLURM_NPROCS is set for the job and gets inherited by srun if it is called without -n, but then that gets in the way of telling if the actual command is prefixed with an srun or not. I suspect that is true for most of those environment variables, since SLURM has a design hierarchy of multiple ways of controlling the options with different levels overriding others (envvar -> script #SBATCH headers -> sbatch/salloc options -> srun options). Arguably I'm way overthinking this and using a more obscure envvar like |
perhaps a conda-forge mpich (uses pmi)
conda-forge openmpi (uses pmix)
NERSC slurm (uses cray-pmi):
there may also be a PMI2_ variant to consider (and potentially others) ? |
@sbailey I tested your
And yes, you're correct that this sets
It looks to me like @dmargala what do you think? |
@sbailey @dmargala I think I have everything cleared up. The root bug was that if the number of GPUs available was > the rank of the communicator, it was making a bad assumption that you wanted to use at least ngpu ranks. So when calling I fixed this, added informative warning messages where appropriate, and cleaned up the login node logic that had been copy/pasted from elsewhere. Here are a bunch of example test cases:
The following were all run after getting an interactive node with
The following were all run after getting an interactive 2 nodes with
The following were run with an interactive node obtained with
Finally, if MPI is not available:
Look good? Still want me to rename to |
@craigwarner-ufastro those look good! Thanks for handling all the cases including the "recommended syntax" message, which is great for telling users what they should do instead of just that they are doing it wrong. Please go ahead with a pull request. OK to leave as "wrap_rrdesi" without the "_mpi" suffix. |
I ve not followed all the technical details, but thanks for digging into that, and glad to hear that issue led up to some improvements. |
I report here an apparently expected behavior of
wrap_rrdesi
, which stops running after few coadds, with no error, and does not deal with the rest of the coadds from the input file.tagging @craigwarner-ufastro (sugggestion from @sbailey)
note that I am a newbie in running redrock with gpus, so it can very be that there is no bug and that I am just mis-calling the code.
here is how I proceed:
the
$MYRRDIR/list_coadds.ascii
file has 18 coadds, but the code stops after dealing with the first 5 coadds, with no apparent error.the log file is
$MYRRDIR/redrock.log
.The text was updated successfully, but these errors were encountered: