Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

paraSlurm.perl hostname #6

Open
AlisaGU opened this issue Dec 18, 2020 · 22 comments
Open

paraSlurm.perl hostname #6

AlisaGU opened this issue Dec 18, 2020 · 22 comments

Comments

@AlisaGU
Copy link

AlisaGU commented Dec 18, 2020

Hi,
I tried submitting doBlastzChainNet.pl by paraSlurm.perl.
I have replaced para with paraSlurm.perl and added jobListTest in the command.

when I ran the doBlastzChainNet.pl, an error related to paraSlurm.perl occurred.

You have to execute path/Parasol_LSF_Slurm/paraSlurm.perl on gsubmit0! Not on .

But, the node is actually gsubmit0.

I tried replacing my $hostname = $ENV{'HOSTNAME'}; with my $hostname = `hostname`;.
Another error occurred:

You have to execute /path/Parasol_LSF_Slurm/paraSlurm.perl on gsubmit0! Not on gsubmit0

Finally, I tried replacing my $hostname = $ENV{'HOSTNAME'}; with my $hostname = "gsubmit0";.
Another error occurred:

sh: sbatch: command not found

It's weird. This is the node to submit slurm code.
I have no idea about these problem.

Could you give some tips?
Best regards,

@MichaelHiller
Copy link

Hmm, can you check whether sbatch is available on gsubmit0?
And whether this is in the default $PATH variable?

This if is simply a sanity check to make sure people only call the script from the cluster head node, where one should submit the jobs.

@AlisaGU
Copy link
Author

AlisaGU commented Dec 18, 2020

Sure.

-bash-4.2$ sbatch -V   
slurm 19.05.5

path just represents a directory, not $PATH.
I have added the path of paraSlurm.perl in $PATH. In case, I replaced $para with an absolute path+paraSlurm.perl

@AlisaGU
Copy link
Author

AlisaGU commented Dec 18, 2020

Actually, it seems that my $hostname = $ENV{'HOSTNAME'}; can't get host name. But my alternatives didn't work

@AlisaGU
Copy link
Author

AlisaGU commented Dec 18, 2020

Another cue: the script paraSlurm.perl push jobListTest jobList -q short -p "--partition=low" worked

@MichaelHiller
Copy link

What does 'which sbatch' give you?

Also, just out-comment this hostname check in the script. It may not be necessary for you.

@AlisaGU
Copy link
Author

AlisaGU commented Dec 18, 2020

-bash-4.2$ which sbatch
/opt/gridview/slurm/bin/sbatch
# first thing: check if the script is executed on $clusterHeadNode
# my $hostname = $ENV{'HOSTNAME'};
# my $hostname = `hostname`;
# my $hostname = "gsubmit0";
# die "######### ERROR #########: You have to execute $0 on $clusterHeadNode! Not on $hostname.\n" if ($hostname ne $clusterHeadNode);

is it right?

@MichaelHiller
Copy link

Yes.
If the jobListTest works, then I don't understand why another jobList does not work.

@AlisaGU
Copy link
Author

AlisaGU commented Dec 18, 2020

Hmmm, error sbatch: command not found

Waiting to get ./lockFile.jobListTest   ..... [Takes too long? Did a previous para run died? If so, open a new terminal and   rm -f ./lockFile.jobListTest ] .....  Can't exec "lockfile": No such file or directory at /picb/evolgen/users/gushanshan/software/para_slurm/Parasol_LSF_Slurm/paraSlurm.perl line 313.
got it
sh: sbatch: command not found
######### ERROR ######### in pushSingleJob: sbatch <<< "$( printf '#!/bin/bash
#SBATCH -J jobListTest
#SBATCH --nodes=1
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=10000
#SBATCH --time=01:00:00
#SBATCH -o ./.para/jobListTest/1/o.0
#SBATCH -e ./.para/jobListTest/1/o.0
/picb/evolgen/users/gushanshan/software/ucsc_pairwiseGenomeAlignmrent/data/scripts/blastz-run-ucsc -outFormat psl tParts/part000.lst qParts/part000.lst ../DEF {check out exists ../psl/part000.lst/part000.lst_part000.lst.psl }' )" failed with exit code 32512
Command failed:
ssh -x -o 'StrictHostKeyChecking = no' -o 'BatchMode = yes' localhost nice /picb/evolgen/users/gushanshan/probiotics/multiple_whole_genome_alignment/pairwise_combined/ucscTest_norepeat/genomes_newScore/ps128/trackData/299v/run.blastz/doClusterRun.csh

Do you need the complete log?

@MichaelHiller
Copy link

I can only guess, but maybe sbatch is not in the environment when the perl script starts a new session with the system command.
Can you add a
my $test = which sbatch;
print "$test\n";
exit(0);

at the beginning and see if that returns /opt/gridview/slurm/bin/sbatch ?

@AlisaGU
Copy link
Author

AlisaGU commented Dec 18, 2020

# The MIT License (MIT)
# Copyright (c) Michael Hiller, 2016

# Version 1.0

# implements parasol functionality for Slurm
# it uses files in a local directory .para/ to keep track of the jobs in a jobList and their status
my $test = which sbatch;
print "$test\n";
exit(0);

use strict;
use warnings;

right? Where should I put it? paraSlurm or doBlastzChainNet.pl?

error:

Can't locate object method "which" via package "sbatch" (perhaps you forgot to load "sbatch"?) at /picb/evolgen/users/gushanshan/software/para_slurm/Parasol_LSF_Slurm/paraSlurm.perl line 10.

@MichaelHiller
Copy link

MichaelHiller commented Dec 18, 2020

Pls use backticks around `which sbatch`.
I had to quote them, otherwise Github did not show the backticks

@AlisaGU
Copy link
Author

AlisaGU commented Dec 18, 2020

thanks. I added it at the beginning of paraSlurm.perl.

error:

/picb/evolgen/users/gushanshan/software/para_slurm/Parasol_LSF_Slurm/paraSlurm.perl make jobListTest jobList
which: no sbatch in $PATH(replacing various dir with $PATH )

@MichaelHiller
Copy link

Then add it to $PATH and put this in your .bashrc

@AlisaGU
Copy link
Author

AlisaGU commented Dec 18, 2020

That works. After that, I added $jobPrefix .= "#SBATCH -p low\n"; in pushSingleJob.

But, the jobs crashed.

Waiting to get ./lockFile.jobListTest   ..... [Takes too long? Did a previous para run died? If so, open a new terminal and   rm -f ./lockFile.jobListTest ] .....  Can't exec "lockfile": No such file or directory at /picb/evolgen/users/gushanshan/software/para_slurm/Parasol_LSF_Slurm/paraSlurm.perl line 317.
got it
DONE.
4 jobs pushed using parameters: -q short 

Waiting to get ./lockFile.jobListTest   ..... [Takes too long? Did a previous para run died? If so, open a new terminal and   rm -f ./lockFile.jobListTest ] .....  Can't exec "lockfile": No such file or directory at /picb/evolgen/users/gushanshan/software/para_slurm/Parasol_LSF_Slurm/paraSlurm.perl line 317.
got it
WAIT UNTIL jobList is finished ..... 
Waiting to get ./lockFile.jobListTest   ..... [Takes too long? Did a previous para run died? If so, open a new terminal and   rm -f ./lockFile.jobListTest ] .....  Can't exec "lockfile": No such file or directory at /picb/evolgen/users/gushanshan/software/para_slurm/Parasol_LSF_Slurm/paraSlurm.perl line 317.
got it
numJobs: 4              RUN: 0          PEND: 2         DONE: 0         FAILED: 2        (0 of them failed 3 times)     allDone: NO
Waiting to get ./lockFile.jobListTest   ..... [Takes too long? Did a previous para run died? If so, open a new terminal and   rm -f ./lockFile.jobListTest ] .....  Can't exec "lockfile": No such file or directory at /picb/evolgen/users/gushanshan/software/para_slurm/Parasol_LSF_Slurm/paraSlurm.perl line 317.
got it
--> 2 jobs crashed and were pushed again
sleep 60 seconds ...  (waiting 0 sec by now)
Waiting to get ./lockFile.jobListTest   ..... [Takes too long? Did a previous para run died? If so, open a new terminal and   rm -f ./lockFile.jobListTest ] .....  Can't exec "lockfile": No such file or directory at /picb/evolgen/users/gushanshan/software/para_slurm/Parasol_LSF_Slurm/paraSlurm.perl line 317.
got it
numJobs: 4              RUN: 0          PEND: 0         DONE: 0         FAILED: 4        (0 of them failed 3 times)     allDone: NO
Waiting to get ./lockFile.jobListTest   ..... [Takes too long? Did a previous para run died? If so, open a new terminal and   rm -f ./lockFile.jobListTest ] .....  Can't exec "lockfile": No such file or directory at /picb/evolgen/users/gushanshan/software/para_slurm/Parasol_LSF_Slurm/paraSlurm.perl line 317.
got it
--> 4 jobs crashed and were pushed again
sleep 60 seconds ...  (waiting 60 sec by now)

@MichaelHiller
Copy link

What is in jobListTest? Can you run this outside of Slurm in the command line?

Otherwise the script seems to work. It detects crashed jobs and pushes them again.

@AlisaGU
Copy link
Author

AlisaGU commented Dec 18, 2020

Hmmm, jobListTest doesn't exist. I thought it's just a syntax of paraSlurm.pl.
Now it looks like I was wrong.

I have read the README.html. However, I am still don't know how to do about jobListTest.

@AlisaGU
Copy link
Author

AlisaGU commented Dec 18, 2020

para.pl make jobListName jobListFile
So, jobListTest is just a name for this jobList.
It seems that I don't need to do anything about it, right?

@MichaelHiller
Copy link

Sorry, I meant what is in jobList? The test script DieRandomTime.perl is meant to crash randomly to test the resubmitting function.
It is correct that 'jobListTest' is just a name for the list, which allows para to keep track of multiple job lists in the same directory.

Just test a jobList file that contains
ls
ls
ls

--> This should work without any crashes.

@AlisaGU
Copy link
Author

AlisaGU commented Dec 20, 2020

Sorry to bother you on weekend.
The things in jobList are

blastz-run-ucsc -outFormat psl tParts/part000.lst qParts/part000.lst ../DEF {check out exists ../psl/part000.lst/part000.lst_part000.lst.psl }
blastz-run-ucsc -outFormat psl tParts/part000.lst qParts/part001.lst ../DEF {check out exists ../psl/part000.lst/part000.lst_part001.lst.psl }
blastz-run-ucsc -outFormat psl tParts/part000.lst qParts/part002.lst ../DEF {check out exists ../psl/part000.lst/part000.lst_part002.lst.psl }
blastz-run-ucsc -outFormat psl tParts/part000.lst qParts/part003.lst ../DEF {check out exists ../psl/part000.lst/part000.lst_part003.lst.psl }

When I tried the first script blastz-run-ucsc -outFormat psl tParts/part000.lst qParts/part000.lst ../DEF {check out exists ../psl/part000.lst/part000.lst_part000.lst.psl }, no result was created.

I tried adding output file like

blastz-run-ucsc -outFormat psl tParts/part000.lst qParts/part000.lst ../DEF ../psl/part000.lst/part000.lst_part000.lst.psl

output file../DEF ../psl/part000.lst/part000.lst_part000.lst.pslwas created.

However, when I added {check out exists ../psl/part000.lst/part000.lst_part000.lst.psl } into script, no result was created again.

@AlisaGU
Copy link
Author

AlisaGU commented Dec 20, 2020

But it seems that the point mentioned earlier is not the key, because I tried running the same script blastz-run-ucsc -outFormat psl tParts/part000.lst qParts/part000.lst ../DEF {check out exists ../psl/part000.lst/part000.lst_part000.lst.psl } on another server , it failed again.

However, when I run the full script of doBlastzChainNet.pl using parasol on another server, it works.

@AlisaGU
Copy link
Author

AlisaGU commented Dec 20, 2020

Can I delete getLock() function of paraSlurm.perl?
Answer seems to be no

@AlisaGU
Copy link
Author

AlisaGU commented Dec 20, 2020

Hmmm, are all scripts of doBlastzChainNet.pl submitted to slurm? or just scripts about blasts?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants