Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pipeline on GCP fails with "Error: pipeline dependencies not found" #224

Open
amtseng opened this issue Apr 6, 2021 · 9 comments
Open

Comments

@amtseng
Copy link

amtseng commented Apr 6, 2021

Describe the bug

I've submitted a good number (25) of ChIP-seq jobs to Caper, and the jobs begin running, but somehow halfway through, the Caper server dies suddenly. Examining the logs and grepping for "error", I find that all of the job logs (in cromwell-workflow-logs/) contain "Error: pipeline dependencies not found".

I have consulted Issue #172, but I have verified that I have activated the encode-chip-seq-pipeline einvironment both when launching the Caper server and when submitting the jobs. I am also experiencing these issues on GCP, and not on MacOS, so I felt it was prudent to create a new issue for this.

OS/Platform

  • OS/Platform: Google Cloud
  • Conda version: 4.7.12
  • Pipeline version: I'm not sure how to check this, sorry
  • Caper version: 1.4.2

Caper configuration file

backend=gcp
gcp-prj=gbsc-gcp-lab-kundaje
tmp-dir=/data/tmp_amtseng
singularity-cachedir=/data/singularity_cachedir_amtseng
file-db=/data/caper_db/caper_file_db_amtseng
db-timeout=120000
max-concurrent-tasks=1000
max-concurrent-workflows=50
use-google-cloud-life-sciences=True
gcp-region=us-central1

Input JSON file

Here, I'm showing one of the 25 jobs submitted.

{
  "chip.title": "A549_cJun_FLAG cells untreated",
  "chip.description": "A549_cJun_FLAG cells untreated",

  "chip.pipeline_type": "tf",

  "chip.aligner": "bowtie2",
  "chip.align_only": false,
  "chip.true_rep_only": false,

  "chip.genome_tsv": "https://storage.googleapis.com/encode-pipeline-genome-data/genome_tsv/v3/hg38.tsv",

  "chip.paired_end": false,
  "chip.ctl_paired_end": false,

  "chip.always_use_pooled_ctl": true,

  "chip.align_cpu": 4,
  "chip.call_peak_cpu": 4,

  "chip.fastqs_rep1_R1": [
    "gs://caper_in/amtseng/AP1/fastqs/SRR12090532.fastq.gz"
  ],
  "chip.fastqs_rep2_R1": [
    "gs://caper_in/amtseng/AP1/fastqs/SRR12090533.fastq.gz"
  ],
  "chip.fastqs_rep3_R1": [
    "gs://caper_in/amtseng/AP1/fastqs/SRR12090534.fastq.gz"
  ],

  "chip.ctl_fastqs_rep1_R1": [
    "gs://caper_in/amtseng/AP1/fastqs/SRR12090601.fastq.gz"
  ],
  "chip.ctl_fastqs_rep2_R1": [
    "gs://caper_in/amtseng/AP1/fastqs/SRR12090602.fastq.gz"
  ],
  "chip.ctl_fastqs_rep3_R1": [
    "gs://caper_in/amtseng/AP1/fastqs/SRR12090603.fastq.gz"
  ]
}

Troubleshooting result

Unfortunately, because the Caper server dies, I am unable to use caper troubleshoot {jobID} to diagnose.
Instead, I've attached the cromwell log for the job. The end of this log is:

I've also attached cromwell.out.
workflow.3d1cb136-9b32-4514-9a33-3262d8303d6f.log

cromwell.out

Thanks!

@leepc12
Copy link
Contributor

leepc12 commented Apr 6, 2021

Looked at two files but can't find any helpful information for debugging.
It looks like cromwell got SIGTERM and gracefully shutdown itself.

2021-04-06 19:33:06,677  ERROR - Timed out trying to gracefully stop WorkflowStoreActor. Forcefully stopping it.

Can you upgrade Caper (which includes Cromwell version upgrade 52->59) and try again? Please follow upgrade instruction on Caper's release note.

$ pip3 install autouri caper --upgrade

@amtseng
Copy link
Author

amtseng commented Apr 6, 2021

I'll give that a try and report back. Thanks, Jin!

@amtseng
Copy link
Author

amtseng commented Apr 7, 2021

I've upgraded Caper/Cromwell (and verified the version update). Running on the same 25 jobs, I still get the exact same errors, and the Caper server crashes.

I then tried running just one job only. Intriguingly, it succeeded! So that suggests to me that either a subset of the jobs are crashing and causing the entire Caper server to crash and take the other jobs with them, or simply having too many jobs at a time is causing troubles...

Very strange! Any ideas? In the meantime, I'm going to try running a few more jobs on their own and see how that goes...

@leepc12
Copy link
Contributor

leepc12 commented Apr 7, 2021

How did you run the server? Did you use Caper's shell script to make a server instance?
https://github.com/ENCODE-DCC/caper/tree/master/scripts/gcp_caper_server

@amtseng
Copy link
Author

amtseng commented Apr 7, 2021

I started the server using this command in a tmux session:

caper server --port 8000 --gcp-loc-dir=gs://caper_out/amtseng/.caper_tmp --gcp-out-dir gs://caper_out/amtseng/

@leepc12
Copy link
Contributor

leepc12 commented Apr 7, 2021

That command line looks good if your Google user account settings have enough permission to GCE, GCS and Google Life Sciences API and on on.

Why don't use a configuration file ~/.caper/default.conf? You can make a good template of it by running the following:

# this will overwrite on the existing conf file. please make a backup if you need.
$ caper init gcp

BTW I strongly recommend to use the above shell script because ENCODE DCC runs thousands of pipeline without any problem on the instance created by that shell script.

Not sure if you have a service account with correct permissions settings. Please use the above script.

@amtseng
Copy link
Author

amtseng commented Apr 8, 2021

I generated the default configuration file using caper init gcp, specifying only the gcp-prj and gcp-out-dir fields. I also started running a Caper server using just caper server in a tmux session.
Caper still crashed, although the logs now have not only the pipeline dependencies not found error, but I also see java.lang.OutOfMemoryError: GC overhead limit exceeded errors.

I've attached cromwell.out and an example workflow log, again.

cromwell.out.txt
workflow.225a8edd-5ee7-45c2-b77f-d5123797d313.log.txt

@leepc12
Copy link
Contributor

leepc12 commented Apr 8, 2021

It looks like Java memory issue?

java.sql.SQLException: java.lang.OutOfMemoryError: GC overhead limit exceeded

Thanks why I recommend the shell script. That script will make an instance with enough memory and all caper settings are automatically configured.

@amtseng
Copy link
Author

amtseng commented Apr 8, 2021

Ah, I'm sorry. I misunderstood which script you were referring to. I'll try to create an instance using create_instance.sh instead of the pre-existing instance we have in the lab.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants