Skip to content

Commit

Permalink
Merge pull request #172 from ENCODE-DCC/dev
Browse files Browse the repository at this point in the history
v2.2.0
  • Loading branch information
leepc12 authored Jun 13, 2022
2 parents 98a551e + 729eff8 commit c10d05b
Show file tree
Hide file tree
Showing 15 changed files with 566 additions and 243 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -111,3 +111,4 @@ src/test_caper_uri/

cromwell.out
dev/
tests/hpc/
2 changes: 1 addition & 1 deletion .isort.cfg
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ include_trailing_comma = True
force_grid_wrap = 0
use_parentheses = True
line_length = 88
known_third_party = WDL,autouri,humanfriendly,matplotlib,numpy,pandas,pyhocon,pytest,requests,setuptools,sklearn
known_third_party = WDL,autouri,distutils,humanfriendly,matplotlib,numpy,pandas,pyhocon,pytest,requests,setuptools,sklearn

[mypy-bin]
ignore_errors = True
26 changes: 20 additions & 6 deletions DETAILS.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,13 +40,14 @@ unhold | WF_ID or STR_LABEL |Release hold of workflows on a Cromwell server
list | WF_ID or STR_LABEL |List submitted workflows on a Cromwell server
metadata | WF_ID or STR_LABEL |Retrieve metadata JSONs for workflows
debug, troubleshoot | WF_ID, STR_LABEL or<br>METADATA_JSON_FILE |Analyze reason for errors
hpc submit| WDL | Submit a Caper leader job to HPC's job engine
hpc list| | List all Caper leader jobs
hpc abort | JOB_ID | Abort a Caper leader job. This will cascade kill all child jobs.

* `init`: To initialize Caper on a given platform. This command also downloads Cromwell/Womtool JARs so that Caper can work completely offline with local data files.

**Platform**|**Description**
:--------|:-----
sherlock | Stanford Sherlock cluster (SLURM)
scg | Stanford SCG cluster (SLURM)
gcp | Google Cloud Platform
aws | Amazon Web Service
local | General local computer
Expand Down Expand Up @@ -458,7 +459,6 @@ Example:
```



## How to override Caper's built-in backend

If Caper's built-in backends don't work as expected on your clusters (e.g. due to different resource settings), then you can override built-in backends with your own configuration file (e.g. `your.backend.conf`). Caper generates a `backend.conf` for built-in backends on a temporary directory.
Expand Down Expand Up @@ -808,9 +808,6 @@ This file DB is genereted on your working directory by default. Its default file
Unless you explicitly define `file-db` in your configuration file `~/.caper/default.conf` this file DB name will depend on your input JSON filename. Therefore, you can simply resume a failed workflow with the same command line used for starting a new pipeline.





## Profiling/monitoring resources on Google Cloud

A workflow ran with Caper>=1.2.0 on `gcp` backend has a monitoring log (`monitoring.log`) by default on each task's execution directory. This log file includes useful resources data on an instance like used memory, used disk space and total cpu percentage.
Expand All @@ -833,3 +830,20 @@ Define task's input file variables to limit analysis on specific tasks and input

Example plots:
- ENCODE ATAC-seq pipeline: [Plot PDF](https://storage.googleapis.com/caper-data/gcp_resource_analysis/example_plot/atac.pdf)


## Singularity and Docker Hub pull limit

If you provide a Singularity image based on docker `docker://` then Caper will locally build a temporary Singularity image (`*.sif`) under `SINGULARITY_CACHEDIR` (defaulting to `~/.singularity/cache` if not defined). However, Singularity will blindly pull from DockerHub to quickly reach [a daily pull limit](https://www.docker.com/increase-rate-limits). It's recommended to use Singularity images from `shub://` (Singularity Hub) or `library://` (Sylabs Cloud).



## How to customize resource parameters for HPCs

Each HPC backend (`slurm`, `sge`, `pbs` and `lsf`) has its own resource parameter. e.g. `slurm-resource-param`. Find it in Caper's configuration file (`~/.caper/default.conf`) and edit it. For example, the default resource parameter for SLURM looks like the following:
```
slurm-resource-param=-n 1 --ntasks-per-node=1 --cpus-per-task=${cpu} ${if defined(memory_mb) then "--mem=" else ""}${memory_mb}${if defined(memory_mb) then "M" else ""} ${if defined(time) then "--time=" else ""}${time*60} ${if defined(gpu) then "--gres=gpu:" else ""}${gpu}
```
This should be a one-liner with WDL syntax allowed in `${}` notation. i.e. Cromwell's built-in resource variables like `cpu`(number of cores for a task), `memory_mb`(total amount of memory for a task in MB), `time`(walltime for a task in hour) and `gpu`(name of gpu unit or number of gpus) in `${}`. See https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md for WDL syntax. This line will be formatted with actual resource values by Cromwell and then passed to the submission command such as `sbatch` and `qsub`.

Note that Cromwell's implicit type conversion (`WomLong` to `String`) seems to be buggy for `WomLong` type memory variables such as `memory_mb` and `memory_gb`. So be careful about using the `+` operator between `WomLong` and other types (`String`, even `Int`). For example, `${"--mem=" + memory_mb}` will not work since `memory_mb` is `WomLong` type. Use `${"if defined(memory_mb) then "--mem=" else ""}{memory_mb}${"if defined(memory_mb) then "mb " else " "}` instead. See https://github.com/broadinstitute/cromwell/issues/4659 for details.
106 changes: 47 additions & 59 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,9 @@
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black) [![CircleCI](https://circleci.com/gh/ENCODE-DCC/caper.svg?style=svg)](https://circleci.com/gh/ENCODE-DCC/caper)

# Caper

Caper (Cromwell Assisted Pipeline ExecutoR) is a wrapper Python package for [Cromwell](https://github.com/broadinstitute/cromwell/).

## Introduction

Caper wraps Cromwell to run pipelines on multiple platforms like GCP (Google Cloud Platform), AWS (Amazon Web Service) and HPCs like SLURM, SGE, PBS/Torque and LSF. It provides easier way of running Cromwell server/run modes by automatically composing necessary input files for Cromwell. Caper can run each task on a specified environment (Docker, Singularity or Conda). Also, Caper automatically localizes all files (keeping their directory structure) defined in your input JSON and command line according to the specified backend. For example, if your chosen backend is GCP and files in your input JSON are on S3 buckets (or even URLs) then Caper automatically transfers `s3://` and `http(s)://` files to a specified `gs://` bucket directory. Supported URIs are `s3://`, `gs://`, `http(s)://` and local absolute paths. You can use such URIs either in CLI and input JSON. Private URIs are also accessible if you authenticate using cloud platform CLIs like `gcloud auth`, `aws configure` and using `~/.netrc` for URLs.
Caper (Cromwell Assisted Pipeline ExecutoR) is a wrapper Python package for [Cromwell](https://github.com/broadinstitute/cromwell/). Caper wraps Cromwell to run pipelines on multiple platforms like GCP (Google Cloud Platform), AWS (Amazon Web Service) and HPCs like SLURM, SGE, PBS/Torque and LSF. It provides easier way of running Cromwell server/run modes by automatically composing necessary input files for Cromwell. Caper can run each task on a specified environment (Docker, Singularity or Conda). Also, Caper automatically localizes all files (keeping their directory structure) defined in your input JSON and command line according to the specified backend. For example, if your chosen backend is GCP and files in your input JSON are on S3 buckets (or even URLs) then Caper automatically transfers `s3://` and `http(s)://` files to a specified `gs://` bucket directory. Supported URIs are `s3://`, `gs://`, `http(s)://` and local absolute paths. You can use such URIs either in CLI and input JSON. Private URIs are also accessible if you authenticate using cloud platform CLIs like `gcloud auth`, `aws configure` and using `~/.netrc` for URLs.


## Installation for Google Cloud Platform and AWS
Expand All @@ -19,16 +16,15 @@ See [this](scripts/gcp_caper_server/README.md) for details.
See [this](scripts/aws_caper_server/README.md) for details.


## Installation
## Installation for local computers and HPCs

1) Make sure that you have Java (>= 11) and Python>=3.6 installed on your system and `pip` to install Caper.

```bash
$ pip install pip --upgrade
$ pip install caper
```

2) If you see an error message like `caper: command not found` then add the following line to the bottom of `~/.bashrc` and re-login.
2) If you see an error message like `caper: command not found` after installing then add the following line to the bottom of `~/.bashrc` and re-login.

```bash
export PATH=$PATH:~/.local/bin
Expand All @@ -38,19 +34,19 @@ See [this](scripts/aws_caper_server/README.md) for details.

**Backend**|**Description**
:--------|:-----
local | local computer without cluster engine.
slurm | SLURM cluster.
sge | Sun GridEngine cluster.
pbs | PBS cluster.
lsf | LSF cluster.
sherlock | Stanford Sherlock (based on `slurm` backend).
scg | Stanford SCG (based on `slurm` backend).
local | local computer without a cluster engine
slurm | SLURM (e.g. Stanford Sherlock and SCG)
sge | Sun GridEngine
pbs | PBS cluster
lsf | LSF cluster

> **IMPORTANT**: `sherlock` and `scg` backends have been deprecated. Use `slurm` backend instead and following instruction comments in the configuration file.

```bash
$ caper init [BACKEND]
```

4) Edit `~/.caper/default.conf` and follow instructions in there. **DO NOT LEAVE ANY PARAMETERS UNDEFINED OR CAPER WILL NOT WORK CORRECTLY**
4) Edit `~/.caper/default.conf` and follow instructions in there. **CAREFULLY READ INSTRUCTION AND DO NOT LEAVE IMPORTANT PARAMETERS UNDEFINED OR CAPER WILL NOT WORK CORRECTLY**


## Docker, Singularity and Conda
Expand All @@ -59,77 +55,69 @@ For local backends (`local`, `slurm`, `sge`, `pbs` and `lsf`), you can use `--do

> **IMPORTANT**: Docker/singularity/conda defined in Caper's configuration file or in CLI (`--docker`, `--singularity` and `--conda`) will be overriden by those defined in WDL task's `runtime`. We provide these parameters to define default/base environment for a pipeline, not to override on WDL task's `runtime`.
For Conda users, make sure that you have installed pipeline's Conda environments before running pipelines. Caper only knows Conda environment's name. You don't need to activate any Conda environment before running a pipeline since Caper will internally run `conda run -n ENV_NAME COMMANDS` for each task.
For Conda users, make sure that you have installed pipeline's Conda environments before running pipelines. Caper only knows Conda environment's name. You don't need to activate any Conda environment before running a pipeline since Caper will internally run `conda run -n ENV_NAME TASK_SHELL_SCRIPT` for each task.

Take a look at the following examples:
```bash
$ caper run test.wdl --docker # can be used as a flag too, Caper will find docker image from WDL if defined
$ caper run test.wdl --docker # can be used as a flag too, Caper will find a docker image defined in WDL
$ caper run test.wdl --singularity docker://ubuntu:latest
$ caper hpc submit test.wdl --singularity --leader-job-name test1 # submit to job engine and use singularity defined in WDL
$ caper submit test.wdl --conda your_conda_env_name # running caper server is required
```
An environemnt defined here will be overriden by those defined in WDL task's `runtime`. Therefore, think of this as a base/default environment for your pipeline. You can define per-task environment in each WDL task's `runtime`.

For cloud backends (`gcp` and `aws`), you always need to use `--docker` (can be skipped). Caper will automatically try to find a base docker image defined in your WDL. For other pipelines, define a base docker image in Caper's CLI or directly in each WDL task's `runtime`.


## Singularity and Docker Hub pull limit

If you provide a Singularity image based on docker `docker://` then Caper will locally build a temporary Singularity image (`*.sif`) under `SINGULARITY_CACHEDIR` (defaulting to `~/.singularity/cache` if not defined). However, Singularity will blindly pull from DockerHub to quickly reach [a daily pull limit](https://www.docker.com/increase-rate-limits). It's recommended to use Singularity images from `shub://` (Singularity Hub) or `library://` (Sylabs Cloud).


## Important notes for Conda users

Since Caper>=2.0 you don't have to activate Conda environment before running pipelines. Caper will internally run `conda run -n ENV_NAME /bin/bash script.sh`. Just make sure that you correctly installed given pipeline's Conda environment(s).


## Important notes for Stanford HPC (Sherlock and SCG) users
An environemnt defined here will be overriden by those defined in WDL task's `runtime`. Therefore, think of this as a base/default environment for your pipeline. You can define per-task docker, singularity images to override those defined in Caper's command line. For example:
```wdl
task my_task {
...
runtime {
docker: "ubuntu:latest"
singularity: "docker://ubuntu:latest"
}
}
```

**DO NOT INSTALL CAPER, CONDA AND PIPELINE'S WDL ON `$SCRATCH` OR `$OAK` STORAGES**. You will see `Segmentation Fault` errors. Install these executables (Caper, Conda, WDL, ...) on `$HOME` OR `$PI_HOME`. You can still use `$OAK` for input data (e.g. FASTQs defined in your input JSON file) but not for outputs, which means that you should not run Caper on `$OAK`. `$SCRATCH` and `$PI_SCRATCH` are okay for both input and output data so run Caper on them. Running Croo to organize outputs into `$OAK` is okay.
For cloud backends (`gcp` and `aws`), Caper will automatically try to find a base docker image defined in your WDL. For other pipelines, define a base docker image in Caper's CLI or directly in each WDL task's `runtime`.


## Running pipelines on HPCs

Use `--singularity` or `--conda` in CLI to run a pipeline inside Singularity image or Conda environment. Most HPCs do not allow docker. For example, submit `caper run ... --singularity` as a leader job (with long walltime and not-very-big resources like 2 cpus and 5GB of RAM). Then Caper's leader job itself will submit its child jobs to the job engine so that both leader and child jobs can be found with `squeue` or `qstat`.
Use `--singularity` or `--conda` in CLI to run a pipeline inside Singularity image or Conda environment. Most HPCs do not allow docker. For example, `caper hpc submit ... --singularity` will submit Caper process to the job engine as a leader job. Then Caper's leader job will submit its child jobs to the job engine so that both leader and child jobs can be found with `squeue` or `qstat`.

Use `caper hpc list` to list all leader jobs. Use `caper hpc abort JOB_ID` to abort a running leader job. **DO NOT DIRECTLY CANCEL A JOB USING CLUSTER COMMAND LIKE SCANCEL OR QDEL** then only your leader job will be canceled, not all the child jobs.

Here are some example command lines to submit Caper as a leader job. Make sure that you correctly configured Caper with `caper init` and filled all parameters in the conf file `~/.caper/default.conf`.

There are extra parameters `--db file --file-db [METADATA_DB_PATH_FOR_CALL_CACHING]` to use call-caching (restarting workflows by re-using previous outputs). If you want to restart a failed workflow then use the same metadata DB path then pipeline will start from where it left off. It will actually start over but will reuse (soft-link) previous outputs.
There is an extra set of parameters `--file-db [METADATA_DB_PATH_FOR_CALL_CACHING]` to use call-caching (restarting workflows by re-using previous outputs). If you want to restart a failed workflow then use the same metadata DB path then pipeline will start from where it left off. It will actually start over but will reuse (soft-link) previous outputs.

```bash
# make a separate directory for each workflow.
# make a new output directory for a workflow.
$ cd [OUTPUT_DIR]

# Example for Stanford Sherlock
$ sbatch -p [SLURM_PARTITON] -J [WORKFLOW_NAME] --export=ALL --mem 5G -t 4-0 --wrap "caper run [WDL] -i [INPUT_JSON] --singularity --db file --file-db [METADATA_DB_PATH_FOR_CALL_CACHING]"
# Example with Singularity without using call-caching.
$ caper hpc submit [WDL] -i [INPUT_JSON] --singularity --leader-job-name GOOD_NAME1

# Example for Stanford SCG
$ sbatch -A [SLURM_ACCOUNT] -J [WORKFLOW_NAME] --export=ALL --mem 5G -t 4-0 --wrap "caper run [WDL] -i [INPUT_JSON] --singularity --db file --file-db [METADATA_DB_PATH_FOR_CALL_CACHING]"
# Example with Conda and using call-caching (restarting a workflow from where it left off)
# Use the same --file-db PATH for next re-run then Caper will collect and softlink previous outputs.
$ caper hpc submit [WDL] -i [INPUT_JSON] --conda --leader-job-name GOOD_NAME2 --db file --file-db [METADATA_DB_PATH]

# Example for General SLURM cluster
$ sbatch -A [SLURM_ACCOUNT_IF_NEEDED] -p [SLURM_PARTITON_IF_NEEDED] -J [WORKFLOW_NAME] --export=ALL --mem 5G -t 4-0 --wrap "caper run [WDL] -i [INPUT_JSON] --singularity --db file --file-db [METADATA_DB_PATH_FOR_CALL_CACHING]"
# List all leader jobs.
$ caper hpc list

# Example for SGE
$ echo "caper run [WDL] -i [INPUT_JSON] --conda --db file --file-db [METADATA_DB_PATH_FOR_CALL_CACHING]" | qsub -V -N [JOB_NAME] -l h_rt=144:00:00 -l h_vmem=3G
# Check leader job's STDOUT file to monitor workflow's status.
# Example for SLURM
$ tail -f slurm-[JOB_ID].out

# Check status of leader job
$ squeue -u $USER | grep -v [WORKFLOW_NAME]
# Cromwell's log will be written to cromwell.out* on the same directory.
# It will be helpful for monitoring your workflow in detail.
$ ls -l cromwell.out*

# Kill the leader job then Caper will gracefully shutdown to kill its children.
$ scancel [LEADER_JOB_ID]
# Abort a leader job (this will cascade-kill all its child jobs)
# If you directly use job engine's command like scancel or qdel then child jobs will still remain running.
$ caper hpc abort [JOB_ID]
```


## How to customize resource parameters for HPCs

Each HPC backend (`slurm`, `sge`, `pbs` and `lsf`) has its own resource parameter. e.g. `slurm-resource-param`. Find it in Caper's configuration file (`~/.caper/default.conf`) and edit it. For example, the default resource parameter for SLURM looks like the following:
```
slurm-resource-param=-n 1 --ntasks-per-node=1 --cpus-per-task=${cpu} ${if defined(memory_mb) then "--mem=" else ""}${memory_mb}${if defined(memory_mb) then "M" else ""} ${if defined(time) then "--time=" else ""}${time*60} ${if defined(gpu) then "--gres=gpu:" else ""}${gpu}
```
This should be a one-liner with WDL syntax allowed in `${}` notation. i.e. Cromwell's built-in resource variables like `cpu`(number of cores for a task), `memory_mb`(total amount of memory for a task in MB), `time`(walltime for a task in hour) and `gpu`(name of gpu unit or number of gpus) in `${}`. See https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md for WDL syntax. This line will be formatted with actual resource values by Cromwell and then passed to the submission command such as `sbatch` and `qsub`.

Note that Cromwell's implicit type conversion (`WomLong` to `String`) seems to be buggy for `WomLong` type memory variables such as `memory_mb` and `memory_gb`. So be careful about using the `+` operator between `WomLong` and other types (`String`, even `Int`). For example, `${"--mem=" + memory_mb}` will not work since `memory_mb` is `WomLong` type. Use `${"if defined(memory_mb) then "--mem=" else ""}{memory_mb}${"if defined(memory_mb) then "mb " else " "}` instead. See https://github.com/broadinstitute/cromwell/issues/4659 for details.


# DETAILS

See [details](DETAILS.md).

2 changes: 1 addition & 1 deletion caper/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,4 @@
from .caper_runner import CaperRunner

__all__ = ['CaperClient', 'CaperClientSubmit', 'CaperRunner']
__version__ = '2.1.3'
__version__ = '2.2.0'
Loading

0 comments on commit c10d05b

Please sign in to comment.