Hummingbird has a conf
folder which contains configuration files for all tested pipelines. The configuration file has a naming convention <pipeline-name>.conf.json
, and contains all the information needed to launch jobs on the cloud. The format of the configuration file is very important and any unnecessary lines or spaces will throw an error. We will go through each required field in the configuration file below.
-
Platform
Specifies information about the cloud computing platform.-
service
The cloud computing service. Specifygcp
for Google Cloud,aws
for AWS, 'azure' for Azure. -
aws:
project
The cloud project ID. Make sure the project has access to all needed functionalities and APIs.regions
The region where the computing resource is hosted.bucket
The name of the cloud storage bucket where all the log and output files generated by Hummingbird will be stored.cloudformation_stack_name
The name of the CloudFormation stack from Section 1: Getting Started.
-
gcp:
project
The cloud project ID. Make sure the project has access to all needed functionalities and APIs.regions
The region where the computing resource is hosted.bucket
The name of the cloud storage bucket where all the log and output files generated by Hummingbird will be stored.
-
azure:
subscription
The Azure Subscription ID. Make sure the project has access to all needed functionalities and APIs.resource_group
The Azure Resource Group.location
The location where the computing resource is hosted.storage_account
The name of the Azure storage account in the Resource Group above.storage_container
The Stoarge Container in the Storage Account above.storage_connection_string
The connection string obtained from Azure Storage Account above.batch_account
Azure Batch account in the Resource Group above.batch_key
Azure Batch key.
-
-
Downsample
Options for Hummingbird's downsampling processes.input
The full gs/s3 path of files to be used as input to Hummingbird. Specify in key-value pairs format, the keys will be used as interpolation of values later.target
The number of reads in the original input files. This number will be used for prediction purpose.output
Path to a directory in your bucket to store output files for downsample. Do not include the bucket name.logging
(GCP only) Path to a directory in your bucket to store log files for downsample. Do not include the bucket name.fractions
(optional) A list of decimals representing the downsample size. The default list is[0.0001, 0.01, 0.1]
which means the sample will be downsized to 0.1%, 1%, 10% of its original size.fullrun
(optional) Default tofalse
. Set totrue
to run the whole input without downsampling.index
(optional) Default tofalse
. Usesamtools
to generate index files for the downsampled input files.
-
Profiling
Options for Hummingbird's Profiling processes.image
The Docker image on which your pipeline will be executed. For AWS backend, it requires you to build a customized image. See documentation here.logging
(GCP only) Path to a directory in your bucket to store log files for profiling. Do not include the bucket name.result
Path to a directory in your bucket to store result files for profiling. Do not include the bucket name.thread
A list of numbers representing number of virtual CPUs on a machine. Default is [8]. It will cause Hummingbird to test the downsampled inputs on a machine with 8 virtual CPUs.- WDL/Cromwell
wdl_file
Path to the workflow wdl_file in your bucket to be submitted by Hummingbird to Cromwell. Do not include the bucket name in the path.backend_conf
Path to the workflow backend configuration file in your bucket to be submitted by Hummingbird to Cromwell. Do not include the bucket name in the path.json_input
Two dimensional array containing the input for each Cromwell call. For each value in thethread
option, Hummingbird requires a list referencing each downsampled file specified in thesize
option. These input files vary based on pipelines.
- Command line tool
command
Command directly executed in the image.input
and/orinput-recursive
Add any additional input resource in key-value pairs format.
output
and/oroutput-recursive
Path in your bucket to where Hummingbird will output the memory and time profiling results. Specify in key-value pairs format, the keys will be used as interpolation of values later. Do not include the bucket name.force
(optional) Default tofalse
. Set totrue
to force to re-execute pipeline even the result exists.tries
(optional) Default to 1. Specify the number of tries/repeated runs for each task, the result will be reported as average of multiple tries/repeated runs.disk
(optional) Default to 500. The size in GB for the data disk size on your instance.