In this section we will walk you through how to run Hummingbird on Google Cloud for a sample pipeline that utilizes the BWA aligner (https://github.com/lh3/bwa). The assumption is that you have already created a project(https://cloud.google.com/resource-manager/docs/creating-managing-projects#creating_a_project), installed the Google Cloud SDK and granted credentials to the SDK. Along with the Google Cloud essentials, you also need to have installed Hummingbird. We will be modifying the bwa.conf.json file located in conf/examples.
- Get a list of all projects by executing
gcloud projects list
. Make a note of the project name of the project in which you want to execute Hummingbird. Add that to theproject
field under thePlatform
section. - Identify the region in which you want all of the computing resources to be launched. This would ideally be the same region you had provided to the gcloud sdk during setup. https://cloud.google.com/compute/docs/regions-zones has more information about regions and zones.
- Create a new storage bucket with the instructions provided in https://cloud.google.com/storage/docs/creating-buckets#storage-create-bucket-console. You can either create a bucket using the cloud storage browser in the Google Cloud Console, or execute
gsutil mb gs://<BUCKET_NAME>
from the command line. If creating the bucket from the command line, provide the-p
(project name),-c
(storage class) and-l
(location) flags to have greater control over the creation of your bucket. Once the bucket is created, add it to thebucket
field under thePlatform
section. Just provide the bucket name, the full path is not required. - For a sample BWA run, we will be using fastq files from one of the Platinum Genomes which are publicly hosted on the
genomics-public-data/platinum-genomes
cloud storage bucket. The two fastq files we will be using are ERR194159_1.fastq.gz and ERR194159_2.fastq.gz. In theinput
field underDownsample
addgs://genomics-public-data/platinum-genomes/fastq/ERR194159_1.fastq.gz
to INPUT_R1 andgs://genomics-public-data/platinum-genomes/fastq/ERR194159_2.fastq.gz
to INPUT_R2. For your own input files provide the full Google Cloud bucket path includinggs://
. The inputs need to be specified in key-value format. The key will be used to interpret the value later on. For example, in the command line, you can refer to the first fastq file as ${INPUT_R1}. fractions
represents the extent to which the whole input will be downsampled. For example, to downsample to 1% of the input file, the user needs to specify 0.01 as the fraction. You can keep the values as is, or tinker around with it to get different results.- Next, we will be adding the output and logging bucket names to the configuration file. The output and logging buckets will be created under the bucket created in Step 3. You will only need to provide the bucket path relative to the bucket created in Step 3. For example, if you created a bucket called bwa-example in Step 3 and then created bwa under bwa-example followed by bwa-logging and bwa-output created in bwa-example/bwa, then bwa/bwa-output has to be provided as the output field and bwa/bwa-logging as the logging field. The logging bucket option is specific to gcp and dsub
- The
fullrun
field, indicates whether the input will be downsampled or not. Keep it as "false" to enable downsampling of input for the pipeline. Setting this option to "true" enables the entire pipeline to be executed on the whole input. - In the
image
field, underProfiling
provide the container image that contains the pipeline on which you wish to execute Hummingbird. - In the
logging
field, provide a bucket where Hummingbird can write the log files that are generated during the profiling step. This should be different from the logging bucket you provided under the Downsample field. It should be relative to the bucket created in Step 3. The logging bucket option is specific to gcp and dsub - In the
result
field, provide a bucket that will store the profiling results. It should be relative to the bucket created in Step 3. - In the
threads
field provide a list of numbers representing number of virtual CPUs on a machine. Default is [8]. If setting fullrun to true, change it to a higher number or else the tool might fail due to execution on an instance with insufficient memory input-recursive
is where you need to provide any additional files located within a directory that will be needed during execution. For example, if you have your reference files under thereferences/GRCh37lite
bucket(relative to the bucket created in Step 3) then you can mention it in theinput-recursive
field with a key such asREF
- In the
command
field provide the command that is to be executed in the container. Use the keys that were mentioned in the input field and the input-recursive field(if any). - The output file name and path can be mentioned in the
output
field. It should be relative to the bucket created in Step 3. - Once the configuration file is created you can execute Hummingbird by executing
hummingbird <path to conf file>