Skip to content

Latest commit

 

History

History
19 lines (17 loc) · 5.2 KB

SampleRun.md

File metadata and controls

19 lines (17 loc) · 5.2 KB

Section 2: Sample Run on Google Cloud

In this section we will walk you through how to run Hummingbird on Google Cloud for a sample pipeline that utilizes the BWA aligner (https://github.com/lh3/bwa). The assumption is that you have already created a project(https://cloud.google.com/resource-manager/docs/creating-managing-projects#creating_a_project), installed the Google Cloud SDK and granted credentials to the SDK. Along with the Google Cloud essentials, you also need to have installed Hummingbird. We will be modifying the bwa.conf.json file located in conf/examples.

  1. Get a list of all projects by executing gcloud projects list. Make a note of the project name of the project in which you want to execute Hummingbird. Add that to the project field under the Platform section.
  2. Identify the region in which you want all of the computing resources to be launched. This would ideally be the same region you had provided to the gcloud sdk during setup. https://cloud.google.com/compute/docs/regions-zones has more information about regions and zones.
  3. Create a new storage bucket with the instructions provided in https://cloud.google.com/storage/docs/creating-buckets#storage-create-bucket-console. You can either create a bucket using the cloud storage browser in the Google Cloud Console, or execute gsutil mb gs://<BUCKET_NAME> from the command line. If creating the bucket from the command line, provide the -p(project name), -c(storage class) and -l(location) flags to have greater control over the creation of your bucket. Once the bucket is created, add it to the bucket field under the Platform section. Just provide the bucket name, the full path is not required.
  4. For a sample BWA run, we will be using fastq files from one of the Platinum Genomes which are publicly hosted on the genomics-public-data/platinum-genomes cloud storage bucket. The two fastq files we will be using are ERR194159_1.fastq.gz and ERR194159_2.fastq.gz. In the input field under Downsample add gs://genomics-public-data/platinum-genomes/fastq/ERR194159_1.fastq.gz to INPUT_R1 and gs://genomics-public-data/platinum-genomes/fastq/ERR194159_2.fastq.gz to INPUT_R2. For your own input files provide the full Google Cloud bucket path including gs://. The inputs need to be specified in key-value format. The key will be used to interpret the value later on. For example, in the command line, you can refer to the first fastq file as ${INPUT_R1}.
  5. fractions represents the extent to which the whole input will be downsampled. For example, to downsample to 1% of the input file, the user needs to specify 0.01 as the fraction. You can keep the values as is, or tinker around with it to get different results.
  6. Next, we will be adding the output and logging bucket names to the configuration file. The output and logging buckets will be created under the bucket created in Step 3. You will only need to provide the bucket path relative to the bucket created in Step 3. For example, if you created a bucket called bwa-example in Step 3 and then created bwa under bwa-example followed by bwa-logging and bwa-output created in bwa-example/bwa, then bwa/bwa-output has to be provided as the output field and bwa/bwa-logging as the logging field. The logging bucket option is specific to gcp and dsub
  7. The fullrun field, indicates whether the input will be downsampled or not. Keep it as "false" to enable downsampling of input for the pipeline. Setting this option to "true" enables the entire pipeline to be executed on the whole input.
  8. In the image field, under Profiling provide the container image that contains the pipeline on which you wish to execute Hummingbird.
  9. In the logging field, provide a bucket where Hummingbird can write the log files that are generated during the profiling step. This should be different from the logging bucket you provided under the Downsample field. It should be relative to the bucket created in Step 3. The logging bucket option is specific to gcp and dsub
  10. In the result field, provide a bucket that will store the profiling results. It should be relative to the bucket created in Step 3.
  11. In the threads field provide a list of numbers representing number of virtual CPUs on a machine. Default is [8]. If setting fullrun to true, change it to a higher number or else the tool might fail due to execution on an instance with insufficient memory
  12. input-recursive is where you need to provide any additional files located within a directory that will be needed during execution. For example, if you have your reference files under the references/GRCh37lite bucket(relative to the bucket created in Step 3) then you can mention it in the input-recursive field with a key such as REF
  13. In the command field provide the command that is to be executed in the container. Use the keys that were mentioned in the input field and the input-recursive field(if any).
  14. The output file name and path can be mentioned in the output field. It should be relative to the bucket created in Step 3.
  15. Once the configuration file is created you can execute Hummingbird by executing hummingbird <path to conf file>