Skip to content

Latest commit

 

History

History
52 lines (46 loc) · 5.16 KB

EditConf.md

File metadata and controls

52 lines (46 loc) · 5.16 KB

Section 3: Editing the Configuration File

Hummingbird has a conf folder which contains configuration files for all tested pipelines. The configuration file has a naming convention <pipeline-name>.conf.json, and contains all the information needed to launch jobs on the cloud. The format of the configuration file is very important and any unnecessary lines or spaces will throw an error. We will go through each required field in the configuration file below.

  1. Platform Specifies information about the cloud computing platform.

    • service The cloud computing service. Specify gcp for Google Cloud, aws for AWS, 'azure' for Azure.

    • aws:

      • project The cloud project ID. Make sure the project has access to all needed functionalities and APIs.
      • regions The region where the computing resource is hosted.
      • bucket The name of the cloud storage bucket where all the log and output files generated by Hummingbird will be stored.
      • cloudformation_stack_name The name of the CloudFormation stack from Section 1: Getting Started.
    • gcp:

      • project The cloud project ID. Make sure the project has access to all needed functionalities and APIs.
      • regions The region where the computing resource is hosted.
      • bucket The name of the cloud storage bucket where all the log and output files generated by Hummingbird will be stored.
    • azure:

      • subscription The Azure Subscription ID. Make sure the project has access to all needed functionalities and APIs.
      • resource_group The Azure Resource Group.
      • location The location where the computing resource is hosted.
      • storage_account The name of the Azure storage account in the Resource Group above.
      • storage_container The Stoarge Container in the Storage Account above.
      • storage_connection_string The connection string obtained from Azure Storage Account above.
      • batch_account Azure Batch account in the Resource Group above.
      • batch_key Azure Batch key.
  2. Downsample Options for Hummingbird's downsampling processes.

    • input The full gs/s3 path of files to be used as input to Hummingbird. Specify in key-value pairs format, the keys will be used as interpolation of values later.
    • target The number of reads in the original input files. This number will be used for prediction purpose.
    • output Path to a directory in your bucket to store output files for downsample. Do not include the bucket name.
    • logging (GCP only) Path to a directory in your bucket to store log files for downsample. Do not include the bucket name.
    • fractions (optional) A list of decimals representing the downsample size. The default list is [0.0001, 0.01, 0.1] which means the sample will be downsized to 0.1%, 1%, 10% of its original size.
    • fullrun (optional) Default to false. Set to true to run the whole input without downsampling.
    • index (optional) Default to false. Use samtools to generate index files for the downsampled input files.
  3. Profiling Options for Hummingbird's Profiling processes.

    • image The Docker image on which your pipeline will be executed. For AWS backend, it requires you to build a customized image. See documentation here.
    • logging (GCP only) Path to a directory in your bucket to store log files for profiling. Do not include the bucket name.
    • result Path to a directory in your bucket to store result files for profiling. Do not include the bucket name.
    • thread A list of numbers representing number of virtual CPUs on a machine. Default is [8]. It will cause Hummingbird to test the downsampled inputs on a machine with 8 virtual CPUs.
    • WDL/Cromwell
      • wdl_file Path to the workflow wdl_file in your bucket to be submitted by Hummingbird to Cromwell. Do not include the bucket name in the path.
      • backend_conf Path to the workflow backend configuration file in your bucket to be submitted by Hummingbird to Cromwell. Do not include the bucket name in the path.
      • json_input Two dimensional array containing the input for each Cromwell call. For each value in the thread option, Hummingbird requires a list referencing each downsampled file specified in the size option. These input files vary based on pipelines.
    • Command line tool
      • command Command directly executed in the image.
      • input and/or input-recursive Add any additional input resource in key-value pairs format.
    • output and/or output-recursive Path in your bucket to where Hummingbird will output the memory and time profiling results. Specify in key-value pairs format, the keys will be used as interpolation of values later. Do not include the bucket name.
    • force (optional) Default to false. Set to true to force to re-execute pipeline even the result exists.
    • tries (optional) Default to 1. Specify the number of tries/repeated runs for each task, the result will be reported as average of multiple tries/repeated runs.
    • disk (optional) Default to 500. The size in GB for the data disk size on your instance.