From 155a25c0169b933d331b23a8508412b1013d5048 Mon Sep 17 00:00:00 2001 From: Miller Date: Fri, 19 Jul 2024 12:26:20 -0400 Subject: [PATCH 1/2] :bookmark: bump to release 2.11.0 --- workflows/kfdrc_alignment_wf.cwl | 2 +- workflows/kfdrc_sentieon_alignment_wf.cwl | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/workflows/kfdrc_alignment_wf.cwl b/workflows/kfdrc_alignment_wf.cwl index b50dc3e..321c1b2 100644 --- a/workflows/kfdrc_alignment_wf.cwl +++ b/workflows/kfdrc_alignment_wf.cwl @@ -868,5 +868,5 @@ hints: - WXS - GVCF "sbg:links": -- id: 'https://github.com/kids-first/kf-alignment-workflow/releases/tag/v2.10.0' +- id: 'https://github.com/kids-first/kf-alignment-workflow/releases/tag/v2.11.0' label: github-release diff --git a/workflows/kfdrc_sentieon_alignment_wf.cwl b/workflows/kfdrc_sentieon_alignment_wf.cwl index e0a09d9..3a1e45a 100644 --- a/workflows/kfdrc_sentieon_alignment_wf.cwl +++ b/workflows/kfdrc_sentieon_alignment_wf.cwl @@ -453,5 +453,5 @@ hints: - GVCF - SENTIEON "sbg:links": -- id: 'https://github.com/kids-first/kf-alignment-workflow/releases/tag/v2.10.0' +- id: 'https://github.com/kids-first/kf-alignment-workflow/releases/tag/v2.11.0' label: github-release From 222c147e573f01c082ff15f39f609d0f67b029dc Mon Sep 17 00:00:00 2001 From: Miller Date: Mon, 22 Jul 2024 17:24:33 -0400 Subject: [PATCH 2/2] :books: update readmes :broom: fix typos :broom: fix typo :broom: fix typo :broom: typos --- ...BWA_GATK_ALIGNMENT_GVCF_WORKFLOW_README.md | 299 ++++++++ ...SENTIEON_ALIGNMENT_GVCF_WORKFLOW_README.md | 9 +- readme.md | 428 +----------- workflows/kfdrc_alignment_wf.cwl | 639 ++++++++---------- workflows/kfdrc_sentieon_alignment_wf.cwl | 7 +- 5 files changed, 597 insertions(+), 785 deletions(-) create mode 100644 docs/KFDRC_BWA_GATK_ALIGNMENT_GVCF_WORKFLOW_README.md diff --git a/docs/KFDRC_BWA_GATK_ALIGNMENT_GVCF_WORKFLOW_README.md b/docs/KFDRC_BWA_GATK_ALIGNMENT_GVCF_WORKFLOW_README.md new file mode 100644 index 0000000..06befbd --- /dev/null +++ b/docs/KFDRC_BWA_GATK_ALIGNMENT_GVCF_WORKFLOW_README.md @@ -0,0 +1,299 @@ +# Kids First Data Resource Center BWA-GATK Short Reads Alignment and HaplotypeCaller Workflow + +

+ Kids First repository logo +

+ +The Kids First Data Resource Center (KFDRC) BWA-GATK Short Reads Alignment and +GATK Haplotyper Workflow is a Common Workflow Language (CWL) implementation of +various software used to take reads generated by next generation sequencing +(NGS) technologies and use those reads to generate alignment and, optionally, +variant information. + +This workflow is an all-in-one alignment workflow capable of handling any kind +of reads inputs: BAM inputs, PE reads and mates inputs, SE reads inputs, or +any combination of these. The workflow will naively attempt to process these +depending on what you tell it you have provided. The user informs the workflow +of which inputs to process using three boolean inputs: `run_bam_processing`, +`run_pe_reads_processing`, and `run_se_reads_processing`. Providing `true` +values for these as well their corresponding inputs will result in those inputs +being processed. + +In addition to the core alignment functionality, this workflow is also capable +of the following: +- Performing Haplotype Calling to generate a gVCF as well as associated metrics +- Collecting Hybrid-Selection (HS) Sequencing Metrics +- Collecting Whole Genome Sequencing (WGS) Metrics +- Collecting Aggregate Sequencing Metrics: Alignment Summary, Insert Size, Sequencing Artifact, GC Bias, and Quality Score Distribution +- Collecting Sex Chromosome Metrics +- Collecting HLA Genotyping Data + +## Pipeline Software and Versions + +Below is a breakdown of the software used in the workflow. Please note that this +breakdown is manually updated and, as such, may fall out of date with our workflow. +For the most up to date versions of our software please refer to the Docker +images present in the workflow. A table of the images and their usage can be +found [here](./dockers_bwagatk_alignment.md). + +| Step | KFDRC BWA-GATK | +|----------------------------|-------------------------------------| +| Bam to Read Group (RG) BAM | samtools split | +| RG Bam to Fastq | biobambam2 bamtofastq | +| Adapter Trimming | cutadapt | +| Fastq to RG Bam | bwa mem | +| Merge RG Bams | sambamba merge | +| Sort Bam | sambamba sort | +| Mark Duplicates | samblaster | +| BaseRecalibration | GATK BaseRecalibrator | +| ApplyRecalibration | GATK ApplyBQSR | +| Gather Recalibrated BAMs | Picard GatherBamFiles | +| Bam to Cram | samtools view | +| Metrics | Picard | +| Sex Metrics | samtools idxstats | +| HLA Genotyping | T1k | +| Contamination Calculation | VerifyBamID | +| gVCF Calling | GATK HaplotypeCaller | +| Gather VCFs | Picard MergeVcfs | +| Metrics | Picard CollectVariantCallingMetrics | + +- [biobambam2](https://github.com/gt1/biobambam2): [2.0.50](https://github.com/gt1/biobambam2/releases/tag/2.0.50-release-20160705161609) +- [bwa](https://github.com/lh3/bwa): [0.7.17-r1188](https://github.com/lh3/bwa/releases/tag/v0.7.17) +- [Cutadapt](https://github.com/marcelm/cutadapt): [4.6](https://github.com/marcelm/cutadapt/releases/tag/v4.5) +- [GATK](https://github.com/broadinstitute/gatk) Our workflow uses three versions of GATK: + - HaplotypeCaller: [4.beta.1](https://github.com/broadinstitute/gatk/releases/tag/4.beta.1) + - BaseRecalibration: [4.0.3.0](https://github.com/broadinstitute/gatk/releases/tag/4.0.3.0) + - File Indexing: [4.1.7.0](https://github.com/broadinstitute/gatk/releases/tag/4.1.7.0) +- [Picard](https://github.com/broadinstitute/picard): [2.18.9](https://github.com/broadinstitute/picard/releases/tag/2.18.9) +- [Sambamba](https://github.com/biod/sambamba): [0.6.3](https://github.com/biod/sambamba/releases/tag/v0.6.3) +- [samblaster](https://github.com/GregoryFaust/samblaster): [0.1.24](https://github.com/GregoryFaust/samblaster/releases/tag/v.0.1.24) +- [Samtools](https://github.com/samtools/samtools): + - idxstats, split: [1.9](https://github.com/samtools/samtools/releases/tag/1.9) + - BAM to CRAM conversion: [1.8](https://github.com/samtools/samtools/releases/tag/1.8) +- [T1k](https://github.com/mourisl/T1K): [1.0.5](https://github.com/mourisl/T1K/releases/tag/v1.0.5) +- [VerifyBamID](https://github.com/Griffan/VerifyBamID): [1.0.2](https://github.com/Griffan/VerifyBamID/releases/tag/1.0.2) + +### Alignment + +The principle function of this pipeline is the realignment of short reads to a reference genome. + +#### Inputing Reads + +Read information are passed in through the following inputs: +```yaml + input_bam_list: "List of input BAM/CRAM/SAM files" + cram_reference: "If input_bam_list contains CRAM, provided the reference used to generate that CRAM here" + input_pe_reads_list: "List of input R1 paired end FASTQ reads" + input_pe_mates_list: "List of input R2 paired end FASTQ reads" + input_pe_rgs_list: "List of RG strings to use in PE processing" + input_se_reads_list: "List of input single end FASTQ reads" + input_se_rgs_list: "List of RG strings to use in SE processing" +``` + +This workflow accepts reads in any of the of the following formats: +- Aligned BAM/CRAM/SAM +- Unaligned BAM/CRAM/SAM +- Paired End FASTQ +- Single End FASTQ + +Additionally, these files must have appropriate read group information. For +aligned and unaligned BAM/CRAM/SAM files, we assume that information is +contained within the file header. If your files do not have read group (`@RG`) +header information, you **must** add that information to the file header for +the pipeline to work! + +If the user provides CRAM inputs, they must also provide the reference that was +used to generate the CRAM file. + +FASTQ files do not have headers; their header information must be provided by +the user at runtime. The inputs that the user provides are handed directly to +the bwa `-R` option. From bwa: +``` +-R Complete read group header line. '\t' can be used in STR and will be + converted to a TAB in the output SAM. The read group ID will be attached to + every read in the output. An example is '@RG\tID:foo\tSM:bar'. [null] +``` + +If the user provides multiple single end FASTQs or multiple sets of paired end +FASTQs, they must keep the files and read group lines in the same order. +For example: +``` +input_pe_reads_list: + - sample1_R1.fq + - sample2_R1.fq + - sample3_R1.fq +input_pe_mates_list: + - sample1_R2.fq + - sample2_R2.fq + - sample3_R2.fq +input_pe_rgs_list: + - sample1_rg_line + - sample2_rg_line + - sample3_rg_line +``` + +##### Activating Reads Processing + +Once the user has provided their reads, they must then choose which processing +the pipeline will perform. This choice is controlled by the following +parameters: +```yaml + run_bam_processing: "BAM/CRAM/SAM processing will be run. Requires: input_bam_list" + run_pe_reads_processing: "PE reads processing will be run. Requires: input_pe_reads_list, input_pe_mates_list, input_pe_rgs_list" + run_se_reads_processing: "SE reads processing will be run. Requires: input_se_reads_list, input_se_rgs_list" +``` + +If the user provides BAM/CRAM/SAMs they should set `run_bam_processing` to true. +If the user provides paired end FASTQs, they should set `run_pe_reads_processing` to true. +If the user provides single end FASTQs, they should set `run_se_reads_processing` to true. + +### Alignment Metrics + +After alignment, the pipeline contains a number of optional metrics collection +steps that the user can enable. These tools can be enabled using the following +parameters: +```yaml + run_hs_metrics: "HsMetrics will be collected. Only recommended for WXS inputs. Requires: wxs_bait_interval_list, wxs_target_interval_list" + run_wgs_metrics: "WgsMetrics will be collected. Only recommended for WGS inputs. Requires: wgs_coverage_interval_list" + run_agg_metrics: "AlignmentSummaryMetrics, GcBiasMetrics, InsertSizeMetrics, QualityScoreDistribution, and SequencingArtifactMetrics will be collected." + run_sex_metrics: "idxstats will be collected and X/Y ratios calculated" +``` + +In all but the most frugal cases, we recommend running the agg(regate) and sex +metrics. These steps require no additional inputs. + +HS and WGS Metrics, however, should only be enabled for appropriate inputs. If +the inputs are WGS, we recommend enabling WGS metrics. If the inputs are WXS, +we recommend enabling WXS metrics. Both of these metrics tools require +additional inputs. For WGS, the user must provide a +`wgs_coverage_interval_list`. For WXS, the user must provide both a +`wxs_bait_interval_list` and `wxs_target_interval_list`. While we provide a +`wgs_coverage_interval_list` with the workflow, the `wxs_bait_interval_list` +and `wxs_target_interval_list` are going to be unique to the sequencing +experiment used to generate the reads. These inputs will vary run to run. +Therefore, the user must provide these intervals. + +### Haplotype Calling + +Following alignment, this workflow can also perform GATK haplotype calling to +generate a gVCF and metrics files. To generate the gVCF, set +`run_gvcf_processing` to `true` and provide the following optional files: +`dbsnp_vcf`, `contamination_sites_bed`, `contamination_sites_mu`, +`contamination_sites_ud`, `wgs_calling_interval_list`, and +`wgs_evaluation_interval_list`. + +These are the gVCF processing relevant inputs and descriptions: +```yaml + dbsnp_vcf: "dbSNP vcf file" + dbsnp_idx: "dbSNP vcf index file" + contamination_sites_bed: ".bed file for markers used in this analysis,format(chr\tpos-1\tpos\trefAllele\taltAllele)" + contamination_sites_mu: ".mu matrix file of genotype matrix" + contamination_sites_ud: ".UD matrix file from SVD result of genotype matrix" + wgs_calling_interval_list: "WGS interval list used to aid scattering Haplotype caller" + wgs_evaluation_interval_list: "Target intervals to restrict gVCF metric analysis (for VariantCallingMetrics)" + run_gvcf_processing: "gVCF will be generated. Requires: dbsnp_vcf, contamination_sites_bed, contamination_sites_mu, contamination_sites_ud, wgs_calling_interval_list, wgs_evaluation_interval_list" +``` + +### HLA Genotyping + +This workflow can also collect HLA genotyping data. To read more on this +feature please see [our documentation](./T1K_README.md). This feature is +on by default but can be disabled by setting `run_t1k` to `false`. HLA +genotyping also requires the `hla_dna_ref_seqs` and `hla_dna_gene_coords` +inputs. The public app provides default files for these. If you have are using +a different reference genome, see the [T1k readme](https://github.com/mourisl/T1K?tab=readme-ov-file#install) +for instructions on creating these files. + +## Outputs + +Below are the complete list of the outputs that can be generated by the pipeline: +```yaml + cram: "(Re)Aligned Reads File" + gvcf: "Genomic VCF generated from the realigned alignment file." + verifybamid_output: "Output from VerifyBamID that is used to calculate contamination." + cutadapt_stats: "Stats from Cutadapt runs, if run. One or more per read group." + bqsr_report: "Recalibration report from BQSR." + gvcf_calling_metrics: "General metrics for gVCF calling quality." + hs_metrics: "Picard CollectHsMetrics metrics for the analysis of target-capture sequencing experiments." + wgs_metrics: "Picard CollectWgsMetrics metrics for evaluating the performance of whole genome sequencing experiments." + alignment_metrics: "Picard CollectAlignmentSummaryMetrics high level metrics about the alignment of reads within a SAM/BAM file." + gc_bias_detail: "Picard CollectGcBiasMetrics detailed metrics about reads that fall within windows of a certain GC bin on the reference genome." + gc_bias_summary: "Picard CollectGcBiasMetrics high level metrics that capture how biased the coverage in a certain lane is." + gc_bias_chart: "Picard CollectGcBiasMetrics plot of GC bias." + insert_metrics: "Picard CollectInsertSizeMetrics metrics about the insert size distribution of a paired-end library." + insert_plot: "Picard CollectInsertSizeMetrics insert size distribution plotted." + artifact_bait_bias_detail_metrics: "Picard CollectSequencingArtifactMetrics bait bias artifacts broken down by context." + artifact_bait_bias_summary_metrics: "Picard CollectSequencingArtifactMetrics summary analysis of a single bait bias artifact." + artifact_error_summary_metrics: "Picard CollectSequencingArtifactMetrics summary metrics as a roll up of the context-specific error rates, to provide global error rates per type of base substitution." + artifact_pre_adapter_detail_metrics: "Picard CollectSequencingArtifactMetrics pre-adapter artifacts broken down by context." + artifact_pre_adapter_summary_metrics: "Picard CollectSequencingArtifactMetrics summary analysis of a single pre-adapter artifact." + qual_metrics: "Quality metrics for the realigned CRAM." + qual_chart: "Visualization of quality metrics." + idxstats: "samtools idxstats of the realigned reads file." + xy_ratio: "Text file containing X and Y reads statistics generated from idxstats." + t1k_genotype_tsv: "HLA genotype results from T1k" +``` + +## Caveats: +- Duplicates are flagged in a process that is connected to bwa mem. The + implication of this design decision is that duplicates are flagged only on the + inputs of that are scattered into bwa. Duplicates, therefore, are not being + flagged at a library level and, for large BAM and FASTQ inputs, duplicates are + only being detected within a portion of the read group. + +## Tips for running: +- For the FASTQ input file lists (PE or SE), make sure the lists are properly + ordered. The items in the arrays are processed based on their position. These + lists are dotproduct scattered. This means that the first file in + `input_pe_reads_list` is run with the first file in `input_pe_mates_list` and + the first string in `input_pe_rgs_list`. This also means these arrays must be + the same length or the workflow will fail. +- The input for the `reference_tar` must be a tar file containing the + reference fasta along with its indexes. The required indexes are + `[.64.ann,.64.amb,.64.bwt,.64.pac,.64.sa,.dict,.fai]` and are generated by bwa, + Picard, and Samtools. Additionally, an `.64.alt` index is recommended. The + following is an example of a complete reference tar input: +``` +~ tar tf Homo_sapiens_assembly38.tgz +Homo_sapiens_assembly38.dict +Homo_sapiens_assembly38.fasta +Homo_sapiens_assembly38.fasta.64.alt +Homo_sapiens_assembly38.fasta.64.amb +Homo_sapiens_assembly38.fasta.64.ann +Homo_sapiens_assembly38.fasta.64.bwt +Homo_sapiens_assembly38.fasta.64.pac +Homo_sapiens_assembly38.fasta.64.sa +Homo_sapiens_assembly38.fasta.fai +``` +- If you are making your own bwa indexes make sure to use the `-6` flag to + obtain the `.64` version of the indexes. Indexes that do not match this naming + schema will cause a failure in certain runner ecosystems. +- Should you decide to create your own reference indexes and omit the ALT + index file from the reference, or if its naming structure mismatches the other + indexes, then your alignments will be equivalent to the results you would + obtain if you run BWA-MEM with the `-j` option. +- For advanced usage, users can skip the knownsite indexing by providing the + `knownsites_indexes` input. This file list should contain the indexes for each + of the files in your knownsites input. Please note this list must be ordered in + such a way where the position of the index file in the `knownsites_indexes` + list must correspond with the position of the VCF file in the knownsites list + that it indexes. Failure to order in this way will result in the pipeline + failing or generating erroneous files. +- For large BAM inputs, users may encounter a scenario where jobs fail during + the bamtofastq step. The given error will recommend that users try increasing + disk space. Increasing the disk space will solve the error for all but the + largest inputs. For those extremely large inputs with many read groups that + continue to get this error, it is recommended that users increase the value for + `bamtofastq_cpu`. Increasing this value will decrease the number of concurrent + bamtofastq jobs that run on the same instance. + +## Basic Info +- [D3b dockerfiles](https://github.com/d3b-center/bixtools) +- Testing Tools: + - [CAVATICA Platform](https://cavatica.sbgenomics.com/) + - [Common Workflow Language reference implementation (cwltool)](https://github.com/common-workflow-language/cwltool/) + +## References +- KFDRC AWS S3 bucket: s3://kids-first-seq-data/broad-references/ +- CAVATICA: https://cavatica.sbgenomics.com/u/kfdrc-harmonization/kf-references/ +- Broad Institute Goolge Cloud: https://console.cloud.google.com/storage/browser/gcp-public-data--broad-references/hg38/v0 diff --git a/docs/KFDRC_SENTIEON_ALIGNMENT_GVCF_WORKFLOW_README.md b/docs/KFDRC_SENTIEON_ALIGNMENT_GVCF_WORKFLOW_README.md index fd8cf60..625ae9c 100644 --- a/docs/KFDRC_SENTIEON_ALIGNMENT_GVCF_WORKFLOW_README.md +++ b/docs/KFDRC_SENTIEON_ALIGNMENT_GVCF_WORKFLOW_README.md @@ -1,14 +1,14 @@ -# Kids First Data Resource Center Sentieon Short Reads Alignment and gVCF Workflow +# Kids First Data Resource Center Sentieon Short Reads Alignment and Haplotyper Workflow

Kids First repository logo

-The Kids First Data Resource Center (KFDRC) Short Reads Alignment and GATK +The Kids First Data Resource Center (KFDRC) Sentieon Short Reads Alignment and Haplotyper Workflow is a Common Workflow Language (CWL) implementation of various software used to take reads generated by next generation sequencing (NGS) technologies and use those reads to generate alignment and, optionally, -variant information. This workflow mirrors the approach of our existing [Input Agnostic Alignment Workflow](https://github.com/kids-first/kf-alignment-workflow#input-agnostic-alignment-workflow), +variant information. This workflow mirrors the approach of our existing [BWA-GATK Workflow](./KFDRC_BWA_GATK_ALIGNMENT_GVCF_WORKFLOW_README.md), and the two have been internally benchmarked as functionally equivalent. The key difference between the two workflows is found in the tools used during the alignment process. @@ -62,6 +62,7 @@ that match our existing Picard metrics suite. |----------------------------|-----------------------|-----------------------------------| | Bam to Read Group (RG) BAM | samtools split | samtools split | | RG Bam to Fastq | biobambam2 bamtofastq | biobambam2 bamtofastq | +| Adapter Trimming | cutadapt | cutadapt | | Fastq to RG Bam | bwa mem | Sentieon bwa mem | | Merge RG Bams | sambamba merge | Sentieon ReadWriter | | Sort Bam | sambamba sort | Sentieon ReadWriter | @@ -71,6 +72,8 @@ that match our existing Picard metrics suite. | Gather Recalibrated BAMs | Picard GatherBamFiles | No splitting occurs in Sentieon | | Bam to Cram | samtools view | Sentieon ReadWriter | | Metrics | Picard | Sentieon | +| Sex Metrics | samtools idxstats | samtools idxstats | +| HLA Genotyping | T1k | T1k | ## Sentieon gVCF Creation: Similarities and Differences diff --git a/readme.md b/readme.md index 6c28d0f..1db06d6 100644 --- a/readme.md +++ b/readme.md @@ -1,12 +1,7 @@ -# Kids First Data Resource Center Alignment and GATK HaplotypeCaller Workflows +# Kids First Data Resource Center DNA Short Reads Alignment and Haplotype Calling Workflows -The Kids First Data Resource Center Alignment and Haplotype Calling Workflow (bam/fastq-to-cram, gVCF optional) follows -Broad best practices outlined in [Data pre-processing for variant discovery.](https://gatk.broadinstitute.org/hc/en-us/articles/360035535912-Data-pre-processing-for-variant-discovery) -It uses bam/fastq input and aligns/re-aligns to a bwa-indexed reference fasta, version hg38. Resultant bam is de-dupped and -base score recalibrated. Contamination is calculated and a gVCF is created optionally using GATK4 vbeta.1-3.5 HaplotypeCaller. Inputs from -this can be used later on for further analysis in joint trio genotyping and subsequent refinement and deNovo variant analysis. If you would like to run this workflow using the cavatica public app, a basic primer on running public apps can be found [here](https://www.notion.so/d3b/Starting-From-Scratch-Running-Cavatica-af5ebb78c38a4f3190e32e67b4ce12bb). - Alternatively, if you'd like to run it locally using `cwltool`, a basic primer on that can be found [here](https://www.notion.so/d3b/Starting-From-Scratch-Running-CWLtool-b8dbbde2dc7742e4aff290b0a878344d) and combined with app-specific info from the readme below. - This workflow is the current production workflow, equivalent to this [Cavatica public app](https://cavatica.sbgenomics.com/public/apps#cavatica/apps-publisher/kfdrc-alignment-workflow) and supersedes the [old workflow](https://github.com/kids-first/kf-alignment-workflow/tree/1.0.0) and [public app](https://cavatica.sbgenomics.com/public/apps#kids-first-drc/kids-first-drc-alignment-workflow/kfdrc-alignment-bam2cram2gvcf/); however outputs are considered equivalent. +This repository contains the Common Workflow Language (CWL) pipelines for the +Kids First Data Resource Center's (KFDRC) Alignment and Haplotype Calling Workflows.

Kids First repository logo @@ -15,406 +10,27 @@ this can be used later on for further analysis in joint trio genotyping and subs

-## Input Agnostic Alignment Workflow -Workflow for the alignment or realignment of input SAMs/BAMs/CRAMs -(Alignment/Map files, or AMs), PE reads, and/or SE reads; conditionally -generate gVCF and metrics. +## KFDRC BWA-GATK Short Reads Alignment and HaplotypeCaller Workflow -This workflow is a all-in-one workflow for handling any kind of reads inputs: -BAM inputs, PE reads and mates inputs, SE reads inputs, or any combination of -these. The workflow will naively attempt to process these depending on what you -tell it you have provided. The user informs the workflow of which inputs to -process using three boolean inputs: `run_bam_processing`, -`run_pe_reads_processing`, and `run_se_reads_processing`. Providing `true` -values for these as well their corresponding inputs will result in those inputs -being processed. +This workflow follows the best practices pipeline outlined by the Broad Institute in [Data pre-processing for variant discovery](https://gatk.broadinstitute.org/hc/en-us/articles/360035535912-Data-pre-processing-for-variant-discovery). See our documentation for more details. +- [Documentation](./docs/KFDRC_BWA_GATK_ALIGNMENT_GVCF_WORKFLOW_README.md) +- [CWL Workflow](./workflows/kfdrc_alignment_wf.cwl) +- [CAVATICA public app](https://cavatica.sbgenomics.com/public/apps#cavatica/apps-publisher/kfdrc-alignment-workflow) -The second half of the workflow deals with optional gVCF creation and metrics -collection. This workflow is capable of collecting the metrics using the -following boolean flags: `run_hs_metrics`, `run_wgs_metrics`, and -`run_agg_metrics`. To run these metrics, additional optional inputs must also -be provided: `wxs_bait_interval_list` and `wxs_target_interval_list` for -HsMetrics, `wgs_coverage_interval_list` for WgsMetrics. To generate the gVCF, -set `run_gvcf_processing` to `true` and provide the following optional files: -`dbsnp_vcf`, `contamination_sites_bed`, `contamination_sites_mu`, -`contamination_sites_ud`, `wgs_calling_interval_list`, and -`wgs_evaluation_interval_list`. Additionally, the workflow is capable of -performing a basic evaluation of the X and Y sex chromosomes using idxstats. To -activate this feature, set `run_sex_metrics` to `true`; no additonal inputs are -required. Finally, the workflow is capable of collecting HLA genotype data -using [T1k](./docs/T1K_README.md). This feature is on by default but can be -disabled by setting `run_t1k` to `false`. HLA genotyping also requires the -`hla_dna_ref_seqs` and `hla_dna_gene_coords` inputs. The public app provides -default files for these. +## KFDRC Sentieon Short Reads Alignment and Haplotyper Workflow -### Basic Info -- dockerfiles: https://github.com/d3b-center/bixtools -- tested with - - Seven Bridges Cavatica Platform: https://cavatica.sbgenomics.com/ - - cwltool: https://github.com/common-workflow-language/cwltool/releases/tag/3.0.20200324120055 +This workflow is a functionally equivalent implementation of the above written by the team at Sentieon. See our documentation for more details. +- [Documentation](./docs/KFDRC_SENTIEON_ALIGNMENT_GVCF_WORKFLOW_README.md) +- [CWL Workflow](./workflows/kfdrc_sentieon_alignment_wf.cwl) +- [CAVATICA public app](https://cavatica.sbgenomics.com/public/apps/cavatica/apps-publisher/kfdrc-sentieon-alignment-workflow) -### References: -- Broad Google Cloud Storage: https://console.cloud.google.com/storage/browser/gcp-public-data--broad-references/hg38/v0 -- KFDRC AWS S3: s3://kids-first-seq-data/broad-references/ -- CAVATICA: https://cavatica.sbgenomics.com/u/kfdrc-harmonization/kf-references/ - -### Inputs: -```yaml - # REQUIRED - reference_tar: { type: File, doc: "Tar file containing a reference fasta and, optionally, its complete set of associated indexes (samtools, bwa, and picard)" } - biospecimen_name: { type: string, doc: "String name of biospcimen" } - output_basename: { type: string, doc: "String to use as the base for output filenames" } - knownsites: { type: 'File[]', doc: "List of files containing known polymorphic sites used to exclude regions around known polymorphisms from analysis" } - knownsites_indexes: { type: 'File[]?', doc: "Corresponding indexes for the knownsites. File position in list must match with its corresponding VCF's position in the knownsites file list. For example, if the first file in the knownsites list is 1000G_omni2.5.hg38.vcf.gz then the first item in this list must be 1000G_omni2.5.hg38.vcf.gz.tbi. Optional, but will save time/cost on indexing." } - # REQUIRED for gVCF - dbsnp_vcf: { type: 'File?', doc: "dbSNP vcf file" } - dbsnp_idx: { type: 'File?', doc: "dbSNP vcf index file" } - contamination_sites_bed: { type: 'File?', doc: ".bed file for markers used in this analysis,format(chr\tpos-1\tpos\trefAllele\taltAllele)" } - contamination_sites_mu: { type: 'File?', doc: ".mu matrix file of genotype matrix" } - contamination_sites_ud: { type: 'File?', doc: ".UD matrix file from SVD result of genotype matrix" } - run_gvcf_processing: { type: boolean, doc: "gVCF will be generated. Requires: dbsnp_vcf, contamination_sites_bed, contamination_sites_mu, contamination_sites_ud, wgs_calling_interval_list, wgs_evaluation_interval_list" } - # ADJUST TO FIT INPUT READS TYPE(S) - input_bam_list: { type: 'File[]?', doc: "List of input BAM files" } - input_pe_reads_list: { type: 'File[]?', doc: "List of input R1 paired end fastq reads" } - input_pe_mates_list: { type: 'File[]?', doc: "List of input R2 paired end fastq reads" } - input_pe_rgs_list: { type: 'string[]?', doc: "List of RG strings to use in PE processing" } - input_se_reads_list: { type: 'File[]?', doc: "List of input single end fastq reads" } - input_se_rgs_list: { type: 'string[]?', doc: "List of RG strings to use in SE processing" } - run_bam_processing: { type: boolean, doc: "BAM processing will be run. Requires: input_bam_list" } - run_pe_reads_processing: { type: boolean, doc: "PE reads processing will be run. Requires: input_pe_reads_list, input_pe_mates_list, input_pe_rgs_list" } - cram_reference: { type: 'File?', doc: "If aligning from cram, need to provided reference used to generate that cram" } - run_se_reads_processing: { type: boolean, doc: "SE reads processing will be run. Requires: input_se_reads_list, input_se_rgs_list" } - # IF WGS or CREATE gVCF - wgs_calling_interval_list: { type: 'File?', doc: "WGS interval list used to aid scattering Haplotype caller" } - wgs_coverage_interval_list: { type: 'File?', doc: "An interval list file that contains the positions to restrict the wgs metrics assessment" } - wgs_evaluation_interval_list: { type: 'File?', doc: "Target intervals to restrict gVCF metric analysis (for VariantCallingMetrics)" } - # IF WXS - wxs_bait_interval_list: { type: 'File?', doc: "An interval list file that contains the locations of the WXS baits used (for HsMetrics)" } - wxs_target_interval_list: { type: 'File?', doc: "An interval list file that contains the locations of the WXS targets (for HsMetrics)" } - # ADJUST TO GENERATE METRICS - run_hs_metrics: { type: boolean, doc: "HsMetrics will be collected. Only recommended for WXS inputs. Requires: wxs_bait_interval_list, wxs_target_interval_list" } - run_wgs_metrics: { type: boolean, doc: "WgsMetrics will be collected. Only recommended for WGS inputs. Requires: wgs_coverage_interval_list" } - run_agg_metrics: { type: boolean, doc: "AlignmentSummaryMetrics, GcBiasMetrics, InsertSizeMetrics, QualityScoreDistribution, and SequencingArtifactMetrics will be collected. Recommended for both WXS and WGS inputs." } - run_sex_metrics: {type: boolean, doc: "idxstats will be collected and X/Y ratios calculated"} - # ADVANCED - cutadapt_r1_adapter: { type: 'string?', doc: "If read1 reads have an adapter, provide regular 3' adapter sequence here to remove it from read1" } - cutadapt_r2_adapter: { type: 'string?', doc: "If read2 reads have an adapter, provide regular 3' adapter sequence here to remove it from read2" } - cutadapt_min_len: { type: 'int?', doc: "If adapter trimming, discard reads/read-pairs where the read length is less than this value. Set to 0 to turn off" } - cutadapt_quality_base: { type: 'int?', doc: "If adapter trimming, use this value as the base quality score. Defaults to 33 but very old reads might need this value set to 64" } - cutadapt_quality_cutoff: { type: 'string?', doc: "If adapter trimming, remove bases from the 3'/5' that fail to meet this cutoff value. If you specify a single cutoff value, the 3' end of each read is trimmed. If you specify two cutoff values separated by a comma, the first value will be trimmed from the 5' and the second value will be trimmed from the 3'" } - min_alignment_score: { type: 'int?', default: 30, doc: "For BWA MEM, Don't output alignment with score lower than INT. This option only affects output." } - bamtofastq_cpu: { type: 'int?', doc: "CPUs to allocate to bamtofastq" } - run_t1k: { type: 'boolean?', default: true, doc: "Set to false to disable T1k HLA typing" } - hla_dna_ref_seqs: { type: 'File?', doc: "FASTA file containing the HLA allele reference sequences for DNA." } - hla_dna_gene_coords: { type: 'File?', doc: "FASTA file containing the coordinates of the HLA genes for DNA." } - t1k_abnormal_unmap_flag: { type: 'boolean?', doc: "Set if the flag in BAM for the unmapped read-pair is nonconcordant" } -``` - -### Outputs: -```yaml - cram: {type: File, outputSource: samtools_bam_to_cram/output, doc: "(Re)Aligned Reads File"} - gvcf: {type: 'File[]?', outputSource: generate_gvcf/gvcf, doc: "Genomic VCF generated from the realigned alignment file."} - verifybamid_output: {type: 'File[]?', outputSource: generate_gvcf/verifybamid_output, doc: "Ouput from VerifyBamID that is used to calculate contamination."} - cutadapt_stats: { type: 'File[]?', outputSource: collect_cutadapt_stats/out_filelist, doc: "Stats from Cutadapt runs, if run. One or more per read group." } - bqsr_report: {type: File, outputSource: gatk_gatherbqsrreports/output, doc: "Recalibration report from BQSR."} - gvcf_calling_metrics: {type: ['null', {type: array, items: {type: array, items: File}}], outputSource: generate_gvcf/gvcf_calling_metrics, doc: "General metrics for gVCF calling quality."} - hs_metrics: {type: 'File[]?', outputSource: picard_collecthsmetrics/output, doc: "Picard CollectHsMetrics metrics for the analysis of target-capture sequencing experiments."} - wgs_metrics: {type: 'File[]?', outputSource: picard_collectwgsmetrics/output, doc: "Picard CollectWgsMetrics metrics for evaluating the performance of whole genome sequencing experiments."} - alignment_metrics: {type: 'File[]?', outputSource: picard_collectalignmentsummarymetrics/output, doc: "Picard CollectAlignmentSummaryMetrics high level metrics about the alignment of reads within a SAM file."} - gc_bias_detail: {type: 'File[]?', outputSource: picard_collectgcbiasmetrics/detail, doc: "Picard CollectGcBiasMetrics detailed metrics about reads that fall within windows of a certain GC bin on the reference genome."} - gc_bias_summary: {type: 'File[]?', outputSource: picard_collectgcbiasmetrics/summary, doc: "Picard CollectGcBiasMetrics high level metrics that capture how biased the coverage in a certain lane is."} - gc_bias_chart: {type: 'File[]?', outputSource: picard_collectgcbiasmetrics/chart, doc: "Picard CollectGcBiasMetrics plot of GC bias."} - insert_metrics: {type: 'File[]?', outputSource: picard_collectinsertsizemetrics/metrics, doc: "Picard CollectInsertSizeMetrics metrics about the insert size distribution of a paired-end library."} - insert_plot: {type: 'File[]?', outputSource: picard_collectinsertsizemetrics/plot, doc: "Picard CollectInsertSizeMetrics insert size distribution plotted."} - artifact_bait_bias_detail_metrics: {type: 'File[]?', outputSource: picard_collectsequencingartifactmetrics/bait_bias_detail_metrics, doc: "Picard CollectSequencingArtifactMetrics bait bias artifacts broken down by context."} - artifact_bait_bias_summary_metrics: {type: 'File[]?', outputSource: picard_collectsequencingartifactmetrics/bait_bias_summary_metrics, doc: "Picard CollectSequencingArtifactMetrics summary analysis of a single bait bias artifact."} - artifact_error_summary_metrics: {type: 'File[]?', outputSource: picard_collectsequencingartifactmetrics/error_summary_metrics, doc: "Picard CollectSequencingArtifactMetrics summary metrics as a roll up of the context-specific error rates, to provide global error rates per type of base substitution."} - artifact_pre_adapter_detail_metrics: {type: 'File[]?', outputSource: picard_collectsequencingartifactmetrics/pre_adapter_detail_metrics, doc: "Picard CollectSequencingArtifactMetrics pre-adapter artifacts broken down by context."} - artifact_pre_adapter_summary_metrics: {type: 'File[]?', outputSource: picard_collectsequencingartifactmetrics/pre_adapter_summary_metrics, doc: "Picard CollectSequencingArtifactMetrics summary analysis of a single pre-adapter artifact."} - qual_metrics: {type: 'File[]?', outputSource: picard_qualityscoredistribution/metrics, doc: "Quality metrics for the realigned CRAM."} - qual_chart: {type: 'File[]?', outputSource: picard_qualityscoredistribution/chart, doc: "Visualization of quality metrics."} - idxstats: {type: 'File?', outputSource: samtools_idxstats_xy_ratio/output, doc: "samtools idxstats of the realigned BAM file."} - xy_ratio: {type: 'File?', outputSource: samtools_idxstats_xy_ratio/ratio, doc: "Text file containing X and Y reads statistics generated from idxstats."} - t1k_genotype_tsv: {type: 'File?', outputSource: t1k/genotype_tsv, doc: "HLA genotype results from T1k" } -``` - -#### Detailed Input Information: -The pipeline is build to handle three distinct input types: -1. SAMs/BAMs/CRAMs (Alignment/Map files, or AMs) -1. PE Fastqs -1. SE Fastqs - -Additionally, the workflow supports these three in any combination. You can have PE Fastqs and AMs, -PE Fastqs and SE Fastqs, AMs and PE Fastqs and SE Fastqs, etc. Each of these three classes will be -procsessed and aligned separately and the resulting BWA aligned bams will be merged into a final BAM -before performing steps like BQSR and Metrics collection. - -##### Alignment/Map Inputs -The Alignment/Map processing portion of the pipeline is the simplest when it comes to inputs. You may provide -a single Alignment/Map file or many AMs. The input for AMs is a file list. In Cavatica or other GUI interfaces, -simply select the files you wish to process. For command line interfaces such as cwltool, your input -should look like the following. -```json -{ - ..., - "run_pe_reads_processing": false, - "run_se_reads_processing": false, - "run_bam_processing": true, - "input_bam_list": [ - { - "class": "File", - "location": "/path/to/bam1.bam" - }, - { - "class": "File", - "location": "/path/to/bam2.bam" - } - ], - ... -} -``` - -##### SE Fastq Inputs -SE fastq processing requires more input to build the jobs correctly. Rather than providing a single -list you must provide two lists: `input_se_reads_list` and `input_se_rgs_list`. The `input_se_reads_list` -is where you put the files and the `input_se_rgs_list` is where you put your desired BAM @RG headers for -each reads file. These two lists are must be ordered and of equal length. By ordered, that means the -first item of the `input_se_rgs_list` will be used when aligning the first item of the `input_se_reads_list`. -IMPORTANT NOTE: When you are entering the rg names, you need to use a second escape `\` to the tab values `\t` -as seen below. When the string value is read in by a tool such as cwltool it will interpret a `\\t` input -as `\t` and a `\t` as the literal `` value which is not a valid entry for bwa mem. -If you are using Cavatica GUI, however, no extra escape is necessary. The GUI will add an extra -escape to any tab values you enter. - -In Cavatica make sure to double check that everything is in the right order when you enter the inputs. -In command line interfaces such as cwltool, your input should look like the following. -```json -{ - ..., - "run_pe_reads_processing": false, - "run_se_reads_processing": true, - "run_bam_processing": false, - "input_se_reads_list": [ - { - "class": "File", - "location": "/path/to/single1.fastq" - }, - { - "class": "File", - "location": "/path/to/single2.fastq" - } - ], - "inputs_se_rgs_list": [ - "@RG\\tID:single1\\tLB:library_name\\tPL:ILLUMINA\\tSM:sample_name", - "@RG\\tID:single2\\tLB:library_name\\tPL:ILLUMINA\\tSM:sample_name" - ], - ... -} -``` -Take particular note of how the first item in the rgs list is the metadata for the first item in the fastq list. - -##### PE Fastq Inputs -PE Fastq processing inputs is exactly like SE Fastq processing but requires you to provide the paired mates -files for your input paired reads. Once again, when using Cavatica make sure your inputs are in the correct -order. In command line interfaces such as cwltool, your input should look like the following. -```json -{ - ..., - "run_pe_reads_processing": true, - "run_se_reads_processing": false, - "run_bam_processing": false, - "input_pe_reads_list": [ - { - "class": "File", - "location": "/path/to/sample1_R1.fastq" - }, - { - "class": "File", - "location": "/path/to/sample2_R1fastq" - }, - { - "class": "File", - "location": "/path/to/sample3_R1.fastq" - } - ], - "input_pe_mates_list": [ - { - "class": "File", - "location": "/path/to/sample1_R2.fastq" - }, - { - "class": "File", - "location": "/path/to/sample2_R2.fastq" - }, - { - "class": "File", - "location": "/path/to/sample3_R2.fastq" - } - ], - "inputs_pe_rgs_list": [ - "@RG\\tID:sample1\\tLB:library_name\\tPL:ILLUMINA\tSM:sample_name", - "@RG\\tID:sample2\\tLB:library_name\\tPL:ILLUMINA\tSM:sample_name", - "@RG\\tID:sample3\\tLB:library_name\\tPL:ILLUMINA\tSM:sample_name" - ], - ... -} -``` - -##### Multiple Input Types -As mentioned above, these three input types can be added in any combination. If you wanted to add -all three your command line input would look like the following. -```json -{ - ..., - "run_pe_reads_processing": true, - "run_se_reads_processing": true, - "run_bam_processing": true, - "input_bam_list": [ - { - "class": "File", - "location": "/path/to/bam1.bam" - }, - { - "class": "File", - "location": "/path/to/bam2.bam" - } - ], - "input_se_reads_list": [ - { - "class": "File", - "location": "/path/to/single1.fastq" - }, - { - "class": "File", - "location": "/path/to/single2.fastq" - } - ], - "inputs_se_rgs_list": [ - "@RG\\tID:single1\\tLB:library_name\\tPL:ILLUMINA\\tSM:sample_name", - "@RG\\tID:single2\\tLB:library_name\\tPL:ILLUMINA\\tSM:sample_name" - ], - "input_pe_reads_list": [ - { - "class": "File", - "location": "/path/to/sample1_R1.fastq" - }, - { - "class": "File", - "location": "/path/to/sample2_R1fastq" - }, - { - "class": "File", - "location": "/path/to/sample3_R1.fastq" - } - ], - "input_pe_mates_list": [ - { - "class": "File", - "location": "/path/to/sample1_R2.fastq" - }, - { - "class": "File", - "location": "/path/to/sample2_R2.fastq" - }, - { - "class": "File", - "location": "/path/to/sample3_R2.fastq" - } - ], - "inputs_pe_rgs_list": [ - "@RG\\tID:sample1\\tLB:library_name\\tPL:ILLUMINA\\tSM:sample_name", - "@RG\\tID:sample2\\tLB:library_name\\tPL:ILLUMINA\\tSM:sample_name", - "@RG\\tID:sample3\\tLB:library_name\\tPL:ILLUMINA\\tSM:sample_name" - ], - ... -} -``` - -#### Example Runtimes: -1. 120 GB WGS BAM with AggMetrics, WgsMetrics, and gVCF creation: 14 hours & $35 -1. 120 GB WGS BAM only: 11 hours -1. 4x40 GB WGS FASTQ files with AggMetrics, WgsMetrics, and gVCF creation: 23 hours & $72 -1. 4x40 GB WGS FASTQ files only: 18 hours -1. 4x9 GB WXS FASTQ files with AggMetrics and gVCF creation: 4 hours & $9 -1. 4x9 GB WXS FASTQ files only: 3 hours - -#### Caveats: -1. Duplicates are flagged in a process that is connected to bwa mem. The implication of this design - decision is that duplicates are flagged only on the inputs of that are scattered into bwa. - Duplicates, therefore, are not being flagged at a library level and, for large BAM and FASTQ inputs, - duplicates are only being detected within a portion of the read group. - -#### Tips for running: -1. For the fastq input file lists (PE or SE), make sure the lists are properly ordered. The items in - the arrays are processed based on their position. These lists are dotproduct scattered. This means - that the first file in `input_pe_reads_list` is run with the first file in `input_pe_mates_list` - and the first string in `input_pe_rgs_list`. This also means these arrays must be the same - length or the workflow will fail. -1. The input for the reference_tar must be a tar file containing the reference fasta along with its indexes. - The required indexes are `[.64.ann,.64.amb,.64.bwt,.64.pac,.64.sa,.dict,.fai]` and are generated by bwa, picard, and samtools. - Additionally, an `.64.alt` index is recommended. -1. If you are making your own bwa indexes make sure to use the `-6` flag to obtain the `.64` version of the - indexes. Indexes that do not match this naming schema will cause a failure in certain runner ecosystems. -1. Should you decide to create your own reference indexes and omit the ALT index file from the reference, - or if its naming structure mismatches the other indexes, then your alignments will be equivalent to the results you would - obtain if you run BWA-MEM with the -j option. -1. The following is an example of a complete reference tar input: -``` -~ tar tf Homo_sapiens_assembly38.tgz -Homo_sapiens_assembly38.dict -Homo_sapiens_assembly38.fasta -Homo_sapiens_assembly38.fasta.64.alt -Homo_sapiens_assembly38.fasta.64.amb -Homo_sapiens_assembly38.fasta.64.ann -Homo_sapiens_assembly38.fasta.64.bwt -Homo_sapiens_assembly38.fasta.64.pac -Homo_sapiens_assembly38.fasta.64.sa -Homo_sapiens_assembly38.fasta.fai -``` -1. For advanced usage, you can skip the knownsite indexing by providing the knownsites_indexes input. - This file list should contain the indexes for each of the files in your knownsites input. Please - note this list must be ordered in such a way where the position of the index file in the - knownsites_indexes list must correspond with the position of the VCF file in the knownsites list - that it indexes. In the example input below you can see that the 1000G_omni2.5.hg38.vcf.gz.tbi - file is the fourth item in the knownsites_indexes because the 1000G_omni2.5.hg38.vcf.gz file is the - fourth item in the knownsites list. Failure to order in this way will result in the pipeline - failing or generating erroneous files. -1. Turning off gVCF creation and metrics collection for a minimal successful run. -1. Suggested reference inputs (available from the [Broad Resource Bundle](https://console.cloud.google.com/storage/browser/genomics-public-data/resources/broad/hg38/v0)): -1. For large BAM inputs, users may encounter a scenario where jobs fail during - the bamtofastq step. The given error will recommend that users try increasing - disk space. Increasing the disk space will solve the error for all but the - largest inputs. For those extremely large inputs with many read groups that - continue to get this error, it is recommended that users increase the value for - `bamtofastq_cpu`. Increasing this value will decrease the number of concurrent - bamtofastq jobs that run on the same instance. -```yaml -contamination_sites_bed: Homo_sapiens_assembly38.contam.bed -contamination_sites_mu: Homo_sapiens_assembly38.contam.mu -contamination_sites_ud: Homo_sapiens_assembly38.contam.UD -dbsnp_vcf: Homo_sapiens_assembly38.dbsnp138.vcf -reference_tar: Homo_sapiens_assembly38.tgz -knownsites: - - Homo_sapiens_assembly38.known_indels.vcf.gz - - Mills_and_1000G_gold_standard.indels.hg38.vcf.gz - - 1000G_phase1.snps.high_confidence.hg38.vcf.gz - - 1000G_omni2.5.hg38.vcf.gz -knownsites_indexes: - - Homo_sapiens_assembly38.known_indels.vcf.gz.tbi - - Mills_and_1000G_gold_standard.indels.hg38.vcf.gz.tbi - - 1000G_phase1.snps.high_confidence.hg38.vcf.gz.tbi - - 1000G_omni2.5.hg38.vcf.gz.tbi -``` - -## KFDRC GATK HaplotypeCaller CRAM to gVCF Workflow - -This workflow taks a CRAM file, converts it to a BAM, determines a -contamination value, then runs GATK HaplotypeCaller to generate a gVCF, gVCF -calling metrics, and, if no contamination value is provided, the VerifyBAMID -output. Additionally, if the user sets the `run_sex_metrics` input to true, two -additional outputs from samtools idxstats will be provided. - -This workflow is the current production workflow, equivalent to this [Cavatica public app](https://cavatica.sbgenomics.com/public/apps#cavatica/apps-publisher/kfdrc-gatk-haplotypecaller-workflow). - -Read more [here](docs/KFDRC_GATK_HAPLOTYPECALLER_CRAM_TO_GVCF_WORKFLOW_README.md). +## Other workflows +- KFDRC GATK HaplotypeCaller CRAM to gVCF Workflow + - [Documentation](./docs/KFDRC_GATK_HAPLOTYPECALLER_CRAM_TO_GVCF_WORKFLOW_README.md) + - [CWL Workflow](./workflows/kfdrc-gatk-haplotypecaller-wf.cwl) + - [CAVATICA public app](https://cavatica.sbgenomics.com/public/apps/cavatica/apps-publisher/kfdrc-gatk-haplotypecaller-workflow) +- Kids First Data Resource Center Sentieon gVCF Workflow + - [Documentation](./docs/KFDRC_SENTIEON_GVCF_WORKFLOW_README.md) + - [CWL Workflow](./workflows/kfdrc_sentieon_gvcf_wf.cwl) + - [CAVATICA public app](https://cavatica.sbgenomics.com/public/apps/cavatica/apps-publisher/kfdrc_sentieon_gvcf_wf) diff --git a/workflows/kfdrc_alignment_wf.cwl b/workflows/kfdrc_alignment_wf.cwl index 321c1b2..542e9a3 100644 --- a/workflows/kfdrc_alignment_wf.cwl +++ b/workflows/kfdrc_alignment_wf.cwl @@ -2,368 +2,262 @@ cwlVersion: v1.2 class: Workflow id: kfdrc-alignment-workflow label: Kids First DRC Alignment and GATK HaplotypeCaller Workflow -doc: |+ - # Kids First Data Resource Center Alignment and GATK HaplotypeCaller Workflows +doc: | + # Kids First Data Resource Center BWA-GATK Short Reads Alignment and HaplotypeCaller Workflow - The Kids First Data Resource Center Alignment and Haplotype Calling Workflow (bam/fastq-to-cram, gVCF optional) follows - Broad best practices outlined in [Data pre-processing for variant discovery.](https://gatk.broadinstitute.org/hc/en-us/articles/360035535912-Data-pre-processing-for-variant-discovery) - It uses bam/fastq input and aligns/re-aligns to a bwa-indexed reference fasta, version hg38. Resultant bam is de-dupped and - base score recalibrated. Contamination is calculated and a gVCF is created optionally using GATK4 vbeta.1-3.5 HaplotypeCaller. Inputs from - this can be used later on for further analysis in joint trio genotyping and subsequent refinement and deNovo variant analysis. If you would like to run this workflow using the cavatica public app, a basic primer on running public apps can be found [here](https://www.notion.so/d3b/Starting-From-Scratch-Running-Cavatica-af5ebb78c38a4f3190e32e67b4ce12bb). - Alternatively, if you'd like to run it locally using `cwltool`, a basic primer on that can be found [here](https://www.notion.so/d3b/Starting-From-Scratch-Running-CWLtool-b8dbbde2dc7742e4aff290b0a878344d) and combined with app-specific info from the readme below. - This workflow is the current production workflow, equivalent to this [Cavatica public app](https://cavatica.sbgenomics.com/public/apps#cavatica/apps-publisher/kfdrc-alignment-workflow) and supersedes the [old workflow](https://github.com/kids-first/kf-alignment-workflow/tree/1.0.0) and [public app](https://cavatica.sbgenomics.com/public/apps#kids-first-drc/kids-first-drc-alignment-workflow/kfdrc-alignment-bam2cram2gvcf/); however outputs are considered equivalent. - -

- Kids First repository logo -

- + Kids First repository logo

- ## Input Agnostic Alignment Workflow - Workflow for the alignment or realignment of input SAMs/BAMs/CRAMs - (Alignment/Map files, or AMs), PE reads, and/or SE reads; conditionally - generate gVCF and metrics. + The Kids First Data Resource Center (KFDRC) BWA-GATK Short Reads Alignment and + GATK Haplotyper Workflow is a Common Workflow Language (CWL) implementation of + various software used to take reads generated by next generation sequencing + (NGS) technologies and use those reads to generate alignment and, optionally, + variant information. - This workflow is a all-in-one workflow for handling any kind of reads inputs: - BAM inputs, PE reads and mates inputs, SE reads inputs, or any combination of - these. The workflow will naively attempt to process these depending on what you - tell it you have provided. The user informs the workflow of which inputs to - process using three boolean inputs: `run_bam_processing`, + This workflow is an all-in-one alignment workflow capable of handling any kind + of reads inputs: BAM inputs, PE reads and mates inputs, SE reads inputs, or + any combination of these. The workflow will naively attempt to process these + depending on what you tell it you have provided. The user informs the workflow + of which inputs to process using three boolean inputs: `run_bam_processing`, `run_pe_reads_processing`, and `run_se_reads_processing`. Providing `true` values for these as well their corresponding inputs will result in those inputs being processed. - The second half of the workflow deals with optional gVCF creation and metrics - collection. This workflow is capable of collecting the metrics using the - following boolean flags: `run_hs_metrics`, `run_wgs_metrics`, and - `run_agg_metrics`. To run these metrics, additional optional inputs must also - be provided: `wxs_bait_interval_list` and `wxs_target_interval_list` for - HsMetrics, `wgs_coverage_interval_list` for WgsMetrics. To generate the gVCF, - set `run_gvcf_processing` to `true` and provide the following optional files: - `dbsnp_vcf`, `contamination_sites_bed`, `contamination_sites_mu`, - `contamination_sites_ud`, `wgs_calling_interval_list`, and - `wgs_evaluation_interval_list`. Additionally, the workflow is capable of - performing a basic evaluation of the X and Y sex chromosomes using idxstats. To - activate this feature, set `run_sex_metrics` to `true`; no additonal inputs are - required. Finally, the workflow is capable of collecting HLA genotype data - using [T1k](./docs/T1K_README.md). This feature is on by default but can be - disabled by setting `run_t1k` to `false`. HLA genotyping also requires the - `hla_dna_ref_seqs` and `hla_dna_gene_coords` inputs. The public app provides - default files for these. - - ### Basic Info - - dockerfiles: https://github.com/d3b-center/bixtools - - tested with - - Seven Bridges Cavatica Platform: https://cavatica.sbgenomics.com/ - - cwltool: https://github.com/common-workflow-language/cwltool/releases/tag/3.0.20200324120055 - - ### References: - - Broad Google Cloud Storage: https://console.cloud.google.com/storage/browser/gcp-public-data--broad-references/hg38/v0 - - KFDRC AWS S3: s3://kids-first-seq-data/broad-references/ - - CAVATICA: https://cavatica.sbgenomics.com/u/kfdrc-harmonization/kf-references/ + In addition to the core alignment functionality, this workflow is also capable + of the following: + - Performing Haplotype Calling to generate a gVCF as well as associated metrics + - Collecting Hybrid-Selection (HS) Sequencing Metrics + - Collecting Whole Genome Sequencing (WGS) Metrics + - Collecting Aggregate Sequencing Metrics: Alignment Summary, Insert Size, Sequencing Artifact, GC Bias, and Quality Score Distribution + - Collecting Sex Chromosome Metrics + - Collecting HLA Genotyping Data + + ## Pipeline Software and Versions + + Below is a breakdown of the software used in the workflow. Please note that this + breakdown is manually updated and, as such, may fall out of date with our workflow. + For the most up to date versions of our software please refer to the Docker + images present in the workflow. A table of the images and their usage can be + found [here](./dockers_bwagatk_alignment.md). + + | Step | KFDRC BWA-GATK | + |----------------------------|-------------------------------------| + | Bam to Read Group (RG) BAM | samtools split | + | RG Bam to Fastq | biobambam2 bamtofastq | + | Adapter Trimming | cutadapt | + | Fastq to RG Bam | bwa mem | + | Merge RG Bams | sambamba merge | + | Sort Bam | sambamba sort | + | Mark Duplicates | samblaster | + | BaseRecalibration | GATK BaseRecalibrator | + | ApplyRecalibration | GATK ApplyBQSR | + | Gather Recalibrated BAMs | Picard GatherBamFiles | + | Bam to Cram | samtools view | + | Metrics | Picard | + | Sex Metrics | samtools idxstats | + | HLA Genotyping | T1k | + | Contamination Calculation | VerifyBamID | + | gVCF Calling | GATK HaplotypeCaller | + | Gather VCFs | Picard MergeVcfs | + | Metrics | Picard CollectVariantCallingMetrics | + + - [biobambam2](https://github.com/gt1/biobambam2): [2.0.50](https://github.com/gt1/biobambam2/releases/tag/2.0.50-release-20160705161609) + - [bwa](https://github.com/lh3/bwa): [0.7.17-r1188](https://github.com/lh3/bwa/releases/tag/v0.7.17) + - [Cutadapt](https://github.com/marcelm/cutadapt): [4.6](https://github.com/marcelm/cutadapt/releases/tag/v4.5) + - [GATK](https://github.com/broadinstitute/gatk) Our workflow uses three versions of GATK: + - HaplotypeCaller: [4.beta.1](https://github.com/broadinstitute/gatk/releases/tag/4.beta.1) + - BaseRecalibration: [4.0.3.0](https://github.com/broadinstitute/gatk/releases/tag/4.0.3.0) + - File Indexing: [4.1.7.0](https://github.com/broadinstitute/gatk/releases/tag/4.1.7.0) + - [Picard](https://github.com/broadinstitute/picard): [2.18.9](https://github.com/broadinstitute/picard/releases/tag/2.18.9) + - [Sambamba](https://github.com/biod/sambamba): [0.6.3](https://github.com/biod/sambamba/releases/tag/v0.6.3) + - [samblaster](https://github.com/GregoryFaust/samblaster): [0.1.24](https://github.com/GregoryFaust/samblaster/releases/tag/v.0.1.24) + - [Samtools](https://github.com/samtools/samtools): + - idxstats, split: [1.9](https://github.com/samtools/samtools/releases/tag/1.9) + - BAM to CRAM conversion: [1.8](https://github.com/samtools/samtools/releases/tag/1.8) + - [T1k](https://github.com/mourisl/T1K): [1.0.5](https://github.com/mourisl/T1K/releases/tag/v1.0.5) + - [VerifyBamID](https://github.com/Griffan/VerifyBamID): [1.0.2](https://github.com/Griffan/VerifyBamID/releases/tag/1.0.2) + + ### Alignment - ### Inputs: + The principle function of this pipeline is the realignment of short reads to a reference genome. + + #### Inputing Reads + + Read information are passed in through the following inputs: ```yaml - # REQUIRED - reference_tar: { type: File, doc: "Tar file containing a reference fasta and, optionally, its complete set of associated indexes (samtools, bwa, and picard)" } - biospecimen_name: { type: string, doc: "String name of biospcimen" } - output_basename: { type: string, doc: "String to use as the base for output filenames" } - knownsites: { type: 'File[]', doc: "List of files containing known polymorphic sites used to exclude regions around known polymorphisms from analysis" } - knownsites_indexes: { type: 'File[]?', doc: "Corresponding indexes for the knownsites. File position in list must match with its corresponding VCF's position in the knownsites file list. For example, if the first file in the knownsites list is 1000G_omni2.5.hg38.vcf.gz then the first item in this list must be 1000G_omni2.5.hg38.vcf.gz.tbi. Optional, but will save time/cost on indexing." } - # REQUIRED for gVCF - dbsnp_vcf: { type: 'File?', doc: "dbSNP vcf file" } - dbsnp_idx: { type: 'File?', doc: "dbSNP vcf index file" } - contamination_sites_bed: { type: 'File?', doc: ".bed file for markers used in this analysis,format(chr\tpos-1\tpos\trefAllele\taltAllele)" } - contamination_sites_mu: { type: 'File?', doc: ".mu matrix file of genotype matrix" } - contamination_sites_ud: { type: 'File?', doc: ".UD matrix file from SVD result of genotype matrix" } - run_gvcf_processing: { type: boolean, doc: "gVCF will be generated. Requires: dbsnp_vcf, contamination_sites_bed, contamination_sites_mu, contamination_sites_ud, wgs_calling_interval_list, wgs_evaluation_interval_list" } - # ADJUST TO FIT INPUT READS TYPE(S) - input_bam_list: { type: 'File[]?', doc: "List of input BAM files" } - input_pe_reads_list: { type: 'File[]?', doc: "List of input R1 paired end fastq reads" } - input_pe_mates_list: { type: 'File[]?', doc: "List of input R2 paired end fastq reads" } - input_pe_rgs_list: { type: 'string[]?', doc: "List of RG strings to use in PE processing" } - input_se_reads_list: { type: 'File[]?', doc: "List of input single end fastq reads" } - input_se_rgs_list: { type: 'string[]?', doc: "List of RG strings to use in SE processing" } - run_bam_processing: { type: boolean, doc: "BAM processing will be run. Requires: input_bam_list" } - run_pe_reads_processing: { type: boolean, doc: "PE reads processing will be run. Requires: input_pe_reads_list, input_pe_mates_list, input_pe_rgs_list" } - cram_reference: { type: 'File?', doc: "If aligning from cram, need to provided reference used to generate that cram" } - run_se_reads_processing: { type: boolean, doc: "SE reads processing will be run. Requires: input_se_reads_list, input_se_rgs_list" } - # IF WGS or CREATE gVCF - wgs_calling_interval_list: { type: 'File?', doc: "WGS interval list used to aid scattering Haplotype caller" } - wgs_coverage_interval_list: { type: 'File?', doc: "An interval list file that contains the positions to restrict the wgs metrics assessment" } - wgs_evaluation_interval_list: { type: 'File?', doc: "Target intervals to restrict gVCF metric analysis (for VariantCallingMetrics)" } - # IF WXS - wxs_bait_interval_list: { type: 'File?', doc: "An interval list file that contains the locations of the WXS baits used (for HsMetrics)" } - wxs_target_interval_list: { type: 'File?', doc: "An interval list file that contains the locations of the WXS targets (for HsMetrics)" } - # ADJUST TO GENERATE METRICS - run_hs_metrics: { type: boolean, doc: "HsMetrics will be collected. Only recommended for WXS inputs. Requires: wxs_bait_interval_list, wxs_target_interval_list" } - run_wgs_metrics: { type: boolean, doc: "WgsMetrics will be collected. Only recommended for WGS inputs. Requires: wgs_coverage_interval_list" } - run_agg_metrics: { type: boolean, doc: "AlignmentSummaryMetrics, GcBiasMetrics, InsertSizeMetrics, QualityScoreDistribution, and SequencingArtifactMetrics will be collected. Recommended for both WXS and WGS inputs." } - run_sex_metrics: {type: boolean, doc: "idxstats will be collected and X/Y ratios calculated"} - # ADVANCED - cutadapt_r1_adapter: { type: 'string?', doc: "If read1 reads have an adapter, provide regular 3' adapter sequence here to remove it from read1" } - cutadapt_r2_adapter: { type: 'string?', doc: "If read2 reads have an adapter, provide regular 3' adapter sequence here to remove it from read2" } - cutadapt_min_len: { type: 'int?', doc: "If adapter trimming, discard reads/read-pairs where the read length is less than this value. Set to 0 to turn off" } - cutadapt_quality_base: { type: 'int?', doc: "If adapter trimming, use this value as the base quality score. Defaults to 33 but very old reads might need this value set to 64" } - cutadapt_quality_cutoff: { type: 'string?', doc: "If adapter trimming, remove bases from the 3'/5' that fail to meet this cutoff value. If you specify a single cutoff value, the 3' end of each read is trimmed. If you specify two cutoff values separated by a comma, the first value will be trimmed from the 5' and the second value will be trimmed from the 3'" } - min_alignment_score: { type: 'int?', default: 30, doc: "For BWA MEM, Don't output alignment with score lower than INT. This option only affects output." } - bamtofastq_cpu: { type: 'int?', doc: "CPUs to allocate to bamtofastq" } - run_t1k: { type: 'boolean?', default: true, doc: "Set to false to disable T1k HLA typing" } - hla_dna_ref_seqs: { type: 'File?', doc: "FASTA file containing the HLA allele reference sequences for DNA." } - hla_dna_gene_coords: { type: 'File?', doc: "FASTA file containing the coordinates of the HLA genes for DNA." } - t1k_abnormal_unmap_flag: { type: 'boolean?', doc: "Set if the flag in BAM for the unmapped read-pair is nonconcordant" } + input_bam_list: "List of input BAM/CRAM/SAM files" + cram_reference: "If input_bam_list contains CRAM, provided the reference used to generate that CRAM here" + input_pe_reads_list: "List of input R1 paired end FASTQ reads" + input_pe_mates_list: "List of input R2 paired end FASTQ reads" + input_pe_rgs_list: "List of RG strings to use in PE processing" + input_se_reads_list: "List of input single end FASTQ reads" + input_se_rgs_list: "List of RG strings to use in SE processing" ``` - ### Outputs: - ```yaml - cram: {type: File, outputSource: samtools_bam_to_cram/output, doc: "(Re)Aligned Reads File"} - gvcf: {type: 'File[]?', outputSource: generate_gvcf/gvcf, doc: "Genomic VCF generated from the realigned alignment file."} - verifybamid_output: {type: 'File[]?', outputSource: generate_gvcf/verifybamid_output, doc: "Ouput from VerifyBamID that is used to calculate contamination."} - cutadapt_stats: { type: 'File[]?', outputSource: collect_cutadapt_stats/out_filelist, doc: "Stats from Cutadapt runs, if run. One or more per read group." } - bqsr_report: {type: File, outputSource: gatk_gatherbqsrreports/output, doc: "Recalibration report from BQSR."} - gvcf_calling_metrics: {type: ['null', {type: array, items: {type: array, items: File}}], outputSource: generate_gvcf/gvcf_calling_metrics, doc: "General metrics for gVCF calling quality."} - hs_metrics: {type: 'File[]?', outputSource: picard_collecthsmetrics/output, doc: "Picard CollectHsMetrics metrics for the analysis of target-capture sequencing experiments."} - wgs_metrics: {type: 'File[]?', outputSource: picard_collectwgsmetrics/output, doc: "Picard CollectWgsMetrics metrics for evaluating the performance of whole genome sequencing experiments."} - alignment_metrics: {type: 'File[]?', outputSource: picard_collectalignmentsummarymetrics/output, doc: "Picard CollectAlignmentSummaryMetrics high level metrics about the alignment of reads within a SAM file."} - gc_bias_detail: {type: 'File[]?', outputSource: picard_collectgcbiasmetrics/detail, doc: "Picard CollectGcBiasMetrics detailed metrics about reads that fall within windows of a certain GC bin on the reference genome."} - gc_bias_summary: {type: 'File[]?', outputSource: picard_collectgcbiasmetrics/summary, doc: "Picard CollectGcBiasMetrics high level metrics that capture how biased the coverage in a certain lane is."} - gc_bias_chart: {type: 'File[]?', outputSource: picard_collectgcbiasmetrics/chart, doc: "Picard CollectGcBiasMetrics plot of GC bias."} - insert_metrics: {type: 'File[]?', outputSource: picard_collectinsertsizemetrics/metrics, doc: "Picard CollectInsertSizeMetrics metrics about the insert size distribution of a paired-end library."} - insert_plot: {type: 'File[]?', outputSource: picard_collectinsertsizemetrics/plot, doc: "Picard CollectInsertSizeMetrics insert size distribution plotted."} - artifact_bait_bias_detail_metrics: {type: 'File[]?', outputSource: picard_collectsequencingartifactmetrics/bait_bias_detail_metrics, doc: "Picard CollectSequencingArtifactMetrics bait bias artifacts broken down by context."} - artifact_bait_bias_summary_metrics: {type: 'File[]?', outputSource: picard_collectsequencingartifactmetrics/bait_bias_summary_metrics, doc: "Picard CollectSequencingArtifactMetrics summary analysis of a single bait bias artifact."} - artifact_error_summary_metrics: {type: 'File[]?', outputSource: picard_collectsequencingartifactmetrics/error_summary_metrics, doc: "Picard CollectSequencingArtifactMetrics summary metrics as a roll up of the context-specific error rates, to provide global error rates per type of base substitution."} - artifact_pre_adapter_detail_metrics: {type: 'File[]?', outputSource: picard_collectsequencingartifactmetrics/pre_adapter_detail_metrics, doc: "Picard CollectSequencingArtifactMetrics pre-adapter artifacts broken down by context."} - artifact_pre_adapter_summary_metrics: {type: 'File[]?', outputSource: picard_collectsequencingartifactmetrics/pre_adapter_summary_metrics, doc: "Picard CollectSequencingArtifactMetrics summary analysis of a single pre-adapter artifact."} - qual_metrics: {type: 'File[]?', outputSource: picard_qualityscoredistribution/metrics, doc: "Quality metrics for the realigned CRAM."} - qual_chart: {type: 'File[]?', outputSource: picard_qualityscoredistribution/chart, doc: "Visualization of quality metrics."} - idxstats: {type: 'File?', outputSource: samtools_idxstats_xy_ratio/output, doc: "samtools idxstats of the realigned BAM file."} - xy_ratio: {type: 'File?', outputSource: samtools_idxstats_xy_ratio/ratio, doc: "Text file containing X and Y reads statistics generated from idxstats."} - t1k_genotype_tsv: {type: 'File?', outputSource: t1k/genotype_tsv, doc: "HLA genotype results from T1k" } + This workflow accepts reads in any of the of the following formats: + - Aligned BAM/CRAM/SAM + - Unaligned BAM/CRAM/SAM + - Paired End FASTQ + - Single End FASTQ + + Additionally, these files must have appropriate read group information. For + aligned and unaligned BAM/CRAM/SAM files, we assume that information is + contained within the file header. If your files do not have read group (`@RG`) + header information, you **must** add that information to the file header for + the pipeline to work! + + If the user provides CRAM inputs, they must also provide the reference that was + used to generate the CRAM file. + + FASTQ files do not have headers; their header information must be provided by + the user at runtime. The inputs that the user provides are handed directly to + the bwa `-R` option. From bwa: + ``` + -R Complete read group header line. '\t' can be used in STR and will be + converted to a TAB in the output SAM. The read group ID will be attached to + every read in the output. An example is '@RG\tID:foo\tSM:bar'. [null] ``` - #### Detailed Input Information: - The pipeline is build to handle three distinct input types: - 1. SAMs/BAMs/CRAMs (Alignment/Map files, or AMs) - 1. PE Fastqs - 1. SE Fastqs - - Additionally, the workflow supports these three in any combination. You can have PE Fastqs and AMs, - PE Fastqs and SE Fastqs, AMs and PE Fastqs and SE Fastqs, etc. Each of these three classes will be - procsessed and aligned separately and the resulting BWA aligned bams will be merged into a final BAM - before performing steps like BQSR and Metrics collection. - - ##### Alignment/Map Inputs - The Alignment/Map processing portion of the pipeline is the simplest when it comes to inputs. You may provide - a single Alignment/Map file or many AMs. The input for AMs is a file list. In Cavatica or other GUI interfaces, - simply select the files you wish to process. For command line interfaces such as cwltool, your input - should look like the following. - ```json - { - ..., - "run_pe_reads_processing": false, - "run_se_reads_processing": false, - "run_bam_processing": true, - "input_bam_list": [ - { - "class": "File", - "location": "/path/to/bam1.bam" - }, - { - "class": "File", - "location": "/path/to/bam2.bam" - } - ], - ... - } + If the user provides multiple single end FASTQs or multiple sets of paired end + FASTQs, they must keep the files and read group lines in the same order. + For example: + ``` + input_pe_reads_list: + - sample1_R1.fq + - sample2_R1.fq + - sample3_R1.fq + input_pe_mates_list: + - sample1_R2.fq + - sample2_R2.fq + - sample3_R2.fq + input_pe_rgs_list: + - sample1_rg_line + - sample2_rg_line + - sample3_rg_line + ``` + + ##### Activating Reads Processing + + Once the user has provided their reads, they must then choose which processing + the pipeline will perform. This choice is controlled by the following + parameters: + ```yaml + run_bam_processing: "BAM/CRAM/SAM processing will be run. Requires: input_bam_list" + run_pe_reads_processing: "PE reads processing will be run. Requires: input_pe_reads_list, input_pe_mates_list, input_pe_rgs_list" + run_se_reads_processing: "SE reads processing will be run. Requires: input_se_reads_list, input_se_rgs_list" ``` - ##### SE Fastq Inputs - SE fastq processing requires more input to build the jobs correctly. Rather than providing a single - list you must provide two lists: `input_se_reads_list` and `input_se_rgs_list`. The `input_se_reads_list` - is where you put the files and the `input_se_rgs_list` is where you put your desired BAM @RG headers for - each reads file. These two lists are must be ordered and of equal length. By ordered, that means the - first item of the `input_se_rgs_list` will be used when aligning the first item of the `input_se_reads_list`. - IMPORTANT NOTE: When you are entering the rg names, you need to use a second escape `\` to the tab values `\t` - as seen below. When the string value is read in by a tool such as cwltool it will interpret a `\\t` input - as `\t` and a `\t` as the literal `` value which is not a valid entry for bwa mem. - If you are using Cavatica GUI, however, no extra escape is necessary. The GUI will add an extra - escape to any tab values you enter. - - In Cavatica make sure to double check that everything is in the right order when you enter the inputs. - In command line interfaces such as cwltool, your input should look like the following. - ```json - { - ..., - "run_pe_reads_processing": false, - "run_se_reads_processing": true, - "run_bam_processing": false, - "input_se_reads_list": [ - { - "class": "File", - "location": "/path/to/single1.fastq" - }, - { - "class": "File", - "location": "/path/to/single2.fastq" - } - ], - "inputs_se_rgs_list": [ - "@RG\\tID:single1\\tLB:library_name\\tPL:ILLUMINA\\tSM:sample_name", - "@RG\\tID:single2\\tLB:library_name\\tPL:ILLUMINA\\tSM:sample_name" - ], - ... - } + If the user provides BAM/CRAM/SAMs they should set `run_bam_processing` to true. + If the user provides paired end FASTQs, they should set `run_pe_reads_processing` to true. + If the user provides single end FASTQs, they should set `run_se_reads_processing` to true. + + ### Alignment Metrics + + After alignment, the pipeline contains a number of optional metrics collection + steps that the user can enable. These tools can be enabled using the following + parameters: + ```yaml + run_hs_metrics: "HsMetrics will be collected. Only recommended for WXS inputs. Requires: wxs_bait_interval_list, wxs_target_interval_list" + run_wgs_metrics: "WgsMetrics will be collected. Only recommended for WGS inputs. Requires: wgs_coverage_interval_list" + run_agg_metrics: "AlignmentSummaryMetrics, GcBiasMetrics, InsertSizeMetrics, QualityScoreDistribution, and SequencingArtifactMetrics will be collected." + run_sex_metrics: "idxstats will be collected and X/Y ratios calculated" ``` - Take particular note of how the first item in the rgs list is the metadata for the first item in the fastq list. - - ##### PE Fastq Inputs - PE Fastq processing inputs is exactly like SE Fastq processing but requires you to provide the paired mates - files for your input paired reads. Once again, when using Cavatica make sure your inputs are in the correct - order. In command line interfaces such as cwltool, your input should look like the following. - ```json - { - ..., - "run_pe_reads_processing": true, - "run_se_reads_processing": false, - "run_bam_processing": false, - "input_pe_reads_list": [ - { - "class": "File", - "location": "/path/to/sample1_R1.fastq" - }, - { - "class": "File", - "location": "/path/to/sample2_R1fastq" - }, - { - "class": "File", - "location": "/path/to/sample3_R1.fastq" - } - ], - "input_pe_mates_list": [ - { - "class": "File", - "location": "/path/to/sample1_R2.fastq" - }, - { - "class": "File", - "location": "/path/to/sample2_R2.fastq" - }, - { - "class": "File", - "location": "/path/to/sample3_R2.fastq" - } - ], - "inputs_pe_rgs_list": [ - "@RG\\tID:sample1\\tLB:library_name\\tPL:ILLUMINA\tSM:sample_name", - "@RG\\tID:sample2\\tLB:library_name\\tPL:ILLUMINA\tSM:sample_name", - "@RG\\tID:sample3\\tLB:library_name\\tPL:ILLUMINA\tSM:sample_name" - ], - ... - } + + In all but the most frugal cases, we recommend running the agg(regate) and sex + metrics. These steps require no additional inputs. + + HS and WGS Metrics, however, should only be enabled for appropriate inputs. If + the inputs are WGS, we recommend enabling WGS metrics. If the inputs are WXS, + we recommend enabling WXS metrics. Both of these metrics tools require + additional inputs. For WGS, the user must provide a + `wgs_coverage_interval_list`. For WXS, the user must provide both a + `wxs_bait_interval_list` and `wxs_target_interval_list`. While we provide a + `wgs_coverage_interval_list` with the workflow, the `wxs_bait_interval_list` + and `wxs_target_interval_list` are going to be unique to the sequencing + experiment used to generate the reads. These inputs will vary run to run. + Therefore, the user must provide these intervals. + + ### Haplotype Calling + + Following alignment, this workflow can also perform GATK haplotype calling to + generate a gVCF and metrics files. To generate the gVCF, set + `run_gvcf_processing` to `true` and provide the following optional files: + `dbsnp_vcf`, `contamination_sites_bed`, `contamination_sites_mu`, + `contamination_sites_ud`, `wgs_calling_interval_list`, and + `wgs_evaluation_interval_list`. + + These are the gVCF processing relevant inputs and descriptions: + ```yaml + dbsnp_vcf: "dbSNP vcf file" + dbsnp_idx: "dbSNP vcf index file" + contamination_sites_bed: ".bed file for markers used in this analysis,format(chr\tpos-1\tpos\trefAllele\taltAllele)" + contamination_sites_mu: ".mu matrix file of genotype matrix" + contamination_sites_ud: ".UD matrix file from SVD result of genotype matrix" + wgs_calling_interval_list: "WGS interval list used to aid scattering Haplotype caller" + wgs_evaluation_interval_list: "Target intervals to restrict gVCF metric analysis (for VariantCallingMetrics)" + run_gvcf_processing: "gVCF will be generated. Requires: dbsnp_vcf, contamination_sites_bed, contamination_sites_mu, contamination_sites_ud, wgs_calling_interval_list, wgs_evaluation_interval_list" ``` - ##### Multiple Input Types - As mentioned above, these three input types can be added in any combination. If you wanted to add - all three your command line input would look like the following. - ```json - { - ..., - "run_pe_reads_processing": true, - "run_se_reads_processing": true, - "run_bam_processing": true, - "input_bam_list": [ - { - "class": "File", - "location": "/path/to/bam1.bam" - }, - { - "class": "File", - "location": "/path/to/bam2.bam" - } - ], - "input_se_reads_list": [ - { - "class": "File", - "location": "/path/to/single1.fastq" - }, - { - "class": "File", - "location": "/path/to/single2.fastq" - } - ], - "inputs_se_rgs_list": [ - "@RG\\tID:single1\\tLB:library_name\\tPL:ILLUMINA\\tSM:sample_name", - "@RG\\tID:single2\\tLB:library_name\\tPL:ILLUMINA\\tSM:sample_name" - ], - "input_pe_reads_list": [ - { - "class": "File", - "location": "/path/to/sample1_R1.fastq" - }, - { - "class": "File", - "location": "/path/to/sample2_R1fastq" - }, - { - "class": "File", - "location": "/path/to/sample3_R1.fastq" - } - ], - "input_pe_mates_list": [ - { - "class": "File", - "location": "/path/to/sample1_R2.fastq" - }, - { - "class": "File", - "location": "/path/to/sample2_R2.fastq" - }, - { - "class": "File", - "location": "/path/to/sample3_R2.fastq" - } - ], - "inputs_pe_rgs_list": [ - "@RG\\tID:sample1\\tLB:library_name\\tPL:ILLUMINA\\tSM:sample_name", - "@RG\\tID:sample2\\tLB:library_name\\tPL:ILLUMINA\\tSM:sample_name", - "@RG\\tID:sample3\\tLB:library_name\\tPL:ILLUMINA\\tSM:sample_name" - ], - ... - } + ### HLA Genotyping + + This workflow can also collect HLA genotyping data. To read more on this + feature please see [our documentation](./T1K_README.md). This feature is + on by default but can be disabled by setting `run_t1k` to `false`. HLA + genotyping also requires the `hla_dna_ref_seqs` and `hla_dna_gene_coords` + inputs. The public app provides default files for these. If you have are using + a different reference genome, see the [T1k readme](https://github.com/mourisl/T1K?tab=readme-ov-file#install) + for instructions on creating these files. + + ## Outputs + + Below are the complete list of the outputs that can be generated by the pipeline: + ```yaml + cram: "(Re)Aligned Reads File" + gvcf: "Genomic VCF generated from the realigned alignment file." + verifybamid_output: "Output from VerifyBamID that is used to calculate contamination." + cutadapt_stats: "Stats from Cutadapt runs, if run. One or more per read group." + bqsr_report: "Recalibration report from BQSR." + gvcf_calling_metrics: "General metrics for gVCF calling quality." + hs_metrics: "Picard CollectHsMetrics metrics for the analysis of target-capture sequencing experiments." + wgs_metrics: "Picard CollectWgsMetrics metrics for evaluating the performance of whole genome sequencing experiments." + alignment_metrics: "Picard CollectAlignmentSummaryMetrics high level metrics about the alignment of reads within a SAM/BAM file." + gc_bias_detail: "Picard CollectGcBiasMetrics detailed metrics about reads that fall within windows of a certain GC bin on the reference genome." + gc_bias_summary: "Picard CollectGcBiasMetrics high level metrics that capture how biased the coverage in a certain lane is." + gc_bias_chart: "Picard CollectGcBiasMetrics plot of GC bias." + insert_metrics: "Picard CollectInsertSizeMetrics metrics about the insert size distribution of a paired-end library." + insert_plot: "Picard CollectInsertSizeMetrics insert size distribution plotted." + artifact_bait_bias_detail_metrics: "Picard CollectSequencingArtifactMetrics bait bias artifacts broken down by context." + artifact_bait_bias_summary_metrics: "Picard CollectSequencingArtifactMetrics summary analysis of a single bait bias artifact." + artifact_error_summary_metrics: "Picard CollectSequencingArtifactMetrics summary metrics as a roll up of the context-specific error rates, to provide global error rates per type of base substitution." + artifact_pre_adapter_detail_metrics: "Picard CollectSequencingArtifactMetrics pre-adapter artifacts broken down by context." + artifact_pre_adapter_summary_metrics: "Picard CollectSequencingArtifactMetrics summary analysis of a single pre-adapter artifact." + qual_metrics: "Quality metrics for the realigned CRAM." + qual_chart: "Visualization of quality metrics." + idxstats: "samtools idxstats of the realigned reads file." + xy_ratio: "Text file containing X and Y reads statistics generated from idxstats." + t1k_genotype_tsv: "HLA genotype results from T1k" ``` - #### Example Runtimes: - 1. 120 GB WGS BAM with AggMetrics, WgsMetrics, and gVCF creation: 14 hours & $35 - 1. 120 GB WGS BAM only: 11 hours - 1. 4x40 GB WGS FASTQ files with AggMetrics, WgsMetrics, and gVCF creation: 23 hours & $72 - 1. 4x40 GB WGS FASTQ files only: 18 hours - 1. 4x9 GB WXS FASTQ files with AggMetrics and gVCF creation: 4 hours & $9 - 1. 4x9 GB WXS FASTQ files only: 3 hours - - #### Caveats: - 1. Duplicates are flagged in a process that is connected to bwa mem. The implication of this design - decision is that duplicates are flagged only on the inputs of that are scattered into bwa. - Duplicates, therefore, are not being flagged at a library level and, for large BAM and FASTQ inputs, - duplicates are only being detected within a portion of the read group. - - #### Tips for running: - 1. For the fastq input file lists (PE or SE), make sure the lists are properly ordered. The items in - the arrays are processed based on their position. These lists are dotproduct scattered. This means - that the first file in `input_pe_reads_list` is run with the first file in `input_pe_mates_list` - and the first string in `input_pe_rgs_list`. This also means these arrays must be the same - length or the workflow will fail. - 1. The input for the reference_tar must be a tar file containing the reference fasta along with its indexes. - The required indexes are `[.64.ann,.64.amb,.64.bwt,.64.pac,.64.sa,.dict,.fai]` and are generated by bwa, picard, and samtools. - Additionally, an `.64.alt` index is recommended. - 1. If you are making your own bwa indexes make sure to use the `-6` flag to obtain the `.64` version of the - indexes. Indexes that do not match this naming schema will cause a failure in certain runner ecosystems. - 1. Should you decide to create your own reference indexes and omit the ALT index file from the reference, - or if its naming structure mismatches the other indexes, then your alignments will be equivalent to the results you would - obtain if you run BWA-MEM with the -j option. - 1. The following is an example of a complete reference tar input: + ## Caveats: + - Duplicates are flagged in a process that is connected to bwa mem. The + implication of this design decision is that duplicates are flagged only on the + inputs of that are scattered into bwa. Duplicates, therefore, are not being + flagged at a library level and, for large BAM and FASTQ inputs, duplicates are + only being detected within a portion of the read group. + + ## Tips for running: + - For the FASTQ input file lists (PE or SE), make sure the lists are properly + ordered. The items in the arrays are processed based on their position. These + lists are dotproduct scattered. This means that the first file in + `input_pe_reads_list` is run with the first file in `input_pe_mates_list` and + the first string in `input_pe_rgs_list`. This also means these arrays must be + the same length or the workflow will fail. + - The input for the `reference_tar` must be a tar file containing the + reference fasta along with its indexes. The required indexes are + `[.64.ann,.64.amb,.64.bwt,.64.pac,.64.sa,.dict,.fai]` and are generated by bwa, + Picard, and Samtools. Additionally, an `.64.alt` index is recommended. The + following is an example of a complete reference tar input: ``` ~ tar tf Homo_sapiens_assembly38.tgz Homo_sapiens_assembly38.dict @@ -376,41 +270,38 @@ doc: |+ Homo_sapiens_assembly38.fasta.64.sa Homo_sapiens_assembly38.fasta.fai ``` - 1. For advanced usage, you can skip the knownsite indexing by providing the knownsites_indexes input. - This file list should contain the indexes for each of the files in your knownsites input. Please - note this list must be ordered in such a way where the position of the index file in the - knownsites_indexes list must correspond with the position of the VCF file in the knownsites list - that it indexes. In the example input below you can see that the 1000G_omni2.5.hg38.vcf.gz.tbi - file is the fourth item in the knownsites_indexes because the 1000G_omni2.5.hg38.vcf.gz file is the - fourth item in the knownsites list. Failure to order in this way will result in the pipeline - failing or generating erroneous files. - 1. Turning off gVCF creation and metrics collection for a minimal successful run. - 1. Suggested reference inputs (available from the [Broad Resource Bundle](https://console.cloud.google.com/storage/browser/genomics-public-data/resources/broad/hg38/v0)): - 1. For large BAM inputs, users may encounter a scenario where jobs fail during - the bamtofastq step. The given error will recommend that users try increasing - disk space. Increasing the disk space will solve the error for all but the - largest inputs. For those extremely large inputs with many read groups that - continue to get this error, it is recommended that users increase the value for - `bamtofastq_cpu`. Increasing this value will decrease the number of concurrent - bamtofastq jobs that run on the same instance. - ```yaml - contamination_sites_bed: Homo_sapiens_assembly38.contam.bed - contamination_sites_mu: Homo_sapiens_assembly38.contam.mu - contamination_sites_ud: Homo_sapiens_assembly38.contam.UD - dbsnp_vcf: Homo_sapiens_assembly38.dbsnp138.vcf - reference_tar: Homo_sapiens_assembly38.tgz - knownsites: - - Homo_sapiens_assembly38.known_indels.vcf.gz - - Mills_and_1000G_gold_standard.indels.hg38.vcf.gz - - 1000G_phase1.snps.high_confidence.hg38.vcf.gz - - 1000G_omni2.5.hg38.vcf.gz - knownsites_indexes: - - Homo_sapiens_assembly38.known_indels.vcf.gz.tbi - - Mills_and_1000G_gold_standard.indels.hg38.vcf.gz.tbi - - 1000G_phase1.snps.high_confidence.hg38.vcf.gz.tbi - - 1000G_omni2.5.hg38.vcf.gz.tbi - ``` + - If you are making your own bwa indexes make sure to use the `-6` flag to + obtain the `.64` version of the indexes. Indexes that do not match this naming + schema will cause a failure in certain runner ecosystems. + - Should you decide to create your own reference indexes and omit the ALT + index file from the reference, or if its naming structure mismatches the other + indexes, then your alignments will be equivalent to the results you would + obtain if you run BWA-MEM with the `-j` option. + - For advanced usage, users can skip the knownsite indexing by providing the + `knownsites_indexes` input. This file list should contain the indexes for each + of the files in your knownsites input. Please note this list must be ordered in + such a way where the position of the index file in the `knownsites_indexes` + list must correspond with the position of the VCF file in the knownsites list + that it indexes. Failure to order in this way will result in the pipeline + failing or generating erroneous files. + - For large BAM inputs, users may encounter a scenario where jobs fail during + the bamtofastq step. The given error will recommend that users try increasing + disk space. Increasing the disk space will solve the error for all but the + largest inputs. For those extremely large inputs with many read groups that + continue to get this error, it is recommended that users increase the value for + `bamtofastq_cpu`. Increasing this value will decrease the number of concurrent + bamtofastq jobs that run on the same instance. + ## Basic Info + - [D3b dockerfiles](https://github.com/d3b-center/bixtools) + - Testing Tools: + - [CAVATICA Platform](https://cavatica.sbgenomics.com/) + - [Common Workflow Language reference implementation (cwltool)](https://github.com/common-workflow-language/cwltool/) + + ## References + - KFDRC AWS S3 bucket: s3://kids-first-seq-data/broad-references/ + - CAVATICA: https://cavatica.sbgenomics.com/u/kfdrc-harmonization/kf-references/ + - Broad Institute Goolge Cloud: https://console.cloud.google.com/storage/browser/gcp-public-data--broad-references/hg38/v0 requirements: - class: ScatterFeatureRequirement - class: MultipleInputFeatureRequirement diff --git a/workflows/kfdrc_sentieon_alignment_wf.cwl b/workflows/kfdrc_sentieon_alignment_wf.cwl index 3a1e45a..decd905 100644 --- a/workflows/kfdrc_sentieon_alignment_wf.cwl +++ b/workflows/kfdrc_sentieon_alignment_wf.cwl @@ -9,11 +9,11 @@ doc: | Kids First repository logo

- The Kids First Data Resource Center (KFDRC) Short Reads Alignment and GATK + The Kids First Data Resource Center (KFDRC) Sentieon Short Reads Alignment and Haplotyper Workflow is a Common Workflow Language (CWL) implementation of various software used to take reads generated by next generation sequencing (NGS) technologies and use those reads to generate alignment and, optionally, - variant information. This workflow mirrors the approach of our existing [Input Agnostic Alignment Workflow](https://github.com/kids-first/kf-alignment-workflow#input-agnostic-alignment-workflow), + variant information. This workflow mirrors the approach of our existing [BWA-GATK Workflow](./KFDRC_BWA_GATK_ALIGNMENT_GVCF_WORKFLOW_README.md), and the two have been internally benchmarked as functionally equivalent. The key difference between the two workflows is found in the tools used during the alignment process. @@ -67,6 +67,7 @@ doc: | |----------------------------|-----------------------|-----------------------------------| | Bam to Read Group (RG) BAM | samtools split | samtools split | | RG Bam to Fastq | biobambam2 bamtofastq | biobambam2 bamtofastq | + | Adapter Trimming | cutadapt | cutadapt | | Fastq to RG Bam | bwa mem | Sentieon bwa mem | | Merge RG Bams | sambamba merge | Sentieon ReadWriter | | Sort Bam | sambamba sort | Sentieon ReadWriter | @@ -76,6 +77,8 @@ doc: | | Gather Recalibrated BAMs | Picard GatherBamFiles | No splitting occurs in Sentieon | | Bam to Cram | samtools view | Sentieon ReadWriter | | Metrics | Picard | Sentieon | + | Sex Metrics | samtools idxstats | samtools idxstats | + | HLA Genotyping | T1k | T1k | ## Sentieon gVCF Creation: Similarities and Differences