Skip to content

Latest commit

 

History

History
109 lines (88 loc) · 6.48 KB

KFDRC_SENTIEON_ALIGNMENT_GVCF_WORKFLOW_README.md

File metadata and controls

109 lines (88 loc) · 6.48 KB

Kids First Data Resource Center Sentieon Short Reads Alignment and Haplotyper Workflow

Kids First repository logo

The Kids First Data Resource Center (KFDRC) Sentieon Short Reads Alignment and Haplotyper Workflow is a Common Workflow Language (CWL) implementation of various software used to take reads generated by next generation sequencing (NGS) technologies and use those reads to generate alignment and, optionally, variant information. This workflow mirrors the approach of our existing BWA-GATK Workflow, and the two have been internally benchmarked as functionally equivalent. The key difference between the two workflows is found in the tools used during the alignment process.

This pipeline was made possible thanks to significant software and support contributions from Sentieon. For more information on our collaborators, check out their website:

Relevant Softwares and Versions

Input Files

This workflow has a unique input sentieon_license that is not present in our main alignment workflow. To run the Sentieon tool, users must provide the license value to run any of the Sentieon tools. We have provided a default value that works exclusively on CAVATICA. Alternatively, if you wish to use this outside of CAVATICA, you will need to provide your own server license.

Otherwise, this workflow uses identical inputs as our existing alignment workflow. For more information see: https://github.com/kids-first/kf-alignment-workflow#inputs

Output Files

This workflow generates outputs identical to our existing alignment workflow. For more information see: https://github.com/kids-first/kf-alignment-workflow#outputs

Sentieon Alignment: Similarities and Differences

The two workflows start identically; both workflows start by splitting the input SAMs/BAMs/CRAMs (Alignment/Map files, or AMs) into read group (RG) AMs using samtools split then convert those RG AMs into FASTQ files using biobambam2 bamtofastq. After FASTQ creation, the two workflows diverge in software usage. Whereas the KFDRC GATK pipeline uses a wide variety of tools (bwa, sambamba, samblaster, GATK, Picard, and samtools) to generate the realigned CRAMs, the KFDRC Sentieon pipeline uses exclusively software implementations from Sentieon, such as their modified version of bwa. One notable difference in the flow of the pipeline is where MarkDuplicates is run. In the original workflow, RG BAMs are split if they are too large. Duplicate Marking is then run on those individual shards rather than the complete RG BAMs. In this workflow, however, duplicates are marked over the whole RG BAM file. Overall this results in a slightly higher rate of marked duplicates and slightly lower mean coverage. For more information about the process in the main workflow see https://github.com/kids-first/kf-alignment-workflow#caveats.

Finally, the metrics collection is done with a series of Sentieon algorithms that match our existing Picard metrics suite.

Step KFDRC GATK KFDRC Sentieon
Bam to Read Group (RG) BAM samtools split samtools split
RG Bam to Fastq biobambam2 bamtofastq biobambam2 bamtofastq
Adapter Trimming cutadapt cutadapt
Fastq to RG Bam bwa mem Sentieon bwa mem
Merge RG Bams sambamba merge Sentieon ReadWriter
Sort Bam sambamba sort Sentieon ReadWriter
Mark Duplicates samblaster Sentieon LocusCollector + Dedup
BaseRecalibration GATK BaseRecalibrator Sentieon QualCal
ApplyRecalibration GATK ApplyBQSR Sentieon ReadWriter QualCalFilter
Gather Recalibrated BAMs Picard GatherBamFiles No splitting occurs in Sentieon
Bam to Cram samtools view Sentieon ReadWriter
Metrics Picard Sentieon
Sex Metrics samtools idxstats samtools idxstats
HLA Genotyping T1k T1k

Sentieon gVCF Creation: Similarities and Differences

After the creation of a recalibrated BAM, if the user wishes, a gVCF file and associated metrics are generated. The Sentieon approach is to run Haplotyper on the recalibrated reads. Like base recalibration, these steps are accomplished without scattering and therefore no additional merging steps are required. Metrics collection and contamination estimation are unchanged.

Step KFDRC GATK KFDRC Sentieon
Contamination Calculation VerifyBamID VerifyBamID
gVCF Calling GATK HaplotypeCaller Senteion Haplotyper
Gather VCFs Picard MergeVcfs No splitting occurs in Sentieon
Metrics Picard CollectVariantCallingMetrics Picard CollectVariantCallingMetrics

Workflow Troubleshooting

  • Sentieon tools scale up RAM usage to match allocated CPUs. If a task that is running into memory issues, that can be solved by EITHER scaling UP the task's allocated RAM and scaling DOWN the tasks allocated CPUs.

Basic Info

References