Kids First Data Resource Center Sentieon Short Reads Alignment and Haplotyper Workflow

The Kids First Data Resource Center (KFDRC) Sentieon Short Reads Alignment and Haplotyper Workflow is a Common Workflow Language (CWL) implementation of various software used to take reads generated by next generation sequencing (NGS) technologies and use those reads to generate alignment and, optionally, variant information. This workflow mirrors the approach of our existing BWA-GATK Workflow, and the two have been internally benchmarked as functionally equivalent. The key difference between the two workflows is found in the tools used during the alignment process.

This pipeline was made possible thanks to significant software and support contributions from Sentieon. For more information on our collaborators, check out their website:

Sentieon: https://www.sentieon.com/

Relevant Softwares and Versions

Sentieon: 202112.01

Input Files

This workflow has a unique input sentieon_license that is not present in our main alignment workflow. To run the Sentieon tool, users must provide the license value to run any of the Sentieon tools. We have provided a default value that works exclusively on CAVATICA. Alternatively, if you wish to use this outside of CAVATICA, you will need to provide your own server license.

Otherwise, this workflow uses identical inputs as our existing alignment workflow. For more information see: https://github.com/kids-first/kf-alignment-workflow#inputs

Output Files

This workflow generates outputs identical to our existing alignment workflow. For more information see: https://github.com/kids-first/kf-alignment-workflow#outputs

Sentieon Alignment: Similarities and Differences

The two workflows start identically; both workflows start by splitting the input SAMs/BAMs/CRAMs (Alignment/Map files, or AMs) into read group (RG) AMs using samtools split then convert those RG AMs into FASTQ files using biobambam2 bamtofastq. After FASTQ creation, the two workflows diverge in software usage. Whereas the KFDRC GATK pipeline uses a wide variety of tools (bwa, sambamba, samblaster, GATK, Picard, and samtools) to generate the realigned CRAMs, the KFDRC Sentieon pipeline uses exclusively software implementations from Sentieon, such as their modified version of bwa. One notable difference in the flow of the pipeline is where MarkDuplicates is run. In the original workflow, RG BAMs are split if they are too large. Duplicate Marking is then run on those individual shards rather than the complete RG BAMs. In this workflow, however, duplicates are marked over the whole RG BAM file. Overall this results in a slightly higher rate of marked duplicates and slightly lower mean coverage. For more information about the process in the main workflow see https://github.com/kids-first/kf-alignment-workflow#caveats.

Finally, the metrics collection is done with a series of Sentieon algorithms that match our existing Picard metrics suite.

Step	KFDRC GATK	KFDRC Sentieon
Bam to Read Group (RG) BAM	samtools split	samtools split
RG Bam to Fastq	biobambam2 bamtofastq	biobambam2 bamtofastq
Adapter Trimming	cutadapt	cutadapt
Fastq to RG Bam	bwa mem	Sentieon bwa mem
Merge RG Bams	sambamba merge	Sentieon ReadWriter
Sort Bam	sambamba sort	Sentieon ReadWriter
Mark Duplicates	samblaster	Sentieon LocusCollector + Dedup
BaseRecalibration	GATK BaseRecalibrator	Sentieon QualCal
ApplyRecalibration	GATK ApplyBQSR	Sentieon ReadWriter QualCalFilter
Gather Recalibrated BAMs	Picard GatherBamFiles	No splitting occurs in Sentieon
Bam to Cram	samtools view	Sentieon ReadWriter
Metrics	Picard	Sentieon
Sex Metrics	samtools idxstats	samtools idxstats
HLA Genotyping	T1k	T1k

Sentieon gVCF Creation: Similarities and Differences

After the creation of a recalibrated BAM, if the user wishes, a gVCF file and associated metrics are generated. The Sentieon approach is to run Haplotyper on the recalibrated reads. Like base recalibration, these steps are accomplished without scattering and therefore no additional merging steps are required. Metrics collection and contamination estimation are unchanged.

Step	KFDRC GATK	KFDRC Sentieon
Contamination Calculation	VerifyBamID	VerifyBamID
gVCF Calling	GATK HaplotypeCaller	Senteion Haplotyper
Gather VCFs	Picard MergeVcfs	No splitting occurs in Sentieon
Metrics	Picard CollectVariantCallingMetrics	Picard CollectVariantCallingMetrics

Workflow Troubleshooting

Sentieon tools scale up RAM usage to match allocated CPUs. If a task that is running into memory issues, that can be solved by EITHER scaling UP the task's allocated RAM and scaling DOWN the tasks allocated CPUs.

Basic Info

D3b dockerfiles
Testing Tools:
- Seven Bridges CAVATICA Platform
- Common Workflow Language reference implementation (cwltool)

References

KFDRC AWS S3 bucket: s3://kids-first-seq-data/broad-references/
CAVATICA: https://cavatica.sbgenomics.com/u/kfdrc-harmonization/kf-references/
Sentieon: https://support.sentieon.com/manual/DNAseq_usage/dnaseq/
Broad Institute Goolge Cloud: https://console.cloud.google.com/storage/browser/gcp-public-data--broad-references/hg38/v0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KFDRC_SENTIEON_ALIGNMENT_GVCF_WORKFLOW_README.md

KFDRC_SENTIEON_ALIGNMENT_GVCF_WORKFLOW_README.md

Kids First Data Resource Center Sentieon Short Reads Alignment and Haplotyper Workflow

Relevant Softwares and Versions

Input Files

Output Files

Sentieon Alignment: Similarities and Differences

Sentieon gVCF Creation: Similarities and Differences

Workflow Troubleshooting

Basic Info

References

Files

KFDRC_SENTIEON_ALIGNMENT_GVCF_WORKFLOW_README.md

Latest commit

History

KFDRC_SENTIEON_ALIGNMENT_GVCF_WORKFLOW_README.md

File metadata and controls

Kids First Data Resource Center Sentieon Short Reads Alignment and Haplotyper Workflow

Relevant Softwares and Versions

Input Files

Output Files

Sentieon Alignment: Similarities and Differences

Sentieon gVCF Creation: Similarities and Differences

Workflow Troubleshooting

Basic Info

References