The Kids First Data Resource Center (KFDRC) Sentieon Short Reads Alignment and Haplotyper Workflow is a Common Workflow Language (CWL) implementation of various software used to take reads generated by next generation sequencing (NGS) technologies and use those reads to generate alignment and, optionally, variant information. This workflow mirrors the approach of our existing BWA-GATK Workflow, and the two have been internally benchmarked as functionally equivalent. The key difference between the two workflows is found in the tools used during the alignment process.
This pipeline was made possible thanks to significant software and support contributions from Sentieon. For more information on our collaborators, check out their website:
- Sentieon: https://www.sentieon.com/
- Sentieon:
202112.01
This workflow has a unique input sentieon_license
that is not present in our
main alignment workflow. To run the Sentieon tool, users must provide the license
value to run any of the Sentieon tools. We have provided a default value that
works exclusively on CAVATICA. Alternatively, if you wish to use this outside
of CAVATICA, you will need to provide your own server license.
Otherwise, this workflow uses identical inputs as our existing alignment workflow. For more information see: https://github.com/kids-first/kf-alignment-workflow#inputs
This workflow generates outputs identical to our existing alignment workflow. For more information see: https://github.com/kids-first/kf-alignment-workflow#outputs
The two workflows start identically; both workflows start by splitting the input SAMs/BAMs/CRAMs (Alignment/Map files, or AMs) into read group (RG) AMs using samtools split then convert those RG AMs into FASTQ files using biobambam2 bamtofastq. After FASTQ creation, the two workflows diverge in software usage. Whereas the KFDRC GATK pipeline uses a wide variety of tools (bwa, sambamba, samblaster, GATK, Picard, and samtools) to generate the realigned CRAMs, the KFDRC Sentieon pipeline uses exclusively software implementations from Sentieon, such as their modified version of bwa. One notable difference in the flow of the pipeline is where MarkDuplicates is run. In the original workflow, RG BAMs are split if they are too large. Duplicate Marking is then run on those individual shards rather than the complete RG BAMs. In this workflow, however, duplicates are marked over the whole RG BAM file. Overall this results in a slightly higher rate of marked duplicates and slightly lower mean coverage. For more information about the process in the main workflow see https://github.com/kids-first/kf-alignment-workflow#caveats.
Finally, the metrics collection is done with a series of Sentieon algorithms that match our existing Picard metrics suite.
Step | KFDRC GATK | KFDRC Sentieon |
---|---|---|
Bam to Read Group (RG) BAM | samtools split | samtools split |
RG Bam to Fastq | biobambam2 bamtofastq | biobambam2 bamtofastq |
Adapter Trimming | cutadapt | cutadapt |
Fastq to RG Bam | bwa mem | Sentieon bwa mem |
Merge RG Bams | sambamba merge | Sentieon ReadWriter |
Sort Bam | sambamba sort | Sentieon ReadWriter |
Mark Duplicates | samblaster | Sentieon LocusCollector + Dedup |
BaseRecalibration | GATK BaseRecalibrator | Sentieon QualCal |
ApplyRecalibration | GATK ApplyBQSR | Sentieon ReadWriter QualCalFilter |
Gather Recalibrated BAMs | Picard GatherBamFiles | No splitting occurs in Sentieon |
Bam to Cram | samtools view | Sentieon ReadWriter |
Metrics | Picard | Sentieon |
Sex Metrics | samtools idxstats | samtools idxstats |
HLA Genotyping | T1k | T1k |
After the creation of a recalibrated BAM, if the user wishes, a gVCF file and associated metrics are generated. The Sentieon approach is to run Haplotyper on the recalibrated reads. Like base recalibration, these steps are accomplished without scattering and therefore no additional merging steps are required. Metrics collection and contamination estimation are unchanged.
Step | KFDRC GATK | KFDRC Sentieon |
---|---|---|
Contamination Calculation | VerifyBamID | VerifyBamID |
gVCF Calling | GATK HaplotypeCaller | Senteion Haplotyper |
Gather VCFs | Picard MergeVcfs | No splitting occurs in Sentieon |
Metrics | Picard CollectVariantCallingMetrics | Picard CollectVariantCallingMetrics |
- Sentieon tools scale up RAM usage to match allocated CPUs. If a task that is running into memory issues, that can be solved by EITHER scaling UP the task's allocated RAM and scaling DOWN the tasks allocated CPUs.
- D3b dockerfiles
- Testing Tools:
- KFDRC AWS S3 bucket: s3://kids-first-seq-data/broad-references/
- CAVATICA: https://cavatica.sbgenomics.com/u/kfdrc-harmonization/kf-references/
- Sentieon: https://support.sentieon.com/manual/DNAseq_usage/dnaseq/
- Broad Institute Goolge Cloud: https://console.cloud.google.com/storage/browser/gcp-public-data--broad-references/hg38/v0