These portable WDL workflows use DeepVariant to call variants from WGS read alignments, followed by GLnexus to merge the resulting Genome VCF (gVCF) files for several samples into a Project VCF (pVCF). The wdl/
directory has three nested workflows:
Based on the DeepVariant docs, the sequential workflow to generate gVCF from a given BAM file and genomic range.
+----------------------------------------------------------------------------+
| |
| DeepVariant.wdl |
| |
| +-----------------+ +-----------------+ +------------------------+ |
sample.bam | | | | | | | |
genome.fa -----> make_examples |----> call_variants |----> postprocess_variants |-----> gVCF
range | | | | | | | |
| +-----------------+ +--------^--------+ +------------------------+ |
| | |
| | |
+----------------------------------|-----------------------------------------+
|
DeepVariant Model
make_examples
and call_variants
internally parallelize across CPUs on the machine they run on. The tasks use the docker image published by the DeepVariant team.
To further parallelize WGS calling accross several machines, scatters DeepVariant.wdl across several genomic ranges (typically full-length chromosomes). For each range, fetches a BAM slice using the GA4GH htsget client in samtools 1.7+, given an htsget server endpoint and sample ID. Finally, concatenates the per-range gVCFs to the complete product.
+--------------------------------------------------------------------------------+
| |
| htsget_DeepVariant.wdl |
| |
| +-----------------+ +-------------------+ |
| | | | | range gVCF |
| +---> htsget client |----> DeepVariant.wdl |---+ |
| | | (samtools) | | | | |
| | | | +-------------------+ | |
sample ID | | +-----------------+ | +-------------------+ |
| | +--> | |
ranges -------+---> ... ... ... ---> bcftools concat +-----> sample gVCF
(e.g. | | +--> | |
chr1 | | +-----------------+ | +-------------------+ |
chr2 | | | | +-------------------+ | |
...) | +---> htsget client | | | | |
| | (samtools) |----> DeepVariant.wdl |---+ |
| | | | | range gVCF |
| +------------^----+ +-------------------+ |
| | | |
| | | |
+------------|-------|-----------------------------------------------------------+
| |
sample ID | |
range | | range BAM
| |
+----v------------+
| |
| htsget server |
| |
+-----------------+
By using htsget, the workflow scatters across the ranges without first having to download and slice up a monolithic BAM file.
Scatters htsget_DeepVariant.wdl across several samples to generate an array of gVCF files, then feeds these to GLnexus to merge them into a pVCF.
+-----------------------------------------------------------+
| |
| htsget_DeepVariant_GLnexus.wdl |
| |
| +--------------------------+ |
| | | sample gVCF |
| +---> htsget_DeepVariant.wdl |----+ |
| | | | | |
| | +--------------------------+ | +-----------+ |
| | +----> | |
sample IDs -------+---> ... ... -----> GLnexus +----> project VCF
| | +----> | |
| | +--------------------------+ | +-----------+ |
| | | | | |
| +---> htsget_DeepVariant.wdl |----+ |
| | | sample gVCF |
| +--------------------------+ |
| |
+-----------------------------------------------------------+
Here's an example inputs JSON providing everything required to launch this top-level workflow with dxWDL or Cromwell:
{
"htsget_DeepVariant_GLnexus.accessions": ["NA12878","NA12891","NA12892"],
"htsget_DeepVariant_GLnexus.htsget_endpoint": "https://htsnexus.rnd.dnanex.us/v1/reads/BroadHiSeqX_b37",
"htsget_DeepVariant_GLnexus.ranges": ["12:112204691-112247789","17:41196312-41277500"],
"htsget_DeepVariant_GLnexus.ref_fasta_gz": (REFERENCE GENOME FILE),
"htsget_DeepVariant_GLnexus.model_tar": (DEEPVARIANT MODEL FILES),
"htsget_DeepVariant_GLnexus.output_name": "b37_CEUtrio_ALDH2_BRCA1",
}