Skip to content
Mike Lin edited this page Feb 12, 2018 · 14 revisions

For this exercise, we'll use GLnexus to merge gVCFs representing chromosome 21 from the six public Platinum Genomes.

On your computer

Build GLnexus

Follow the README instructions to build GLnexus. The glnexus_cli executable is statically linked, so it doesn't need to be installed in any particular location. Here, we'll assume you've copied it to a working directory for this exercise.

Additionally, have tabix and bcftools installed.

Download example gVCF

In your working directory, download and extract example gVCFs we've generated from the Platinum Genomes BAMs on chromosome 21, using DeepVariant 0.5.1. The download is 35MB.

curl -s https://raw.githubusercontent.com/wiki/dnanexus-rnd/GLnexus/data/dv_platinum6_chr21_gvcf.tar | tar xv

Run GLnexus

The glnexus_cli executable consumes the gVCF files, and a three-column BED file giving the genomic ranges to analyze. For exomes, the BED file might contain the exome capture targets with some padding, while for WGS you can just give the full-length chromosomes.

echo -e "chr21\t0\t48129895" > hg19_chr21.bed
./glnexus_cli --config DeepVariant --bed hg19_chr21.bed \
    dv_platinum6_chr21_gvcf/*.gvcf.gz > dv_platinum6_chr21.bcf

This should take just a minute or so. glnexus_cli emits an uncompressed, multi-sample BCF stream to its standard output. We can use bcftools to convert this BCF to bgzip VCF:

bcftools view dv_platinum6_chr21.bcf | bgzip -c > dv_platinum6_chr21.vcf.gz
zless dv_platinum6_chr21.vcf.gz

You could put glnexus_cli, bcftools, and bgzip in a shell pipeline to automate the format conversion (but see Performance).

To process gVCFs from other variant callers, change the --config flag appropriately; see Configuration

glnexus_cli leaves behind a subdirectory GLnexus.DB used for external sorting of the gVCF data. You can delete this directory when you're done; glnexus_cli currently has no way to do anything further with it.

Scaling up

For large projects GLnexus is designed to utilize a powerful server flat-out, but there are several Performance tuning tricks needed to achieve that.

If you have too many gVCFs to enumerate on the command line, you can make a manifest file with one gVCF filename per line. Then pass the filename of this manifest to GLnexus along with the --list flag, instead of the individual gVCF filenames.

glnexus_cli does not use index files for the input gVCFs. If you need to process only a few selected genomic ranges, then it may be advantageous to slice your gVCFs beforehand.

On DNAnexus

We also have a DNAnexus platform applet to wrap the open-source executable, which we can use for the same exercise. A copy of this applet resides in the public project GLnexus_Getting_Started along with example gVCFs we've generated from the Platinum Genomes BAMs on chromosome 21, using DeepVariant 0.5.1.

Copy these all to your own project, and run it like so:

echo -e "chr21\t0\t48129895" | dx upload -o hg19_chr21.bed -
dx run GLnexus -i config=DeepVariant -i bed_ranges_to_genotype=hg19_chr21.bed -i output_name=dv_platinum6_chr21 \
	-i gvcf=dv_platinum6_chr21_gvcf/NA12877.chr21.gvcf.gz \
	-i gvcf=dv_platinum6_chr21_gvcf/NA12878.chr21.gvcf.gz \
	-i gvcf=dv_platinum6_chr21_gvcf/NA12891.chr21.gvcf.gz \
	-i gvcf=dv_platinum6_chr21_gvcf/NA12890.chr21.gvcf.gz \
	-i gvcf=dv_platinum6_chr21_gvcf/NA12889.chr21.gvcf.gz \
	-i gvcf=dv_platinum6_chr21_gvcf/NA12892.chr21.gvcf.gz \
        -y --watch

This will output a bgzip VCF file on job completion.

dx cat dv_platinum6_chr21.vcf.gz | zless

If you have too many gVCF files to enumerate on the command-line, the applet can also take in a file containing list of gVCF file IDs, in case there are too many:

dx find data --brief --folder dv_platinum6_chr21_gvcf/ | dx upload -o platinum6_gvcfs.txt -
dx run GLnexus -i config=DeepVariant -i bed_ranges_to_genotype=hg19_chr21.bed -i output_name=dv_platinum6_chr21 -i gvcf_manifest=platinum6_gvcfs.txt

To process gVCFs from other variant callers, change the config input appropriately; see Configuration

The source code for this applet is in this repository under cli/dxapplet. Beyond this little applet wrapping the open-source executable, we have a cloud-native framework for GLnexus enabling giant projects to scale out on many compute nodes, and reprocess incrementally as new samples are sequenced. Contact the DNAnexus science team to discuss such requirements. (The open-source version produces identical scientific results.)