-
Notifications
You must be signed in to change notification settings - Fork 41
Getting Started
For this exercise, we'll use GLnexus to merge gVCFs representing chromosome 21 from the six public Platinum Genomes.
Follow the README instructions to obtain or build GLnexus. The glnexus_cli
executable is statically linked, so it doesn't need to be installed in any particular location. Here, we'll assume you've copied it to a working directory for this exercise.
Additionally, have tabix
and bcftools
installed.
In your working directory, download and extract example gVCFs we've generated from the Platinum Genomes BAMs on chromosome 21, using DeepVariant 0.5.1. The download is 35MB.
curl -s https://raw.githubusercontent.com/wiki/dnanexus-rnd/GLnexus/data/dv_platinum6_chr21_gvcf.tar | tar xv
The glnexus_cli
executable consumes the gVCF files, and a three-column BED file giving the genomic ranges to analyze. For exomes, the BED file might contain the exome capture targets with some padding, while for WGS you can just give the full-length chromosomes.
echo -e "chr21\t0\t48129895" > hg19_chr21.bed
./glnexus_cli --config DeepVariant --bed hg19_chr21.bed \
dv_platinum6_chr21_gvcf/*.gvcf.gz > dv_platinum6_chr21.bcf
This should take just a minute or so. glnexus_cli
emits an uncompressed, multi-sample BCF stream to its standard output. We can use bcftools
to convert this BCF to bgzip VCF:
bcftools view dv_platinum6_chr21.bcf | bgzip -c > dv_platinum6_chr21.vcf.gz
zless dv_platinum6_chr21.vcf.gz
You could put glnexus_cli
, bcftools
, and bgzip
in a shell pipeline to automate the format conversion (but see Performance).
To process gVCFs from other variant callers, change the --config
flag appropriately; run glnexus_cli -h
to list the available configuration presets. The Configuration page discusses customizing them if needed.
glnexus_cli
leaves behind a subdirectory GLnexus.DB
used for external sorting of the gVCF data. You can delete this directory when you're done; glnexus_cli
currently has no way to do anything further with it.
For large projects GLnexus is designed to utilize a powerful server flat-out, but there are several Performance tuning tricks needed to achieve that.
If you have too many gVCFs to enumerate on the command line, you can make a manifest file with one gVCF filename per line. Then pass the filename of this manifest to GLnexus along with the --list
flag, instead of the individual gVCF filenames.
glnexus_cli
does not use index files for the input gVCFs. If you need to process only a few selected genomic ranges, then it may be advantageous to slice your gVCFs beforehand.
We also have a DNAnexus platform applet to wrap the open-source executable, which we can use for the same exercise. A copy of this applet resides in the public project GLnexus_Getting_Started along with example gVCFs we've generated from the Platinum Genomes BAMs on chromosome 21, using DeepVariant 0.5.1.
Copy these all to your own project, and run it like so:
echo -e "chr21\t0\t48129895" | dx upload -o hg19_chr21.bed -
dx run GLnexus -i config=DeepVariant -i bed_ranges_to_genotype=hg19_chr21.bed -i output_name=dv_platinum6_chr21 \
-i gvcf=dv_platinum6_chr21_gvcf/NA12877.chr21.gvcf.gz \
-i gvcf=dv_platinum6_chr21_gvcf/NA12878.chr21.gvcf.gz \
-i gvcf=dv_platinum6_chr21_gvcf/NA12891.chr21.gvcf.gz \
-i gvcf=dv_platinum6_chr21_gvcf/NA12890.chr21.gvcf.gz \
-i gvcf=dv_platinum6_chr21_gvcf/NA12889.chr21.gvcf.gz \
-i gvcf=dv_platinum6_chr21_gvcf/NA12892.chr21.gvcf.gz \
-y --watch
This will output a bgzip VCF file on job completion.
dx cat dv_platinum6_chr21.vcf.gz | zless
If you have too many gVCF files to enumerate on the command-line, the applet can also take in a file containing list of gVCF file IDs, in case there are too many:
dx find data --brief --folder dv_platinum6_chr21_gvcf/ | dx upload -o platinum6_gvcfs.txt -
dx run GLnexus -i config=DeepVariant -i bed_ranges_to_genotype=hg19_chr21.bed -i output_name=dv_platinum6_chr21 -i gvcf_manifest=platinum6_gvcfs.txt
To process gVCFs from other variant callers, change the config
input appropriately.
The source code for this applet is in this repository under cli/dxapplet. Beyond this little applet wrapping the open-source executable, we have a cloud-native framework for GLnexus enabling giant projects to scale out on many compute nodes, and reprocess incrementally as new samples are sequenced. Contact the DNAnexus science team to discuss such requirements. (The open-source version produces identical scientific results.)