Sentieon DNAscope LongRead is a pipeline for alignment and germline variant calling (SNVs, SVs, and indels) from long-read sequence data. The DNAscope LongRead pipeline is able to take advantage of longer read lengths to perform quick and accurate variant calling using specially calibrated machine learning models.
The pipeline will accept as input aligned reads in BAM or CRAM format, or un-aligned reads in FASTQ, uBAM, or uCRAM format. The pipeline will output variants in the VCF (or gVCF) formats and aligned reads in BAM or CRAM formats.
DNAscope LongRead is implemented using the Sentieon software package, which requires a valid license for use. Please contact [email protected] for access to the Sentieon software and an evaluation license.
- Sentieon software package version 202308.01 or higher.
- Python version 3.8 or higher.
- bcftools version 1.10 or higher for SNV and indel calling.
- bedtools for SNV and indel calling.
- samtools version 1.16 or higher for alignment of read data in uBAM or uCRAM format or re-alignment of previously aligned reads.
The sentieon
, python
, bcftools
, bedtools
, and samtools
executables will be accessed through the user's PATH
environment variable.
DNAscope LongRead will call variants present in the sample relative to a high quality reference genome sequence. Besides the reference genome file, a samtools fasta index file (.fai) needs to be present.
We recommend aligning to a reference genome without alternate contigs.
A single command is run to align and call SNVs, indels, and structural variants from PacBio HiFi or ONT reads in the FASTQ format:
sentieon-cli dnascope-longread [-h] \
-r REFERENCE \
--fastq INPUT_FASTQ ... \
--readgroups READGROUP ... \
-m MODEL_BUNDLE \
[-d DBSNP] \
[-b DIPLOID_BED] \
[-t NUMBER_THREADS] \
[-g] \
--tech HiFi|ONT \
[--haploid-bed HAPLOID_BED] \
sample.vcf.gz
With FASTQ input, the DNAscope LongRead pipeline requires the following arguments:
-r REFERENCE
: the location of the reference FASTA file. A reference fasta index, ".fai" file, is also required.--fastq INPUT_FASTQ
: the input sample file in FASTQ format. One or more files can be supplied by passing multiple files after the--fastq
argument.--readgroups READGROUP
: readgroup information for the read data. The--readgroups
argument is required if the input data is in the FASTQ format. This argument expects complete readgroup strings and these strings will be passed tominimap2
through the-R
argument. An example argument is--readgroups '@RG\tID:foo\tSM:bar'
.-m MODEL_BUNDLE
: the location of the model bundle. Model bundle files can be found in the sentieon-models repository.--tech HiFi|ONT
: Sequencing technology used to generate the reads. Supported arguments areONT
orHiFi
.sample.vcf.gz
: the location of the output VCF file for SNVs and indels. The pipeline requires the output file end with the suffix, ".vcf.gz". The file path without the suffix will be used as the basename for other output files.
The Sentieon LongRead pipeline accepts the following optional arguments:
-d DBSNP
: the location of the Single Nucleotide Polymorphism database (dbSNP) used to label known variants in VCF (.vcf
) or bgzip compressed VCF (.vcf.gz
) format. Only one file is supported. Supplying this file will annotate variants with their dbSNP refSNP ID numbers. A VCF index file is required.-b DIPLOID_BED
: interval in the reference to restrict diploid variant calling, in BED file format. Supplying this file will limit diploid variant calling to the intervals inside the BED file.--haploid-bed HAPLOID_BED
: interval in the reference to restrict haploid variant calling, in BED file format. Supplying this file will perform haploid variant calling across the intervals inside the BED file.-t NUMBER_THREADS
: number of computing threads that will be used by the software to run parallel processes. The argument is optional; if omitted, the pipeline will use as many threads as the server has.-g
: output variants in the gVCF format, in addition to the VCF output file. The tool will output a bgzip compressed gVCF file with a corresponding index file.-h
: print the command-line help and exit.--dry-run
: print the pipeline commands, but do not actually execute them.
A single command is run to align and call SNVs, indels, and structural variants from PacBio HiFi or ONT reads in the uBAM, uCRAM, BAM, or CRAM formats:
sentieon-cli dnascope-longread [-h] \
-r REFERENCE \
-i SAMPLE_INPUT ... \
--align \
-m MODEL_BUNDLE \
[-d DBSNP] \
[-b DIPLOID_BED] \
[-t NUMBER_THREADS] \
[-g] \
--tech HiFi|ONT \
[--haploid-bed HAPLOID_BED] \
[--input_ref INPUT_REF] \
sample.vcf.gz
With uBAM, uCRAM, BAM, or CRAM input, the DNAscope LongRead pipeline requires the following new arguments:
-i SAMPLE_INPUT
: the input input sample file in uBAM or uCRAM format. One or more files can be supplied by passing multiple files after the-i
argument.--align
: re-align the input read data to the reference genome using Sentieon minimap2.
The DNAscope LongRead pipeline accepts the following new optional arguments:
--input_ref INPUT_REF
: a reference fasta used for decoding the input file(s). Required with uCRAM or CRAM input. Can be different from the fasta used with the-r
argument.
A single command is run to call SNVs, indels, and structural variants from PacBio HiFi or ONT reads in the BAM, or CRAM formats:
sentieon-cli dnascope-longread [-h] \
-r REFERENCE \
-i SAMPLE_INPUT ... \
-m MODEL_BUNDLE \
[-d DBSNP] \
[-b DIPLOID_BED] \
[-t NUMBER_THREADS] \
[-g] \
--tech HiFi|ONT \
[--haploid-bed HAPLOID_BED] \
sample.vcf.gz
Not supplying the --align
argument will direct the pipeline to call variants directly from the input reads.
The following files are output when processing FASTQ data or uBAM, uCRAM, BAM, or CRAM files with the --align
argument:
sample.vcf.gz
: SNV and indel variant calls across the regions of the genome as defined in the-b DIPLOID_BED
file.sample.sv.vcf.gz
: structural variant calls from the Sentieon LongReadSV tool.sample_mm2_sorted_fq_*.cram
: aligned and coordinate-sorted reads from the input FASTQ files.sample_mm2_sorted_*.cram
: aligned and coordinate-sorted reads from the input uBAM, uCRAM, BAM, or CRAM files.
With the --haploid-bed HAPLOID_BED
argument, the pipeline will create the following additional output files:
sample.haploid.vcf.gz
: SNV and indel variant calls across the haploid regions of the genome (as defined in the--haploid-bed HAPLOID_BED
file).
The default pipeline is recommended for use with samples from diploid organisms. For samples with both diploid and haploid chromosomes, the -b DIPLOID_BED
option can be used to limit diploid variant calling to diploid chromosomes and the --haploid-bed HAPLOID_BED
argument can be used to perform haploid variant calling across haploid chromosomes. Diploid and haploid variants will be output to separate VCF files.
Diploid and haploid BED files for the human hg38 reference genome (with male samples) can be found in the /data folder in this repository.
Scripts in this repository are made available under the BSD 2-Clause license.
The Python scripts in the sentieon_cli/scripts
folder perform low-level manipulation of intermediate gVCF and VCF files generated by the pipeline. Due to the low-level data handling performed by these scripts, modification of these files by users is discouraged.
Sentieon DNAscope LongRead – A highly Accurate, Fast, and Efficient Pipeline for Germline Variant Calling from PacBio HiFi reads - A preprint describing the DNAscope LongRead pipeline for calling variants from PacBio HiFi data.