Skip to content

3. Results

Philip Reiner Kensche edited this page Aug 14, 2019 · 47 revisions

Output is written to the output directory with the following general structure:

Directory File types
./ _commonBam_wroteQcSummary.txt, _qualitycontrol.json
./fastqc/ if if runFastQC=true: .fastqc.zip, fastq_qcpass_status.txt
./alignment/ .bam, .bai, .md5, .dupmark_metrics.txt
./coverage/ .DepthOfCoverage_Genome.txt, _readCoverage_1kb_windows.txt, _readCoverage_1kb_windows_coveragePlot_${chr}.png
./fingerprinting/ if runFingerprinting=true: .fp
./flagstats/ _flagstats.txt, _extendedFlagstats.txt
./insertsize_distribution/ _insertsize_plot.png, _insertsize_plot.png_qcValues.txt, _insertsizes.txt
./structural_variation/ _DiffChroms.txt, _DiffChroms.png, _DiffChroms.png_qcValues.txt
./correctedGcBias/ ACEseq QC files if runACEseqQc=true: _windows.anno.txt, _sex.txt, _windows.filtered.txt.gz, _windows.filtered.corrected.txt.gz, _qc_gc_corrected.json, _qc_gc_corrected.slim.tsv, _gc_corrected.png
./roddyExecutionStore/ Execution metadata, logs, configurations. Please refer to the Roddy documentation for details

File contents

Many of the following files are procude by the workflow as one for each lane, optionally for each library -- e.g. with tagmentation WGBS data --, and for the final merged BAM. Some files are also per chromosome.

fastqc.zip, fastq_qcpass_status.txt

FastQC output file and a QC-status derived from the FastQC output by fastqcClassify. The status categories are

Category Meaning Formula
PASS high quality reads mean(q1) > 28
WARN reads quality compromised (20 < mean(q1) < = 28) or (PASS and one median below 20)
FAIL low quality reads mean(q1) < = 20

The Phred score is related to the probability of producing a sequencing error, and each sequenced nucleotide is associated to such a value. As a reference, a Phred score of 20 is a probability of 0.01 of having an error and a score of 30 a probability of 0.001. The reported value corresponds to the average of the first quantile of the Phred score in each position. A value above 28 is OK and below or equal to 20 is of bad quality. Anything in between is dubious. Also notice, in case there is one single position where the median score goes below 20 the sample will be flagged as dubious in case it was considered OK.

.bam, .bai, .md5

Position-sorted BAM file with associated BAI index and MD5 sum of the BAM. The workflow does not remove duplicates but only marks them as duplicates. Therefore, unless you do adapter trimming, the BAM contains all reads also contained in the input FASTQs and in principle the full set of reads is recovered in the files, except for the FASTQ comments, which are dropped. See the BamToFastqPlugin for an performant workflow to reconstitute (almost) original FASTQs (without FASTQ comments and of course not in the original read order).

.dupmark_metrics.txt

Standard output of the duplication marking program. Please refer to the documentation of the biobamba, picard or sambamba tool.

.DepthOfCoverage_Genome.txt

A TSV produced by coverageQc with the following header. An important number is that of the "QC bases":

QC Bases

Bases are filtered for quality based on the following per-base and per-read criteria:

  1. Read must be mapped (obvious)
  2. Read must not be marked as duplicate. Note that the duplication marker leaves one of the duplicates unmarked.
  3. Mapping quality > 0 (pretty strong filtering criterion with BWA)
  4. Count only alignment match (M), insert to the reference (I), sequence match (=), and sequence mismatch (X) CIGAR entries. Consequenctly, the following read-bases are not considered based on the CIGAR string: soft-clipped (S), skipped in reference (N), deleted in reference (D), hard-clipped (H), padded (P).
  5. Length of remaining bases >= min length 36 bp. If this is false, the whole read won't be counted!
  6. Average quality score of remaining bases >= 0 (Phred score) (Note: For WES the default cutoff is 25). If this false, the whole read won't be counted!

Note that overall, only reads on chromosomes listed in the CHROM_SIZES_FILE are considered by the workflow.

The "QC bases" are used in many of the statistics reported by coverageQc-tool.

Overview

Some of the following statistical measures are parametrized by one of the following parameters:

  • base quality cutoff: Defined in the variable BASE_QUALITY_CUTOFF. For WGS and WGBS data the current default is 0 (defined in the COWorkflowsBasePlugin). For WES this default is overwritten to be 25 (see analysisExome.xml).
  • minimum read length: Only the default from coverageQc is used, which is 36.

Furthermore, note that in all these values, except the duplicate-count, reads marked as PCR duplicates are ignored and not counted. Also usually bases that are softclipped (unaligning read-ends) are not counted.

Full details (and the only true reference) can be found in coverageQC.d function addRead().

Column Description Format
interval Contig or chromosome name. "all" is the overall value across all considered contigs.
coverage QC bases Coverage based on QC bases and total bases (including all non-{A,T,C,G}-bases). \d+.\d+x
#QC bases/#total bases Number of QC bases "/" number of total bases (including all non-{A,T,C,G}-bases). \d+/\d+
mapq=0 read{1,2} Number of reads with BWA mapping quality 0 for read 1 and 2. \d+
mapq>0,readlength<minlength read{1,2} Number of reads with mapping quality >0 but shorter than the minimum read length \d+
mapq>0,BaseQualityMedian<basequalCutoff read{1,2} Number of reads with mapping quality >0 and base-quality mean < base-quality cutoff. Note this is the mean base quality, not the median! It is calculated using non-softclipped bases only. \d+
mapq>0,BaseQualityMedian>=basequalCutoff read{1,2} Number of reads with mapping quality >0 and base-quality mean >= base-quality cutoff. Note this is the mean, not the median! It is calculated using non-softclipped bases only. \d+
%incorrect PE orientation Number of read pairs with unexpected/incorrect alignment orientation. \d+
#incorrect proper pair Non-duplicate read pairs marked as properly paired and both aligned in reverse orientation. \d+
#duplicates read{1,2} Reads marked as PCR duplicates by e.g. sambamba. \d+
genome_w/o_N coverage QC bases Coverage based on QC bases and {A,T,C,G}-bases. \d+
#QC bases/#total not_N bases Number QC bases "/" number of {A,T,C,G} bases (i.e. excluding 'N's). \d+/\d+

_readCoverage_1kb_windows.txt, _readCoverage_1kb_windows_coveragePlot_${chr}.png.

Three-column TSV produced by genomeCoverage and a streamed through filter_readbins.pl, such that only chromosomes of interest are kept. The columns are the chromosome, the 0-based start index of the window on the chromosome, and the number of reads covering the window. Window size is 1 kb.

The genomeCoverage tool only counts reads not marked as duplicate reads and -- in "countReads" mode -- having a mapping quality of at least 1.

.fp

Only present if fingerprinting was turned on.

TBD

_flagstats.txt

Please refer to the documentation of samtools flagstats for details.

_extendedFlagstats.txt

This file is produced by flags_isizes_PEaberrations.pl.

Value Description
total alignments TBD
non-duplicate, non-secondary, non-supplementary reads TBD
such with mapping quality >=1 TBD
such on regarded chromosomes TBD
such with both reads on regarded chromosomes TBD
such mapping on different chromosomes TBD
proper pairs read 1 TBD

_insertsizes.txt

The files suffixed by _insertsizes.txt are produced by flags_isizes_PEaberrations.pl. It is a TSV file with insert size (column 1) and count (column 2).

Using this as input the script insertsizePlot.R plots the _insertsize_plot.png and writes the estimated distribution parameters for convenience into the _insertsize_plot.png_qcValues.txt file in three rows:

  1. median
  2. standard deviation / median
  3. standard deviation

DiffChroms

The _DiffChroms.txt file is produced by flags_isizes_PEaberrations.pl and is used for producing the , _DiffChroms.png and _DiffChroms.png_qcValues.txt files by chrom_diff.r.

TBD

ACEseq QC

The files in the ./correctGcBias/ directory (_windows.anno.txt, _sex.txt, _windows.filtered.txt.gz, _windows.filtered.corrected.txt.gz, _qc_gc_corrected.json, _qc_gc_corrected.slim.tsv, _gc_corrected.png) are only produced if runACEseqQC was set to true and should be used only with WGS data.

Please refer to the documentation of the ACEseq workflow for extensive information about the contents of these files.

FWHM

The full width half maximum (FWHM) parameter indicates whether there are problems with the coverage over the genome. It is based on the coverage distribution after GC-bias correction. It most likely also affects the quality of the SNV and Indel calls. Severe cases should be considered for exclusion from the study. _gcBias_corrected.png

GC-bias mean slope

Slope of curve fitted through coverage vs. GC-content points. Large values could indicate less sensitivity and specificity SNV and INDEL in some regions of the genome due to lower coverage. This may be caused by library preparation (e.g. NEB bead selection, leading to enrichment of GC-rich regions).

GC-bias absolute curvature

Absolute curvature of curve fitted through coverage vs. GC-content points. Large values could indicate less sensitivity and specificity for SNV and INDEL in some regions of the genome due to lower coverage.

_commonBam_wroteQcSummary.txt

Summary TSV file with contents collected from the other files produced by writeQCsummary.pl. The file only contains a single line for the overall (all chromosomes) statistics.

Column Description
PID patient ID; could also be cell-line, or what other information you encode here
SAMPLE_TYPE usually tumor01, blood2, etc.
RUN_ID Name of the $run/sequence/ directory containing the FASTQs.
LANE Lane identifier.
TOTAL_READ_COUNT (flagstat) from flagstat
%TOTAL_READ_MAPPED_BWA (flagstat) from flagstat
%properly_paired (flagstat) from flagstat
%singletons (flagstat) from flagstat
ALIGNED_READ_COUNT TBD
%DUPLICATES (Picard metrics file) from .dupmark_metrics.txt file
ESTIMATED_LIBRARY_SIZE (Picard metrics file) from .dupmark_metrics.txt file
%PE_reads_on_diff_chromosomes (mapq>0) TBD
%sd_PE_insertsize (mapq>0) TBD
PE_insertsize (mapq>0) TBD
coverage QC bases w/o N from _DepthOfCoverage_Genome.txt
QC bases/ total bases w/o N from _DepthOfCoverage_Genome.txt
coverage QC bases from _DepthOfCoverage_Genome.txt
QC bases/ total bases from _DepthOfCoverage_Genome.txt
mapq=0 read{1,2} from _DepthOfCoverage_Genome.txt
mapq>0,readlength<minlength read{1,2} from _DepthOfCoverage_Genome.txt
mapq>0,BaseQualityMedian<basequalCutoff read{1,2} from _DepthOfCoverage_Genome.txt
mapq>0,BaseQualityMedian>=basequalCutoff read{1,2} from _DepthOfCoverage_Genome.txt
%incorrect PE orientation from _DepthOfCoverage_Genome.txt
#incorrect proper pair from _DepthOfCoverage_Genome.txt
#duplicates read{1,2} (excluded from coverage analysis from _DepthOfCoverage_Genome.txt
ChrX coverage QC bases from _DepthOfCoverage_Genome.txt
ChrY coverage QC bases from _DepthOfCoverage_Genome.txt

_qualitycontrol.json

Summary JSON file with contents collected from the other files. Produced by qcJson.pl. The JSON contains short entries per chromosome and a more extensive entry for the "all" chromosome summing all results per chromosome.

Variable Description
chromosome chromosome identifier
referenceLength length of chromosomes
qcBasesMapped See QC Bases from _DepthOfCoverage_Genome.txt.
coverageQcBases qcBasesMapped / referenceLength
genomeWithoutNReferenceLength the "#total not_N bases" in "#QC bases/#total not_N bases"; same as length not counting 'N's from CHROM_SIZES_FILE
genomeWithoutNQcBasesMapped the "#QC bases" in "#QC bases/#total not_N bases"
genomeWithoutNCoverageQcBases genomeWithoutNQcBasesMapped / genomeWithoutNReferenceLength
insertSizeMedian from _insertsize_plot.png_qcValues.txt-file; line 1.
insertSizeSD Standard deviation of the insert size distribution; from _insertsize_plot.png_qcValues.txt-file; line 3.
insertSizeCV insertSizeSD / insertSizeMedian. From from _insertsize_plot.png_qcValues.txt-file; line 2.
singletons from flagstats
withItselfAndMateMapped from flagstats
withMateMappedToDifferentChr from flagstats
properlyPaired from flagstats
pairedRead{1,2} flagstats read 1, 2
qcFailedReads flagstats qc-failed
pairedInSequencing from flagstats
totalReadCounter from flagstats total
duplicates from flagstats duplicates
totalMappedReadCounter from flagstats mapped
percentageMatesOnDifferentChr from flagstats
withMateMappedToDifferentChrMaq from flagstats mapq >= 5

Whole-Exome Sequencing (WES)

For WES data, a target region file (TARGET_REGION_FILE parameter) is provided to the workflow and reads are kept if at least one base overlap exists to at least one of these regions. The QC statistics are than calculated like before, e.g. with QC bases mapped produced by the coverageQc.d programm on a base level, but only for the pre-selected reads.

Note that the default for BASE_QUALITY_CUTOFF is 25 for the WES workflow, but 0 for the WGS workflow (as defined in the COWorkflowsBasePlugin). This default has historical reasons and you may want to change it.

Clone this wiki locally