Skip to content

3. Results

Philip Reiner Kensche edited this page Apr 29, 2019 · 47 revisions

Output is written to the output directory with the following general structure:

Directory File types
./ _commonBam_wroteQcSummary.txt, _qualitycontrol.json
./fastqc/ if if runFastQC=true: .fastqc.zip, fastq_qcpass_status.txt
./alignment/ .bam, .bai, .md5, .dupmark_metrics.txt
./coverage/ .DepthOfCoverage_Genome.txt, _readCoverage_1kb_windows.txt, _readCoverage_1kb_windows_coveragePlot_${chr}.png
./fingerprinting/ if runFingerprinting=true: .fp
./flagstats/ _flagstats.txt, _extendedFlagstats.txt
./insertsize_distribution/ _insertsize_plot.png, _insertsize_plot.png_qcValues.txt, _insertsizes.txt
./structural_variation/ _DiffChroms.txt, _DiffChroms.png, _DiffChroms.png_qcValues.txt
./correctedGcBias/ ACEseq QC files if runACEseqQc=true: _windows.anno.txt, _sex.txt, _windows.filtered.txt.gz, _windows.filtered.corrected.txt.gz, _qc_gc_corrected.json, _qc_gc_corrected.slim.tsv, _gc_corrected.png
./roddyExecutionStore/ Execution metadata, logs, configurations. Please refer to the Roddy documentation for details

File contents

Many of the following files are procude by the workflow as one for each lane, optionally for each library -- e.g. with tagmentation WGBS data --, and for the final merged BAM. Some files are also per chromosome.

fastqc.zip, fastq_qcpass_status.txt

FastQC output file and a QC-status derived from the FastQC output by fastqcClassify. The status categories are

Category Meaning Formula
PASS high quality reads mean(q1) > 28
WARN reads quality compromised (20 < mean(q1) < = 28) or (PASS and one median below 20)
FAIL low quality reads mean(q1) < = 20

The Phred score is related to the probability of producing a sequencing error, and each sequenced nucleotide is associated to such a value. As a reference, a Phred score of 20 is a probability of 0.01 of having an error and a score of 30 a probability of 0.001. The reported value corresponds to the average of the first quantile of the Phred score in each position. A value above 28 is OK and below or equal to 20 is of bad quality. Anything in between is dubious. Also notice, in case there is one single position where the median score goes below 20 the sample will be flagged as dubious in case it was considered OK.

.bam, .bai, .md5

Position-sorted BAM file with associated BAI index and MD5 sum of the BAM. The workflow does not remove duplicates but only marks them as duplicates. Therefore, unless you do adapter trimming, the BAM contains all reads also contained in the input FASTQs and in principle the full set of reads is recovered in the files, except for the FASTQ comments, which are dropped. See the BamToFastqPlugin for an performant workflow to reconstitute (almost) original FASTQs (without FASTQ comments and of course not in the original read order).

.dupmark_metrics.txt

Standard output of the duplication marking program. Please refer to the documentation of the biobamba, picard or sambamba tool.

.DepthOfCoverage_Genome.txt

A TSV produced by coverageQc with the following header

Column Description Format
interval Contig or chromosome name. "all" is the overall value across all considered contigs.
coverage QC bases Coverage based on QC bases (see below). \d+.\d+x
#QC bases/#total bases Number of QC bases and number of total bases \d+/\d+
mapq=0 read{1,2} TBD \d+
mapq>0,readlength<minlength read{1,2} TBD \d+
mapq>0,BaseQualityMedian<basequalCutoff read{1,2} Note this is the mean, not the median! TBD \d+
mapq>0,BaseQualityMedian>=basequalCutoff read{1,2} Note this is the mean, not the median! TBD \d+
%incorrect PE orientation TBD \d+
#incorrect proper pair TBD \d+
#duplicates read{1,2} (excluded from coverage analysis) TBD \d+
genome_w/o_N coverage QC bases TBD \d+
#QC bases/#total not_N bases Number QC bases and number of bases excluding 'N's. The total number here originates from the ... file. TBD \d+/\d+

QC Bases

Bases are filtered for quality based on the following per-base and per-read criteria:

  1. Read must be mapped (obvious)
  2. Read must not be marked as duplicate.
  3. Mapping quality > 0 (pretty strong filtering criterion with BWA)
  4. Count only alignment match (M), insert to the reference (I), sequence match (=), and sequence mismatch (X) CIGAR entries. Consequenctly, the following read-bases are not considered based on the CIGAR string: soft-clipped (S), skipped in reference (N), deleted in reference (D), hard-clipped (H), padded (P).
  5. Length of remaining bases >= min length 36 bp. If this is false, the whole read won't be counted!
  6. Average quality score of remaining bases >= 0 (Phred score). If this false, the whole read won't be counted!

Note that overall, only reads on chromosomes listed in the CHROM_SIZES_FILE are considered by the workflow.

_readCoverage_1kb_windows.txt, _readCoverage_1kb_windows_coveragePlot_${chr}.png.

Three-column TSV produced by genomeCoverage and a streamed through filter_readbins.pl, such that only chromosomes of interest are kept. The columns are the chromosome, the 0-based start index of the window on the chromosome, and the number of reads covering the window. Window size is 1 kb.

The genomeCoverage tool only counts reads not marked as duplicate reads and -- in "countReads" mode -- having a mapping quality of at least 1.

.fp

Only present if fingerprinting was turned on.

TBD

_flagstats.txt

Please refer to the documentation of samtools flagstats for details.

_extendedFlagstats.txt

This file is produced by flags_isizes_PEaberrations.pl.

Value Description
total alignments TBD
non-duplicate, non-secondary, non-supplementary reads TBD
such with mapping quality >=1 TBD
such on regarded chromosomes TBD
such with both reads on regarded chromosomes TBD
such mapping on different chromosomes TBD
proper pairs read 1 TBD

_insertsizes.txt

The files suffixed by _insertsizes.txt are produced by flags_isizes_PEaberrations.pl. It is a TSV file with insert size (column 1) and count (column 2).

Using this as input the script insertsizePlot.R plots the _insertsize_plot.png and writes the estimated distribution parameters for convenience into the _insertsize_plot.png_qcValues.txt file in three rows:

  1. median
  2. standard deviation / median
  3. standard deviation

DiffChroms

The _DiffChroms.txt file is produced by flags_isizes_PEaberrations.pl and is used for producing the , _DiffChroms.png and _DiffChroms.png_qcValues.txt files by chrom_diff.r.

TBD

ACEseq QC

The files in the ./correctGcBias/ directory (_windows.anno.txt, _sex.txt, _windows.filtered.txt.gz, _windows.filtered.corrected.txt.gz, _qc_gc_corrected.json, _qc_gc_corrected.slim.tsv, _gc_corrected.png) are only produced if runACEseqQC was set to true and should be used only with WGS data.

Please refer to the documentation of the ACEseq workflow for extensive information about the contents of these files.

FWHM

The full width half maximum (FWHM) parameter indicates whether there are problems with the coverage over the genome. It is based on the coverage distribution after GC-bias correction. It most likely also affects the quality of the SNV and Indel calls. Severe cases should be considered for exclusion from the study. _gcBias_corrected.png

GC-bias mean slope

Slope of curve fitted through coverage vs. GC-content points. Large values could indicate less sensitivity and specificity SNV and INDEL in some regions of the genome due to lower coverage. This may be caused by library preparation (e.g. NEB bead selection, leading to enrichment of GC-rich regions).

GC-bias absolute curvature

Absolute curvature of curve fitted through coverage vs. GC-content points. Large values could indicate less sensitivity and specificity for SNV and INDEL in some regions of the genome due to lower coverage.

_commonBam_wroteQcSummary.txt

Summary TSV file with contents collected from the other files produced by writeQCsummary.pl. The file only contains a single line for the overall (all chromosomes) statistics.

Column Description
PID patient ID; could also be cell-line, or what other information you encode here
SAMPLE_TYPE usually tumor01, blood2, etc.
RUN_ID Name of the $run/sequence/ directory containing the FASTQs.
LANE Lane identifier.
TOTAL_READ_COUNT (flagstat) from flagstat
%TOTAL_READ_MAPPED_BWA (flagstat) from flagstat
%properly_paired (flagstat) from flagstat
%singletons (flagstat) from flagstat
ALIGNED_READ_COUNT TBD
%DUPLICATES (Picard metrics file) from .dupmark_metrics.txt file
ESTIMATED_LIBRARY_SIZE (Picard metrics file) from .dupmark_metrics.txt file
%PE_reads_on_diff_chromosomes (mapq>0) TBD
%sd_PE_insertsize (mapq>0) TBD
PE_insertsize (mapq>0) TBD
coverage QC bases w/o N from _DepthOfCoverage_Genome.txt
QC bases/ total bases w/o N from _DepthOfCoverage_Genome.txt
coverage QC bases from _DepthOfCoverage_Genome.txt
QC bases/ total bases from _DepthOfCoverage_Genome.txt
mapq=0 read{1,2} from _DepthOfCoverage_Genome.txt
mapq>0,readlength<minlength read{1,2} from _DepthOfCoverage_Genome.txt
mapq>0,BaseQualityMedian<basequalCutoff read{1,2} from _DepthOfCoverage_Genome.txt
mapq>0,BaseQualityMedian>=basequalCutoff read{1,2} from _DepthOfCoverage_Genome.txt
%incorrect PE orientation from _DepthOfCoverage_Genome.txt
#incorrect proper pair from _DepthOfCoverage_Genome.txt
#duplicates read{1,2} (excluded from coverage analysis from _DepthOfCoverage_Genome.txt
ChrX coverage QC bases from _DepthOfCoverage_Genome.txt
ChrY coverage QC bases from _DepthOfCoverage_Genome.txt

_qualitycontrol.json

Summary JSON file with contents collected from the other files. Produced by qcJson.pl. The JSON contains short entries per chromosome and a more extensive entry for the "all" chromosome summing all results per chromosome.

Variable Description
chromosome chromosome identifier
referenceLength length of chromosomes
qcBasesMapped See QC Bases from _DepthOfCoverage_Genome.txt.
coverageQcBases qcBasesMapped / referenceLength
genomeWithoutNReferenceLength the "#total not_N bases" in "#QC bases/#total not_N bases"; same as length not counting 'N's from CHROM_SIZES_FILE
genomeWithoutNQcBasesMapped the "#QC bases" in "#QC bases/#total not_N bases"
genomeWithoutNCoverageQcBases genomeWithoutNQcBasesMapped / genomeWithoutNReferenceLength
insertSizeMedian from _insertsize_plot.png_qcValues.txt-file; line 1.
insertSizeSD Standard deviation of the insert size distribution; from _insertsize_plot.png_qcValues.txt-file; line 3.
insertSizeCV insertSizeSD / insertSizeMedian. From from _insertsize_plot.png_qcValues.txt-file; line 2.
singletons from flagstats
withItselfAndMateMapped from flagstats
withMateMappedToDifferentChr from flagstats
properlyPaired from flagstats
pairedRead{1,2} flagstats read 1, 2
qcFailedReads flagstats qc-failed
pairedInSequencing from flagstats
totalReadCounter from flagstats total
duplicates from flagstats duplicates
totalMappedReadCounter from flagstats mapped
percentageMatesOnDifferentChr from flagstats
withMateMappedToDifferentChrMaq from flagstats mapq >= 5
Clone this wiki locally