-
Notifications
You must be signed in to change notification settings - Fork 5
3. Results
Output is written to the output directory with the following general structure:
Directory | File types |
---|---|
./ |
_commonBam_wroteQcSummary.txt , _qualitycontrol.json
|
./fastqc/ |
if if runFastQC=true : .fastqc.zip , fastq_qcpass_status.txt
|
./alignment/ |
.bam , .bai , .md5 , .dupmark_metrics.txt
|
./coverage/ |
.DepthOfCoverage_Genome.txt , _readCoverage_1kb_windows.txt , _readCoverage_1kb_windows_coveragePlot_${chr}.png
|
./fingerprinting/ |
if runFingerprinting=true : .fp
|
./flagstats/ |
_flagstats.txt , _extendedFlagstats.txt
|
./insertsize_distribution/ |
_insertsize_plot.png , _insertsize_plot.png_qcValues.txt , _insertsizes.txt
|
./structural_variation/ |
_DiffChroms.txt , _DiffChroms.png , _DiffChroms.png_qcValues.txt
|
./correctedGcBias/ |
ACEseq QC files if runACEseqQc=true : _windows.anno.txt , _sex.txt , _windows.filtered.txt.gz , _windows.filtered.corrected.txt.gz , _qc_gc_corrected.json , _qc_gc_corrected.slim.tsv , _gc_corrected.png
|
./roddyExecutionStore/ |
Execution metadata, logs, configurations. Please refer to the Roddy documentation for details |
./methylationcalling/ |
Only produced for the WGBS workflow |
Many of the following files are produced by the workflow as one for each lane, optionally for each library -- e.g. with tagmentation WGBS data --, and for the final merged BAM. Some files are also per chromosome.
FastQC output file and a QC-status derived from the FastQC output by fastqcClassify. The status categories are
Category | Meaning | Formula |
---|---|---|
PASS | high quality reads | mean(q1) > 28 |
WARN | reads quality compromised | (20 < mean(q1) < = 28) or (PASS and one median below 20) |
FAIL | low quality reads | mean(q1) < = 20 |
The Phred score is related to the probability of producing a sequencing error, and each sequenced nucleotide is associated to such a value. As a reference, a Phred score of 20 is a probability of 0.01 of having an error and a score of 30 a probability of 0.001. The reported value corresponds to the average of the first quantile of the Phred score in each position. A value above 28 is OK and below or equal to 20 is of bad quality. Anything in between is dubious. Also notice, in case there is one single position where the median score goes below 20 the sample will be flagged as dubious in case it was considered OK.
Position-sorted BAM file with associated BAI index and MD5 sum of the BAM. The workflow does not remove duplicates but only marks them as duplicates. Therefore, unless you do adapter trimming, the BAM contains all reads also contained in the input FASTQs and in principle the full set of reads is recovered in the files, except for the FASTQ comments, which are dropped. See the BamToFastqPlugin for an performant workflow to reconstitute (almost) original FASTQs (without FASTQ comments and of course not in the original read order).
Standard output of the duplication marking program. Please refer to the documentation of the biobamba, picard or sambamba tool.
A TSV produced by coverageQc with the following header. An important number is that of the "QC bases":
Bases are filtered for quality based on the following per-base and per-read criteria:
- Read must be mapped (obvious)
- Read must not be marked as duplicate. Note that the duplication marker leaves one of the duplicates unmarked.
- Mapping quality > 0 (pretty strong filtering criterion with BWA)
- Count only alignment match (M), insert to the reference (I), sequence match (=), and sequence mismatch (X) CIGAR entries. Consequenctly, the following read-bases are not considered based on the CIGAR string: soft-clipped (S), skipped in reference (N), deleted in reference (D), hard-clipped (H), padded (P).
- Length of remaining bases >= min length 36 bp. If this is false, the whole read won't be counted!
- Average quality score of remaining bases >= 0 (Phred score) (Note: For WES the default cutoff is 25). If this false, the whole read won't be counted!
Note that overall, only reads on chromosomes listed in the CHROM_SIZES_FILE
are considered by the workflow.
The "QC bases" are used in many of the statistics reported by coverageQc-tool.
Some of the following statistical measures are parametrized by one of the following parameters:
- base quality cutoff: Defined in the variable
BASE_QUALITY_CUTOFF
. For WGS and WGBS data the current default is 0 (defined in the COWorkflowsBasePlugin). For WES this default is overwritten to be 25 (see analysisExome.xml). - minimum read length: Only the default from coverageQc is used, which is 36.
Furthermore, note that in all these values, except the duplicate-count, reads marked as PCR duplicates are ignored and not counted. Also usually bases that are softclipped (unaligning read-ends) are not counted.
Full details (and the only true reference) can be found in coverageQC.d function addRead()
.
Column | Description | Format |
---|---|---|
interval | Contig or chromosome name. "all" is the overall value across all considered contigs. | |
coverage QC bases | Coverage based on QC bases and total bases (including all non-{A,T,C,G}-bases). | \d+.\d+x |
#QC bases/#total bases | Number of QC bases "/" number of total bases (including all non-{A,T,C,G}-bases). | \d+/\d+ |
mapq=0 read{1,2} | Number of reads with BWA mapping quality 0 for read 1 and 2. | \d+ |
mapq>0,readlength<minlength read{1,2} | Number of reads with mapping quality >0 but shorter than the minimum read length | \d+ |
mapq>0,BaseQualityMedian<basequalCutoff read{1,2} | Number of reads with mapping quality >0 and base-quality mean < base-quality cutoff. Note this is the mean base quality, not the median! It is calculated using non-softclipped bases only. | \d+ |
mapq>0,BaseQualityMedian>=basequalCutoff read{1,2} | Number of reads with mapping quality >0 and base-quality mean >= base-quality cutoff. Note this is the mean, not the median! It is calculated using non-softclipped bases only. | \d+ |
%incorrect PE orientation | Number of read pairs with unexpected/incorrect alignment orientation. | \d+ |
#incorrect proper pair | Non-duplicate read pairs marked as properly paired and both aligned in reverse orientation. | \d+ |
#duplicates read{1,2} | Reads marked as PCR duplicates by e.g. sambamba. | \d+ |
genome_w/o_N coverage QC bases | Coverage based on QC bases and {A,T,C,G}-bases. | \d+ |
#QC bases/#total not_N bases | Number QC bases "/" number of {A,T,C,G} bases (i.e. excluding 'N's). | \d+/\d+ |
Three-column TSV produced by genomeCoverage and a streamed through filter_readbins.pl, such that only chromosomes of interest are kept. The columns are the chromosome, the 0-based start index of the window on the chromosome, and the number of reads covering the window. Window size is 1 kb.
The genomeCoverage tool only counts reads not marked as duplicate reads and -- in "countReads" mode -- having a mapping quality of at least 1.
Only present if fingerprinting was turned on.
TBD
Please refer to the documentation of samtools flagstats for details.
This file is produced by flags_isizes_PEaberrations.pl.
Value | Description |
---|---|
total alignments | TBD |
non-duplicate, non-secondary, non-supplementary reads | TBD |
such with mapping quality >=1 | TBD |
such on regarded chromosomes | TBD |
such with both reads on regarded chromosomes | TBD |
such mapping on different chromosomes | TBD |
proper pairs read 1 | TBD |
The files suffixed by _insertsizes.txt
are produced by flags_isizes_PEaberrations.pl. It is a TSV file with insert size (column 1) and count (column 2).
Using this as input the script insertsizePlot.R plots the _insertsize_plot.png
and writes the estimated distribution parameters for convenience into the _insertsize_plot.png_qcValues.txt
file in three rows:
- median
- standard deviation / median
- standard deviation
The _DiffChroms.txt
file is produced by flags_isizes_PEaberrations.pl and is used for producing the , _DiffChroms.png
and _DiffChroms.png_qcValues.txt
files by chrom_diff.r.
TBD
The files in the ./correctGcBias/
directory (_windows.anno.txt
, _sex.txt
, _windows.filtered.txt.gz
, _windows.filtered.corrected.txt.gz
, _qc_gc_corrected.json
, _qc_gc_corrected.slim.tsv
, _gc_corrected.png
) are only produced if runACEseqQC
was set to true and should be used only with WGS data.
Please refer to the documentation of the ACEseq workflow for extensive information about the contents of these files.
The full width half maximum (FWHM) parameter indicates whether there are problems with the coverage over the genome. It is based on the coverage distribution after GC-bias correction. It most likely also affects the quality of the SNV and Indel calls. Severe cases should be considered for exclusion from the study. _gcBias_corrected.png
Slope of curve fitted through coverage vs. GC-content points. Large values could indicate less sensitivity and specificity SNV and INDEL in some regions of the genome due to lower coverage. This may be caused by library preparation (e.g. NEB bead selection, leading to enrichment of GC-rich regions).
Absolute curvature of curve fitted through coverage vs. GC-content points. Large values could indicate less sensitivity and specificity for SNV and INDEL in some regions of the genome due to lower coverage.
Summary TSV file with contents collected from the other files produced by writeQCsummary.pl. The file only contains a single line for the overall (all chromosomes) statistics.
Column | Description |
---|---|
PID | patient ID; could also be cell-line, or what other information you encode here |
SAMPLE_TYPE | usually tumor01, blood2, etc. |
RUN_ID | Name of the $run/sequence/ directory containing the FASTQs. |
LANE | Lane identifier. |
TOTAL_READ_COUNT (flagstat) | from flagstat |
%TOTAL_READ_MAPPED_BWA (flagstat) | from flagstat |
%properly_paired (flagstat) | from flagstat |
%singletons (flagstat) | from flagstat |
ALIGNED_READ_COUNT | TBD |
%DUPLICATES (Picard metrics file) | from .dupmark_metrics.txt file |
ESTIMATED_LIBRARY_SIZE (Picard metrics file) | from .dupmark_metrics.txt file |
%PE_reads_on_diff_chromosomes (mapq>0) | TBD |
%sd_PE_insertsize (mapq>0) | TBD |
PE_insertsize (mapq>0) | TBD |
coverage QC bases w/o N | from _DepthOfCoverage_Genome.txt |
QC bases/ total bases w/o N | from _DepthOfCoverage_Genome.txt |
coverage QC bases | from _DepthOfCoverage_Genome.txt |
QC bases/ total bases | from _DepthOfCoverage_Genome.txt |
mapq=0 read{1,2} | from _DepthOfCoverage_Genome.txt |
mapq>0,readlength<minlength read{1,2} | from _DepthOfCoverage_Genome.txt |
mapq>0,BaseQualityMedian<basequalCutoff read{1,2} | from _DepthOfCoverage_Genome.txt |
mapq>0,BaseQualityMedian>=basequalCutoff read{1,2} | from _DepthOfCoverage_Genome.txt |
%incorrect PE orientation | from _DepthOfCoverage_Genome.txt |
#incorrect proper pair | from _DepthOfCoverage_Genome.txt |
#duplicates read{1,2} (excluded from coverage analysis | from _DepthOfCoverage_Genome.txt |
ChrX coverage QC bases | from _DepthOfCoverage_Genome.txt |
ChrY coverage QC bases | from _DepthOfCoverage_Genome.txt |
Summary JSON file with contents collected from the other files. Produced by qcJson.pl. The JSON contains short entries per chromosome and a more extensive entry for the "all" chromosome summing all results per chromosome.
Variable | Description |
---|---|
chromosome | chromosome identifier |
referenceLength | length of chromosomes |
qcBasesMapped | See QC Bases from _DepthOfCoverage_Genome.txt. |
coverageQcBases |
qcBasesMapped / referenceLength
|
genomeWithoutNReferenceLength | the "#total not_N bases" in "#QC bases/#total not_N bases"; same as length not counting 'N's from CHROM_SIZES_FILE
|
genomeWithoutNQcBasesMapped | the "#QC bases" in "#QC bases/#total not_N bases" |
genomeWithoutNCoverageQcBases |
genomeWithoutNQcBasesMapped / genomeWithoutNReferenceLength
|
insertSizeMedian | from _insertsize_plot.png_qcValues.txt-file; line 1. |
insertSizeSD | Standard deviation of the insert size distribution; from _insertsize_plot.png_qcValues.txt-file; line 3. |
insertSizeCV |
insertSizeSD / insertSizeMedian . From from _insertsize_plot.png_qcValues.txt-file; line 2. |
singletons | from flagstats |
withItselfAndMateMapped | from flagstats |
withMateMappedToDifferentChr | from flagstats |
properlyPaired | from flagstats |
pairedRead{1,2} | flagstats read 1, 2 |
qcFailedReads | flagstats qc-failed |
pairedInSequencing | from flagstats |
totalReadCounter | from flagstats total |
duplicates | from flagstats duplicates |
totalMappedReadCounter | from flagstats mapped |
percentageMatesOnDifferentChr | from flagstats |
withMateMappedToDifferentChrMaq | from flagstats mapq >= 5 |
For WES data, a target region file (TARGET_REGION_FILE
parameter) is provided to the workflow and reads are kept if at least one base overlap exists to at least one of these regions. The QC statistics are than calculated like before, e.g. with QC bases mapped produced by the coverageQc.d programm on a base level, but only for the pre-selected reads.
Note that the default for BASE_QUALITY_CUTOFF
is 25 for the WES workflow, but 0 for the WGS workflow (as defined in the COWorkflowsBasePlugin). This default has historical reasons and you may want to change it.
The WES-QC files are suffixed by target
or variants of it (unfortunately no consistent naming scheme was applied, but changing this would required downstream workflows of researchers to get changed, so this won't happen with version 1 of the plugin anymore).
There is one additional file produced.
The file is produced by resources/analysisTools/exomePipeline/targetcov.pl
that processes the output of coverageBed -d -abam $targetRegionRestrictedBam -b $targetRegionsBed
(bedtools), which reports the reports the depth at each position in each target feature. The targetcov.pl
output is in TSV format with the following columns:
- chromosome
- interval start; 0-based, right-open (i.e.
[0, $intervalLength)
coordinates like in BED format) - interval end; 0-based, right-open
- reads: estimated number of reads covering the region. This is not an exact count, but attempts to count the read-starts in the interval. However, dependent on split-reads (one read with multiple subalignments) and end-to-start arrangements of adjacent alignment positions of different reads the number may deviate from the real number of reads.
- covered bases: number of positions with non-zero coverage
- length: the length of the interval
- ratio: proportion of covered bases (
covered bases / length
) - average: average coverage
TBD