-
Notifications
You must be signed in to change notification settings - Fork 5
3. Results
Output is written to the output directory with the following general structure:
Directory | File types |
---|---|
./ |
_commonBam_wroteQcSummary.txt , _qualitycontrol.json
|
./fastqc/ |
if if runFastQC=true : .fastqc.zip , fastq_qcpass_status.txt
|
./alignment/ |
.bam , .bai , .md5 , .dupmark_metrics.txt
|
./coverage/ |
.DepthOfCoverage_Genome.txt , _readCoverage_1kb_windows.txt , _readCoverage_1kb_windows_coveragePlot_${chr}.png
|
./fingerprinting/ |
if runFingerprinting=true : .fp
|
./flagstats/ |
_flagstats.txt , _extendedFlagstats.txt
|
./insertsize_distribution/ |
_insertsize_plot.png , _insertsize_plot.png_qcValues.txt , _insertsizes.txt
|
./structural_variation/ |
_DiffChroms.txt , _DiffChroms.png , _DiffChroms.png_qcValues.txt
|
./correctedGcBias/ |
ACEseq QC files if runACEseqQc=true : _windows.anno.txt , _sex.txt , _windows.filtered.txt.gz , _windows.filtered.corrected.txt.gz , _qc_gc_corrected.json , _qc_gc_corrected.slim.tsv , _gc_corrected.png
|
./roddyExecutionStore/ |
Execution metadata, logs, configurations. Please refer to the Roddy documentation for details |
NOTE: If you are using data produced by the automation system OTP, you will find a somewhat different structure. Specifically, there the alignments and
.bai
files are in the top-level directory, while above mentioned categories are sorted intoqualitycontrol/{merged,$laneId,$libraryId}
directories.
Many of the following files are procude by the workflow as one for each lane, optionally for each library -- e.g. with tagmentation WGBS data --, and for the final merged BAM. Some files are also per chromosome.
FastQC output file and a QC-status derived from the FastQC output by fastqcClassify. The status categories are
Category | Meaning | Formula |
---|---|---|
PASS | high quality reads | mean(q1) > 28 |
WARN | reads quality compromised | (20 < mean(q1) < = 28) or (PASS and one median below 20) |
FAIL | low quality reads | mean(q1) < = 20 |
The Phred score is related to the probability of producing a sequencing error, and each sequenced nucleotide is associated to such a value. As a reference, a Phred score of 20 is a probability of 0.01 of having an error and a score of 30 a probability of 0.001. The reported value corresponds to the average of the first quantile of the Phred score in each position. A value above 28 is OK and below or equal to 20 is of bad quality. Anything in between is dubious. Also notice, in case there is one single position where the median score goes below 20 the sample will be flagged as dubious in case it was considered OK.
Position-sorted BAM file with associated BAI index and MD5 sum of the BAM. The workflow does not remove duplicates but only marks them as duplicates. Therefore, unless you do adapter trimming, the BAM contains all reads also contained in the input FASTQs and in principle the full set of reads is recovered in the files, except for the FASTQ comments, which are dropped. See the BamToFastqPlugin for an performant workflow to reconstitute (almost) original FASTQs (without FASTQ comments and of course not in the original read order).
Standard output of the duplication marking program. Please refer to the documentation of the biobamba, picard or sambamba tool.
A TSV produced by coverageQc with the following header
Column | Description | Format |
---|---|---|
interval | Contig or chromosome name. "all" is the overall value across all considered contigs. | |
coverage QC bases | Coverage based on QC bases (see below). | \d+.\d+x |
#QC bases/#total bases | Number of QC bases and number of total bases | \d+/\d+ |
mapq=0 read{1,2} | TBD | \d+ |
mapq>0,readlength<minlength read{1,2} | TBD | \d+ |
mapq>0,BaseQualityMedian<basequalCutoff read{1,2} | Note this is the mean, not the median! TBD | \d+ |
mapq>0,BaseQualityMedian>=basequalCutoff read{1,2} | Note this is the mean, not the median! TBD | \d+ |
%incorrect PE orientation | TBD | \d+ |
#incorrect proper pair | TBD | \d+ |
#duplicates read{1,2} (excluded from coverage analysis) | TBD | \d+ |
genome_w/o_N coverage QC bases | TBD | \d+ |
#QC bases/#total not_N bases | Number QC bases and number of bases excluding 'N's. The total number here originates from the ... file. TBD | \d+/\d+ |
Bases are filtered for quality based on the following per-base and per-read criteria:
- Mapped reads (obvious)
- Mapping quality > 0 (pretty strong filtering criterion with BWA)
- Count only alignment match (M), insert to the reference (I), sequence match (=), and sequence mismatch (X) CIGAR entries. Bases corresponding to these operations are considered further. Thus the following read-bases are not considered based on the CIGAR string (see SAMv1.pdf specification): soft-clipped (S), skipped in reference (N), deleted in reference (D), hard-clipped (H), padded (P)
- Length of remaining bases >= min length 36 bp. If this is false, the whole read won't be counted!
- Average quality score of remaining bases >= 0 (Phred score). If this false, the whole read won't be counted!
Three-column TSV produced by genomeCoverage and a streamed through filter_readbins.pl, such that only chromosomes of interest are kept. The columns are the chromosome, the 0-based start index of the window on the chromosome, and the number of reads covering the window. Window size is 1 kb.
The genomeCoverage tool only counts reads not marked as duplicate reads and -- in "countReads" mode -- having a mapping quality of at least 1.
Only present if fingerprinting was turned on.
TBD
Please refer to the documentation of samtools flagstats for details.
This file is produced by flags_isizes_PEaberrations.pl.
Value | Description |
---|---|
total alignments | TBD |
non-duplicate, non-secondary, non-supplementary reads | TBD |
such with mapping quality >=1 | TBD |
such on regarded chromosomes | TBD |
such with both reads on regarded chromosomes | TBD |
such mapping on different chromosomes | TBD |
proper pairs read 1 | TBD |
The files suffixed by _insertsizes.txt
are produced by flags_isizes_PEaberrations.pl. It is a TSV file with insert size (column 1) and count (column 2).
Using this as input the script insertsizePlot.R plots the _insertsize_plot.png
and writes the estimated distribution parameters for convenience into the _insertsize_plot.png_qcValues.txt
file in three rows:
- median
- standard deviation / median
- standard deviation
The _DiffChroms.txt
file is produced by flags_isizes_PEaberrations.pl and is used for producing the , _DiffChroms.png
and _DiffChroms.png_qcValues.txt
files by chrom_diff.r.
TBD
The files in the ./correctGcBias/
directory (_windows.anno.txt
, _sex.txt
, _windows.filtered.txt.gz
, _windows.filtered.corrected.txt.gz
, _qc_gc_corrected.json
, _qc_gc_corrected.slim.tsv
, _gc_corrected.png
) are only produced if runACEseqQC
was set to true.
Please refer to the documentation of the ACEseq workflow for extensive information about the contents of these files.
Summary TSV file with contents collected from the other files produced by writeQCsummary.pl. The file only contains a single line for the overall (all chromosomes) statistics.
Column | Description |
---|---|
PID | patient ID; could also be cell-line, or what other information you encode here |
SAMPLE_TYPE | usually tumor01, blood2, etc. |
RUN_ID | Name of the $run/sequence/ directory containing the FASTQs. |
LANE | Lane identifier. |
TOTAL_READ_COUNT (flagstat) | from flagstat |
%TOTAL_READ_MAPPED_BWA (flagstat) | from flagstat |
%properly_paired (flagstat) | from flagstat |
%singletons (flagstat) | from flagstat |
ALIGNED_READ_COUNT | TBD |
%DUPLICATES (Picard metrics file) | from .dupmark_metrics.txt file |
ESTIMATED_LIBRARY_SIZE (Picard metrics file) | from .dupmark_metrics.txt file |
%PE_reads_on_diff_chromosomes (mapq>0) | TBD |
%sd_PE_insertsize (mapq>0) | TBD |
PE_insertsize (mapq>0) | TBD |
coverage QC bases w/o N | from _DepthOfCoverage_Genome.txt |
QC bases/ total bases w/o N | from _DepthOfCoverage_Genome.txt |
coverage QC bases | from _DepthOfCoverage_Genome.txt |
QC bases/ total bases | from _DepthOfCoverage_Genome.txt |
mapq=0 read{1,2} | from _DepthOfCoverage_Genome.txt |
mapq>0,readlength<minlength read{1,2} | from _DepthOfCoverage_Genome.txt |
mapq>0,BaseQualityMedian<basequalCutoff read{1,2} | from _DepthOfCoverage_Genome.txt |
mapq>0,BaseQualityMedian>=basequalCutoff read{1,2} | from _DepthOfCoverage_Genome.txt |
%incorrect PE orientation | from _DepthOfCoverage_Genome.txt |
#incorrect proper pair | from _DepthOfCoverage_Genome.txt |
#duplicates read{1,2} (excluded from coverage analysis | from _DepthOfCoverage_Genome.txt |
ChrX coverage QC bases | from _DepthOfCoverage_Genome.txt |
ChrY coverage QC bases | from _DepthOfCoverage_Genome.txt |
Summary JSON file with contents collected from the other files. Produced by qcJson.pl. The JSON contains short entries per chromosome and a more extensive entry for the "all" chromosome summing all results per chromosome.
Variable | Description |
---|---|
chromosome | chromosome identifier |
referenceLength | length of chromosomes |
qcBasesMapped | See QC Bases from _DepthOfCoverage_Genome.txt. |
coverageQcBases |
qcBasesMapped / referenceLength
|
genomeWithoutNReferenceLength | the "#total not_N bases" in "#QC bases/#total not_N bases"; same as length not counting 'N's from CHROM_SIZES_FILE
|
genomeWithoutNQcBasesMapped | the "#QC bases" in "#QC bases/#total not_N bases" |
genomeWithoutNCoverageQcBases |
genomeWithoutNQcBasesMapped / genomeWithoutNReferenceLength
|
insertSizeMedian | from _insertsize_plot.png_qcValues.txt-file; line 1. |
insertSizeSD | Standard deviation of the insert size distribution; from _insertsize_plot.png_qcValues.txt-file; line 3. |
insertSizeCV |
insertSizeSD / insertSizeMedian . From from _insertsize_plot.png_qcValues.txt-file; line 2. |
singletons | from flagstats |
withItselfAndMateMapped | from flagstats |
withMateMappedToDifferentChr | from flagstats |
properlyPaired | from flagstats |
pairedRead{1,2} | flagstats read 1, 2 |
qcFailedReads | flagstats qc-failed |
pairedInSequencing | from flagstats |
totalReadCounter | from flagstats total |
duplicates | from flagstats duplicates |
totalMappedReadCounter | from flagstats mapped |
percentageMatesOnDifferentChr | from flagstats |
withMateMappedToDifferentChrMaq | from flagstats mapq >= 5 |