3. Results

Output is written to the output directory with the following general structure:

Directory	File types
`./`	`_commonBam_wroteQcSummary.txt`, `_qualitycontrol.json`
`./fastqc/`	if if `runFastQC=true`: `.fastqc.zip`, `fastq_qcpass_status.txt`
`./alignment/`	`.bam`, `.bai`, `.md5`, `.dupmark_metrics.txt`
`./coverage/`	`.DepthOfCoverage_Genome.txt`, `_readCoverage_1kb_windows.txt`, `_readCoverage_1kb_windows_coveragePlot_${chr}.png`
`./fingerprinting/`	if `runFingerprinting=true`: `.fp`
`./flagstats/`	`_flagstats.txt`, `_extendedFlagstats.txt`
`./insertsize_distribution/`	`_insertsize_plot.png`, `_insertsize_plot.png_qcValues.txt`, `_insertsizes.txt`
`./structural_variation/`	`_DiffChroms.txt`, `_DiffChroms.png`, `_DiffChroms.png_qcValues.txt`
`./correctedGcBias/`	ACEseq QC files if `runACEseqQc=true`: `_windows.anno.txt`, `_sex.txt`, `_windows.filtered.txt.gz`, `_windows.filtered.corrected.txt.gz`, `_qc_gc_corrected.json`, `_qc_gc_corrected.slim.tsv`, `_gc_corrected.png`
`./roddyExecutionStore/`	Execution metadata, logs, configurations. Please refer to the Roddy documentation for details

NOTE: If you are using data produced by the automation system OTP, you will find a somewhat different structure. Specifically, there the alignments and .bai files are in the top-level directory, while above mentioned categories are sorted into qualitycontrol/{merged,$laneId,$libraryId} directories.

File contents

Many of the following files are procude by the workflow as one for each lane, optionally for each library -- e.g. with tagmentation WGBS data --, and for the final merged BAM. Some files are also per chromosome.

`fastqc.zip`, `fastq_qcpass_status.txt`

FastQC output file and a QC-status derived from the FastQC output by fastqcClassify. The status categories are

Category	Meaning	Formula
PASS	high quality reads	mean(q1) > 28
WARN	reads quality compromised	(20 < mean(q1) < = 28) or (PASS and one median below 20)
FAIL	low quality reads	mean(q1) < = 20

The Phred score is related to the probability of producing a sequencing error, and each sequenced nucleotide is associated to such a value. As a reference, a Phred score of 20 is a probability of 0.01 of having an error and a score of 30 a probability of 0.001. The reported value corresponds to the average of the first quantile of the Phred score in each position. A value above 28 is OK and below or equal to 20 is of bad quality. Anything in between is dubious. Also notice, in case there is one single position where the median score goes below 20 the sample will be flagged as dubious in case it was considered OK.

`.bam`, `.bai`, `.md5`

Position-sorted BAM file with associated BAI index and MD5 sum of the BAM. The workflow does not remove duplicates but only marks them as duplicates. Therefore, unless you do adapter trimming, the BAM contains all reads also contained in the input FASTQs and in principle the full set of reads is recovered in the files, except for the FASTQ comments, which are dropped. See the BamToFastqPlugin for an performant workflow to reconstitute (almost) original FASTQs (without FASTQ comments and of course not in the original read order).

`.dupmark_metrics.txt`

Standard output of the duplication marking program. Please refer to the documentation of the biobamba, picard or sambamba tool.

`.DepthOfCoverage_Genome.txt`

A TSV produced by coverageQc with the following header

Column	Description	Format
interval	Contig or chromosome name. "all" is the overall value across all considered contigs.
coverage QC bases	Coverage based on QC bases (see below).	\d+.\d+x
#QC bases/#total bases	Number of QC bases and number of total bases	\d+/\d+
mapq=0 read{1,2}	TBD	\d+
mapq>0,readlength<minlength read{1,2}	TBD	\d+
mapq>0,BaseQualityMedian<basequalCutoff read{1,2}	Note this is the mean, not the median! TBD	\d+
mapq>0,BaseQualityMedian>=basequalCutoff read{1,2}	Note this is the mean, not the median! TBD	\d+
%incorrect PE orientation	TBD	\d+
#incorrect proper pair	TBD	\d+
#duplicates read{1,2} (excluded from coverage analysis)	TBD	\d+
genome_w/o_N coverage QC bases	TBD	\d+
#QC bases/#total not_N bases	Number QC bases and number of bases excluding 'N's. The total number here originates from the ... file. TBD	\d+/\d+

QC Bases

Bases are filtered for quality based on the following per-base and per-read criteria:

Mapped reads (obvious)
Mapping quality > 0 (pretty strong filtering criterion with BWA)
Count only alignment match (M), insert to the reference (I), sequence match (=), and sequence mismatch (X) CIGAR entries. Bases corresponding to these operations are considered further. Thus the following read-bases are not considered based on the CIGAR string (see SAMv1.pdf specification): soft-clipped (S), skipped in reference (N), deleted in reference (D), hard-clipped (H), padded (P)
Length of remaining bases >= min length 36 bp. If this is false, the whole read won't be counted!
Average quality score of remaining bases >= 0 (Phred score). If this false, the whole read won't be counted!

`_readCoverage_1kb_windows.txt`, `_readCoverage_1kb_windows_coveragePlot_${chr}.png`.

Three-column TSV produced by genomeCoverage and a streamed through filter_readbins.pl, such that only chromosomes of interest are kept. The columns are the chromosome, the 0-based start index of the window on the chromosome, and the number of reads covering the window. Window size is 1 kb.

The genomeCoverage tool only counts reads not marked as duplicate reads and -- in "countReads" mode -- having a mapping quality of at least 1.

`.fp`

Only present if fingerprinting was turned on.

TBD

`_flagstats.txt`

Please refer to the documentation of samtools flagstats for details.

`_extendedFlagstats.txt`

This file is produced by flags_isizes_PEaberrations.pl.

Value	Description
total alignments	TBD
non-duplicate, non-secondary, non-supplementary reads	TBD
such with mapping quality >=1	TBD
such on regarded chromosomes	TBD
such with both reads on regarded chromosomes	TBD
such mapping on different chromosomes	TBD
proper pairs read 1	TBD

`_insertsizes.txt`

The files suffixed by _insertsizes.txt are produced by flags_isizes_PEaberrations.pl. It is a TSV file with insert size (column 1) and count (column 2).

Using this as input the script insertsizePlot.R plots the _insertsize_plot.png and writes the estimated distribution parameters for convenience into the _insertsize_plot.png_qcValues.txt file in three rows:

median
standard deviation / median
standard deviation

DiffChroms

The _DiffChroms.txt file is produced by flags_isizes_PEaberrations.pl and is used for producing the , _DiffChroms.png and _DiffChroms.png_qcValues.txt files by chrom_diff.r.

TBD

ACEseq QC

The files in the ./correctGcBias/ directory (_windows.anno.txt, _sex.txt, _windows.filtered.txt.gz, _windows.filtered.corrected.txt.gz, _qc_gc_corrected.json, _qc_gc_corrected.slim.tsv, _gc_corrected.png) are only produced if runACEseqQC was set to true.

Please refer to the documentation of the ACEseq workflow for extensive information about the contents of these files.

`_commonBam_wroteQcSummary.txt`

Summary TSV file with contents collected from the other files produced by writeQCsummary.pl. The file only contains a single line for the overall (all chromosomes) statistics.

Column	Description
PID	patient ID; could also be cell-line, or what other information you encode here
SAMPLE_TYPE	usually tumor01, blood2, etc.
RUN_ID	Name of the `$run/sequence/` directory containing the FASTQs.
LANE	Lane identifier.
TOTAL_READ_COUNT (flagstat)	from flagstat
%TOTAL_READ_MAPPED_BWA (flagstat)	from flagstat
%properly_paired (flagstat)	from flagstat
%singletons (flagstat)	from flagstat
ALIGNED_READ_COUNT	TBD
%DUPLICATES (Picard metrics file)	from `.dupmark_metrics.txt` file
ESTIMATED_LIBRARY_SIZE (Picard metrics file)	from `.dupmark_metrics.txt` file
%PE_reads_on_diff_chromosomes (mapq>0)	TBD
%sd_PE_insertsize (mapq>0)	TBD
PE_insertsize (mapq>0)	TBD
coverage QC bases w/o N	from _DepthOfCoverage_Genome.txt
QC bases/ total bases w/o N	from _DepthOfCoverage_Genome.txt
coverage QC bases	from _DepthOfCoverage_Genome.txt
QC bases/ total bases	from _DepthOfCoverage_Genome.txt
mapq=0 read{1,2}	from _DepthOfCoverage_Genome.txt
mapq>0,readlength<minlength read{1,2}	from _DepthOfCoverage_Genome.txt
mapq>0,BaseQualityMedian<basequalCutoff read{1,2}	from _DepthOfCoverage_Genome.txt
mapq>0,BaseQualityMedian>=basequalCutoff read{1,2}	from _DepthOfCoverage_Genome.txt
%incorrect PE orientation	from _DepthOfCoverage_Genome.txt
#incorrect proper pair	from _DepthOfCoverage_Genome.txt
#duplicates read{1,2} (excluded from coverage analysis	from _DepthOfCoverage_Genome.txt
ChrX coverage QC bases	from _DepthOfCoverage_Genome.txt
ChrY coverage QC bases	from _DepthOfCoverage_Genome.txt

`_qualitycontrol.json`

Summary JSON file with contents collected from the other files. Produced by qcJson.pl. The JSON contains short entries per chromosome and a more extensive entry for the "all" chromosome summing all results per chromosome.

Variable	Description
chromosome	chromosome identifier
referenceLength	length of chromosomes
qcBasesMapped	See QC Bases from _DepthOfCoverage_Genome.txt.
coverageQcBases	`qcBasesMapped` / `referenceLength`
genomeWithoutNReferenceLength	the "#total not_N bases" in "#QC bases/#total not_N bases"; same as length not counting 'N's from `CHROM_SIZES_FILE`
genomeWithoutNQcBasesMapped	the "#QC bases" in "#QC bases/#total not_N bases"
genomeWithoutNCoverageQcBases	`genomeWithoutNQcBasesMapped` / `genomeWithoutNReferenceLength`
insertSizeMedian	from _insertsize_plot.png_qcValues.txt-file; line 1.
insertSizeSD	Standard deviation of the insert size distribution; from _insertsize_plot.png_qcValues.txt-file; line 3.
insertSizeCV	`insertSizeSD` / `insertSizeMedian`. From from _insertsize_plot.png_qcValues.txt-file; line 2.
singletons	from flagstats
withItselfAndMateMapped	from flagstats
withMateMappedToDifferentChr	from flagstats
properlyPaired	from flagstats
pairedRead{1,2}	flagstats read 1, 2
qcFailedReads	flagstats qc-failed
pairedInSequencing	from flagstats
totalReadCounter	from flagstats total
duplicates	from flagstats duplicates
totalMappedReadCounter	from flagstats mapped
percentageMatesOnDifferentChr	from flagstats
withMateMappedToDifferentChrMaq	from flagstats mapq >= 5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

3. Results

File contents

`fastqc.zip`, `fastq_qcpass_status.txt`

`.bam`, `.bai`, `.md5`

`.dupmark_metrics.txt`

`.DepthOfCoverage_Genome.txt`

QC Bases

`_readCoverage_1kb_windows.txt`, `_readCoverage_1kb_windows_coveragePlot_${chr}.png`.

`.fp`

`_flagstats.txt`

`_extendedFlagstats.txt`

`_insertsizes.txt`

DiffChroms

ACEseq QC

`_commonBam_wroteQcSummary.txt`

`_qualitycontrol.json`

Clone this wiki locally

3. Results

File contents

fastqc.zip, fastq_qcpass_status.txt

.bam, .bai, .md5

.dupmark_metrics.txt

.DepthOfCoverage_Genome.txt

QC Bases

_readCoverage_1kb_windows.txt, _readCoverage_1kb_windows_coveragePlot_${chr}.png.

.fp

_flagstats.txt

_extendedFlagstats.txt

_insertsizes.txt

DiffChroms

ACEseq QC

_commonBam_wroteQcSummary.txt

_qualitycontrol.json

Clone this wiki locally

`fastqc.zip`, `fastq_qcpass_status.txt`

`.bam`, `.bai`, `.md5`

`.dupmark_metrics.txt`

`.DepthOfCoverage_Genome.txt`

`_readCoverage_1kb_windows.txt`, `_readCoverage_1kb_windows_coveragePlot_${chr}.png`.

`.fp`

`_flagstats.txt`

`_extendedFlagstats.txt`

`_insertsizes.txt`

`_commonBam_wroteQcSummary.txt`

`_qualitycontrol.json`