+ + + + +
+ + + ++ A modular tool to aggregate results from bioinformatics analyses across many samples into a single report. +
+ + + + + + + + + + + +Report + + generated on 2023-06-22, 12:59 + + + based on data in: + +
+-
+
+
/data/project/worthey_lab/projects/experimental_pipelines/mana/pipelines/quac/data/quac/results/test_project_2samples_wgs-include_priorQC/analysis/A/qc/picard-stats
+
+ /data/project/worthey_lab/projects/experimental_pipelines/mana/pipelines/quac/.test/ngs-data/test_project/analysis/B/qc/fastqc-raw
+
+ /data/project/worthey_lab/projects/experimental_pipelines/mana/pipelines/quac/.test/ngs-data/test_project/analysis/B/qc/fastq_screen-trimmed
+
+ /data/project/worthey_lab/projects/experimental_pipelines/mana/pipelines/quac/data/quac/results/test_project_2samples_wgs-include_priorQC/analysis/project_level_qc/somalier/ancestry
+
+ /data/project/worthey_lab/projects/experimental_pipelines/mana/pipelines/quac/.test/ngs-data/test_project/analysis/A/qc/fastqc-raw
+
+ /data/project/worthey_lab/projects/experimental_pipelines/mana/pipelines/quac/data/quac/results/test_project_2samples_wgs-include_priorQC/analysis/A/qc/verifyBamID
+
+ /data/project/worthey_lab/projects/experimental_pipelines/mana/pipelines/quac/.test/ngs-data/test_project/analysis/A/qc/fastqc-trimmed
+
+ /data/project/worthey_lab/projects/experimental_pipelines/mana/pipelines/quac/data/quac/results/test_project_2samples_wgs-include_priorQC/analysis/B/qc/verifyBamID
+
+ /data/project/worthey_lab/projects/experimental_pipelines/mana/pipelines/quac/data/quac/results/test_project_2samples_wgs-include_priorQC/analysis/B/qc/samtools-stats
+
+ /data/project/worthey_lab/projects/experimental_pipelines/mana/pipelines/quac/data/quac/results/test_project_2samples_wgs-include_priorQC/analysis/A/qc/samtools-stats
+
+ /data/project/worthey_lab/projects/experimental_pipelines/mana/pipelines/quac/.test/ngs-data/test_project/analysis/A/qc/dedup
+
+ /data/project/worthey_lab/projects/experimental_pipelines/mana/pipelines/quac/data/quac/results/test_project_2samples_wgs-include_priorQC/analysis/project_level_qc/somalier/relatedness
+
+ /data/project/worthey_lab/projects/experimental_pipelines/mana/pipelines/quac/data/quac/results/test_project_2samples_wgs-include_priorQC/analysis/B/qc/bcftools-stats
+
+ /data/project/worthey_lab/projects/experimental_pipelines/mana/pipelines/quac/data/quac/results/test_project_2samples_wgs-include_priorQC/analysis/B/qc/quac_watch
+
+ /data/project/worthey_lab/projects/experimental_pipelines/mana/pipelines/quac/.test/ngs-data/test_project/analysis/B/qc/dedup
+
+ /data/project/worthey_lab/projects/experimental_pipelines/mana/pipelines/quac/data/quac/results/test_project_2samples_wgs-include_priorQC/analysis/project_level_qc/multiqc/configs
+
+ /data/project/worthey_lab/projects/experimental_pipelines/mana/pipelines/quac/data/quac/results/test_project_2samples_wgs-include_priorQC/analysis/B/qc/qualimap/B
+
+ /data/project/worthey_lab/projects/experimental_pipelines/mana/pipelines/quac/data/quac/results/test_project_2samples_wgs-include_priorQC/analysis/A/qc/quac_watch
+
+ /data/project/worthey_lab/projects/experimental_pipelines/mana/pipelines/quac/.test/ngs-data/test_project/analysis/B/qc/fastqc-trimmed
+
+ /data/project/worthey_lab/projects/experimental_pipelines/mana/pipelines/quac/data/quac/results/test_project_2samples_wgs-include_priorQC/analysis/A/qc/qualimap/A
+
+ /data/project/worthey_lab/projects/experimental_pipelines/mana/pipelines/quac/data/quac/results/test_project_2samples_wgs-include_priorQC/analysis/A/qc/bcftools-stats
+
+ /data/project/worthey_lab/projects/experimental_pipelines/mana/pipelines/quac/.test/ngs-data/test_project/analysis/A/qc/fastq_screen-trimmed
+
+ /data/project/worthey_lab/projects/experimental_pipelines/mana/pipelines/quac/data/quac/results/test_project_2samples_wgs-include_priorQC/analysis/B/qc/picard-stats
+
+
Change sample names: + + + + + +
+ + + + + ++ + + + + + + + +
General Statistics
+ + + + + + + + + + Showing 14/18 rows and 21/51 columns. + +Sample Name | % GC | Ins. size | ≥ 15X | ≥ 30X | ≥ 40X | Median cov | Mean cov | % Aligned | M Reads | % Aligned | % Dups | Error rate | % Proper Pairs | Vars | Hom | Het | Ts/Tv | % Dups | % GC | M Seqs | Contamination (S) |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
A | 46% | 383 | 0.0% | 0.0% | 0.0% | 0.0X | 0.0X | 100.0% | 0.0 | 100% | 0.52% | 98.7% | 6863 | 1783 | 3878 | 2.23 | 46.296% | ||||
A-1 | 3.2% | ||||||||||||||||||||
A-1-R1 | 7.5% | 47% | 0.0 | ||||||||||||||||||
A-1-R2 | 6.2% | 47% | 0.0 | ||||||||||||||||||
A-2 | 2.7% | ||||||||||||||||||||
A-2-R1 | 6.5% | 47% | 0.0 | ||||||||||||||||||
A-2-R2 | 5.4% | 47% | 0.0 | ||||||||||||||||||
B | 46% | 383 | 0.0% | 0.0% | 0.0% | 0.0X | 0.0X | 100.0% | 0.0 | 100% | 0.52% | 98.7% | 6863 | 1783 | 3878 | 2.23 | 46.296% | ||||
B-1 | 3.2% | ||||||||||||||||||||
B-1-R1 | 7.5% | 47% | 0.0 | ||||||||||||||||||
B-1-R2 | 6.2% | 47% | 0.0 | ||||||||||||||||||
B-2 | 2.7% | ||||||||||||||||||||
B-2-R1 | 6.5% | 47% | 0.0 | ||||||||||||||||||
B-2-R2 | 5.4% | 47% | 0.0 |
QuaC-Watch
+This section contains QuaC-Watch results. QuaC-Watch summarizes if samples have passed the QC thresholds.
+ + + + ++ Overall QuaC-Watch Summary + +
+ +Overall QuaC-Watch summary of results from several QC tools
Sample Name | fastqc | qualimap_overall | qualimap_chromosome_specific | picard | picard_dups | bcftools_stats | variant_per_contig | verifybamid | fastq_screen |
---|---|---|---|---|---|---|---|---|---|
B | fail | fail | fail | fail | pass | fail | fail | fail | fail |
A | fail | fail | fail | fail | pass | fail | fail | fail | fail |
+
+ FastQC (trimmed) + +
+ +Quick summary of FastQC (trimmed) results. See FastQC section below for detailed results.
Sample Name | per_base_sequence_quality | per_tile_sequence_quality | per_sequence_quality_scores | per_base_sequence_content | per_sequence_gc_content | per_base_n_content | sequence_length_distribution | sequence_duplication_levels | overrepresented_sequences | adapter_content |
---|---|---|---|---|---|---|---|---|---|---|
B-1-R1 | pass | fail | pass | fail | fail | pass | warn | pass | fail | pass |
B-1-R2 | pass | fail | pass | fail | warn | warn | warn | pass | warn | pass |
B-2-R1 | pass | fail | pass | fail | fail | pass | warn | pass | fail | pass |
B-2-R2 | pass | fail | pass | fail | warn | warn | warn | pass | warn | pass |
A-1-R1 | pass | fail | pass | fail | fail | pass | warn | pass | fail | pass |
A-1-R2 | pass | fail | pass | fail | warn | warn | warn | pass | warn | pass |
A-2-R1 | pass | fail | pass | fail | fail | pass | warn | pass | fail | pass |
A-2-R2 | pass | fail | pass | fail | warn | warn | warn | pass | warn | pass |
+
+ Qualimap - Overall stats + +
+ +Quick summary of Qualimap results. See QualiMap section below for detailed results.
Sample Name | avg_gc | percentage_aligned | mean_coverage | median_coverage | mean_cov:median_cov | median_insert_size | general_error_rate |
---|---|---|---|---|---|---|---|
B | fail | pass | fail | fail | fail | pass | pass |
A | fail | pass | fail | fail | fail | pass | pass |
+
+ Qualimap - Chromosome stats + +
+ +Quick summary chromosome-level coverage info using Qualimap results. See QualiMap section below for detailed results.
+
+ Picard + +
+ +Quick summary of picard metrics. Note: Picard-Duplication is reported separately (bcoz reasons!). See Picard section below for detailed results.
Sample Name | PCT_PF_READS_ALIGNED | PF_HQ_ALIGNED_Q20_BASES | PCT_ADAPTER | PCT_CHIMERAS | Q30_BASES | perc_Q30_BASES | PCT_EXC_TOTAL | PCT_15X |
---|---|---|---|---|---|---|---|---|
B | pass | fail | pass | fail | fail | fail | pass | fail |
A | pass | fail | pass | fail | fail | fail | pass | fail |
+
+ Picard-dups + +
+ +Quick summary of picard-duplication metrics. See Picard section below for detailed results.
Sample Name | PERCENT_DUPLICATION |
---|---|
B-1 | pass |
B-2 | pass |
A-1 | pass |
A-2 | pass |
+
+ Bcftools stats + +
+ +Quick summary of Bcftools-stats results. See Bcftools section below for detailed results.
Sample Name | number_of_records | number_of_SNPs | number_of_indels | perc_snps | perc_indels | tstv | heterozygosity_ratio |
---|---|---|---|---|---|---|---|
B | fail | fail | fail | pass | pass | fail | pass |
A | fail | fail | fail | pass | pass | fail | pass |
+
+ Variant frequency per contig + +
+ +Quick summary of %variant per contig results.
Sample Name | chr1 | chr2 | chr3 | chr4 | chr5 | chr6 | chr7 | chr8 | chr9 | chr10 | chr11 | chr12 | chr13 | chr14 | chr15 | chr16 | chr17 | chr18 | chr19 | chr20 | chr21 | chr22 | chrX | chrY |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
B | fail | fail | fail | fail | fail | fail | fail | fail | fail | fail | fail | fail | fail | fail | fail | fail | fail | fail | fail | fail | fail | fail | fail | fail |
A | fail | fail | fail | fail | fail | fail | fail | fail | fail | fail | fail | fail | fail | fail | fail | fail | fail | fail | fail | fail | fail | fail | fail | fail |
+
+ VerifyBAMID + +
+ +Quick summary of VerifyBAMID results. See VerifyBAMID section below for detailed results.
Sample Name | Contamination(%) |
---|---|
B | fail |
A | fail |
+
+ FastQ Screen (trimmed) + +
+ +Quick summary of FastQ Screen (trimmed) results. See FastQ Screen section below for detailed results.
Sample Name | %Human | %Mouse | %Rat | %No hits | %Drosophila | %Worm | %Yeast | %Arabidopsis | %Ecoli | %rRNA | %MT | %PhiX | %Lambda | %Vectors | %Adapters |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
B-1-R1_screen | fail | fail | fail | fail | fail | pass | fail | fail | pass | fail | pass | pass | pass | pass | pass |
B-1-R2_screen | fail | fail | fail | fail | fail | pass | fail | fail | pass | fail | pass | pass | pass | pass | pass |
B-2-R1_screen | fail | fail | fail | fail | fail | pass | fail | fail | pass | fail | pass | pass | pass | pass | pass |
B-2-R2_screen | fail | fail | fail | fail | fail | pass | fail | fail | pass | fail | pass | pass | pass | pass | pass |
A-1-R1_screen | fail | fail | fail | fail | fail | pass | fail | fail | pass | fail | pass | pass | pass | pass | pass |
A-1-R2_screen | fail | fail | fail | fail | fail | pass | fail | fail | pass | fail | pass | pass | pass | pass | pass |
A-2-R1_screen | fail | fail | fail | fail | fail | pass | fail | fail | pass | fail | pass | pass | pass | pass | pass |
A-2-R2_screen | fail | fail | fail | fail | fail | pass | fail | fail | pass | fail | pass | pass | pass | pass | pass |
+ + + +
QualiMap
+QualiMap is a platform-independent application to facilitate the quality control of alignment sequencing data and its derivatives like feature counts.
+ + + + ++ Coverage histogram + + + +
+ +Distribution of the number of locations in the reference genome with a given depth of coverage.
For a set of DNA or RNA reads mapped to a reference sequence, such as a genome +or transcriptome, the depth of coverage at a given base position is the number +of high-quality reads that map to the reference at that position +(Sims et al. 2014).
+Bases of a reference sequence (y-axis) are groupped by their depth of coverage +(0×, 1×, …, N×) (x-axis). This plot shows +the frequency of coverage depths relative to the reference sequence for each +read dataset, which provides an indirect measure of the level and variation of +coverage depth in the corresponding sequenced sample.
+If reads are randomly distributed across the reference sequence, this plot +should resemble a Poisson distribution (Lander & Waterman 1988), with a peak indicating approximate +depth of coverage, and more uniform coverage depth being reflected in a narrower +spread. The optimal level of coverage depth depends on the aims of the +experiment, though it should at minimum be sufficiently high to adequately +address the biological question; greater uniformity of coverage is generally +desirable, because it increases breadth of coverage for a given depth of +coverage, allowing equivalent results to be achieved at a lower sequencing depth +(Sampson +et al. 2011; Sims +et al. 2014). However, it is difficult to achieve uniform coverage +depth in practice, due to biases introduced during sample preparation +(van +Dijk et al. 2014), sequencing (Ross et al. 2013) and read mapping +(Sims et al. 2014).
+This plot may include a small peak for regions of the reference sequence with +zero depth of coverage. Such regions may be absent from the given sample (due +to a deletion or structural rearrangement), present in the sample but not +successfully sequenced (due to bias in sequencing or preparation), or sequenced +but not successfully mapped to the reference (due to the choice of mapping +algorithm, the presence of repeat sequences, or mismatches caused by variants +or sequencing errors). Related factors cause most datasets to contain some +unmapped reads (Sims +et al. 2014).
+
+ Cumulative genome coverage + + + +
+ +Percentage of the reference genome with at least the given depth of coverage.
For a set of DNA or RNA reads mapped to a reference sequence, such as a genome +or transcriptome, the depth of coverage at a given base position is the number +of high-quality reads that map to the reference at that position, while the +breadth of coverage is the fraction of the reference sequence to which reads +have been mapped with at least a given depth of coverage +(Sims et al. 2014).
+Defining coverage breadth in terms of coverage depth is useful, because +sequencing experiments typically require a specific minimum depth of coverage +over the region of interest (Sims et al. 2014), so the extent of the reference sequence +that is amenable to analysis is constrained to lie within regions that have +sufficient depth. With inadequate sequencing breadth, it can be difficult to +distinguish the absence of a biological feature (such as a gene) from a lack +of data (Green 2007).
+For increasing coverage depths (1×, 2×, …, N×), +coverage breadth is calculated as the percentage of the reference +sequence that is covered by at least that number of reads, then plots +coverage breadth (y-axis) against coverage depth (x-axis). This plot +shows the relationship between sequencing depth and breadth for each read +dataset, which can be used to gauge, for example, the likely effect of a +minimum depth filter on the fraction of a genome available for analysis.
+
+ Insert size histogram + + + +
+ +Distribution of estimated insert sizes of mapped reads.
To overcome limitations in the length of DNA or RNA sequencing reads, +many sequencing instruments can produce two or more shorter reads from +one longer fragment in which the relative position of reads is +approximately known, such as paired-end or mate-pair reads +(Mardis 2013). Such techniques can extend the reach +of sequencing technology, allowing for more accurate placement of reads +(Reinert et al. 2015) and better resolution of repeat +regions (Reinert et al. 2015), as well as detection of +structural variation (Alkan et al. 2011) and chimeric transcripts +(Maher et al. 2009).
+All these methods assume that the approximate size of an insert is known. +(Insert size can be defined as the length in bases of a sequenced DNA or +RNA fragment, excluding technical sequences such as adapters, which are +typically removed before alignment.) This plot allows for that assumption +to be assessed. With the set of mapped fragments for a given sample, QualiMap +groups the fragments by insert size, then plots the frequency of mapped +fragments (y-axis) over a range of insert sizes (x-axis). In an ideal case, +the distribution of fragment sizes for a sequencing library would culminate +in a single peak indicating average insert size, with a narrow spread +indicating highly consistent fragment lengths.
+QualiMap calculates insert sizes as follows: for each fragment in which
+every read mapped successfully to the same reference sequence, it
+extracts the insert size from the TLEN
field of the leftmost read
+(see the Qualimap 2 documentation), where the TLEN
(or
+'observed Template LENgth') field contains 'the number of bases from the
+leftmost mapped base to the rightmost mapped base'
+(SAM
+format specification). Note that because it is defined in terms of
+alignment to a reference sequence, the value of the TLEN
field may
+differ from the insert size due to factors such as alignment clipping,
+alignment errors, or structural variation or splicing in a gap between
+reads from the same fragment.
+
+ GC content distribution + + + +
+ +Each solid line represents the distribution of GC content of mapped reads for a given sample. The dotted line represents a pre-calculated GC distribution for the reference genome.
GC bias is the difference between the guanine-cytosine content +(GC-content) of a set of sequencing reads and the GC-content of the DNA +or RNA in the original sample. It is a well-known issue with sequencing +systems, and may be introduced by PCR amplification, among other factors +(Benjamini +& Speed 2012; Ross et al. 2013).
+QualiMap calculates the GC-content of individual mapped reads, then +groups those reads by their GC-content (1%, 2%, …, 100%), and +plots the frequency of mapped reads (y-axis) at each level of GC-content +(x-axis). This plot shows the GC-content distribution of mapped reads +for each read dataset, which should ideally resemble that of the +original sample. It can be useful to display the GC-content distribution +of an appropriate reference sequence for comparison, and QualiMap has an +option to do this (see the Qualimap 2 documentation).
+ + + +
Picard
+Picard is a set of Java command line tools for manipulating high-throughput sequencing data.
+ + + + ++ Alignment Summary + +
+ +Please note that Picard's read counts are divided by two for paired-end data.
+
+ Mark Duplicates + + + +
+ +Number of reads, categorised by duplication state. Pair counts are doubled - see help text for details.
The table in the Picard metrics file contains some columns referring +read pairs and some referring to single reads.
+To make the numbers in this plot sum correctly, values referring to pairs are doubled +according to the scheme below:
+-
+
READS_IN_DUPLICATE_PAIRS = 2 * READ_PAIR_DUPLICATES
+READS_IN_UNIQUE_PAIRS = 2 * (READ_PAIRS_EXAMINED - READ_PAIR_DUPLICATES)
+READS_IN_UNIQUE_UNPAIRED = UNPAIRED_READS_EXAMINED - UNPAIRED_READ_DUPLICATES
+READS_IN_DUPLICATE_PAIRS_OPTICAL = 2 * READ_PAIR_OPTICAL_DUPLICATES
+READS_IN_DUPLICATE_PAIRS_NONOPTICAL = READS_IN_DUPLICATE_PAIRS - READS_IN_DUPLICATE_PAIRS_OPTICAL
+READS_IN_DUPLICATE_UNPAIRED = UNPAIRED_READ_DUPLICATES
+READS_UNMAPPED = UNMAPPED_READS
+
+
+ WGS Coverage + +
+ +The number of bases in the genome territory for each fold coverage. Note that final 1% of data is hidden to prevent very long tails.
+
+ WGS Filtered Bases + +
+ +For more information about the filtered categories, see the Picard documentation.
+ + + +
Samtools
+Samtools is a suite of programs for interacting with high-throughput sequencing data.
+ + + + ++ Percent Mapped + + + +
+ +Alignment metrics from samtools stats
; mapped vs. unmapped reads.
For a set of samples that have come from the same multiplexed library, +similar numbers of reads for each sample are expected. Large differences in numbers might +indicate issues during the library preparation process. Whilst large differences in read +numbers may be controlled for in downstream processings (e.g. read count normalisation), +you may wish to consider whether the read depths achieved have fallen below recommended +levels depending on the applications.
+Low alignment rates could indicate contamination of samples (e.g. adapter sequences), +low sequencing quality or other artefacts. These can be further investigated in the +sequence level QC (e.g. from FastQC).
+
+ Alignment metrics + +
+ +This module parses the output from samtools stats
. All numbers in millions.
+ + + +
Bcftools
+Bcftools contains utilities for variant calling and manipulating VCFs and BCFs.
+ + + + ++ Variant Substitution Types + +
+ + + + ++
+ Variant Quality + +
+ + + + ++
+ Indel Distribution + +
+ + + + ++
+ Variant depths + +
+ +Read depth support distribution for called variants
+ + + +
FastQC (trimmed)
+This section of the report shows FastQC results after adapter trimming.
+ + + + ++ Sequence Counts + + + +
+ +Sequence counts for each sample. Duplicate read counts are an estimate only.
This plot show the total number of reads, broken down into unique and duplicate +if possible (only more recent versions of FastQC give duplicate info).
+You can read more about duplicate calculation in the +FastQC documentation. +A small part has been copied here for convenience:
+Only sequences which first appear in the first 100,000 sequences +in each file are analysed. This should be enough to get a good impression +for the duplication levels in the whole file. Each sequence is tracked to +the end of the file to give a representative count of the overall duplication level.
+The duplication detection requires an exact sequence match over the whole length of +the sequence. Any reads over 75bp in length are truncated to 50bp for this analysis.
+
+ Sequence Quality Histograms + + + +
+ +The mean quality value across each base position in the read.
To enable multiple samples to be plotted on the same graph, only the mean quality +scores are plotted (unlike the box plots seen in FastQC reports).
+Taken from the FastQC help:
+The y-axis on the graph shows the quality scores. The higher the score, the better +the base call. The background of the graph divides the y axis into very good quality +calls (green), calls of reasonable quality (orange), and calls of poor quality (red). +The quality of calls on most platforms will degrade as the run progresses, so it is +common to see base calls falling into the orange area towards the end of a read.
+
+ Per Sequence Quality Scores + + + +
+ +The number of reads with average quality scores. Shows if a subset of reads has poor quality.
From the FastQC help:
+The per sequence quality score report allows you to see if a subset of your +sequences have universally low quality values. It is often the case that a +subset of sequences will have universally poor quality, however these should +represent only a small percentage of the total sequences.
+
+ Per Base Sequence Content + + + +
+ +The proportion of each base position for which each of the four normal DNA bases has been called.
To enable multiple samples to be shown in a single plot, the base composition data +is shown as a heatmap. The colours represent the balance between the four bases: +an even distribution should give an even muddy brown colour. Hover over the plot +to see the percentage of the four bases under the cursor.
+To see the data as a line plot, as in the original FastQC graph, click on a sample track.
+From the FastQC help:
+Per Base Sequence Content plots out the proportion of each base position in a +file for which each of the four normal DNA bases has been called.
+In a random library you would expect that there would be little to no difference +between the different bases of a sequence run, so the lines in this plot should +run parallel with each other. The relative amount of each base should reflect +the overall amount of these bases in your genome, but in any case they should +not be hugely imbalanced from each other.
+It's worth noting that some types of library will always produce biased sequence +composition, normally at the start of the read. Libraries produced by priming +using random hexamers (including nearly all RNA-Seq libraries) and those which +were fragmented using transposases inherit an intrinsic bias in the positions +at which reads start. This bias does not concern an absolute sequence, but instead +provides enrichement of a number of different K-mers at the 5' end of the reads. +Whilst this is a true technical bias, it isn't something which can be corrected +by trimming and in most cases doesn't seem to adversely affect the downstream +analysis.
Rollover for sample name
+ ++
+ Per Sequence GC Content + + + +
+ +The average GC content of reads. Normal random library typically have a + roughly normal distribution of GC content.
From the FastQC help:
+This module measures the GC content across the whole length of each sequence +in a file and compares it to a modelled normal distribution of GC content.
+In a normal random library you would expect to see a roughly normal distribution +of GC content where the central peak corresponds to the overall GC content of +the underlying genome. Since we don't know the the GC content of the genome the +modal GC content is calculated from the observed data and used to build a +reference distribution.
+An unusually shaped distribution could indicate a contaminated library or +some other kinds of biased subset. A normal distribution which is shifted +indicates some systematic bias which is independent of base position. If there +is a systematic bias which creates a shifted normal distribution then this won't +be flagged as an error by the module since it doesn't know what your genome's +GC content should be.
+
+ Per Base N Content + + + +
+ +The percentage of base calls at each position for which an N
was called.
From the FastQC help:
+If a sequencer is unable to make a base call with sufficient confidence then it will
+normally substitute an N
rather than a conventional base call. This graph shows the
+percentage of base calls at each position for which an N
was called.
It's not unusual to see a very low proportion of Ns appearing in a sequence, especially +nearer the end of a sequence. However, if this proportion rises above a few percent +it suggests that the analysis pipeline was unable to interpret the data well enough to +make valid base calls.
+
+ Sequence Length Distribution + +
+ +The distribution of fragment sizes (read lengths) found. + See the FastQC help
+
+ Sequence Duplication Levels + + + +
+ +The relative level of duplication found for every sequence.
From the FastQC Help:
+In a diverse library most sequences will occur only once in the final set. +A low level of duplication may indicate a very high level of coverage of the +target sequence, but a high level of duplication is more likely to indicate +some kind of enrichment bias (eg PCR over amplification). This graph shows +the degree of duplication for every sequence in a library: the relative +number of sequences with different degrees of duplication.
+Only sequences which first appear in the first 100,000 sequences +in each file are analysed. This should be enough to get a good impression +for the duplication levels in the whole file. Each sequence is tracked to +the end of the file to give a representative count of the overall duplication level.
+The duplication detection requires an exact sequence match over the whole length of +the sequence. Any reads over 75bp in length are truncated to 50bp for this analysis.
+In a properly diverse library most sequences should fall into the far left of the +plot in both the red and blue lines. A general level of enrichment, indicating broad +oversequencing in the library will tend to flatten the lines, lowering the low end +and generally raising other categories. More specific enrichments of subsets, or +the presence of low complexity contaminants will tend to produce spikes towards the +right of the plot.
+
+ Overrepresented sequences + + + +
+ +The total amount of overrepresented sequences found in each library.
FastQC calculates and lists overrepresented sequences in FastQ files. It would not be +possible to show this for all samples in a MultiQC report, so instead this plot shows +the number of sequences categorized as over represented.
+Sometimes, a single sequence may account for a large number of reads in a dataset. +To show this, the bars are split into two: the first shows the overrepresented reads +that come from the single most common sequence. The second shows the total count +from all remaining overrepresented sequences.
+From the FastQC Help:
+A normal high-throughput library will contain a diverse set of sequences, with no +individual sequence making up a tiny fraction of the whole. Finding that a single +sequence is very overrepresented in the set either means that it is highly biologically +significant, or indicates that the library is contaminated, or not as diverse as you expected.
+FastQC lists all of the sequences which make up more than 0.1% of the total. +To conserve memory only sequences which appear in the first 100,000 sequences are tracked +to the end of the file. It is therefore possible that a sequence which is overrepresented +but doesn't appear at the start of the file for some reason could be missed by this module.
+
+ Adapter Content + + + +
+ +The cumulative percentage count of the proportion of your + library which has seen each of the adapter sequences at each position.
Note that only samples with ≥ 0.1% adapter contamination are shown.
+There may be several lines per sample, as one is shown for each adapter +detected in the file.
+From the FastQC Help:
+The plot shows a cumulative percentage count of the proportion +of your library which has seen each of the adapter sequences at each position. +Once a sequence has been seen in a read it is counted as being present +right through to the end of the read so the percentages you see will only +increase as the read length goes on.
+
+ Status Checks + + + +
+ +Status for each FastQC section showing whether results seem entirely normal (green), +slightly abnormal (orange) or very unusual (red).
FastQC assigns a status for each section of the report. +These give a quick evaluation of whether the results of the analysis seem +entirely normal (green), slightly abnormal (orange) or very unusual (red).
+It is important to stress that although the analysis results appear to give a pass/fail result, +these evaluations must be taken in the context of what you expect from your library. +A 'normal' sample as far as FastQC is concerned is random and diverse. +Some experiments may be expected to produce libraries which are biased in particular ways. +You should treat the summary evaluations therefore as pointers to where you should concentrate +your attention and understand why your library may not look random and diverse.
+Specific guidance on how to interpret the output of each module can be found in the relevant +report section, or in the FastQC help.
+In this heatmap, we summarise all of these into a single heatmap for a quick overview. +Note that not all FastQC sections have plots in MultiQC reports, but all status checks +are shown in this heatmap.
+ + + +
FastQC (raw)
+FastQC (raw) is a quality control tool for high throughput sequence data, written by Simon Andrews at the Babraham Institute in Cambridge.
+ + + + ++ Sequence Counts + + + +
+ +Sequence counts for each sample. Duplicate read counts are an estimate only.
This plot show the total number of reads, broken down into unique and duplicate +if possible (only more recent versions of FastQC give duplicate info).
+You can read more about duplicate calculation in the +FastQC documentation. +A small part has been copied here for convenience:
+Only sequences which first appear in the first 100,000 sequences +in each file are analysed. This should be enough to get a good impression +for the duplication levels in the whole file. Each sequence is tracked to +the end of the file to give a representative count of the overall duplication level.
+The duplication detection requires an exact sequence match over the whole length of +the sequence. Any reads over 75bp in length are truncated to 50bp for this analysis.
+
+ Sequence Quality Histograms + + + +
+ +The mean quality value across each base position in the read.
To enable multiple samples to be plotted on the same graph, only the mean quality +scores are plotted (unlike the box plots seen in FastQC reports).
+Taken from the FastQC help:
+The y-axis on the graph shows the quality scores. The higher the score, the better +the base call. The background of the graph divides the y axis into very good quality +calls (green), calls of reasonable quality (orange), and calls of poor quality (red). +The quality of calls on most platforms will degrade as the run progresses, so it is +common to see base calls falling into the orange area towards the end of a read.
+
+ Per Sequence Quality Scores + + + +
+ +The number of reads with average quality scores. Shows if a subset of reads has poor quality.
From the FastQC help:
+The per sequence quality score report allows you to see if a subset of your +sequences have universally low quality values. It is often the case that a +subset of sequences will have universally poor quality, however these should +represent only a small percentage of the total sequences.
+
+ Per Base Sequence Content + + + +
+ +The proportion of each base position for which each of the four normal DNA bases has been called.
To enable multiple samples to be shown in a single plot, the base composition data +is shown as a heatmap. The colours represent the balance between the four bases: +an even distribution should give an even muddy brown colour. Hover over the plot +to see the percentage of the four bases under the cursor.
+To see the data as a line plot, as in the original FastQC graph, click on a sample track.
+From the FastQC help:
+Per Base Sequence Content plots out the proportion of each base position in a +file for which each of the four normal DNA bases has been called.
+In a random library you would expect that there would be little to no difference +between the different bases of a sequence run, so the lines in this plot should +run parallel with each other. The relative amount of each base should reflect +the overall amount of these bases in your genome, but in any case they should +not be hugely imbalanced from each other.
+It's worth noting that some types of library will always produce biased sequence +composition, normally at the start of the read. Libraries produced by priming +using random hexamers (including nearly all RNA-Seq libraries) and those which +were fragmented using transposases inherit an intrinsic bias in the positions +at which reads start. This bias does not concern an absolute sequence, but instead +provides enrichement of a number of different K-mers at the 5' end of the reads. +Whilst this is a true technical bias, it isn't something which can be corrected +by trimming and in most cases doesn't seem to adversely affect the downstream +analysis.
Rollover for sample name
+ ++
+ Per Sequence GC Content + + + +
+ +The average GC content of reads. Normal random library typically have a + roughly normal distribution of GC content.
From the FastQC help:
+This module measures the GC content across the whole length of each sequence +in a file and compares it to a modelled normal distribution of GC content.
+In a normal random library you would expect to see a roughly normal distribution +of GC content where the central peak corresponds to the overall GC content of +the underlying genome. Since we don't know the the GC content of the genome the +modal GC content is calculated from the observed data and used to build a +reference distribution.
+An unusually shaped distribution could indicate a contaminated library or +some other kinds of biased subset. A normal distribution which is shifted +indicates some systematic bias which is independent of base position. If there +is a systematic bias which creates a shifted normal distribution then this won't +be flagged as an error by the module since it doesn't know what your genome's +GC content should be.
+
+ Per Base N Content + + + +
+ +The percentage of base calls at each position for which an N
was called.
From the FastQC help:
+If a sequencer is unable to make a base call with sufficient confidence then it will
+normally substitute an N
rather than a conventional base call. This graph shows the
+percentage of base calls at each position for which an N
was called.
It's not unusual to see a very low proportion of Ns appearing in a sequence, especially +nearer the end of a sequence. However, if this proportion rises above a few percent +it suggests that the analysis pipeline was unable to interpret the data well enough to +make valid base calls.
+
+ Sequence Length Distribution + +
+ +The distribution of fragment sizes (read lengths) found. + See the FastQC help
+
+ Sequence Duplication Levels + + + +
+ +The relative level of duplication found for every sequence.
From the FastQC Help:
+In a diverse library most sequences will occur only once in the final set. +A low level of duplication may indicate a very high level of coverage of the +target sequence, but a high level of duplication is more likely to indicate +some kind of enrichment bias (eg PCR over amplification). This graph shows +the degree of duplication for every sequence in a library: the relative +number of sequences with different degrees of duplication.
+Only sequences which first appear in the first 100,000 sequences +in each file are analysed. This should be enough to get a good impression +for the duplication levels in the whole file. Each sequence is tracked to +the end of the file to give a representative count of the overall duplication level.
+The duplication detection requires an exact sequence match over the whole length of +the sequence. Any reads over 75bp in length are truncated to 50bp for this analysis.
+In a properly diverse library most sequences should fall into the far left of the +plot in both the red and blue lines. A general level of enrichment, indicating broad +oversequencing in the library will tend to flatten the lines, lowering the low end +and generally raising other categories. More specific enrichments of subsets, or +the presence of low complexity contaminants will tend to produce spikes towards the +right of the plot.
+
+ Overrepresented sequences + + + +
+ +The total amount of overrepresented sequences found in each library.
FastQC calculates and lists overrepresented sequences in FastQ files. It would not be +possible to show this for all samples in a MultiQC report, so instead this plot shows +the number of sequences categorized as over represented.
+Sometimes, a single sequence may account for a large number of reads in a dataset. +To show this, the bars are split into two: the first shows the overrepresented reads +that come from the single most common sequence. The second shows the total count +from all remaining overrepresented sequences.
+From the FastQC Help:
+A normal high-throughput library will contain a diverse set of sequences, with no +individual sequence making up a tiny fraction of the whole. Finding that a single +sequence is very overrepresented in the set either means that it is highly biologically +significant, or indicates that the library is contaminated, or not as diverse as you expected.
+FastQC lists all of the sequences which make up more than 0.1% of the total. +To conserve memory only sequences which appear in the first 100,000 sequences are tracked +to the end of the file. It is therefore possible that a sequence which is overrepresented +but doesn't appear at the start of the file for some reason could be missed by this module.
+
+ Adapter Content + + + +
+ +The cumulative percentage count of the proportion of your + library which has seen each of the adapter sequences at each position.
Note that only samples with ≥ 0.1% adapter contamination are shown.
+There may be several lines per sample, as one is shown for each adapter +detected in the file.
+From the FastQC Help:
+The plot shows a cumulative percentage count of the proportion +of your library which has seen each of the adapter sequences at each position. +Once a sequence has been seen in a read it is counted as being present +right through to the end of the read so the percentages you see will only +increase as the read length goes on.
+
+ Status Checks + + + +
+ +Status for each FastQC section showing whether results seem entirely normal (green), +slightly abnormal (orange) or very unusual (red).
FastQC assigns a status for each section of the report. +These give a quick evaluation of whether the results of the analysis seem +entirely normal (green), slightly abnormal (orange) or very unusual (red).
+It is important to stress that although the analysis results appear to give a pass/fail result, +these evaluations must be taken in the context of what you expect from your library. +A 'normal' sample as far as FastQC is concerned is random and diverse. +Some experiments may be expected to produce libraries which are biased in particular ways. +You should treat the summary evaluations therefore as pointers to where you should concentrate +your attention and understand why your library may not look random and diverse.
+Specific guidance on how to interpret the output of each module can be found in the relevant +report section, or in the FastQC help.
+In this heatmap, we summarise all of these into a single heatmap for a quick overview. +Note that not all FastQC sections have plots in MultiQC reports, but all status checks +are shown in this heatmap.
+ + + +
FastQ Screen (trimmed)
+FastQ Screen (trimmed) allows you to screen a library of sequences in FastQ format against a set of sequence databases so you can see if the composition of the library matches with what you expect.
+ + + + ++ Mapped Reads + +
+ + + + + + + + + ++ + + +
VerifyBAMID
+VerifyBAMID detects sample contamination and/or sample swaps.
+ + + + + + +The following values provide estimates of sample contamination. Click help for more information.
Please note that FREEMIX
is named Contamination (Seq) and CHIPMIX
+is named Contamination (S+A) in this MultiQC report.
VerifyBamID provides a series of information that is informative to determine
+whether the sample is possibly contaminated or swapped, but there is no single
+criteria that works for every circumstances. There are a few unmodeled factor
+in the estimation of [SELF-IBD]/[BEST-IBD]
and [%MIX]
, so please note that the
+MLE estimation may not always exactly match to the true amount of contamination.
+Here we provide a guideline to flag potentially contaminated/swapped samples:
-
+
- Each sample or lane can be checked in this way.
+ When
[CHIPMIX] >> 0.02
and/or[FREEMIX] >> 0.02
, meaning 2% or more of + non-reference bases are observed in reference sites, we recommend to examine + the data more carefully for the possibility of contamination.
+ - We recommend to check each lane for the possibility of sample swaps.
+ When
[CHIPMIX] ~ 1
AND[FREEMIX] ~ 0
, then it is possible that the sample + is swapped with another sample. When[CHIPMIX] ~ 0
in.bestSM
file, +[CHIP_ID]
might be actually the swapped sample. Otherwise, the swapped + sample may not exist in the genotype data you have compared.
+ - When genotype data is not available but allele-frequency-based estimates of
+
[FREEMIX] >= 0.03
and[FREELK1]-[FREELK0]
is large, then it is possible + that the sample is contaminated with other sample. We recommend to use + per-sample data rather than per-lane data for checking this for low coverage + data, because the inference will be more confident when there are large number + of bases with depth 2 or higher.
+
Copied from the VerifyBAMID documentation - see the link for more details.
Sample Name | Read Group | SNPS | M Reads | Average Depth | Contamination (Seq) | FREEELK1 | FREELK0 |
---|---|---|---|---|---|---|---|
A | NA | 100000 | 0.0 | 1.5 X | 46.296% | -72 | -77 |
B | NA | 100000 | 0.0 | 1.5 X | 46.296% | -72 | -77 |
+ + + +
Somalier
+Somalier calculates genotype :: pedigree correspondence checks from sketches derived from BAM/CRAM or VCF
+ + + + ++ Statistics + +
+ +Various statistics from the somalier report.
Sample Name | Sex | Ancestry | P(Ancestry) | HetVar | NA sites | Sites depth | Allele balance | Allele balance < 0.2, > 0.8 | HetVar X | Mean depth X | Mean depth Y |
---|---|---|---|---|---|---|---|---|---|---|---|
A | unknown | AFR | 0.31 | 1 | 17383 | 7.0 X | 0.6 | 0.00 | 0 | 0.0 X | 0.0 X |
B | unknown | AFR | 0.31 | 1 | 17383 | 7.0 X | 0.6 | 0.00 | 0 | 0.0 X | 0.0 X |
+
+ Relatedness + +
+ +Shared allele rates between sample pairs. +Points are coloured by degree of expected-relatedness: Unrelated, Sib-sib, 0.4900000095367432, Parent-child,
+
+ Relatedness Heatmap + +
+ +Heatmap displaying relatedness of sample pairs.
+
+ Heterozygosity + + + +
+ +Standard devation of heterozygous allele balance against mean depth.
A high standard deviation in allele balance suggests contamination.
+
+ Sex + + + +
+ +Predicted sex against scaled depth on X
Higher values of depth, low values suggest male.
+
+ Ancestry Barplot + + + +
+ +Predicted ancestries of samples.
Shows the percentwise predicted probability of each +ancestry. A sample might contain traces of several ancestries. +If the number of samples is too high, the plot is rendered as a +non-interactive flat image.
+
+ Ancestry PCA + + + +
+ +Principal components of samples against background PCs.
Sample PCs are plotted against background PCs from the +background data supplied to somalier. +Color indicates predicted ancestry of sample. Data points in close +proximity are predicted to be of similar ancestry. Consider whether +the samples cluster as expected.