From 3f64a903ee1b2a2d3ca3e82598e3f878c537eee0 Mon Sep 17 00:00:00 2001 From: Manavalan Gajapathy Date: Thu, 13 Jul 2023 08:20:23 -0500 Subject: [PATCH 01/12] updates page name --- mkdocs.yaml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mkdocs.yaml b/mkdocs.yaml index d397ef9..b09f5d2 100644 --- a/mkdocs.yaml +++ b/mkdocs.yaml @@ -2,7 +2,7 @@ site_name: QuaC pipeline nav: - Home: index.md - Installation and Configuration: installation_configuration.md - - How to run QuaC: quac_cli.md + - QuaC command line interface: quac_cli.md - Input/Output: input_output.md - QuaC-Watch: quac_watch.md - System testing: system_testing.md From 5b63690127989c271ee420838f798b558c68b2e6 Mon Sep 17 00:00:00 2001 From: Manavalan Gajapathy Date: Thu, 13 Jul 2023 08:21:20 -0500 Subject: [PATCH 02/12] removes unnecessary text --- docs/quac_cli.md | 32 -------------------------------- 1 file changed, 32 deletions(-) diff --git a/docs/quac_cli.md b/docs/quac_cli.md index 78ca149..0783236 100644 --- a/docs/quac_cli.md +++ b/docs/quac_cli.md @@ -64,38 +64,6 @@ QuaC wrapper options: wrapper's) will be stored (default: data/quac/logs) ``` -### Useful features - -Besides the basic features, wrapper script [`src/run_quac.py`](../src/run_quac.py) offers the following: - -- Pass custom snakemake args using option `--extra_args`. -- Dry-run snakemake using flag `--dryrun`. Note that this is same as `--extra_args='-n'`. -- Submit snakemake process to Slurm, instead of running it locally, using `--cli_cluster_config`. -- Submit jobs triggered by snakemake workflow to Slurm using `--snakemake_cluster_config`. - -## Minimal example - -Minimal example to run the wrapper script, which in turn will execute the QuaC pipeline on-machine: (instead of using a -SLURM job scheduler on an HPC system for running on a distributed system) - -```sh -# First set up dependencies in the environment. -### Cheaha users can set them up as follows. -module reset -module load Anaconda3/2020.02 -module load Singularity/3.5.2-GCC-5.4.0-2.26 - -# activate conda env -conda activate quac - -# run CLI/wrapper script -python src/run_quac.py \ - --project_name "PROJECT_DUCK" \ - --projects_path "/path/to/the/projects" \ - --pedigree "path/to/lake/with/ducks_pedigree_file.ped" \ - --quac_watch_config "path/to/quac_watch_config.yaml" -``` - ## Example usage Refer to commands used in [system testing](./system_testing.md) for example usage. From 16a52c58284ccb36c206676aae900c7117ea6155 Mon Sep 17 00:00:00 2001 From: Manavalan Gajapathy Date: Thu, 13 Jul 2023 08:21:51 -0500 Subject: [PATCH 03/12] removes broken links to files outside docs dir --- docs/quac_watch.md | 6 ++---- docs/system_testing.md | 7 +++---- docs/visualize_pipeline.md | 2 +- 3 files changed, 6 insertions(+), 9 deletions(-) diff --git a/docs/quac_watch.md b/docs/quac_watch.md index c541a47..4084727 100644 --- a/docs/quac_watch.md +++ b/docs/quac_watch.md @@ -11,10 +11,8 @@ and readily highlight samples that need further review. We provide pre-defined thresholds for QC metrics as part of the QuaC repo and they need to be supplied via `--quac_watch_config`: -* For Genome sequencing - - [configs/quac_watch/wgs_quac_watch_config.yaml](../configs/quac_watch/wgs_quac_watch_config.yaml) -* For Exome sequencing - - [configs/quac_watch/exome_quac_watch_config.yaml](../configs/quac_watch/exome_quac_watch_config.yaml) +* For Genome sequencing - `configs/quac_watch/wgs_quac_watch_config.yaml` +* For Exome sequencing - `configs/quac_watch/exome_quac_watch_config.yaml` These thresholds were curated based on diff --git a/docs/system_testing.md b/docs/system_testing.md index ea5515c..2425d32 100644 --- a/docs/system_testing.md +++ b/docs/system_testing.md @@ -1,10 +1,9 @@ # System testing The system testing implemented for this pipeline tests whether the pipeline runs from start to finish without any error. -This testing uses test datasets present in [`.test/ngs-data/test_project`](../.test/ngs-data/test_project), which -reflects a test project containing four samples -- Two samples without priorQC data (`no_priorQC`) and two with priorQC -data (`include_priorQC`). [See .test/README.md](../.test/README.md) for more info on how these test datasets were -created. +This testing uses test datasets present in `.test/ngs-data/test_project`, which reflects a test project containing four +samples -- Two samples without priorQC data (`no_priorQC`) and two with priorQC data (`include_priorQC`). See +`.test/README.md` for more info on how these test datasets were created. !!! warning diff --git a/docs/visualize_pipeline.md b/docs/visualize_pipeline.md index 103ac06..d6b1395 100644 --- a/docs/visualize_pipeline.md +++ b/docs/visualize_pipeline.md @@ -1,7 +1,7 @@ # Visualization of pipeline [Visualization of the pipeline](https://snakemake.readthedocs.io/en/stable/executing/cli.html#visualization) based on -the test datasets are available in [directory `./pipeline_visualized/`](../pipeline_visualized/). Commands used to +the test datasets are available in directory `./pipeline_visualized/`. Commands used to create this visualization: ```sh From e1375a2af16ae3dc87badd2ac21a9edea64fce40 Mon Sep 17 00:00:00 2001 From: Manavalan Gajapathy Date: Thu, 13 Jul 2023 08:37:46 -0500 Subject: [PATCH 04/12] updates page name --- mkdocs.yaml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mkdocs.yaml b/mkdocs.yaml index b09f5d2..c4e752a 100644 --- a/mkdocs.yaml +++ b/mkdocs.yaml @@ -2,7 +2,7 @@ site_name: QuaC pipeline nav: - Home: index.md - Installation and Configuration: installation_configuration.md - - QuaC command line interface: quac_cli.md + - Command line interface: quac_cli.md - Input/Output: input_output.md - QuaC-Watch: quac_watch.md - System testing: system_testing.md From 5b5e58a7199e23420974e7c3615065b2fef175e2 Mon Sep 17 00:00:00 2001 From: Manavalan Gajapathy Date: Thu, 13 Jul 2023 08:38:13 -0500 Subject: [PATCH 05/12] makes required args explicit --- src/run_quac.py | 18 +++++++++--------- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/src/run_quac.py b/src/run_quac.py index bd04dfb..ecf51ab 100755 --- a/src/run_quac.py +++ b/src/run_quac.py @@ -311,33 +311,33 @@ def create_dirpath(arg): ) ############ Args for QuaC workflow ############ - WORKFLOW = PARSER.add_argument_group("QuaC workflow options") + WORKFLOW = PARSER.add_argument_group("QuaC snakemake workflow options") WORKFLOW.add_argument( "--project_name", - help="Project name", - metavar="", + help="Project name. Required.", + required=True, ) WORKFLOW.add_argument( "--projects_path", - help="Path where all projects are hosted. Do not include project name here.", + help="Path where all projects are hosted. Do not include project name here. Required.", type=lambda x: is_valid_dir(PARSER, x), - metavar="", + required=True, ) WORKFLOW.add_argument( "--pedigree", - help="Pedigree filepath. Must correspond to the project supplied via --project_name", + help="Pedigree filepath. Must correspond to the project supplied via --project_name. Required.", type=lambda x: is_valid_file(PARSER, x), - metavar="", + required=True, ) WORKFLOW.add_argument( "--quac_watch_config", help=( "YAML config path specifying QC thresholds for QuaC-Watch." - " See directory 'configs/quac_watch/' in quac repo for the included config files." + " See directory 'configs/quac_watch/' in quac repo for the included config files. Required." ), type=lambda x: is_valid_file(PARSER, x), - metavar="", + required=True, ) WORKFLOW.add_argument( "--workflow_config", From 06c603ff4d4b40f4a040d93791bbef183faa1481 Mon Sep 17 00:00:00 2001 From: Manavalan Gajapathy Date: Thu, 13 Jul 2023 08:56:07 -0500 Subject: [PATCH 06/12] improves link --- docs/input_output.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/input_output.md b/docs/input_output.md index 915382e..2ef89f1 100644 --- a/docs/input_output.md +++ b/docs/input_output.md @@ -101,8 +101,8 @@ as input to QuaC. ## Output QuaC results are stored at the path specified via option `--outdir` (default: -`data/quac/results/test_project/analysis`). Refer to the [system testing's output](./system_testing.md) to learn more -about the output directory structure. +`data/quac/results/test_project/analysis`). Refer to the [system testing's +output](./system_testing.md#expected-output-files) to learn more about the output directory structure. !!! tip From 08d2b2c7708e72a08678777925d520b8ca5b2a08 Mon Sep 17 00:00:00 2001 From: Manavalan Gajapathy Date: Thu, 13 Jul 2023 08:56:24 -0500 Subject: [PATCH 07/12] adds example multiqc report --- docs/example_output/multiqc_report.html | 11202 ++++++++++++++++++++++ 1 file changed, 11202 insertions(+) create mode 100644 docs/example_output/multiqc_report.html diff --git a/docs/example_output/multiqc_report.html b/docs/example_output/multiqc_report.html new file mode 100644 index 0000000..0138167 --- /dev/null +++ b/docs/example_output/multiqc_report.html @@ -0,0 +1,11202 @@ + + + + + + + + + + + + + +MultiQC Report + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+
+

+ + + + + + +

+ +

Loading report..

+ +
+ +
+
+ + + +
+ + + + +
+ + + + +
+

+ + Highlight Samples +

+ +
+ + + +
+

+ Regex mode off + + +

+
    +
    + + +
    +

    + + Rename Samples +

    + +
    + + + +
    +

    Click here for bulk input.

    +
    +

    Paste two columns of a tab-delimited table here (eg. from Excel).

    +

    First column should be the old name, second column the new name.

    +
    + + +
    +
    +

    + Regex mode off + + +

    +
      +
      + + +
      +

      + + Show / Hide Samples +

      + +
      +
      + +
      +
      + +
      +
      + + +
      +
      + +

      + Regex mode off + + +

      +
        +
        + + +
        +

        Export Plots

        +
        + +
        +
        +
        +
        +
        + + px +
        +
        +
        +
        + + px +
        +
        +
        +
        +
        + +
        +
        + +
        +
        +
        +
        + +
        +
        +
        + + X +
        +
        +
        +
        + +
        +

        Download the raw data used to create the plots in this report below:

        +
        +
        + +
        +
        + +
        +
        + +

        Note that additional data was saved in multiqc_report_data when this report was generated.

        + +
        +
        +
        + +
        +
        Choose Plots
        + + +
        + +
        + +

        If you use plots from MultiQC in a publication or presentation, please cite:

        +
        + MultiQC: Summarize analysis results for multiple tools and samples in a single report
        + Philip Ewels, Måns Magnusson, Sverker Lundin and Max Käller
        + Bioinformatics (2016)
        + doi: 10.1093/bioinformatics/btw354
        + PMID: 27312411 +
        +
        +
        + + +
        +

        Save Settings

        +

        You can save the toolbox settings for this report to the browser.

        +
        + + +
        +
        + +

        Load Settings

        +

        Choose a saved report profile from the dropdown box below:

        +
        +
        + +
        +
        + + + + +
        +
        +
        + + +
        +

        About MultiQC

        +

        This report was generated using MultiQC, version 1.9

        +

        You can see a YouTube video describing how to use MultiQC reports here: + https://youtu.be/qPbIlO_KWN0

        +

        For more information about MultiQC, including other videos and + extensive documentation, please visit http://multiqc.info

        +

        You can report bugs, suggest improvements and find the source code for MultiQC on GitHub: + https://github.com/ewels/MultiQC

        +

        MultiQC is published in Bioinformatics:

        +
        + MultiQC: Summarize analysis results for multiple tools and samples in a single report
        + Philip Ewels, Måns Magnusson, Sverker Lundin and Max Käller
        + Bioinformatics (2016)
        + doi: 10.1093/bioinformatics/btw354
        + PMID: 27312411 +
        +
        + +
        + +
        + + +
        + + + +

        + + + + +

        + + + +

        + A modular tool to aggregate results from bioinformatics analyses across many samples into a single report. +

        + + + + + + + + + + + +
        +

        Report + + generated on 2023-06-22, 12:59 + + + based on data in: + +

        +
          + +
        • /data/project/worthey_lab/projects/experimental_pipelines/mana/pipelines/quac/data/quac/results/test_project_2samples_wgs-include_priorQC/analysis/A/qc/picard-stats
        • + +
        • /data/project/worthey_lab/projects/experimental_pipelines/mana/pipelines/quac/.test/ngs-data/test_project/analysis/B/qc/fastqc-raw
        • + +
        • /data/project/worthey_lab/projects/experimental_pipelines/mana/pipelines/quac/.test/ngs-data/test_project/analysis/B/qc/fastq_screen-trimmed
        • + +
        • /data/project/worthey_lab/projects/experimental_pipelines/mana/pipelines/quac/data/quac/results/test_project_2samples_wgs-include_priorQC/analysis/project_level_qc/somalier/ancestry
        • + +
        • /data/project/worthey_lab/projects/experimental_pipelines/mana/pipelines/quac/.test/ngs-data/test_project/analysis/A/qc/fastqc-raw
        • + +
        • /data/project/worthey_lab/projects/experimental_pipelines/mana/pipelines/quac/data/quac/results/test_project_2samples_wgs-include_priorQC/analysis/A/qc/verifyBamID
        • + +
        • /data/project/worthey_lab/projects/experimental_pipelines/mana/pipelines/quac/.test/ngs-data/test_project/analysis/A/qc/fastqc-trimmed
        • + +
        • /data/project/worthey_lab/projects/experimental_pipelines/mana/pipelines/quac/data/quac/results/test_project_2samples_wgs-include_priorQC/analysis/B/qc/verifyBamID
        • + +
        • /data/project/worthey_lab/projects/experimental_pipelines/mana/pipelines/quac/data/quac/results/test_project_2samples_wgs-include_priorQC/analysis/B/qc/samtools-stats
        • + +
        • /data/project/worthey_lab/projects/experimental_pipelines/mana/pipelines/quac/data/quac/results/test_project_2samples_wgs-include_priorQC/analysis/A/qc/samtools-stats
        • + +
        • /data/project/worthey_lab/projects/experimental_pipelines/mana/pipelines/quac/.test/ngs-data/test_project/analysis/A/qc/dedup
        • + +
        • /data/project/worthey_lab/projects/experimental_pipelines/mana/pipelines/quac/data/quac/results/test_project_2samples_wgs-include_priorQC/analysis/project_level_qc/somalier/relatedness
        • + +
        • /data/project/worthey_lab/projects/experimental_pipelines/mana/pipelines/quac/data/quac/results/test_project_2samples_wgs-include_priorQC/analysis/B/qc/bcftools-stats
        • + +
        • /data/project/worthey_lab/projects/experimental_pipelines/mana/pipelines/quac/data/quac/results/test_project_2samples_wgs-include_priorQC/analysis/B/qc/quac_watch
        • + +
        • /data/project/worthey_lab/projects/experimental_pipelines/mana/pipelines/quac/.test/ngs-data/test_project/analysis/B/qc/dedup
        • + +
        • /data/project/worthey_lab/projects/experimental_pipelines/mana/pipelines/quac/data/quac/results/test_project_2samples_wgs-include_priorQC/analysis/project_level_qc/multiqc/configs
        • + +
        • /data/project/worthey_lab/projects/experimental_pipelines/mana/pipelines/quac/data/quac/results/test_project_2samples_wgs-include_priorQC/analysis/B/qc/qualimap/B
        • + +
        • /data/project/worthey_lab/projects/experimental_pipelines/mana/pipelines/quac/data/quac/results/test_project_2samples_wgs-include_priorQC/analysis/A/qc/quac_watch
        • + +
        • /data/project/worthey_lab/projects/experimental_pipelines/mana/pipelines/quac/.test/ngs-data/test_project/analysis/B/qc/fastqc-trimmed
        • + +
        • /data/project/worthey_lab/projects/experimental_pipelines/mana/pipelines/quac/data/quac/results/test_project_2samples_wgs-include_priorQC/analysis/A/qc/qualimap/A
        • + +
        • /data/project/worthey_lab/projects/experimental_pipelines/mana/pipelines/quac/data/quac/results/test_project_2samples_wgs-include_priorQC/analysis/A/qc/bcftools-stats
        • + +
        • /data/project/worthey_lab/projects/experimental_pipelines/mana/pipelines/quac/.test/ngs-data/test_project/analysis/A/qc/fastq_screen-trimmed
        • + +
        • /data/project/worthey_lab/projects/experimental_pipelines/mana/pipelines/quac/data/quac/results/test_project_2samples_wgs-include_priorQC/analysis/B/qc/picard-stats
        • + +
        + + +
        + + + +

        Change sample names: + + + + + +

        + + + + + +
        + + + + + + + + +
        +

        General Statistics

        + + + + + + + + + + Showing 14/18 rows and 21/51 columns. + +
        +
        + +
        Sample Name% GCIns. size≥ 15X≥ 30X≥ 40XMedian covMean cov% AlignedM Reads% Aligned% DupsError rate% Proper PairsVarsHomHetTs/Tv% Dups% GCM SeqsContamination (S)
        A
        46%
        383
        0.0%
        0.0%
        0.0%
        0.0X
        0.0X
        100.0%
        0.0
        100%
        0.52%
        98.7%
        6863
        1783
        3878
        2.23
        46.296%
        A-1
        3.2%
        A-1-R1
        7.5%
        47%
        0.0
        A-1-R2
        6.2%
        47%
        0.0
        A-2
        2.7%
        A-2-R1
        6.5%
        47%
        0.0
        A-2-R2
        5.4%
        47%
        0.0
        B
        46%
        383
        0.0%
        0.0%
        0.0%
        0.0X
        0.0X
        100.0%
        0.0
        100%
        0.52%
        98.7%
        6863
        1783
        3878
        2.23
        46.296%
        B-1
        3.2%
        B-1-R1
        7.5%
        47%
        0.0
        B-1-R2
        6.2%
        47%
        0.0
        B-2
        2.7%
        B-2-R1
        6.5%
        47%
        0.0
        B-2-R2
        5.4%
        47%
        0.0
        a.chr21.1
        a.chr21.2
        b.chr21.1
        b.chr21.2
        + + +
        + + + + + + +
        +

        QuaC-Watch

        +

        This section contains QuaC-Watch results. QuaC-Watch summarizes if samples have passed the QC thresholds.

        + + + + +
        + +

        + Overall QuaC-Watch Summary + +

        + +

        Overall QuaC-Watch summary of results from several QC tools

        + + +
        + + + + + + + + + Showing 2/2 rows and 9/9 columns. + +
        +
        + +
        Sample Namefastqcqualimap_overallqualimap_chromosome_specificpicardpicard_dupsbcftools_statsvariant_per_contigverifybamidfastq_screen
        B
        fail
        fail
        fail
        fail
        pass
        fail
        fail
        fail
        fail
        A
        fail
        fail
        fail
        fail
        pass
        fail
        fail
        fail
        fail
        + +
        + +
        +
        + + + + +
        + +

        + FastQC (trimmed) + +

        + +

        Quick summary of FastQC (trimmed) results. See FastQC section below for detailed results.

        + + +
        + + + + + + + + + Showing 8/8 rows and 10/10 columns. + +
        +
        + +
        Sample Nameper_base_sequence_qualityper_tile_sequence_qualityper_sequence_quality_scoresper_base_sequence_contentper_sequence_gc_contentper_base_n_contentsequence_length_distributionsequence_duplication_levelsoverrepresented_sequencesadapter_content
        B-1-R1
        pass
        fail
        pass
        fail
        fail
        pass
        warn
        pass
        fail
        pass
        B-1-R2
        pass
        fail
        pass
        fail
        warn
        warn
        warn
        pass
        warn
        pass
        B-2-R1
        pass
        fail
        pass
        fail
        fail
        pass
        warn
        pass
        fail
        pass
        B-2-R2
        pass
        fail
        pass
        fail
        warn
        warn
        warn
        pass
        warn
        pass
        A-1-R1
        pass
        fail
        pass
        fail
        fail
        pass
        warn
        pass
        fail
        pass
        A-1-R2
        pass
        fail
        pass
        fail
        warn
        warn
        warn
        pass
        warn
        pass
        A-2-R1
        pass
        fail
        pass
        fail
        fail
        pass
        warn
        pass
        fail
        pass
        A-2-R2
        pass
        fail
        pass
        fail
        warn
        warn
        warn
        pass
        warn
        pass
        + +
        + +
        +
        + + + + +
        + +

        + Qualimap - Overall stats + +

        + +

        Quick summary of Qualimap results. See QualiMap section below for detailed results.

        + + +
        + + + + + + + + + Showing 2/2 rows and 7/14 columns. + +
        +
        + +
        Sample Nameavg_gcpercentage_alignedmean_coveragemedian_coveragemean_cov:median_covmedian_insert_sizegeneral_error_rate
        B
        fail
        pass
        fail
        fail
        fail
        pass
        pass
        A
        fail
        pass
        fail
        fail
        fail
        pass
        pass
        + +
        + +
        +
        + + + + +
        + +

        + Qualimap - Chromosome stats + +

        + +

        Quick summary chromosome-level coverage info using Qualimap results. See QualiMap section below for detailed results.

        + + +
        loading..
        +
        + +
        +
        + + + + +
        + +

        + Picard + +

        + +

        Quick summary of picard metrics. Note: Picard-Duplication is reported separately (bcoz reasons!). See Picard section below for detailed results.

        + + +
        + + + + + + + + + Showing 2/2 rows and 8/16 columns. + +
        +
        + +
        Sample NamePCT_PF_READS_ALIGNEDPF_HQ_ALIGNED_Q20_BASESPCT_ADAPTERPCT_CHIMERASQ30_BASESperc_Q30_BASESPCT_EXC_TOTALPCT_15X
        B
        pass
        fail
        pass
        fail
        fail
        fail
        pass
        fail
        A
        pass
        fail
        pass
        fail
        fail
        fail
        pass
        fail
        + +
        + +
        +
        + + + + +
        + +

        + Picard-dups + +

        + +

        Quick summary of picard-duplication metrics. See Picard section below for detailed results.

        + + +
        + + + + + + + + + Showing 4/4 rows and 1/2 columns. + +
        +
        + +
        Sample NamePERCENT_DUPLICATION
        B-1
        pass
        B-2
        pass
        A-1
        pass
        A-2
        pass
        + +
        + +
        +
        + + + + +
        + +

        + Bcftools stats + +

        + +

        Quick summary of Bcftools-stats results. See Bcftools section below for detailed results.

        + + +
        + + + + + + + + + Showing 2/2 rows and 7/14 columns. + +
        +
        + +
        Sample Namenumber_of_recordsnumber_of_SNPsnumber_of_indelsperc_snpsperc_indelststvheterozygosity_ratio
        B
        fail
        fail
        fail
        pass
        pass
        fail
        pass
        A
        fail
        fail
        fail
        pass
        pass
        fail
        pass
        + +
        + +
        +
        + + + + +
        + +

        + Variant frequency per contig + +

        + +

        Quick summary of %variant per contig results.

        + + +
        + + + + + + + + + Showing 2/2 rows and 24/48 columns. + +
        +
        + +
        Sample Namechr1chr2chr3chr4chr5chr6chr7chr8chr9chr10chr11chr12chr13chr14chr15chr16chr17chr18chr19chr20chr21chr22chrXchrY
        B
        fail
        fail
        fail
        fail
        fail
        fail
        fail
        fail
        fail
        fail
        fail
        fail
        fail
        fail
        fail
        fail
        fail
        fail
        fail
        fail
        fail
        fail
        fail
        fail
        A
        fail
        fail
        fail
        fail
        fail
        fail
        fail
        fail
        fail
        fail
        fail
        fail
        fail
        fail
        fail
        fail
        fail
        fail
        fail
        fail
        fail
        fail
        fail
        fail
        + +
        + +
        +
        + + + + +
        + +

        + VerifyBAMID + +

        + +

        Quick summary of VerifyBAMID results. See VerifyBAMID section below for detailed results.

        + + +
        + + + + + + + + + Showing 2/2 rows and 1/2 columns. + +
        +
        + +
        Sample NameContamination(%)
        B
        fail
        A
        fail
        + +
        + +
        +
        + + + + +
        + +

        + FastQ Screen (trimmed) + +

        + +

        Quick summary of FastQ Screen (trimmed) results. See FastQ Screen section below for detailed results.

        + + +
        + + + + + + + + + Showing 8/8 rows and 15/30 columns. + +
        +
        + +
        Sample Name%Human%Mouse%Rat%No hits%Drosophila%Worm%Yeast%Arabidopsis%Ecoli%rRNA%MT%PhiX%Lambda%Vectors%Adapters
        B-1-R1_screen
        fail
        fail
        fail
        fail
        fail
        pass
        fail
        fail
        pass
        fail
        pass
        pass
        pass
        pass
        pass
        B-1-R2_screen
        fail
        fail
        fail
        fail
        fail
        pass
        fail
        fail
        pass
        fail
        pass
        pass
        pass
        pass
        pass
        B-2-R1_screen
        fail
        fail
        fail
        fail
        fail
        pass
        fail
        fail
        pass
        fail
        pass
        pass
        pass
        pass
        pass
        B-2-R2_screen
        fail
        fail
        fail
        fail
        fail
        pass
        fail
        fail
        pass
        fail
        pass
        pass
        pass
        pass
        pass
        A-1-R1_screen
        fail
        fail
        fail
        fail
        fail
        pass
        fail
        fail
        pass
        fail
        pass
        pass
        pass
        pass
        pass
        A-1-R2_screen
        fail
        fail
        fail
        fail
        fail
        pass
        fail
        fail
        pass
        fail
        pass
        pass
        pass
        pass
        pass
        A-2-R1_screen
        fail
        fail
        fail
        fail
        fail
        pass
        fail
        fail
        pass
        fail
        pass
        pass
        pass
        pass
        pass
        A-2-R2_screen
        fail
        fail
        fail
        fail
        fail
        pass
        fail
        fail
        pass
        fail
        pass
        pass
        pass
        pass
        pass
        + +
        + + +
        + + +
        +
        + + + +
        +

        QualiMap

        +

        QualiMap is a platform-independent application to facilitate the quality control of alignment sequencing data and its derivatives like feature counts.

        + + + + +
        + +

        + Coverage histogram + + + +

        + +

        Distribution of the number of locations in the reference genome with a given depth of coverage.

        + + +
        +

        For a set of DNA or RNA reads mapped to a reference sequence, such as a genome +or transcriptome, the depth of coverage at a given base position is the number +of high-quality reads that map to the reference at that position +(Sims et al. 2014).

        +

        Bases of a reference sequence (y-axis) are groupped by their depth of coverage +(0×, 1×, …, N×) (x-axis). This plot shows +the frequency of coverage depths relative to the reference sequence for each +read dataset, which provides an indirect measure of the level and variation of +coverage depth in the corresponding sequenced sample.

        +

        If reads are randomly distributed across the reference sequence, this plot +should resemble a Poisson distribution (Lander & Waterman 1988), with a peak indicating approximate +depth of coverage, and more uniform coverage depth being reflected in a narrower +spread. The optimal level of coverage depth depends on the aims of the +experiment, though it should at minimum be sufficiently high to adequately +address the biological question; greater uniformity of coverage is generally +desirable, because it increases breadth of coverage for a given depth of +coverage, allowing equivalent results to be achieved at a lower sequencing depth +(Sampson +et al. 2011; Sims +et al. 2014). However, it is difficult to achieve uniform coverage +depth in practice, due to biases introduced during sample preparation +(van +Dijk et al. 2014), sequencing (Ross et al. 2013) and read mapping +(Sims et al. 2014).

        +

        This plot may include a small peak for regions of the reference sequence with +zero depth of coverage. Such regions may be absent from the given sample (due +to a deletion or structural rearrangement), present in the sample but not +successfully sequenced (due to bias in sequencing or preparation), or sequenced +but not successfully mapped to the reference (due to the choice of mapping +algorithm, the presence of repeat sequences, or mismatches caused by variants +or sequencing errors). Related factors cause most datasets to contain some +unmapped reads (Sims +et al. 2014).

        +
        + +
        loading..
        +
        + +
        +
        + + + + +
        + +

        + Cumulative genome coverage + + + +

        + +

        Percentage of the reference genome with at least the given depth of coverage.

        + + +
        +

        For a set of DNA or RNA reads mapped to a reference sequence, such as a genome +or transcriptome, the depth of coverage at a given base position is the number +of high-quality reads that map to the reference at that position, while the +breadth of coverage is the fraction of the reference sequence to which reads +have been mapped with at least a given depth of coverage +(Sims et al. 2014).

        +

        Defining coverage breadth in terms of coverage depth is useful, because +sequencing experiments typically require a specific minimum depth of coverage +over the region of interest (Sims et al. 2014), so the extent of the reference sequence +that is amenable to analysis is constrained to lie within regions that have +sufficient depth. With inadequate sequencing breadth, it can be difficult to +distinguish the absence of a biological feature (such as a gene) from a lack +of data (Green 2007).

        +

        For increasing coverage depths (1×, 2×, …, N×), +coverage breadth is calculated as the percentage of the reference +sequence that is covered by at least that number of reads, then plots +coverage breadth (y-axis) against coverage depth (x-axis). This plot +shows the relationship between sequencing depth and breadth for each read +dataset, which can be used to gauge, for example, the likely effect of a +minimum depth filter on the fraction of a genome available for analysis.

        +
        + +
        loading..
        +
        + +
        +
        + + + + +
        + +

        + Insert size histogram + + + +

        + +

        Distribution of estimated insert sizes of mapped reads.

        + + +
        +

        To overcome limitations in the length of DNA or RNA sequencing reads, +many sequencing instruments can produce two or more shorter reads from +one longer fragment in which the relative position of reads is +approximately known, such as paired-end or mate-pair reads +(Mardis 2013). Such techniques can extend the reach +of sequencing technology, allowing for more accurate placement of reads +(Reinert et al. 2015) and better resolution of repeat +regions (Reinert et al. 2015), as well as detection of +structural variation (Alkan et al. 2011) and chimeric transcripts +(Maher et al. 2009).

        +

        All these methods assume that the approximate size of an insert is known. +(Insert size can be defined as the length in bases of a sequenced DNA or +RNA fragment, excluding technical sequences such as adapters, which are +typically removed before alignment.) This plot allows for that assumption +to be assessed. With the set of mapped fragments for a given sample, QualiMap +groups the fragments by insert size, then plots the frequency of mapped +fragments (y-axis) over a range of insert sizes (x-axis). In an ideal case, +the distribution of fragment sizes for a sequencing library would culminate +in a single peak indicating average insert size, with a narrow spread +indicating highly consistent fragment lengths.

        +

        QualiMap calculates insert sizes as follows: for each fragment in which +every read mapped successfully to the same reference sequence, it +extracts the insert size from the TLEN field of the leftmost read +(see the Qualimap 2 documentation), where the TLEN (or +'observed Template LENgth') field contains 'the number of bases from the +leftmost mapped base to the rightmost mapped base' +(SAM +format specification). Note that because it is defined in terms of +alignment to a reference sequence, the value of the TLEN field may +differ from the insert size due to factors such as alignment clipping, +alignment errors, or structural variation or splicing in a gap between +reads from the same fragment.

        +
        + +
        loading..
        +
        + +
        +
        + + + + +
        + +

        + GC content distribution + + + +

        + +

        Each solid line represents the distribution of GC content of mapped reads for a given sample. The dotted line represents a pre-calculated GC distribution for the reference genome.

        + + +
        +

        GC bias is the difference between the guanine-cytosine content +(GC-content) of a set of sequencing reads and the GC-content of the DNA +or RNA in the original sample. It is a well-known issue with sequencing +systems, and may be introduced by PCR amplification, among other factors +(Benjamini +& Speed 2012; Ross et al. 2013).

        +

        QualiMap calculates the GC-content of individual mapped reads, then +groups those reads by their GC-content (1%, 2%, …, 100%), and +plots the frequency of mapped reads (y-axis) at each level of GC-content +(x-axis). This plot shows the GC-content distribution of mapped reads +for each read dataset, which should ideally resemble that of the +original sample. It can be useful to display the GC-content distribution +of an appropriate reference sequence for comparison, and QualiMap has an +option to do this (see the Qualimap 2 documentation).

        +
        + +
        loading..
        +
        + + +
        + + +
        +
        + + + +
        +

        Picard

        +

        Picard is a set of Java command line tools for manipulating high-throughput sequencing data.

        + + + + +
        + +

        + Alignment Summary + +

        + +

        Please note that Picard's read counts are divided by two for paired-end data.

        + + +
        + + +
        +
        loading..
        +
        + +
        +
        + + + + +
        + +

        + Mark Duplicates + + + +

        + +

        Number of reads, categorised by duplication state. Pair counts are doubled - see help text for details.

        + + +
        +

        The table in the Picard metrics file contains some columns referring +read pairs and some referring to single reads.

        +

        To make the numbers in this plot sum correctly, values referring to pairs are doubled +according to the scheme below:

        +
          +
        • READS_IN_DUPLICATE_PAIRS = 2 * READ_PAIR_DUPLICATES
        • +
        • READS_IN_UNIQUE_PAIRS = 2 * (READ_PAIRS_EXAMINED - READ_PAIR_DUPLICATES)
        • +
        • READS_IN_UNIQUE_UNPAIRED = UNPAIRED_READS_EXAMINED - UNPAIRED_READ_DUPLICATES
        • +
        • READS_IN_DUPLICATE_PAIRS_OPTICAL = 2 * READ_PAIR_OPTICAL_DUPLICATES
        • +
        • READS_IN_DUPLICATE_PAIRS_NONOPTICAL = READS_IN_DUPLICATE_PAIRS - READS_IN_DUPLICATE_PAIRS_OPTICAL
        • +
        • READS_IN_DUPLICATE_UNPAIRED = UNPAIRED_READ_DUPLICATES
        • +
        • READS_UNMAPPED = UNMAPPED_READS
        • +
        +
        + +
        + + +
        +
        loading..
        +
        + +
        +
        + + + + +
        + +

        + WGS Coverage + +

        + +

        The number of bases in the genome territory for each fold coverage. Note that final 1% of data is hidden to prevent very long tails.

        + + +
        + + +
        + +
        loading..
        +
        + +
        +
        + + + + +
        + +

        + WGS Filtered Bases + +

        + +

        For more information about the filtered categories, see the Picard documentation.

        + + +
        +
        loading..
        +
        + + +
        + + +
        +
        + + + +
        +

        Samtools

        +

        Samtools is a suite of programs for interacting with high-throughput sequencing data.

        + + + + +
        + +

        + Percent Mapped + + + +

        + +

        Alignment metrics from samtools stats; mapped vs. unmapped reads.

        + + +
        +

        For a set of samples that have come from the same multiplexed library, +similar numbers of reads for each sample are expected. Large differences in numbers might +indicate issues during the library preparation process. Whilst large differences in read +numbers may be controlled for in downstream processings (e.g. read count normalisation), +you may wish to consider whether the read depths achieved have fallen below recommended +levels depending on the applications.

        +

        Low alignment rates could indicate contamination of samples (e.g. adapter sequences), +low sequencing quality or other artefacts. These can be further investigated in the +sequence level QC (e.g. from FastQC).

        +
        + +
        + + +
        +
        loading..
        +
        + +
        +
        + + + + +
        + +

        + Alignment metrics + +

        + +

        This module parses the output from samtools stats. All numbers in millions.

        + + +
        +
        loading..
        +
        + + +
        + + +
        +
        + + + +
        +

        Bcftools

        +

        Bcftools contains utilities for variant calling and manipulating VCFs and BCFs.

        + + + + +
        + +

        + Variant Substitution Types + +

        + + + + +
        + + +
        +
        loading..
        +
        + +
        +
        + + + + +
        + +

        + Variant Quality + +

        + + + + +
        + + + + +
        + +
        loading..
        +
        + +
        +
        + + + + +
        + +

        + Indel Distribution + +

        + + + + +
        loading..
        +
        + +
        +
        + + + + +
        + +

        + Variant depths + +

        + +

        Read depth support distribution for called variants

        + + +
        loading..
        +
        + + +
        + + +
        +
        + + + +
        +

        FastQC (trimmed)

        +

        This section of the report shows FastQC results after adapter trimming.

        + + + + +
        + +

        + Sequence Counts + + + +

        + +

        Sequence counts for each sample. Duplicate read counts are an estimate only.

        + + +
        +

        This plot show the total number of reads, broken down into unique and duplicate +if possible (only more recent versions of FastQC give duplicate info).

        +

        You can read more about duplicate calculation in the +FastQC documentation. +A small part has been copied here for convenience:

        +

        Only sequences which first appear in the first 100,000 sequences +in each file are analysed. This should be enough to get a good impression +for the duplication levels in the whole file. Each sequence is tracked to +the end of the file to give a representative count of the overall duplication level.

        +

        The duplication detection requires an exact sequence match over the whole length of +the sequence. Any reads over 75bp in length are truncated to 50bp for this analysis.

        +
        + +
        + + +
        +
        loading..
        +
        + +
        +
        + + + + +
        + +

        + Sequence Quality Histograms + + + +

        + +

        The mean quality value across each base position in the read.

        + + +
        +

        To enable multiple samples to be plotted on the same graph, only the mean quality +scores are plotted (unlike the box plots seen in FastQC reports).

        +

        Taken from the FastQC help:

        +

        The y-axis on the graph shows the quality scores. The higher the score, the better +the base call. The background of the graph divides the y axis into very good quality +calls (green), calls of reasonable quality (orange), and calls of poor quality (red). +The quality of calls on most platforms will degrade as the run progresses, so it is +common to see base calls falling into the orange area towards the end of a read.

        +
        + +
        loading..
        +
        + +
        +
        + + + + +
        + +

        + Per Sequence Quality Scores + + + +

        + +

        The number of reads with average quality scores. Shows if a subset of reads has poor quality.

        + + +
        +

        From the FastQC help:

        +

        The per sequence quality score report allows you to see if a subset of your +sequences have universally low quality values. It is often the case that a +subset of sequences will have universally poor quality, however these should +represent only a small percentage of the total sequences.

        +
        + +
        loading..
        +
        + +
        +
        + + + + +
        + +

        + Per Base Sequence Content + + + +

        + +

        The proportion of each base position for which each of the four normal DNA bases has been called.

        + + +
        +

        To enable multiple samples to be shown in a single plot, the base composition data +is shown as a heatmap. The colours represent the balance between the four bases: +an even distribution should give an even muddy brown colour. Hover over the plot +to see the percentage of the four bases under the cursor.

        +

        To see the data as a line plot, as in the original FastQC graph, click on a sample track.

        +

        From the FastQC help:

        +

        Per Base Sequence Content plots out the proportion of each base position in a +file for which each of the four normal DNA bases has been called.

        +

        In a random library you would expect that there would be little to no difference +between the different bases of a sequence run, so the lines in this plot should +run parallel with each other. The relative amount of each base should reflect +the overall amount of these bases in your genome, but in any case they should +not be hugely imbalanced from each other.

        +

        It's worth noting that some types of library will always produce biased sequence +composition, normally at the start of the read. Libraries produced by priming +using random hexamers (including nearly all RNA-Seq libraries) and those which +were fragmented using transposases inherit an intrinsic bias in the positions +at which reads start. This bias does not concern an absolute sequence, but instead +provides enrichement of a number of different K-mers at the 5' end of the reads. +Whilst this is a true technical bias, it isn't something which can be corrected +by trimming and in most cases doesn't seem to adversely affect the downstream +analysis.

        +
        + +
        +
        +
        + + Click a sample row to see a line plot for that dataset. +
        +
        Rollover for sample name
        + +
        + Position: - +
        %T: -
        +
        %C: -
        +
        %A: -
        +
        %G: -
        +
        +
        +
        + +
        +
        +
        +
        + + +
        +
        + + + + +
        + +

        + Per Sequence GC Content + + + +

        + +

        The average GC content of reads. Normal random library typically have a + roughly normal distribution of GC content.

        + + +
        +

        From the FastQC help:

        +

        This module measures the GC content across the whole length of each sequence +in a file and compares it to a modelled normal distribution of GC content.

        +

        In a normal random library you would expect to see a roughly normal distribution +of GC content where the central peak corresponds to the overall GC content of +the underlying genome. Since we don't know the the GC content of the genome the +modal GC content is calculated from the observed data and used to build a +reference distribution.

        +

        An unusually shaped distribution could indicate a contaminated library or +some other kinds of biased subset. A normal distribution which is shifted +indicates some systematic bias which is independent of base position. If there +is a systematic bias which creates a shifted normal distribution then this won't +be flagged as an error by the module since it doesn't know what your genome's +GC content should be.

        +
        + +
        + + +
        + +
        loading..
        +
        + +
        +
        + + + + +
        + +

        + Per Base N Content + + + +

        + +

        The percentage of base calls at each position for which an N was called.

        + + +
        +

        From the FastQC help:

        +

        If a sequencer is unable to make a base call with sufficient confidence then it will +normally substitute an N rather than a conventional base call. This graph shows the +percentage of base calls at each position for which an N was called.

        +

        It's not unusual to see a very low proportion of Ns appearing in a sequence, especially +nearer the end of a sequence. However, if this proportion rises above a few percent +it suggests that the analysis pipeline was unable to interpret the data well enough to +make valid base calls.

        +
        + +
        loading..
        +
        + +
        +
        + + + + +
        + +

        + Sequence Length Distribution + +

        + +

        The distribution of fragment sizes (read lengths) found. + See the FastQC help

        + + +
        loading..
        +
        + +
        +
        + + + + +
        + +

        + Sequence Duplication Levels + + + +

        + +

        The relative level of duplication found for every sequence.

        + + +
        +

        From the FastQC Help:

        +

        In a diverse library most sequences will occur only once in the final set. +A low level of duplication may indicate a very high level of coverage of the +target sequence, but a high level of duplication is more likely to indicate +some kind of enrichment bias (eg PCR over amplification). This graph shows +the degree of duplication for every sequence in a library: the relative +number of sequences with different degrees of duplication.

        +

        Only sequences which first appear in the first 100,000 sequences +in each file are analysed. This should be enough to get a good impression +for the duplication levels in the whole file. Each sequence is tracked to +the end of the file to give a representative count of the overall duplication level.

        +

        The duplication detection requires an exact sequence match over the whole length of +the sequence. Any reads over 75bp in length are truncated to 50bp for this analysis.

        +

        In a properly diverse library most sequences should fall into the far left of the +plot in both the red and blue lines. A general level of enrichment, indicating broad +oversequencing in the library will tend to flatten the lines, lowering the low end +and generally raising other categories. More specific enrichments of subsets, or +the presence of low complexity contaminants will tend to produce spikes towards the +right of the plot.

        +
        + +
        loading..
        +
        + +
        +
        + + + + +
        + +

        + Overrepresented sequences + + + +

        + +

        The total amount of overrepresented sequences found in each library.

        + + +
        +

        FastQC calculates and lists overrepresented sequences in FastQ files. It would not be +possible to show this for all samples in a MultiQC report, so instead this plot shows +the number of sequences categorized as over represented.

        +

        Sometimes, a single sequence may account for a large number of reads in a dataset. +To show this, the bars are split into two: the first shows the overrepresented reads +that come from the single most common sequence. The second shows the total count +from all remaining overrepresented sequences.

        +

        From the FastQC Help:

        +

        A normal high-throughput library will contain a diverse set of sequences, with no +individual sequence making up a tiny fraction of the whole. Finding that a single +sequence is very overrepresented in the set either means that it is highly biologically +significant, or indicates that the library is contaminated, or not as diverse as you expected.

        +

        FastQC lists all of the sequences which make up more than 0.1% of the total. +To conserve memory only sequences which appear in the first 100,000 sequences are tracked +to the end of the file. It is therefore possible that a sequence which is overrepresented +but doesn't appear at the start of the file for some reason could be missed by this module.

        +
        + +
        +
        loading..
        +
        + +
        +
        + + + + +
        + +

        + Adapter Content + + + +

        + +

        The cumulative percentage count of the proportion of your + library which has seen each of the adapter sequences at each position.

        + + +
        +

        Note that only samples with ≥ 0.1% adapter contamination are shown.

        +

        There may be several lines per sample, as one is shown for each adapter +detected in the file.

        +

        From the FastQC Help:

        +

        The plot shows a cumulative percentage count of the proportion +of your library which has seen each of the adapter sequences at each position. +Once a sequence has been seen in a read it is counted as being present +right through to the end of the read so the percentages you see will only +increase as the read length goes on.

        +
        + +
        No samples found with any adapter contamination > 0.1%
        + +
        +
        + + + + +
        + +

        + Status Checks + + + +

        + +

        Status for each FastQC section showing whether results seem entirely normal (green), +slightly abnormal (orange) or very unusual (red).

        + + +
        +

        FastQC assigns a status for each section of the report. +These give a quick evaluation of whether the results of the analysis seem +entirely normal (green), slightly abnormal (orange) or very unusual (red).

        +

        It is important to stress that although the analysis results appear to give a pass/fail result, +these evaluations must be taken in the context of what you expect from your library. +A 'normal' sample as far as FastQC is concerned is random and diverse. +Some experiments may be expected to produce libraries which are biased in particular ways. +You should treat the summary evaluations therefore as pointers to where you should concentrate +your attention and understand why your library may not look random and diverse.

        +

        Specific guidance on how to interpret the output of each module can be found in the relevant +report section, or in the FastQC help.

        +

        In this heatmap, we summarise all of these into a single heatmap for a quick overview. +Note that not all FastQC sections have plots in MultiQC reports, but all status checks +are shown in this heatmap.

        +
        + +
        + +
        loading..
        +
        + + +
        + + +
        +
        + + + +
        +

        FastQC (raw)

        +

        FastQC (raw) is a quality control tool for high throughput sequence data, written by Simon Andrews at the Babraham Institute in Cambridge.

        + + + + +
        + +

        + Sequence Counts + + + +

        + +

        Sequence counts for each sample. Duplicate read counts are an estimate only.

        + + +
        +

        This plot show the total number of reads, broken down into unique and duplicate +if possible (only more recent versions of FastQC give duplicate info).

        +

        You can read more about duplicate calculation in the +FastQC documentation. +A small part has been copied here for convenience:

        +

        Only sequences which first appear in the first 100,000 sequences +in each file are analysed. This should be enough to get a good impression +for the duplication levels in the whole file. Each sequence is tracked to +the end of the file to give a representative count of the overall duplication level.

        +

        The duplication detection requires an exact sequence match over the whole length of +the sequence. Any reads over 75bp in length are truncated to 50bp for this analysis.

        +
        + +
        + + +
        +
        loading..
        +
        + +
        +
        + + + + +
        + +

        + Sequence Quality Histograms + + + +

        + +

        The mean quality value across each base position in the read.

        + + +
        +

        To enable multiple samples to be plotted on the same graph, only the mean quality +scores are plotted (unlike the box plots seen in FastQC reports).

        +

        Taken from the FastQC help:

        +

        The y-axis on the graph shows the quality scores. The higher the score, the better +the base call. The background of the graph divides the y axis into very good quality +calls (green), calls of reasonable quality (orange), and calls of poor quality (red). +The quality of calls on most platforms will degrade as the run progresses, so it is +common to see base calls falling into the orange area towards the end of a read.

        +
        + +
        loading..
        +
        + +
        +
        + + + + +
        + +

        + Per Sequence Quality Scores + + + +

        + +

        The number of reads with average quality scores. Shows if a subset of reads has poor quality.

        + + +
        +

        From the FastQC help:

        +

        The per sequence quality score report allows you to see if a subset of your +sequences have universally low quality values. It is often the case that a +subset of sequences will have universally poor quality, however these should +represent only a small percentage of the total sequences.

        +
        + +
        loading..
        +
        + +
        +
        + + + + +
        + +

        + Per Base Sequence Content + + + +

        + +

        The proportion of each base position for which each of the four normal DNA bases has been called.

        + + +
        +

        To enable multiple samples to be shown in a single plot, the base composition data +is shown as a heatmap. The colours represent the balance between the four bases: +an even distribution should give an even muddy brown colour. Hover over the plot +to see the percentage of the four bases under the cursor.

        +

        To see the data as a line plot, as in the original FastQC graph, click on a sample track.

        +

        From the FastQC help:

        +

        Per Base Sequence Content plots out the proportion of each base position in a +file for which each of the four normal DNA bases has been called.

        +

        In a random library you would expect that there would be little to no difference +between the different bases of a sequence run, so the lines in this plot should +run parallel with each other. The relative amount of each base should reflect +the overall amount of these bases in your genome, but in any case they should +not be hugely imbalanced from each other.

        +

        It's worth noting that some types of library will always produce biased sequence +composition, normally at the start of the read. Libraries produced by priming +using random hexamers (including nearly all RNA-Seq libraries) and those which +were fragmented using transposases inherit an intrinsic bias in the positions +at which reads start. This bias does not concern an absolute sequence, but instead +provides enrichement of a number of different K-mers at the 5' end of the reads. +Whilst this is a true technical bias, it isn't something which can be corrected +by trimming and in most cases doesn't seem to adversely affect the downstream +analysis.

        +
        + +
        +
        +
        + + Click a sample row to see a line plot for that dataset. +
        +
        Rollover for sample name
        + +
        + Position: - +
        %T: -
        +
        %C: -
        +
        %A: -
        +
        %G: -
        +
        +
        +
        + +
        +
        +
        +
        + + +
        +
        + + + + +
        + +

        + Per Sequence GC Content + + + +

        + +

        The average GC content of reads. Normal random library typically have a + roughly normal distribution of GC content.

        + + +
        +

        From the FastQC help:

        +

        This module measures the GC content across the whole length of each sequence +in a file and compares it to a modelled normal distribution of GC content.

        +

        In a normal random library you would expect to see a roughly normal distribution +of GC content where the central peak corresponds to the overall GC content of +the underlying genome. Since we don't know the the GC content of the genome the +modal GC content is calculated from the observed data and used to build a +reference distribution.

        +

        An unusually shaped distribution could indicate a contaminated library or +some other kinds of biased subset. A normal distribution which is shifted +indicates some systematic bias which is independent of base position. If there +is a systematic bias which creates a shifted normal distribution then this won't +be flagged as an error by the module since it doesn't know what your genome's +GC content should be.

        +
        + +
        + + +
        + +
        loading..
        +
        + +
        +
        + + + + +
        + +

        + Per Base N Content + + + +

        + +

        The percentage of base calls at each position for which an N was called.

        + + +
        +

        From the FastQC help:

        +

        If a sequencer is unable to make a base call with sufficient confidence then it will +normally substitute an N rather than a conventional base call. This graph shows the +percentage of base calls at each position for which an N was called.

        +

        It's not unusual to see a very low proportion of Ns appearing in a sequence, especially +nearer the end of a sequence. However, if this proportion rises above a few percent +it suggests that the analysis pipeline was unable to interpret the data well enough to +make valid base calls.

        +
        + +
        loading..
        +
        + +
        +
        + + + + +
        + +

        + Sequence Length Distribution + +

        + +

        The distribution of fragment sizes (read lengths) found. + See the FastQC help

        + + +
        loading..
        +
        + +
        +
        + + + + +
        + +

        + Sequence Duplication Levels + + + +

        + +

        The relative level of duplication found for every sequence.

        + + +
        +

        From the FastQC Help:

        +

        In a diverse library most sequences will occur only once in the final set. +A low level of duplication may indicate a very high level of coverage of the +target sequence, but a high level of duplication is more likely to indicate +some kind of enrichment bias (eg PCR over amplification). This graph shows +the degree of duplication for every sequence in a library: the relative +number of sequences with different degrees of duplication.

        +

        Only sequences which first appear in the first 100,000 sequences +in each file are analysed. This should be enough to get a good impression +for the duplication levels in the whole file. Each sequence is tracked to +the end of the file to give a representative count of the overall duplication level.

        +

        The duplication detection requires an exact sequence match over the whole length of +the sequence. Any reads over 75bp in length are truncated to 50bp for this analysis.

        +

        In a properly diverse library most sequences should fall into the far left of the +plot in both the red and blue lines. A general level of enrichment, indicating broad +oversequencing in the library will tend to flatten the lines, lowering the low end +and generally raising other categories. More specific enrichments of subsets, or +the presence of low complexity contaminants will tend to produce spikes towards the +right of the plot.

        +
        + +
        loading..
        +
        + +
        +
        + + + + +
        + +

        + Overrepresented sequences + + + +

        + +

        The total amount of overrepresented sequences found in each library.

        + + +
        +

        FastQC calculates and lists overrepresented sequences in FastQ files. It would not be +possible to show this for all samples in a MultiQC report, so instead this plot shows +the number of sequences categorized as over represented.

        +

        Sometimes, a single sequence may account for a large number of reads in a dataset. +To show this, the bars are split into two: the first shows the overrepresented reads +that come from the single most common sequence. The second shows the total count +from all remaining overrepresented sequences.

        +

        From the FastQC Help:

        +

        A normal high-throughput library will contain a diverse set of sequences, with no +individual sequence making up a tiny fraction of the whole. Finding that a single +sequence is very overrepresented in the set either means that it is highly biologically +significant, or indicates that the library is contaminated, or not as diverse as you expected.

        +

        FastQC lists all of the sequences which make up more than 0.1% of the total. +To conserve memory only sequences which appear in the first 100,000 sequences are tracked +to the end of the file. It is therefore possible that a sequence which is overrepresented +but doesn't appear at the start of the file for some reason could be missed by this module.

        +
        + +
        +
        loading..
        +
        + +
        +
        + + + + +
        + +

        + Adapter Content + + + +

        + +

        The cumulative percentage count of the proportion of your + library which has seen each of the adapter sequences at each position.

        + + +
        +

        Note that only samples with ≥ 0.1% adapter contamination are shown.

        +

        There may be several lines per sample, as one is shown for each adapter +detected in the file.

        +

        From the FastQC Help:

        +

        The plot shows a cumulative percentage count of the proportion +of your library which has seen each of the adapter sequences at each position. +Once a sequence has been seen in a read it is counted as being present +right through to the end of the read so the percentages you see will only +increase as the read length goes on.

        +
        + +
        loading..
        +
        + +
        +
        + + + + +
        + +

        + Status Checks + + + +

        + +

        Status for each FastQC section showing whether results seem entirely normal (green), +slightly abnormal (orange) or very unusual (red).

        + + +
        +

        FastQC assigns a status for each section of the report. +These give a quick evaluation of whether the results of the analysis seem +entirely normal (green), slightly abnormal (orange) or very unusual (red).

        +

        It is important to stress that although the analysis results appear to give a pass/fail result, +these evaluations must be taken in the context of what you expect from your library. +A 'normal' sample as far as FastQC is concerned is random and diverse. +Some experiments may be expected to produce libraries which are biased in particular ways. +You should treat the summary evaluations therefore as pointers to where you should concentrate +your attention and understand why your library may not look random and diverse.

        +

        Specific guidance on how to interpret the output of each module can be found in the relevant +report section, or in the FastQC help.

        +

        In this heatmap, we summarise all of these into a single heatmap for a quick overview. +Note that not all FastQC sections have plots in MultiQC reports, but all status checks +are shown in this heatmap.

        +
        + +
        + +
        loading..
        +
        + + +
        + + +
        +
        + + + +
        +

        FastQ Screen (trimmed)

        +

        FastQ Screen (trimmed) allows you to screen a library of sequences in FastQ format against a set of sequence databases so you can see if the composition of the library matches with what you expect.

        + + + + +
        + +

        + Mapped Reads + +

        + + + + +
        +
        + + + +
        + + +
        +
        + + + +
        +

        VerifyBAMID

        +

        VerifyBAMID detects sample contamination and/or sample swaps.

        + + + + + + +
        + +

        The following values provide estimates of sample contamination. Click help for more information.

        + + +
        +

        Please note that FREEMIX is named Contamination (Seq) and CHIPMIX +is named Contamination (S+A) in this MultiQC report.

        +

        VerifyBamID provides a series of information that is informative to determine +whether the sample is possibly contaminated or swapped, but there is no single +criteria that works for every circumstances. There are a few unmodeled factor +in the estimation of [SELF-IBD]/[BEST-IBD] and [%MIX], so please note that the +MLE estimation may not always exactly match to the true amount of contamination. +Here we provide a guideline to flag potentially contaminated/swapped samples:

        +
          +
        • Each sample or lane can be checked in this way. + When [CHIPMIX] >> 0.02 and/or [FREEMIX] >> 0.02, meaning 2% or more of + non-reference bases are observed in reference sites, we recommend to examine + the data more carefully for the possibility of contamination.
        • +
        • We recommend to check each lane for the possibility of sample swaps. + When [CHIPMIX] ~ 1 AND [FREEMIX] ~ 0, then it is possible that the sample + is swapped with another sample. When [CHIPMIX] ~ 0 in .bestSM file, + [CHIP_ID] might be actually the swapped sample. Otherwise, the swapped + sample may not exist in the genotype data you have compared.
        • +
        • When genotype data is not available but allele-frequency-based estimates of + [FREEMIX] >= 0.03 and [FREELK1]-[FREELK0] is large, then it is possible + that the sample is contaminated with other sample. We recommend to use + per-sample data rather than per-lane data for checking this for low coverage + data, because the inference will be more confident when there are large number + of bases with depth 2 or higher.
        • +
        +

        Copied from the VerifyBAMID documentation - see the link for more details.

        +
        + +
        + + + + + + + + + Showing 2/2 rows and 7/12 columns. + +
        +
        + +
        Sample NameRead GroupSNPSM ReadsAverage DepthContamination (Seq)FREEELK1FREELK0
        A
        NA
        100000
        0.0
        1.5 X
        46.296%
        -72
        -77
        B
        NA
        100000
        0.0
        1.5 X
        46.296%
        -72
        -77
        + +
        + + +
        + + +
        +
        + + + +
        +

        Somalier

        +

        Somalier calculates genotype :: pedigree correspondence checks from sketches derived from BAM/CRAM or VCF

        + + + + +
        + +

        + Statistics + +

        + +

        Various statistics from the somalier report.

        + + +
        + + + + + + + + + Showing 2/2 rows and 11/26 columns. + +
        +
        + +
        Sample NameSexAncestryP(Ancestry)HetVarNA sitesSites depthAllele balanceAllele balance < 0.2, > 0.8HetVar XMean depth XMean depth Y
        AunknownAFR
        0.31
        1
        17383
        7.0 X
        0.6
        0.00
        0
        0.0 X
        0.0 X
        BunknownAFR
        0.31
        1
        17383
        7.0 X
        0.6
        0.00
        0
        0.0 X
        0.0 X
        + +
        + +
        +
        + + + + +
        + +

        + Relatedness + +

        + +

        Shared allele rates between sample pairs. +Points are coloured by degree of expected-relatedness: Unrelated, Sib-sib, 0.4900000095367432, Parent-child,

        + + +
        loading..
        +
        + +
        +
        + + + + +
        + +

        + Relatedness Heatmap + +

        + +

        Heatmap displaying relatedness of sample pairs.

        + + +
        + +
        loading..
        +
        + +
        +
        + + + + +
        + +

        + Heterozygosity + + + +

        + +

        Standard devation of heterozygous allele balance against mean depth.

        + + +
        +

        A high standard deviation in allele balance suggests contamination.

        +
        + +
        loading..
        +
        + +
        +
        + + + + +
        + +

        + Sex + + + +

        + +

        Predicted sex against scaled depth on X

        + + +
        +

        Higher values of depth, low values suggest male.

        +
        + +
        loading..
        +
        + +
        +
        + + + + +
        + +

        + Ancestry Barplot + + + +

        + +

        Predicted ancestries of samples.

        + + +
        +

        Shows the percentwise predicted probability of each +ancestry. A sample might contain traces of several ancestries. +If the number of samples is too high, the plot is rendered as a +non-interactive flat image.

        +
        + +
        + + +
        +
        loading..
        +
        + +
        +
        + + + + +
        + +

        + Ancestry PCA + + + +

        + +

        Principal components of samples against background PCs.

        + + +
        +

        Sample PCs are plotted against background PCs from the +background data supplied to somalier. +Color indicates predicted ancestry of sample. Data points in close +proximity are predicted to be of similar ancestry. Consider whether +the samples cluster as expected.

        +
        + +
        loading..
        +
        + + +
        + + +
        + + + + +
        + + + + + + + + + + + + + + + + From e87e61aa7a2e2d94d079bb566a071399103422a3 Mon Sep 17 00:00:00 2001 From: Manavalan Gajapathy Date: Thu, 13 Jul 2023 12:47:53 -0500 Subject: [PATCH 08/12] highlights multiqc output --- docs/system_testing.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/system_testing.md b/docs/system_testing.md index 2425d32..eb84a59 100644 --- a/docs/system_testing.md +++ b/docs/system_testing.md @@ -114,7 +114,7 @@ data/quac/results/test_project_2samples_wgs-include_priorQC/ │   │   └── ... │   ├── multiqc_final_pass │   │   ├── ... - │   │   └── A_multiqc.html + │   │   └── A_multiqc.html <--- Sample-level multiqc output file │   ├── multiqc_initial_pass │   │   ├── ... │   │   └── A_multiqc.html @@ -143,7 +143,7 @@ data/quac/results/test_project_2samples_wgs-include_priorQC/ │   │   └── aggregated_rename_configs.tsv │   ├── multiqc_report_data │   │   └── ... - │   └── multiqc_report.html + │   └── multiqc_report.html <--- Project-level multiqc output file └── somalier ├── ancestry │   └── ... From 1801364e9ec03aa190dbd8944f5012c3af44268f Mon Sep 17 00:00:00 2001 From: Manavalan Gajapathy Date: Thu, 13 Jul 2023 13:44:00 -0500 Subject: [PATCH 09/12] adds faq on snakemake locks --- docs/faq.md | 10 ++++++++++ mkdocs.yaml | 1 + 2 files changed, 11 insertions(+) create mode 100644 docs/faq.md diff --git a/docs/faq.md b/docs/faq.md new file mode 100644 index 0000000..326f273 --- /dev/null +++ b/docs/faq.md @@ -0,0 +1,10 @@ +# FAQ + +## QuaC workflow failed due to `Error: Directory cannot be locked` + +See [snakemake docs +here](https://snakemake.readthedocs.io/en/stable/project_info/faq.html#how-does-snakemake-lock-the-working-directory) on +why snakemake locks the working directory. `Error: Directory cannot be locked` might happen when the parent snakemake +process gets killed unexpectedly before completion. It is recommended to investigate why it got killed before proceeding +to the next step. If you want to remove the lock (ie. unlock it), run `src/run_quac.py` with -e='--unlock'. + diff --git a/mkdocs.yaml b/mkdocs.yaml index c4e752a..7fc8853 100644 --- a/mkdocs.yaml +++ b/mkdocs.yaml @@ -7,6 +7,7 @@ nav: - QuaC-Watch: quac_watch.md - System testing: system_testing.md - Pipeline visualization: visualize_pipeline.md + - Frequently Asked Questions: faq.md - Changelog: Changelog.md - QC review: - Sample QC review system: sample_qc_review_system.md From 887f07389c2dc49e7256c718a9042f5300de726b Mon Sep 17 00:00:00 2001 From: Manavalan Gajapathy Date: Sun, 16 Jul 2023 11:14:51 -0500 Subject: [PATCH 10/12] Update docs/faq.md Co-authored-by: Brandon M Wilk --- docs/faq.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/docs/faq.md b/docs/faq.md index 326f273..d280e02 100644 --- a/docs/faq.md +++ b/docs/faq.md @@ -6,5 +6,6 @@ See [snakemake docs here](https://snakemake.readthedocs.io/en/stable/project_info/faq.html#how-does-snakemake-lock-the-working-directory) on why snakemake locks the working directory. `Error: Directory cannot be locked` might happen when the parent snakemake process gets killed unexpectedly before completion. It is recommended to investigate why it got killed before proceeding -to the next step. If you want to remove the lock (ie. unlock it), run `src/run_quac.py` with -e='--unlock'. +to the next step. If you want to remove the lock (ie. unlock it), add `-e='--unlock'` to your original run `src/run_quac.py` command. +Once that has completed you can run the original command again and the pipeline will pick up from it's last state. From 6c60de885da0ddc3de31f162583c5c7f362723e3 Mon Sep 17 00:00:00 2001 From: Manavalan Gajapathy Date: Sun, 16 Jul 2023 11:27:03 -0500 Subject: [PATCH 11/12] updates changelog --- docs/Changelog.md | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/docs/Changelog.md b/docs/Changelog.md index c989ad0..91d9680 100644 --- a/docs/Changelog.md +++ b/docs/Changelog.md @@ -12,6 +12,11 @@ YYYY-MM-DD John Doe ``` --- +2023-07-16 Manavalan Gajapathy + +* Updates doc based on users feedback. + + 2023-06-30 Manavalan Gajapathy * Merges `joss_manuscript` to the `master` branch to bring it up to date. From 9118d18a2f0af8dc9f208b86dbb9646ed7bad216 Mon Sep 17 00:00:00 2001 From: Manavalan Gajapathy Date: Sun, 16 Jul 2023 11:27:44 -0500 Subject: [PATCH 12/12] line wrapping --- docs/faq.md | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/docs/faq.md b/docs/faq.md index d280e02..501c7c0 100644 --- a/docs/faq.md +++ b/docs/faq.md @@ -6,6 +6,7 @@ See [snakemake docs here](https://snakemake.readthedocs.io/en/stable/project_info/faq.html#how-does-snakemake-lock-the-working-directory) on why snakemake locks the working directory. `Error: Directory cannot be locked` might happen when the parent snakemake process gets killed unexpectedly before completion. It is recommended to investigate why it got killed before proceeding -to the next step. If you want to remove the lock (ie. unlock it), add `-e='--unlock'` to your original run `src/run_quac.py` command. -Once that has completed you can run the original command again and the pipeline will pick up from it's last state. +to the next step. If you want to remove the lock (ie. unlock it), add `-e='--unlock'` to your original run +`src/run_quac.py` command. Once that has completed you can run the original command again and the pipeline will pick up +from it's last state.