Skip to content

phac-nml/genomic-sequencing-test-datasets

Repository files navigation

Title: genomic-sequencing-test-datasets

Purpose: This series of test datasets have been prepared emulating real Illumina and Nanopore sequencing data. These datasets are designed for use in:

a) Evaluating software changes prior to implementation

b) Proficiency testing of personnel performing NGS analysis of SARS-CoV-2

Note: These test datasets are focused solely on the analysis and interpretation of sequence data and excludes contextual data.

For information and test datasets on training and quality control of contextual data, please see here. This resource contains "Gold Standard", "Common Errors", and a "Scenarios dataset".

Resources

Sequencing Test Datasets

SARS-CoV-2

  • Location of in-house training set: 10.5281/zenodo.13748104 soon-to-be-released

    • Datasets included:
      • Illumina
        • Training Set A

          • Contains dehosted Illumina paired-end sequencing data generated using the Freed primer scheme for 25 samples and 3 controls
        • Training Set B

          • Contains dehosted Illumina paired-end sequencing data generated using the Freed primer scheme for 25 samples and 5 controls
        • Training Set C

          • Contains dehosted Illumina paired-end sequencing data generated using the freed_V2_nml primer scheme for 92 samples and 5 controls
        • Training Set D

          • Contains dehosted Illumina paired-end sequencing data generated using the freed_V2_nml primer scheme for 92 samples and 7 controls
        • Permission to use these datasets was provided by the Roy Romanow Provincial Laboratory (Saskatchewan). It should not be used for applications outside those outlined in this document without the express written consent of this group.

          Authors: Anna Majer, Shari Tyson, Grace Seo, Philip Mabon, Elsie Grudeski, Rhiannon Huzarewich, Russell Mandes, Anneliese Landgraff, Jennifer Tanner, Natalie Knox, Morag Graham, Gary Van Domselaar, Ryan McDonald, Keith MacKenzie, Meredith Faires, Kara Loos, Stefani Kary, Rachel DePaulo, Alanna Senecal, Amanda Lang, Jessica Minion, Nathalie Bastien, Yan Li, Timothy Booth, Darian Hole, Madison Chapel, Kirsten Biggar, CanCOGeN's metadata curation team, Public Health Agency of Canada CanCOGeN team

      • Nanopore
        • Contains dehosted nanopore fastq files (and possibly fast5 files?) generated using the freed_V2_nml primer scheme for 93 samples and 3 controls
  • CDC public dataset repo git link / publication

    • What's included:
      • QC failures
      • host contamination
      • outbreak data
      • samples generated using different wet-lab approaches
      • datasets across different lineages

Recipes - how datasets are created

  • The Illumina Training Set A is composed of clinical samples from Saskatchewan sequenced by the National Microbiology Laboratory in April 2021 on a MiSeq instrument with the freed primer scheme.

    - A run with clean negative controls was selected for use and a subset of samples representing various Ct values and final genome completeness scores.
    
    - Clinical positive control sample was also included
    
    - An artificial mixture sample was produced by combining reads from two existing samples to ensure that at least one sample would generate an EXCESS_AMBIGUITY flag
    
  • Illumina Training Set B was created by combining Training Set A with a series of additional negative controls, some of which will fail QC checks.

  • Illumina Training Set C is composed of clinical samples from Saskatchewan sequenced by the National Microbiology Laboratory in October 2022 on a MiSeq instrument with the freed_V2_nml primer scheme. This dataset also contains the same 2 negative contols found in Training Set A, as well as 3 additional controls: (i) a no templale control, (ii) a SARS-CoV-2 Alpha variant positive control, and (iii) a SARS-CoV-2 Delta variant positive control.

  • Illumina Training Set D was created by combining the samples from Training Set C with the 4 negative contols found in Training Set B, as well as 3 additional controls: (i) a no templale control, (ii) a SARS-CoV-2 Alpha variant positive control, and (iii) a SARS-CoV-2 Delta variant positive control (these 3 additional controls are the same 3 additional controls fouund in Training set C).

  • The Nanopore Training Set was generated by the Roy Romanow Provincial Laboratory in Saskatchewan on an Mk1c device in January 2022. - Run was selected to try and closely match the Illumina run in terms of composition, that is clean negative controls, a variety of Ct values, and final genome completeness scores.

Since Illumina Training Sets A and B were generated during major SARS-CoV-2 variant waves, the lineage composition of these datasets are homogeneous. The Nanopore dataset and Illumina Training Sets C and D contain samples with a variety of SARS-CoV-2 Omicron sublineages.

Assessing operator proficiency:

- Data interpretation criteria must address the below considerations
    - Do Negative (negative-ctrl-#) controls pass specified internal QC metrics? If you detect warnings, what are the reasons and how will this impact the run?
    - Does the positive control (positive-ctrl-1) pass specified internal QC metrics?
    - Based on the above, is the run valid and can interpretations be made about the samples included? Provide rationale
    - How many samples meet minimum quality standards? List which do and do not.
    - Which (if any) samples display excess ambiguity? Should the data from these samples be excluded?

Guide for producing HTML output comparing QC results of Illumina runs

SIGNAL_output_comparison_v2.Rmd can be used to compare SIGNAL pipeline results generated by two different users on the same test Illumina data set.

This script requires the following seven files (from both analyses) as input:

  • SIGNAL results file:
    • .qc.csv
  • Ncov-tools quality control files:
    • _tree.nwk
    • _ambiguous_position_report.tsv
    • _mixture_report.tsv
    • _ncov_watch_variants.tsv
    • _negative_control_report.tsv
    • _summary_qc.tsv

The current version of the script requires that the arrangement of these files adhere to a specific directory structure. For example:

Training_Set_A
├── User_1_SIGNAL_A
│   └── SIGNAL_analyses
│       ├── ncov-tools
│       │   ├── qc_analysis
│       │   │   └── nml_tree.nwk
│       │   └── qc_reports
│       │       ├── nml_ambiguous_position_report.tsv
│       │       ├── nml_mixture_report.tsv
│       │       ├── nml_ncov_watch_variants.tsv
│       │       ├── nml_negative_control_report.tsv
│       │       └── nml_summary_qc.tsv
│       └── nml.qc.csv
└── User_2_SIGNAL_A
    └── SIGNAL_analyses
        ├── ncov-tools
        │   ├── qc_analysis
        │   │   └── nml_tree.nwk
        │   └── qc_reports
        │       ├── nml_ambiguous_position_report.tsv
        │       ├── nml_mixture_report.tsv
        │       ├── nml_ncov_watch_variants.tsv
        │       ├── nml_negative_control_report.tsv
        │       └── nml_summary_qc.tsv
        └── nml.qc.csv

Configuration file

A configuration (.yaml) file is used to identify the SIGNAL results directories that are to be compared, and to load the necessary R packages for the R markdown script.

The condfiguration file requires the following:

Base_path - absolute path to the directory that contains the Data_folder (see below)

Data_folder - parent directory of per-user analyses folders; Name_of_folder_one and Name_of_folder_two (see below)

Name_of_folder_one - Parent directory for user 1 SIGNAL analyses on a particular Illumina dataset

Name_of_folder_two - Parent directory for user 2 SIGNAL analyses on a particular Illumina dataset

Using the directory structure example above, the configuration file entries would be as follows:

Base_path: <ABSOLUTE/PATH/TO/Training_Set_A> 
    - e.g. "C:/Zenodo_data/SARS_CoV_2/Illumina" 
Data_folder: "Training_Set_A"
Name_of_folder_one: "User_1_SIGNAL_A"
Name_of_folder_two: "User_2_SIGNAL_A"

The following R packages must also be listed in the configuration file as follows:

Packages:
  - dplyr
  - data.table
  - readxl
  - tidyverse
  - stringr
  - R.utils
  - writexl
  - lubridate
  - stringi
  - ggplot2
  - gridExtra
  - tinytex
  - kableExtra
  - DT
  - extraoperators
  - hablar
  - ape
  - phylogram
  - dendextend
  - data.tree
  - here

renv LOCK file usage

A LOCK file has been created to track R version and R package versions used with the script. To load the R package versions used during the creation of this script, run renv::restore().

Script usage

Once the configuration file has been created, knit the script in RStudio to generate HTML output.

HTML Output

Contains a detailed comparison of per-user SIGNAL/ncov-tools output.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •