- Added new LOAD_SAMPLESHEET subworkflow to centralize samplesheet processing
- Updated tags to prevent inappropriate S3 auto-cleanup
- Testing infrastructure
- Split up the tests in
End-to-end MGS workflow test
so that they can be run in parallel on Github Actions. - Implemented an end-to-end test that checks if the RUN workflow produces the correct output. The correct output for the test has been saved in
test-data/gold-standard-results
so that the user can diff the output of their test with the correct output to check where their pipeline might be failing.
- Split up the tests in
- Began development of single-end read processing (still in progress)
- Restructured RAW, CLEAN, QC, TAXONOMY, and PROFILE workflows to handle both single-end and paired-end reads
- Added new FASTP_SINGLE, TRUNCATE_CONCAT_SINGLE, BBDUK_SINGLE, CONCAT_GROUP_SINGLE, SUBSET_READS_SINGLE and SUBSET_READS_SINGLE_TARGET processes to handle single-end reads
- Created separate end-to-end test workflow for single-end processing (which will be removed once single-end processing is fully integrated)
- Modified samplesheet handling to support both single-end and paired-end data
- Updated generate_samplesheet.sh to handle single-end data with --single_end flag
- Added read_type.config to handle single-end vs paired-end settings (set automatically based on samplesheet format)
- Created run_dev_se.config and run_dev_se.nf for single-end development testing (which will be removed once single-end processing is fully integrated)
- Added single-end samplesheet to test-data
- Changes to default read filtering:
- Relaxed FASTP quality filtering (
--cut_mean_quality
and--average_qual
reduced from 25 to 20). - Relaxed BBDUK viral filtering (switched from 3 21-mers to 1 24-mer).
- Relaxed FASTP quality filtering (
- Overhauled BLAST validation functionality:
- BLAST now runs on forward and reverse reads independently
- BLAST output filtering no longer assumes specific filename suffixes
- Paired BLAST output includes more information
- RUN_VALIDATION can now directly take in FASTA files instead of a virus read DB
- Fixed issues with publishing BLAST output under new Nextflow version
- Implemented nf-test for end-to-end testing of pipeline functionality
- Implemented test suite in
tests/main.nf.test
- Reconfigured INDEX workflow to enable generation of miniature index directories for testing
- Added Github Actions workflow in
.github/workflows/end-to-end.yml
- Pull requests will now fail if any of INDEX, RUN, or RUN_VALIDATION crashes when run on test data.
- Generated first version of new, curated test dataset for testing RUN workflow. Samplesheet and config file are available in
test-data
. The previous test dataset intest
has been removed.
- Implemented test suite in
- Implemented S3 auto-cleanup:
- Added tags to published files to facilitate S3 auto-cleanup
- Added S3 lifecycle configuration file to
ref
, along with a script inbin
to add it to an S3 bucket
- Minor changes
- Added logic to check if
grouping
variable innextflow.config
matches the input samplesheet, if it doesn't, the code throws an error. - Externalized resource specifications to
resources.config
, removing hardcoded CPU/memory values - Renamed
index-params.json
toparams-index.json
to avoid clash with Github Actions - Removed redundant subsetting statement from TAXONOMY workflow.
- Added --group_across_illumina_lanes option to generate_samplesheet
- Added logic to check if
- Enabled extraction of BBDuk-subset putatively-host-viral raw reads for downstream chimera detection.
- Added back viral read fields accidentally being discarded by COLLAPSE_VIRUS_READS.
- Reintroduced user-specified sample grouping and concatenation (e.g. across sequencing lanes) for deduplication in PROFILE and EXTRACT_VIRAL_READS.
- Generalised pipeline to detect viruses infecting arbitrary host taxa (not just human-infecting viruses) as specified by
ref/host-taxa.tsv
and config parameters. - Configured index workflow to enable hard-exclusion of specific virus taxa (primarily phages) from being marked as infecting ost taxa of interest.
- Updated pipeline output code to match changes made in latest Nextflow update (24.10.0).
- Created a new script
bin/analyze-pipeline.py
to analyze pipeline structure and identify unused workflows and modules. - Cleaned up unused workflows and modules made obsolete in this and previous updates.
- Moved module scripts from
bin
to module directories. - Modified trace filepath to be predictable across runs.
- Removed addParams calls when importing dependencies (deprecated in latest Nextflow update).
- Switched from nt to core_nt for BLAST validation.
- Reconfigured QC subworkflow to run FASTQC and MultiQC on each pair of input files separately (fixes bug arising from allowing arbitrary filenames for forward and reverse read files).
- Created a new output directory where we put log files called
logging
. - Added the trace file from Nextflow to the
logging
directory which can be used for understanding cpu, memory usage, and other infromation like runtime. After running the pipeline,plot-timeline-script.R
can be used to generate a useful summary plot of the runtime for each process in the pipeline. - Removed CONCAT_GZIPPED.
- Replaced the sample input format with something more similar to nf-core, called
samplesheet.csv
. This new input file can be generated using the scriptgenerate_samplesheet.sh
. - Now run deduplication on paired-ends reads using clumpify in the taxonomic workflow.
- Fragment length analysis and deduplication analysis.
- BBtools: Extract the fragment length as well as the number of duplicates from the taxonomic workflow and add them to the
hv_hits_putative_collapsed.tsv.gz
. - Bowtie2: Conduct a duplication analysis on the aligned reads, then add the number of duplicates and fragment length to the
hv_hits_putative_collapsed.tsv.gz
.
- BBtools: Extract the fragment length as well as the number of duplicates from the taxonomic workflow and add them to the
- Added validation workflow for post-hoc BLAST validation of putative HV reads.
- Fixed subsetReads to run on all reads when the number of reads per sample is below the set threshold.
- Clarifications to documentation (in README and elsewhere)
- Re-added "joined" status marker to reads output by join_fastq.py
- Restructured run workflow to improve computational efficiency, especially on large datasets
- Added preliminary BBDuk masking step to HV identification phase
- Added read subsampling to profiling phase
- Deleted ribodepletion and deduplication from preprocessing phase
- Added riboseparation to profiling phase
- Restructured profiling phase output
- Added
addcounts
andpasses
flags to deduplication in HV identification phase
- Parallelized key bottlenecks in index workflow
- Added custom suffix specification for raw read files
- Assorted bug fixes
- Added specific container versions to
containers.config
- Added version & time tracking to workflows
- Added index reference files (params, version) to run output
- Minor changes to default config files
- Major refactor
- Start of changelog