Adding single-read functionality to RAW and CLEAN #80

simonleandergrimm · 2024-10-28T16:45:36Z

This PR adds support for single-read (single-end) sequencing data to the RAW and CLEAN stages of the pipeline while maintaining existing paired-end functionality. This allows the pipeline to process both single-end and paired-end sequencing data using the same workflow infrastructure.

Key Changes

Amended generate_samplesheet.sh so it can also take in single-read data.
Added the run_dev_se.nf workflow, which will be the workflow in which single-read functionality is added up until all steps of the pipeline have single-read functionality. At that point, we can replace run.nf with run_dev_se.nf.
Added read_type parameter ("single_end" or "paired_end") in run_dev_se.config to control pipeline behavior
Split FASTP process into FASTP_SINGLE and FASTP_PAIRED variants
Split TRUNCATE_CONCAT process into _SINGLE and _PAIRED variants
Edited RAW, CLEAN, QC, and HV_SCREEN subworkflows to either take in the single_end or paired_end version of processes.
- Edited HV_SCREEN only because it would otherwise fail due to not identifying the fastp process correctly.
Updated bin/summarize-multiqc-single.R so it takes in a read-type variable, which triggers if/else branches throughout the script to change data processing accordingly.

Testing

I added test directories with example data for both single and paired-end cases

test-single-read/ - Contains single-end test data and configuration
test-paired-end/ - Contains paired-end test data and configuration

I validated the pipeline changes in this notebook: https://data.securebio.org/simons-notebook/posts/2024-10-24-mgs-single-read-eval/

… for single read analyses.

…-multiqc-paired.R

…on. Renamed Multiqc to not be confusing regardings its naming as "Single"

…d version

… or paired-end read version of fastp

…read or paired-end read version of fastp

… paired-end read version of SUMMARIZE_MULTIQC

…for paired-end vs single read runs.

…n outputs

…d paired end read runs

…on/*/raw

…book/notebooks/2024-10-17_crits-christoph-2-4-0.html to create analyses of single and paired-end read data.

…samples being paired-end or not.

… pair information for single read data. Also dropped some code which combines values across read pairs, for single read data. I dropped the renaming of tab_tsv to tab_tsv_2 for paired end data, so I didn't have to create two different versions of the combine step at the end of the subscript. ``` tab <- tab_json %>% inner_join(tab_tsv, by="sample") ```

…ead data, as I instead amended the existing script to be able to handle both single read paired end data.

…ad version. Renamed Multiqc to not be confusing regardings its naming as "Single"" This reverts commit 01ea0c5.

…rizeMultiqc" This reverts commit ad8faf9.

willbradshaw

Just a few minor changes here, otherwise looks good!

configs/read_type.config

subworkflows/local/loadSampleSheet/main.nf

workflows/run_dev_se.nf

workflows/run.nf

willbradshaw · 2024-12-16T21:24:47Z

@simonleandergrimm @harmonbhasin Since we're getting close to merging this would be a good time to update the CHANGELOG.

simonleandergrimm · 2024-12-16T23:56:38Z

Aside from Will getting back to my comment and me editing the changelog this is good to go in.

willbradshaw · 2024-12-17T15:21:14Z

workflows/run_dev_se.nf

@@ -51,7 +37,7 @@ workflow RUN_DEV_SE {
    // Publish results
    params_str = JsonOutput.prettyPrint(JsonOutput.toJson(params))
    params_ch = Channel.of(params_str).collectFile(name: "run-params.json")
-    time_ch = Channel.of(start_time_str + "\n").collectFile(name: "time.txt")
+    time_ch = Channel.of(params.start_time_str + "\n").collectFile(name: "time.txt")


How is this getting into params? Seems naively like it would be simpler to just emit it from LOAD_SAMPLESHEET.

@simonleandergrimm what Will is referring to here is this line "params.start_time_str". Previously this was in the main workflow like this:
start_time = new Date() start_time_str = start_time.format("YYYY-MM-dd HH:mm:ss z (Z)")
. Either move this back to the main workflow (in which case you can use start_time_str, or keep it in LOAD_SAMPLESHEET (and make sure to emit it).

I know what Will meant. I changed the code; let me know if the new time_ch creation logic looks fine.

harmonbhasin

@simonleandergrimm outside of the change Will requested, this looks good to me. Make that change and you should be good to go!

harmonbhasin · 2024-12-18T15:04:03Z

workflows/run_dev_se.nf

@@ -51,7 +37,7 @@ workflow RUN_DEV_SE {
    // Publish results
    params_str = JsonOutput.prettyPrint(JsonOutput.toJson(params))
    params_ch = Channel.of(params_str).collectFile(name: "run-params.json")
-    time_ch = Channel.of(start_time_str + "\n").collectFile(name: "time.txt")
+    time_ch = Channel.of(params.start_time_str + "\n").collectFile(name: "time.txt")


@simonleandergrimm what Will is referring to here is this line "params.start_time_str". Previously this was in the main workflow like this:
start_time = new Date() start_time_str = start_time.format("YYYY-MM-dd HH:mm:ss z (Z)")
. Either move this back to the main workflow (in which case you can use start_time_str, or keep it in LOAD_SAMPLESHEET (and make sure to emit it).

…ory/mgs-workflow into single-read-raw-clean

simonleandergrimm · 2024-12-18T20:06:04Z

@willbradshaw Let me know if this CHANGELOG edit looks good to you. If so I will create the same for the other PR.

# v2.5.3
- Added support for single-end read processing:
    - Restructured RAW and CLEAN workflows to handle both single-end and paired-end reads
    - Added new FASTP_SINGLE process alongside existing FASTP_PAIRED
    - Added new TRUNCATE_CONCAT_SINGLE process alongside existing TRUNCATE_CONCAT_PAIRED
    - Added single-end logic to QC and RAW subworkflows
    - Created separate end-to-end test workflow for single-end processing (which will be removed once single-end processing is fully integrated)
- Improved samplesheet handling:
    - Added new LOAD_SAMPLESHEET subworkflow to centralize samplesheet processing
    - Modified samplesheet handling to support both single-end and paired-end data
    - Updated generate_samplesheet.sh to handle single-end data with --single_end flag
- Configuration updates:
    - Added read_type.config to handle single-end vs paired-end settings (set automatically based on samplesheet format)
    - Created run_dev_se.config and run_dev_se.nf for single-end and paired-end development testing (which will be removed once single-end processing is fully integrated)
    - Added single-end samplesheet to test-data

willbradshaw · 2024-12-19T13:35:58Z

I'm not a huge fan of those CHANGELOG changes because they imply that single-end read processing is more complete than it actually is. I would also keep all the single-end updates under one section:

# v2.5.3 (in progress)
- Added new LOAD_SAMPLESHEET subworkflow to centralize samplesheet processing 
- Began development of single-end read processing (still in progress)
    - Restructured RAW, CLEAN, and QC workflows to handle both single-end and paired-end reads
    - Added new FASTP_SINGLE and TRUNCATE_CONCAT_SINGLE processes to handle single-end reads
    - Created separate end-to-end test workflow for single-end processing (which will be removed once single-end processing is fully integrated)
    - Modified samplesheet handling to support both single-end and paired-end data
    - Updated generate_samplesheet.sh to handle single-end data with --single_end flag
    - Added read_type.config to handle single-end vs paired-end settings (set automatically based on samplesheet format)
    - Created run_dev_se.config and run_dev_se.nf for single-end development testing (which will be removed once single-end processing is fully integrated)
    - Added single-end samplesheet to test-data

simonleandergrimm · 2024-12-19T14:16:39Z

Edited CHANGELOG to incorporate your changes

harmonbhasin

Looks good to me!

simonleandergrimm added 30 commits October 21, 2024 19:19

Adding single read option to raw/main.nf

15354f6

Adding WIP version of run.nf to enable testing raw and clean versions…

ad2115d

… for single read analyses.

Created separate versions of summarize-multiqc-single.R and summarize…

03ee37a

…-multiqc-paired.R

Split processes in fastp to a single read and paired-end read version.

b517340

Split processes in MultiQC to a single read and paired-end read versi…

01ea0c5

…on. Renamed Multiqc to not be confusing regardings its naming as "Single"

Deleted summarizeMultiqcSingle, which was superseded by summarizeMultiqc

ad8faf9

Split processes in truncateConcat to a single read and paired-end rea…

ef0e9c8

…d version

Created a single_end if clause in Clean to either use the single read…

2535ccd

… or paired-end read version of fastp

Created a single_end if clause in hv_screen to either use the single …

cbcb109

…read or paired-end read version of fastp

Created a single_end if clause in qc to either use the single read or…

c7f8c83

… paired-end read version of SUMMARIZE_MULTIQC

Renamed test dir to test-paired-end. Added clause in nextflow.config …

ff0a8be

…for paired-end vs single read runs.

Edited gitignore to leave out test-paired-end and test-single-read ru…

6048dd3

…n outputs

Fixed name of test-single-end dir to test-single-read

92270e5

Created a version of test dir that allows the run of single-read data.

b13ac94

Added script to quickly download the s3 output of test single read an…

dff2302

…d paired end read runs

Added nextflow config for test paired and test single read.

64bb7f4

Fixed if clause in main.nf

5bd1aec

Updated gen samplesheet scripts to pull in data from s3://nao-mgs-sim…

c8fd3ac

…on/*/raw

Updated gitignore

578fde0

Activated CLEAN subworkflow in run.nf

59218b9

Starting to adapt Will's https://data.securebio.org/wills-public-note…

fd9dc1e

…book/notebooks/2024-10-17_crits-christoph-2-4-0.html to create analyses of single and paired-end read data.

Adding ignoring mgs-results to gitignore

81ff0ba

Adding Will's auxiliary scripts to run his quarto notebooks.

590b2c3

Merge branch 'master' into single-read-raw

6a650b4

Amended qmd somewhat so data imports work.

9f1eb03

Added a flag to summarize-multiqc-single.R that provides info on the …

9622004

…samples being paired-end or not.

Deleting seperate version of summarize-multiqc I created for paired r…

f8d9c28

…ead data, as I instead amended the existing script to be able to handle both single read paired end data.

Revert "Split processes in MultiQC to a single read and paired-end re…

8e1c7b5

…ad version. Renamed Multiqc to not be confusing regardings its naming as "Single"" This reverts commit 01ea0c5.

Revert "Deleted summarizeMultiqcSingle, which was superseded by summa…

0ba0552

…rizeMultiqc" This reverts commit ad8faf9.

willbradshaw requested changes Dec 16, 2024

View reviewed changes

configs/read_type.config Outdated Show resolved Hide resolved

subworkflows/local/loadSampleSheet/main.nf Outdated Show resolved Hide resolved

workflows/run_dev_se.nf Outdated Show resolved Hide resolved

workflows/run.nf Show resolved Hide resolved

simonleandergrimm added 3 commits December 16, 2024 23:38

fixed samplesheet typo.

c096c48

Put additional things into loadsamplesheet.

10dbc48

added params. info

4384fc4

willbradshaw reviewed Dec 17, 2024

View reviewed changes

Base automatically changed from harmon_fix_gh_actions_test to dev December 17, 2024 18:27

simonleandergrimm closed this Dec 18, 2024

simonleandergrimm deleted the single-read-raw-clean branch December 18, 2024 13:22

simonleandergrimm restored the single-read-raw-clean branch December 18, 2024 13:28

simonleandergrimm reopened this Dec 18, 2024

harmonbhasin requested changes Dec 18, 2024

View reviewed changes

simonleandergrimm and others added 7 commits December 18, 2024 16:29

Added new logic for handling start_time_str variable.

92c9312

Update .gitignore

2c356f8

Update end-to-end.yml

69386c7

Update .gitignore

a96cc36

Updated index

73d1e70

Merge branch 'single-read-raw-clean' of https://github.com/naobservat…

e2af24d

…ory/mgs-workflow into single-read-raw-clean

Edited CHANGELOG.md to take into account changes made.

65a7b76

simonleandergrimm requested review from willbradshaw and harmonbhasin December 18, 2024 21:15

Amended CHANGELOG.md with changes suggested by Will

be30318

harmonbhasin approved these changes Dec 19, 2024

View reviewed changes

willbradshaw approved these changes Dec 20, 2024

View reviewed changes

willbradshaw merged commit 3a4c2f1 into dev Dec 20, 2024
4 checks passed

willbradshaw deleted the single-read-raw-clean branch December 20, 2024 13:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding single-read functionality to RAW and CLEAN #80

Adding single-read functionality to RAW and CLEAN #80

simonleandergrimm commented Oct 28, 2024 •

edited

Loading

willbradshaw left a comment

willbradshaw commented Dec 16, 2024

simonleandergrimm commented Dec 16, 2024

willbradshaw Dec 17, 2024

harmonbhasin Dec 18, 2024

simonleandergrimm Dec 18, 2024

harmonbhasin left a comment

harmonbhasin Dec 18, 2024

simonleandergrimm commented Dec 18, 2024

willbradshaw commented Dec 19, 2024

simonleandergrimm commented Dec 19, 2024

harmonbhasin left a comment

Adding single-read functionality to RAW and CLEAN #80

Adding single-read functionality to RAW and CLEAN #80

Conversation

simonleandergrimm commented Oct 28, 2024 • edited Loading

Key Changes

Testing

willbradshaw left a comment

Choose a reason for hiding this comment

willbradshaw commented Dec 16, 2024

simonleandergrimm commented Dec 16, 2024

willbradshaw Dec 17, 2024

Choose a reason for hiding this comment

harmonbhasin Dec 18, 2024

Choose a reason for hiding this comment

simonleandergrimm Dec 18, 2024

Choose a reason for hiding this comment

harmonbhasin left a comment

Choose a reason for hiding this comment

harmonbhasin Dec 18, 2024

Choose a reason for hiding this comment

simonleandergrimm commented Dec 18, 2024

willbradshaw commented Dec 19, 2024

simonleandergrimm commented Dec 19, 2024

harmonbhasin left a comment

Choose a reason for hiding this comment

simonleandergrimm commented Oct 28, 2024 •

edited

Loading