Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding single-read functionality to RAW and CLEAN #80

Merged
merged 108 commits into from
Dec 20, 2024

Conversation

simonleandergrimm
Copy link
Collaborator

@simonleandergrimm simonleandergrimm commented Oct 28, 2024

This PR adds support for single-read (single-end) sequencing data to the RAW and CLEAN stages of the pipeline while maintaining existing paired-end functionality. This allows the pipeline to process both single-end and paired-end sequencing data using the same workflow infrastructure.

Key Changes

  • Amended generate_samplesheet.sh so it can also take in single-read data.
  • Added the run_dev_se.nf workflow, which will be the workflow in which single-read functionality is added up until all steps of the pipeline have single-read functionality. At that point, we can replace run.nf with run_dev_se.nf.
  • Added read_type parameter ("single_end" or "paired_end") in run_dev_se.config to control pipeline behavior
  • Split FASTP process into FASTP_SINGLE and FASTP_PAIRED variants
  • Split TRUNCATE_CONCAT process into _SINGLE and _PAIRED variants
  • Edited RAW, CLEAN, QC, and HV_SCREEN subworkflows to either take in the single_end or paired_end version of processes.
    • Edited HV_SCREEN only because it would otherwise fail due to not identifying the fastp process correctly.
  • Updated bin/summarize-multiqc-single.R so it takes in a read-type variable, which triggers if/else branches throughout the script to change data processing accordingly.

Testing

I added test directories with example data for both single and paired-end cases

  • test-single-read/ - Contains single-end test data and configuration
  • test-paired-end/ - Contains paired-end test data and configuration

I validated the pipeline changes in this notebook: https://data.securebio.org/simons-notebook/posts/2024-10-24-mgs-single-read-eval/

…on. Renamed Multiqc to not be confusing regardings its naming as "Single"
… paired-end read version of SUMMARIZE_MULTIQC
… pair information for single read data. Also dropped some code which combines values across read pairs, for single read data.

I dropped the renaming of tab_tsv to tab_tsv_2 for paired end data, so I didn't have to create two different versions of the combine step at the end of the subscript.
```
 tab <- tab_json %>% inner_join(tab_tsv, by="sample")
```
…ead data, as I instead amended the existing script to be able to handle both single read paired end data.
…ad version. Renamed Multiqc to not be confusing regardings its naming as "Single""

This reverts commit 01ea0c5.
Copy link
Contributor

@willbradshaw willbradshaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few minor changes here, otherwise looks good!

configs/read_type.config Outdated Show resolved Hide resolved
subworkflows/local/loadSampleSheet/main.nf Outdated Show resolved Hide resolved
workflows/run_dev_se.nf Outdated Show resolved Hide resolved
workflows/run.nf Show resolved Hide resolved
@willbradshaw
Copy link
Contributor

@simonleandergrimm @harmonbhasin Since we're getting close to merging this would be a good time to update the CHANGELOG.

@simonleandergrimm
Copy link
Collaborator Author

Aside from Will getting back to my comment and me editing the changelog this is good to go in.

@@ -51,7 +37,7 @@ workflow RUN_DEV_SE {
// Publish results
params_str = JsonOutput.prettyPrint(JsonOutput.toJson(params))
params_ch = Channel.of(params_str).collectFile(name: "run-params.json")
time_ch = Channel.of(start_time_str + "\n").collectFile(name: "time.txt")
time_ch = Channel.of(params.start_time_str + "\n").collectFile(name: "time.txt")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this getting into params? Seems naively like it would be simpler to just emit it from LOAD_SAMPLESHEET.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@simonleandergrimm what Will is referring to here is this line "params.start_time_str". Previously this was in the main workflow like this:
start_time = new Date() start_time_str = start_time.format("YYYY-MM-dd HH:mm:ss z (Z)")
. Either move this back to the main workflow (in which case you can use start_time_str, or keep it in LOAD_SAMPLESHEET (and make sure to emit it).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know what Will meant. I changed the code; let me know if the new time_ch creation logic looks fine.

Base automatically changed from harmon_fix_gh_actions_test to dev December 17, 2024 18:27
@simonleandergrimm simonleandergrimm deleted the single-read-raw-clean branch December 18, 2024 13:22
@simonleandergrimm simonleandergrimm restored the single-read-raw-clean branch December 18, 2024 13:28
Copy link
Collaborator

@harmonbhasin harmonbhasin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@simonleandergrimm outside of the change Will requested, this looks good to me. Make that change and you should be good to go!

@@ -51,7 +37,7 @@ workflow RUN_DEV_SE {
// Publish results
params_str = JsonOutput.prettyPrint(JsonOutput.toJson(params))
params_ch = Channel.of(params_str).collectFile(name: "run-params.json")
time_ch = Channel.of(start_time_str + "\n").collectFile(name: "time.txt")
time_ch = Channel.of(params.start_time_str + "\n").collectFile(name: "time.txt")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@simonleandergrimm what Will is referring to here is this line "params.start_time_str". Previously this was in the main workflow like this:
start_time = new Date() start_time_str = start_time.format("YYYY-MM-dd HH:mm:ss z (Z)")
. Either move this back to the main workflow (in which case you can use start_time_str, or keep it in LOAD_SAMPLESHEET (and make sure to emit it).

@simonleandergrimm
Copy link
Collaborator Author

@willbradshaw Let me know if this CHANGELOG edit looks good to you. If so I will create the same for the other PR.

# v2.5.3
- Added support for single-end read processing:
    - Restructured RAW and CLEAN workflows to handle both single-end and paired-end reads
    - Added new FASTP_SINGLE process alongside existing FASTP_PAIRED
    - Added new TRUNCATE_CONCAT_SINGLE process alongside existing TRUNCATE_CONCAT_PAIRED
    - Added single-end logic to QC and RAW subworkflows
    - Created separate end-to-end test workflow for single-end processing (which will be removed once single-end processing is fully integrated)
- Improved samplesheet handling:
    - Added new LOAD_SAMPLESHEET subworkflow to centralize samplesheet processing
    - Modified samplesheet handling to support both single-end and paired-end data
    - Updated generate_samplesheet.sh to handle single-end data with --single_end flag
- Configuration updates:
    - Added read_type.config to handle single-end vs paired-end settings (set automatically based on samplesheet format)
    - Created run_dev_se.config and run_dev_se.nf for single-end and paired-end development testing (which will be removed once single-end processing is fully integrated)
    - Added single-end samplesheet to test-data

@willbradshaw
Copy link
Contributor

I'm not a huge fan of those CHANGELOG changes because they imply that single-end read processing is more complete than it actually is. I would also keep all the single-end updates under one section:

# v2.5.3 (in progress)
- Added new LOAD_SAMPLESHEET subworkflow to centralize samplesheet processing 
- Began development of single-end read processing (still in progress)
    - Restructured RAW, CLEAN, and QC workflows to handle both single-end and paired-end reads
    - Added new FASTP_SINGLE and TRUNCATE_CONCAT_SINGLE processes to handle single-end reads
    - Created separate end-to-end test workflow for single-end processing (which will be removed once single-end processing is fully integrated)
    - Modified samplesheet handling to support both single-end and paired-end data
    - Updated generate_samplesheet.sh to handle single-end data with --single_end flag
    - Added read_type.config to handle single-end vs paired-end settings (set automatically based on samplesheet format)
    - Created run_dev_se.config and run_dev_se.nf for single-end development testing (which will be removed once single-end processing is fully integrated)
    - Added single-end samplesheet to test-data

@simonleandergrimm
Copy link
Collaborator Author

Edited CHANGELOG to incorporate your changes

Copy link
Collaborator

@harmonbhasin harmonbhasin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

@willbradshaw willbradshaw merged commit 3a4c2f1 into dev Dec 20, 2024
4 checks passed
@willbradshaw willbradshaw deleted the single-read-raw-clean branch December 20, 2024 13:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants