Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update nf-core modules with new nftests #420

Draft
wants to merge 4 commits into
base: dev
Choose a base branch
from

Conversation

nschcolnicov
Copy link

PR checklist

  • This comment contains a description of changes (with reason).
  • If you've fixed a bug or added code that should be tested, add tests!
  • If you've added a new tool - have you followed the pipeline conventions in the contribution docs
  • If necessary, also make a PR on the nf-core/differentialabundance branch on the nf-core/test-datasets repository.
  • Make sure your code lints (nf-core lint).
  • Ensure the test suite passes (nf-test test main.nf.test -profile test,docker).
  • Check for unexpected warnings in debug mode (nextflow run . -profile debug,test,docker --outdir <OUTDIR>).
  • Usage Documentation in docs/usage.md is updated.
  • Output Documentation in docs/output.md is updated.
  • CHANGELOG.md is updated.
  • README.md is updated (including new tool citations and authors/contributors).

@nf-core-bot
Copy link
Member

Warning

Newer version of the nf-core template is available.

Your pipeline is using an old version of the nf-core template: 3.0.2.
Please update your pipeline to the latest version.

For more documentation on how to update your pipeline, please see the nf-core documentation and Synchronisation documentation.

Copy link

github-actions bot commented Jan 15, 2025

nf-core pipelines lint overall result: Passed ✅ ⚠️

Posted for pipeline commit ae94a44

+| ✅ 300 tests passed       |+
#| ❔   6 tests were ignored |#
!| ❗   4 tests had warnings |!

❗ Test warnings:

  • pipeline_todos - TODO string in main.nf: Optionally add in-text citation tools to this list.
  • pipeline_todos - TODO string in main.nf: Optionally add bibliographic entries to this list.
  • pipeline_todos - TODO string in main.nf: Only uncomment below if logic in toolCitationText/toolBibliographyText has been filled!
  • pipeline_todos - TODO string in base.config: Check the defaults for all processes

❔ Tests ignored:

✅ Tests passed:

Run details

  • nf-core/tools version 3.0.2
  • Run at 2025-01-17 18:45:00

@nschcolnicov
Copy link
Author

Need to update all of the modules from this closed PR, once they get merged in the modules repo PR

@nschcolnicov nschcolnicov force-pushed the update_modules_nftests branch from 6c463c1 to ae94a44 Compare January 17, 2025 18:43
@nschcolnicov
Copy link
Author

I'm looking into possible reasons why results differ between running it in CI tests and my local environment, specifically in the GSEA module. I saw that the default value for the random seed parameter "params.gsea_rnd_seed" is "timestamp", @pinin4fjords why are we using this value instead of a fixed integer? Wouldn't it make results non-reproducible?

Comment on lines +79 to +115
# Rename files so that they can be properly referenced by the output channels
# Function to rename files based on the given pattern
rename_files() {
local pattern=\$1
local exclude_patterns=\$2
local extension=\$3

# Find files matching the pattern but not matching the exclusion patterns
find . -type f -name "\$pattern" | while read -r file; do
# Exclude files based on the provided exclusion patterns
if ! echo "\$file" | grep -qE "\$exclude_patterns"; then
# Rename the file by adding the prefix "gene_sets_"
mv "\$file" "\$(dirname "\$file")/gene_sets_\$(basename "\$file")"
fi
done
}

# Pattern and exclusion for .tsv files
tsv_pattern="*.tsv"
tsv_exclude="gene_set_size|gsea_report|ranked_gene_list"

# Pattern and exclusion for .html files
html_pattern="*.html"
html_exclude="gsea_report|heat_map_corr_plot|index|pos_snapshot|neg_snapshot"

# Pattern and exclusion for .png files
png_pattern="*.png"
png_exclude="butterfly|enplot|global_es_histogram|gset_rnd_es_dist|heat_map|pvalues_vs_nes_plot|ranked_list_corr"

# Rename .tsv files
rename_files "\$tsv_pattern" "\$tsv_exclude" ".tsv"

# Rename .html files
rename_files "\$html_pattern" "\$html_exclude" ".html"

# Rename .png files
rename_files "\$png_pattern" "\$png_exclude" ".png"
Copy link
Member

@pinin4fjords pinin4fjords Jan 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Rename files so that they can be properly referenced by the output channels
# Function to rename files based on the given pattern
rename_files() {
local pattern=\$1
local exclude_patterns=\$2
local extension=\$3
# Find files matching the pattern but not matching the exclusion patterns
find . -type f -name "\$pattern" | while read -r file; do
# Exclude files based on the provided exclusion patterns
if ! echo "\$file" | grep -qE "\$exclude_patterns"; then
# Rename the file by adding the prefix "gene_sets_"
mv "\$file" "\$(dirname "\$file")/gene_sets_\$(basename "\$file")"
fi
done
}
# Pattern and exclusion for .tsv files
tsv_pattern="*.tsv"
tsv_exclude="gene_set_size|gsea_report|ranked_gene_list"
# Pattern and exclusion for .html files
html_pattern="*.html"
html_exclude="gsea_report|heat_map_corr_plot|index|pos_snapshot|neg_snapshot"
# Pattern and exclusion for .png files
png_pattern="*.png"
png_exclude="butterfly|enplot|global_es_histogram|gset_rnd_es_dist|heat_map|pvalues_vs_nes_plot|ranked_list_corr"
# Rename .tsv files
rename_files "\$tsv_pattern" "\$tsv_exclude" ".tsv"
# Rename .html files
rename_files "\$html_pattern" "\$html_exclude" ".html"
# Rename .png files
rename_files "\$png_pattern" "\$png_exclude" ".png"
# Prefix gene_set files
for e in \
tsv:gene_set_size\|gsea_report\|ranked_gene_list \
html:gsea_report\|heat_map_corr_plot\|index\|pos_snapshot\|neg_snapshot \
png:butterfly\|enplot\|global_es_histogram\|gset_rnd_es_dist\|heat_map\|pvalues_vs_nes_plot\|ranked_list_corr
do
IFS=: read -r ext exclude <<<"\$e"
find . -type f -name "*.\$ext" | grep -Ev "\$exclude" | while IFS= read -r f
do
mv "$\f" "\$(dirname "\$f")/gene_sets_\$(basename "$\f")"
done
done

Could have been done a bit more concisely, along these lines? Not tested

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even though longer, I would argue that the previous version is a lot more readable. But not the hill I'm going to die on.

Copy link
Member

@pinin4fjords pinin4fjords left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems to be a lot of work here, nice. But I have a couple of concerns that may need feeding back through nf-core/modules:

  • We shouldn't unconditionally set the random seed on tools that use it. That's changing the author's designed default behaviour. We can set seeds for reproducibility in testing, but that should be the limit.
  • We should only pin the main dependency. Anything else gives us headaches down the line. There is upcoming functionality in nf-core to use Conda lockfiles for reproducibility, but that is not what environment.ymls are for (unless you have a very good reason for those I'm not aware of).

@@ -2,4 +2,9 @@ channels:
- conda-forge
- bioconda
dependencies:
- conda-forge::coreutils=8.30
- conda-forge::coreutils=9.5
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not be pinning thise many dependencies in the environment.yml, this is not a lockfile.

@@ -2,4 +2,9 @@ channels:
- conda-forge
- bioconda
dependencies:
- conda-forge::coreutils=8.30
- conda-forge::coreutils=9.5
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, too many deps pinned

def chip_command = chip ? "-chip $chip -collapse true" : ''
def VERSION = '4.3.2' // WARN: Version information not provided by tool on CLI. Please update this string when bumping container versions.
if (!(args ==~ /.*-rnd_seed.*/)) {args += " -rnd_seed 10"}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not change the default behaviour of underlying software unless we have to.

Set this for tests, but not here.


cat <<-END_VERSIONS > versions.yml
"${task.process}":
r-base: \$(echo \$(R --version 2>&1) | sed 's/^.*R version //; s/ .*\$//')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should actually change this, and the template, to not report the r version. This breaks every time Bioconda updates its R (which I only came to appreciate later on).

In general we should only pin the main dependency, and only report the version of that dependency.

@grst
Copy link
Member

grst commented Jan 20, 2025

We shouldn't unconditionally set the random seed on tools that use it. That's changing the author's designed default behaviour. We can set seeds for reproducibility in testing, but that should be the limit.

I would argue that in most of the cases it's not "designed default behavior" but rather neglect. If we care about reproducibility of the pipeline we should set the seed if a tool doesn't provide stable output by itself.

Agree with the rest.

@pinin4fjords
Copy link
Member

I would argue that in most of the cases it's not "designed default behavior" but rather neglect. If we care about reproducibility of the pipeline we should set the seed if a tool doesn't provide stable output by itself.

Sorry, I really strongly disagree with that. The lack of reproducibility is 'real' in tools like this. Setting the seed makes things reproducible, but it's only an illusion. The fact is that there's a simulation going on which means you should get a slightly different answer each time. For example, if a user gets a 'significant' result one time that goes away next time they run, that's a genuine reflection on the reliability of the result, not a consequence of neglect.

What if someone actually wanted to examine the impact of the simulations on reliability? It's not for us to fix the seed and make that harder.

In short, this is a hill I will die on ;-)

@nschcolnicov
Copy link
Author

I would argue that in most of the cases it's not "designed default behavior" but rather neglect. If we care about reproducibility of the pipeline we should set the seed if a tool doesn't provide stable output by itself.

Sorry, I really strongly disagree with that. The lack of reproducibility is 'real' in tools like this. Setting the seed makes things reproducible, but it's only an illusion. The fact is that there's a simulation going on which means you should get a slightly different answer each time. For example, if a user gets a 'significant' result one time that goes away next time they run, that's a genuine reflection on the reliability of the result, not a consequence of neglect.

What if someone actually wanted to examine the impact of the simulations on reliability? It's not for us to fix the seed and make that harder.

In short, this is a hill I will die on ;-)

Thank you for your comments @grst @pinin4fjords. Regarding the seed debate, I think it is worth noting that the behavior of the module within the pipeline hasn't changed. While it is true that in the context of the module, this line will set a default seed whenever the doesn't set a seed value through the ext.args; In the context of the pipeline, nothing has changed, since the random seed value is already set by default in the modules.config via the gsea_rnd_seed to the value timestamp.
This issue should be created in the modules repository and we can update the pipeline once that has been updated.

Following up on this debate, I think that having a random seed being set by default whenever the user disregards this argument makes sense, and will ensure reproducibility, which is the behavior that I think most people are looking for. If they are looking for more flexibility, the seed value can easily be modified by the user. In any case, I think it would be best to leave the module as is, and update the documentation

Let me know your thoughts!

@pinin4fjords
Copy link
Member

Let me know your thoughts!

See above ;-). As I say, I strongly think we shouldn't be setting random seeds as default.

Setting the seeds does not make this tool reproducible in a meaningful sense. To be a little bit reductive, it's like you're asking someone to throw a dice but telling them to throw a 6. As such it's appropriate to set in testing but not as a matter of course. It's actively misleading if people keep getting the same answer, even with small numbers of simulations, purely because they're setting a random seed. It's just wrong.

It's not about flexibility- if people are expecting consistent results from a method based on random simulations, then I'm afraid they don't understand how that method works. A user doesn't expect to have to do something to an nf-core module to restore what should be its default behaviour.

Could you just quickly take that out at the module level please? Then we can change it here before merge.

@grst
Copy link
Member

grst commented Jan 21, 2025

This discussion probably doesn't lead anywhere, but I would still like to throw in that for me, this is about provenance and documentation of how one arrives at a certain result.

If I publish a table with DE results and write that I used nextflow run nf-core/differentialabundance -r X.X.X with a bunch of parameters. If there's a fixed seed, someone can run this themselves and get exactly the same table. Without the seed, they will get a different result, so I could as well have made the table up (or tweaked it a little bit to make my favorite gene significant).

EDIT: A solution that would satistfy both would be to choose a random seed and report it. But this would add additional complexity to the pipeline code.

@pinin4fjords
Copy link
Member

pinin4fjords commented Jan 21, 2025

This discussion probably doesn't lead anywhere, but I would still like to throw in that for me, this is about provenance and documentation of how one arrives at a certain result.

If I publish a table with DE results and write that I used nextflow run nf-core/differentialabundance -r X.X.X with a bunch of parameters. If there's a fixed seed, someone can run this themselves and get exactly the same table. Without the seed, they will get a different result, so I could as well have made the table up (or tweaked it a little bit to make my favorite gene significant).

Yep, you absolutely should set the random seed in that situation (and state the seed set). But you should do it actively, and it shouldn't be the default.

@grst
Copy link
Member

grst commented Jan 21, 2025

Yep, you absolutely should set the random seed in that situation (and state the seed set). But you should do it actively, and it shouldn't be the default.

So I propose that the pipeline gets a global seed parameter that, if enabled, sets a seed to all applicable methods. I can open a separate ticket for that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants