-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update nf-core modules with new nftests #420
base: dev
Are you sure you want to change the base?
Conversation
Warning Newer version of the nf-core template is available. Your pipeline is using an old version of the nf-core template: 3.0.2. For more documentation on how to update your pipeline, please see the nf-core documentation and Synchronisation documentation. |
|
38189f9
to
051ee08
Compare
Need to update all of the modules from this closed PR, once they get merged in the modules repo PR |
6c463c1
to
ae94a44
Compare
I'm looking into possible reasons why results differ between running it in CI tests and my local environment, specifically in the GSEA module. I saw that the default value for the random seed parameter "params.gsea_rnd_seed" is "timestamp", @pinin4fjords why are we using this value instead of a fixed integer? Wouldn't it make results non-reproducible? |
# Rename files so that they can be properly referenced by the output channels | ||
# Function to rename files based on the given pattern | ||
rename_files() { | ||
local pattern=\$1 | ||
local exclude_patterns=\$2 | ||
local extension=\$3 | ||
|
||
# Find files matching the pattern but not matching the exclusion patterns | ||
find . -type f -name "\$pattern" | while read -r file; do | ||
# Exclude files based on the provided exclusion patterns | ||
if ! echo "\$file" | grep -qE "\$exclude_patterns"; then | ||
# Rename the file by adding the prefix "gene_sets_" | ||
mv "\$file" "\$(dirname "\$file")/gene_sets_\$(basename "\$file")" | ||
fi | ||
done | ||
} | ||
|
||
# Pattern and exclusion for .tsv files | ||
tsv_pattern="*.tsv" | ||
tsv_exclude="gene_set_size|gsea_report|ranked_gene_list" | ||
|
||
# Pattern and exclusion for .html files | ||
html_pattern="*.html" | ||
html_exclude="gsea_report|heat_map_corr_plot|index|pos_snapshot|neg_snapshot" | ||
|
||
# Pattern and exclusion for .png files | ||
png_pattern="*.png" | ||
png_exclude="butterfly|enplot|global_es_histogram|gset_rnd_es_dist|heat_map|pvalues_vs_nes_plot|ranked_list_corr" | ||
|
||
# Rename .tsv files | ||
rename_files "\$tsv_pattern" "\$tsv_exclude" ".tsv" | ||
|
||
# Rename .html files | ||
rename_files "\$html_pattern" "\$html_exclude" ".html" | ||
|
||
# Rename .png files | ||
rename_files "\$png_pattern" "\$png_exclude" ".png" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Rename files so that they can be properly referenced by the output channels | |
# Function to rename files based on the given pattern | |
rename_files() { | |
local pattern=\$1 | |
local exclude_patterns=\$2 | |
local extension=\$3 | |
# Find files matching the pattern but not matching the exclusion patterns | |
find . -type f -name "\$pattern" | while read -r file; do | |
# Exclude files based on the provided exclusion patterns | |
if ! echo "\$file" | grep -qE "\$exclude_patterns"; then | |
# Rename the file by adding the prefix "gene_sets_" | |
mv "\$file" "\$(dirname "\$file")/gene_sets_\$(basename "\$file")" | |
fi | |
done | |
} | |
# Pattern and exclusion for .tsv files | |
tsv_pattern="*.tsv" | |
tsv_exclude="gene_set_size|gsea_report|ranked_gene_list" | |
# Pattern and exclusion for .html files | |
html_pattern="*.html" | |
html_exclude="gsea_report|heat_map_corr_plot|index|pos_snapshot|neg_snapshot" | |
# Pattern and exclusion for .png files | |
png_pattern="*.png" | |
png_exclude="butterfly|enplot|global_es_histogram|gset_rnd_es_dist|heat_map|pvalues_vs_nes_plot|ranked_list_corr" | |
# Rename .tsv files | |
rename_files "\$tsv_pattern" "\$tsv_exclude" ".tsv" | |
# Rename .html files | |
rename_files "\$html_pattern" "\$html_exclude" ".html" | |
# Rename .png files | |
rename_files "\$png_pattern" "\$png_exclude" ".png" | |
# Prefix gene_set files | |
for e in \ | |
tsv:gene_set_size\|gsea_report\|ranked_gene_list \ | |
html:gsea_report\|heat_map_corr_plot\|index\|pos_snapshot\|neg_snapshot \ | |
png:butterfly\|enplot\|global_es_histogram\|gset_rnd_es_dist\|heat_map\|pvalues_vs_nes_plot\|ranked_list_corr | |
do | |
IFS=: read -r ext exclude <<<"\$e" | |
find . -type f -name "*.\$ext" | grep -Ev "\$exclude" | while IFS= read -r f | |
do | |
mv "$\f" "\$(dirname "\$f")/gene_sets_\$(basename "$\f")" | |
done | |
done |
Could have been done a bit more concisely, along these lines? Not tested
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even though longer, I would argue that the previous version is a lot more readable. But not the hill I'm going to die on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems to be a lot of work here, nice. But I have a couple of concerns that may need feeding back through nf-core/modules:
- We shouldn't unconditionally set the random seed on tools that use it. That's changing the author's designed default behaviour. We can set seeds for reproducibility in testing, but that should be the limit.
- We should only pin the main dependency. Anything else gives us headaches down the line. There is upcoming functionality in nf-core to use Conda lockfiles for reproducibility, but that is not what environment.ymls are for (unless you have a very good reason for those I'm not aware of).
@@ -2,4 +2,9 @@ channels: | |||
- conda-forge | |||
- bioconda | |||
dependencies: | |||
- conda-forge::coreutils=8.30 | |||
- conda-forge::coreutils=9.5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should not be pinning thise many dependencies in the environment.yml, this is not a lockfile.
@@ -2,4 +2,9 @@ channels: | |||
- conda-forge | |||
- bioconda | |||
dependencies: | |||
- conda-forge::coreutils=8.30 | |||
- conda-forge::coreutils=9.5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, too many deps pinned
def chip_command = chip ? "-chip $chip -collapse true" : '' | ||
def VERSION = '4.3.2' // WARN: Version information not provided by tool on CLI. Please update this string when bumping container versions. | ||
if (!(args ==~ /.*-rnd_seed.*/)) {args += " -rnd_seed 10"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should not change the default behaviour of underlying software unless we have to.
Set this for tests, but not here.
|
||
cat <<-END_VERSIONS > versions.yml | ||
"${task.process}": | ||
r-base: \$(echo \$(R --version 2>&1) | sed 's/^.*R version //; s/ .*\$//') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should actually change this, and the template, to not report the r version. This breaks every time Bioconda updates its R (which I only came to appreciate later on).
In general we should only pin the main dependency, and only report the version of that dependency.
I would argue that in most of the cases it's not "designed default behavior" but rather neglect. If we care about reproducibility of the pipeline we should set the seed if a tool doesn't provide stable output by itself. Agree with the rest. |
Sorry, I really strongly disagree with that. The lack of reproducibility is 'real' in tools like this. Setting the seed makes things reproducible, but it's only an illusion. The fact is that there's a simulation going on which means you should get a slightly different answer each time. For example, if a user gets a 'significant' result one time that goes away next time they run, that's a genuine reflection on the reliability of the result, not a consequence of neglect. What if someone actually wanted to examine the impact of the simulations on reliability? It's not for us to fix the seed and make that harder. In short, this is a hill I will die on ;-) |
Thank you for your comments @grst @pinin4fjords. Regarding the seed debate, I think it is worth noting that the behavior of the module within the pipeline hasn't changed. While it is true that in the context of the module, this line will set a default seed whenever the doesn't set a seed value through the Following up on this debate, I think that having a random seed being set by default whenever the user disregards this argument makes sense, and will ensure reproducibility, which is the behavior that I think most people are looking for. If they are looking for more flexibility, the seed value can easily be modified by the user. In any case, I think it would be best to leave the module as is, and update the documentation Let me know your thoughts! |
See above ;-). As I say, I strongly think we shouldn't be setting random seeds as default. Setting the seeds does not make this tool reproducible in a meaningful sense. To be a little bit reductive, it's like you're asking someone to throw a dice but telling them to throw a 6. As such it's appropriate to set in testing but not as a matter of course. It's actively misleading if people keep getting the same answer, even with small numbers of simulations, purely because they're setting a random seed. It's just wrong. It's not about flexibility- if people are expecting consistent results from a method based on random simulations, then I'm afraid they don't understand how that method works. A user doesn't expect to have to do something to an nf-core module to restore what should be its default behaviour. Could you just quickly take that out at the module level please? Then we can change it here before merge. |
This discussion probably doesn't lead anywhere, but I would still like to throw in that for me, this is about provenance and documentation of how one arrives at a certain result. If I publish a table with DE results and write that I used EDIT: A solution that would satistfy both would be to choose a random seed and report it. But this would add additional complexity to the pipeline code. |
Yep, you absolutely should set the random seed in that situation (and state the seed set). But you should do it actively, and it shouldn't be the default. |
So I propose that the pipeline gets a global |
PR checklist
nf-core lint
).nf-test test main.nf.test -profile test,docker
).nextflow run . -profile debug,test,docker --outdir <OUTDIR>
).docs/usage.md
is updated.docs/output.md
is updated.CHANGELOG.md
is updated.README.md
is updated (including new tool citations and authors/contributors).