Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Workflow Run RO-crate format #39

Open
wants to merge 53 commits into
base: master
Choose a base branch
from

Conversation

famosab
Copy link

@famosab famosab commented Dec 18, 2024

We worked on a first version of the plugin which is able to render valid RO-crates for any workflow run.

Happy to receive feedback to get this finished up :)

Continues #19 and #33.

famosab and others added 16 commits November 18, 2024 15:45
add encodingFormat for nextflow.config
feat: add wrroc to valid formats
* fx: make getIntermediateOutputFiles work again

* Fix bugs

fixes #16
fixes #17

---------

Co-authored-by: fbartusch <[email protected]>
* feat: add README to create

* feat: ignore vscode

* fix: make getIntermediateOutputFiles work again (#18) (#19)

* fx: make getIntermediateOutputFiles work again

* Fix bugs

fixes #16
fixes #17

---------

Co-authored-by: fbartusch <[email protected]>

* feat: add README to json

* feat: check first if readme exists

* Add readme to hasPart

Signed-off-by: fbartusch <[email protected]>

---------

Signed-off-by: fbartusch <[email protected]>
Co-authored-by: fbartusch <[email protected]>
* Add getEncodingFormat function that return the encoding format for a file
* handle YAML files manually

Signed-off-by: fbartusch <[email protected]>
* main workflow complies (more or less) with ComputationalWorkflow profile version 1.0
  (if set in manifest add license, url, version, description, ...)
* Correct value vor ActionStatus

Signed-off-by: fbartusch <[email protected]>
* start with metaYaml imports

* merge dev-wrroc into metaYaml (#23)

* add encodingFormat for nextflow.config

* add encodingFormat for main.nf

* feat: add wrroc to valid formats

* fix: make getIntermediateOutputFiles work again (#18)

* fx: make getIntermediateOutputFiles work again

* Fix bugs

fixes #16
fixes #17

---------

Co-authored-by: fbartusch <[email protected]>

* feat: add README to crate (#14)

* feat: add README to create

* feat: ignore vscode

* fix: make getIntermediateOutputFiles work again (#18) (#19)

* fx: make getIntermediateOutputFiles work again

* Fix bugs

fixes #16
fixes #17

---------

Co-authored-by: fbartusch <[email protected]>

* feat: add README to json

* feat: check first if readme exists

* Add readme to hasPart

Signed-off-by: fbartusch <[email protected]>

---------

Signed-off-by: fbartusch <[email protected]>
Co-authored-by: fbartusch <[email protected]>

---------

Signed-off-by: fbartusch <[email protected]>
Co-authored-by: fbartusch <[email protected]>

* WIP

* only add from meta if meta exists

* remove usage from ext args

* add module name to id

---------

Signed-off-by: fbartusch <[email protected]>
Co-authored-by: fbartusch <[email protected]>
@famosab

This comment was marked as outdated.

@simleo

This comment was marked as outdated.

@famosab
Copy link
Author

famosab commented Dec 18, 2024

ro-crate-metadata.json
This was created using the plugin and this pipeline: https://github.com/famosab/wrrocmetatest

@bentsherman

This comment was marked as outdated.

@bentsherman bentsherman changed the base branch from master to workflow-run-crate December 18, 2024 15:46
@bentsherman bentsherman changed the base branch from workflow-run-crate to master December 18, 2024 15:47
@simleo

This comment was marked as outdated.

@famosab

This comment was marked as outdated.

@simleo

This comment was marked as outdated.

@famosab

This comment was marked as outdated.

@famosab

This comment was marked as outdated.

@simleo

This comment was marked as resolved.

@famosab

This comment was marked as outdated.

@simleo

This comment was marked as resolved.

@fbartusch
Copy link

Found some new problems with copying input files to the crate:

  1. We iterate over the params: params.each { name, value ->
    In nf-prov/demo I have a parameter:

    name: genome
    value: null
    

    This crashes. Checking for null values fixes this problem:

    params.each { name, value ->
      if (!value)
       return
     [...]
    
  2. Ignoring files with http(s) and s3 works, but I started running tests on our HPC-Cluster, where we have a NFS-mount providing igenomes ... and it tries to copy the whole igenomes directory into the crate. I don't think that is feasable.

  3. The validator complains about this:

    "message": "Every Data Entity Directory URI MUST end with `/`",
    "violatingPropertyValue": "file:///home/fbartusch/nf-core/results/demo/1.0.0/multiqc/multiqc_plots"
    

    The RO-Crate specification states that the trailing / is a SHOULD criterium. So maybe the validator is wrong here. I opened an issue for the validator.

@bentsherman
Copy link
Member

@fbartusch thanks for the comments, I have fixed those first two issues

@simleo is there another entity type that might be appropriate for these intermediate outputs? I tried a few based on what is allowed for CreateAction object/result and the validator accepts if they are of type CreativeWork

@simleo
Copy link

simleo commented Jan 20, 2025

@simleo is there another entity type that might be appropriate for these intermediate outputs? I tried a few based on what is allowed for CreateAction object/result and the validator accepts if they are of type CreativeWork

I think CreativeWork is OK.

I suppose we could also remove the formal parameters whose values are null.

Agree.

@simleo
Copy link

simleo commented Jan 20, 2025

  • use git URL + commit hash instead of copying the pipeline scripts

I ran Famke's pipeline again with the latest version of the plugin and I could not find these items anywhere in the RO-Crate metadata.

@elichad
Copy link

elichad commented Jan 20, 2025

@bentsherman I spoke to @stain (co-lead of RO-Crate) about this.

First, he clarified that entities of type File are not necessarily data entities - so that's a validator bug as @simleo already identified.

Second, the Five Safes Crate profile has some guidance on representing files that are referenced but not actually in the crate (focused on the sensitive data reasons). This suggests using type DigitalDocument. I think I overlooked this recommendation when making the crate in my previous comment.

@bentsherman
Copy link
Member

I ran Famke's pipeline again with the latest version of the plugin and I could not find these items anywhere in the RO-Crate metadata.

@simleo did you download the pipeline manually and run it from the local path? I think their README suggests this, but the best practice here is to run directly from the canonical repo:

nextflow run famosa/wrrocmetatest # ...

See my comments here. This is the only way for Nextflow to know the repo URL and commit hash, and add it to the crate.

@bentsherman
Copy link
Member

@elichad thanks for the suggestion. Since CreativeWork is valid and DigitalDocument seems to be recommended mainly for sensitive data, I'm inclined to leave it as CreativeWork for now, and perhaps go back to File and Dataset once the validator bug is fixed.

@bentsherman
Copy link
Member

@famosab @fbartusch I think we are just about ready to merge. I have tested with your test pipeline, but if you'd like to give it one more round of testing with any other pipelines, if everything looks good from your side, I think we can merge in the next few days

@simleo
Copy link

simleo commented Jan 22, 2025

@bentsherman thanks for the tip: I ran the workflow as nextflow run famosab/wrrocmetatest ... and now the ComputationalWorkflow has codeRepository, version and url. Since the license is not set in nextflow.config, however, the RO-Crate metadata ends up having a data entity with an @id of null:

{
    "@id": "./",
    "@type": "Dataset",
    "hasPart": [
	    ...
        {
            "@id": null
        },
        ...
    ],
	...
},
...
{
    "@id": null,
    "@type": "CreativeWork"
}

In such cases, no entity at all should be added to the crate instead.

@fbartusch
Copy link

I'm currently running tests against nf-core pipelines with some scripts I wrote for testing plugins.
It wil take some time and I hope I have some results this afternoon.

@simleo
Copy link

simleo commented Jan 23, 2025

I've tried again (with the current version of the plugin) to run Famke's pipeline locally:

nextflow run main.nf -profile docker --input testsheet.csv --outdir results -c testdata.config

with local files in testsheet.tsv:

sample,fastq_1,fastq_2
test,/home/simleo/repos/wrrocmetatest/read1.fq.gz,/home/simleo/repos/wrrocmetatest/read2.fq.gz

The resulting crate is not even readable by ro-crate-py because it has absolute ids in it (/home/simleo/repos/wrrocmetatest/read{1,2}.fq.gz). This violates the spec in File Data Entity:

@id MUST be either a URI Path relative to the RO Crate root, or an absolute URI

I see three possible ways to fix this:

  1. Copy the files into the crate and add them with their relative path
  2. Prepend file:// to the ids, making them absolute URIs
  3. Add them as CreativeWork as done for intermediates

However, with options 2 and 3 the crate consumer has no way to reconstruct the two input files.

@fbartusch
Copy link

Most of the nf-core pipelines are currently failing with the plugin :(
nextflow.log says:

Jan-23 17:09:49.075 [main] DEBUG nextflow.Session - Failed to invoke observer completion handler: nextflow.prov.ProvObserver@6965f207
java.lang.NullPointerException: null
        at nextflow.prov.WrrocRenderer.getModuleId(WrrocRenderer.groovy:774)
        at nextflow.prov.WrrocRenderer.access$2(WrrocRenderer.groovy)
        at nextflow.prov.WrrocRenderer$_render_closure15.doCall(WrrocRenderer.groovy:331)
        at nextflow.prov.WrrocRenderer$_render_closure15.call(WrrocRenderer.groovy)
        at org.codehaus.groovy.runtime.DefaultGroovyMethods.collect(DefaultGroovyMethods.java:3661)
        at org.codehaus.groovy.runtime.DefaultGroovyMethods.collect(DefaultGroovyMethods.java:3646)
        at org.codehaus.groovy.runtime.DefaultGroovyMethods.collect(DefaultGroovyMethods.java:3692)
        at nextflow.prov.WrrocRenderer.render(WrrocRenderer.groovy:325)
        at nextflow.prov.ProvObserver.onFlowComplete(ProvObserver.groovy:121)
        at nextflow.Session.notifyFlowComplete(Session.groovy:1155)
        at nextflow.Session.shutdown0(Session.groovy:749)
        at nextflow.Session.destroy(Session.groovy:694)
        at nextflow.script.ScriptRunner.shutdown(ScriptRunner.groovy:260)
        at nextflow.script.ScriptRunner.execute(ScriptRunner.groovy:146)
        at nextflow.cli.CmdRun.run(CmdRun.groovy:376)
        at nextflow.cli.Launcher.run(Launcher.groovy:503)
        at nextflow.cli.Launcher.main(Launcher.groovy:658)

I'm checking now why it fails and use the nf-core bamtofastq pipeline, as it is the fastest (and most simple?) pipeline that fails. That should make debugging easier.

@bentsherman
Copy link
Member

@simleo this is why I recommend using the original HTTP URLs:

sample,fastq_1,fastq_2
test,https://openstack.cebitec.uni-bielefeld.de:8080/swift/v1/denbi-mg-course/read1.fq.gz,https://openstack.cebitec.uni-bielefeld.de:8080/swift/v1/denbi-mg-course/read2.fq.gz

But it is unavoidable that some users will be using local input files and we'll need to handle that gracefully. As a first iteration I'm inclined to warn about such input files and maybe make them CreativeWork if they aren't included in the crate. I will try a few things

@bentsherman
Copy link
Member

@simleo I ended up taking the absolute URI approach. That made the resulting crate valid. We can encourage the use of remote URIs as a best practice.

In summary, only input files that are (1) specified directly by a param, (2) local, and (3) not a directory, will be copied into the crate. All of these restrictions are designed to prevent explosive data transfers from directories, remote data, and file globs.

@fbartusch I ran bamtofastq with test profile and it succeeded. Let me know how the rest of your tests go with the latest revision

@fbartusch
Copy link

@bentsherman bamtofastq looks indeed good and the validator is happy. I'm running now the tests for the other nf-core pipelines.

@fbartusch
Copy link

@bentsherman Only one pipeline out of 42 I ran fails because of the plugin: demultiplex revision 1.5.1

Jan-30 10:16:18.415 [main] DEBUG nextflow.Session - Failed to invoke observer completion handler: nextflow.prov.ProvObserver@6e65fc8b
java.lang.NullPointerException: null
        at nextflow.prov.WrrocRenderer.getTaskOutputName(WrrocRenderer.groovy:870)
        at nextflow.prov.WrrocRenderer.access$9(WrrocRenderer.groovy)
        at nextflow.prov.WrrocRenderer$_render_closure22.doCall(WrrocRenderer.groovy:442)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:566)
        at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:343)
        at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:328)
        at org.codehaus.groovy.runtime.metaclass.ClosureMetaClass.invokeMethod(ClosureMetaClass.java:279)
        at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1007)
        at groovy.lang.Closure.call(Closure.java:433)
        at org.codehaus.groovy.runtime.DefaultGroovyMethods.callClosureForMapEntry(DefaultGroovyMethods.java:6061)
        at org.codehaus.groovy.runtime.DefaultGroovyMethods.collect(DefaultGroovyMethods.java:3985)
        at org.codehaus.groovy.runtime.DefaultGroovyMethods.collect(DefaultGroovyMethods.java:4002)
        at nextflow.prov.WrrocRenderer.render(WrrocRenderer.groovy:439)
        at nextflow.prov.ProvObserver.onFlowComplete(ProvObserver.groovy:121)
        at nextflow.Session.notifyFlowComplete(Session.groovy:1155)
        at nextflow.Session.shutdown0(Session.groovy:749)
        at nextflow.Session.destroy(Session.groovy:694)
        at nextflow.script.ScriptRunner.shutdown(ScriptRunner.groovy:260)
        at nextflow.script.ScriptRunner.execute(ScriptRunner.groovy:146)
        at nextflow.cli.CmdRun.run(CmdRun.groovy:376)
        at nextflow.cli.Launcher.run(Launcher.groovy:503)
        at nextflow.cli.Launcher.main(Launcher.groovy:658)

But all others didn't pass the validator (I used the latest commit fa8c6c7, not the PyPI release). I think this is the validator version with the least number of remaining bugs, right @simleo ?
I used grepto get the validator messages from all validation reports for all pipelines and uploaded them here

Although the list looks very long at first glance these seem to be just corner cases.
There are three types of messages. I will state one example for each problem type and my educated guess what causes it:

  1. "The RO-Crate does not include the Data Entity 'work/tmp/03/b9fce9cd2416a84f3e472fa0606095/all_logs_tabs.txt' as part of its payload"

All of these messages relate to files in the temporary directory work/tmp. Maybe this is just a corner case with the tmp directory the current code misses, because no other regular file from the workdir (like work/3c/52eb4a7b50f0eff9ef603a10d064ac) causes problems.

  1. "FormalParameter MUST have an additionalType"

Example: "violatingEntity": "./#param/genome" for mag pipeline revision 3.0.2.

Thanks to the saved effective nextflow.config in the RO-Crate I can be 100% sure that this is the parameter value during runtime and it's the default value:

genome = null

Also an edge case in handling null parameter values?

  1. "RO-Crate file descriptor "ro-crate-metadata.json" is not fully flattened at entity "#param/max_memory/value"",

Example: \"#param/max_memory/value\" is not fully flattened for methylseq pipeline revision 2.6.0

It looks like this:

{
    "@id": "#param/max_memory/value",
    "@type": "PropertyValue",
    "exampleOfWork": {
        "@id": "#param/max_memory"
    },
    "name": "max_memory",
    "value": {
        "bytes": 6442450944,
        "giga": 6,
        "kilo": 6291456,
        "mega": 6144
    }
},

The effective configuration during runtime is: max_memory = '6 GB'. Actually I don't know why it looks so strange in ro-crate-metadata.json. I guess Nextflow takes this value, sees it's some kind of "file size" and converts in in a list expressing the value in different units?

One last thing regarding the license.
@bentsherman , you are now using the manifest.license for both main.nf and the RO-Crate itself. I thought that the RO-Crate has a license that tells how the RO-Crate (e.g. the contained research data and results) can be used. This can be different from the license under which the Nextflow workflow is published.
@simleo Is this correct?

@simleo
Copy link

simleo commented Jan 31, 2025

But all others didn't pass the validator (I used the latest commit fa8c6c7, not the PyPI release)

That's the current development version, so good choice 👍

I thought that the RO-Crate has a license that tells how the RO-Crate (e.g. the contained research data and results) can be used. This can be different from the license under which the Nextflow workflow is published.

Workflow RO-Crate says:

The Crate MUST specify a license. The license is assumed to apply to any content of the crate, unless overriden by license on individual File entities.

where the first appearance of "Crate" here means the root data entity. See also Licensing, Access control and copyright.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add RO crate format
5 participants