Merge pull request #23 from getwilds/tf-wdl

Adding WDL and Docker Contribution Guides
getwilds · Mar 14, 2024 · f1c4b87 · f1c4b87
2 parents 6c63584 + f8abdf5
commit f1c4b87
Show file tree

Hide file tree

Showing 3 changed files with 119 additions and 0 deletions.
diff --git a/_quarto.yml b/_quarto.yml
@@ -25,6 +25,8 @@ book:
         - codereview-guidelines.qmd
     - packagedocs.qmd
     - maintenance.qmd
+    - wdlconfig.qmd
+    - docker.qmd
     - conventions.qmd
   site-url: https://getwilds.org/guide/
 

diff --git a/docker.qmd b/docker.qmd
@@ -0,0 +1,49 @@
+
+# Docker Configuration Guide {{< iconify mdi docker >}} {#sec-docker}
+
+The mindset with regard to Docker images is different for WILDS WDLs compared to other projects. Normally, repositories are relatively self-contained and only need one image that can just be directly linked to that repository. However, WDL pipelines often require a different image for each step, creating the need for a laundry list of Docker images for each repository. In addition, our bioinformatics workflows will have a large amount of image overlap in that the same tools get used, just in a different fashion depending on the workflow. To avoid unnecessary image duplication, the [WILDS Docker Library](https://github.com/getwilds/wilds-docker-library) will contain all Dockerfiles and images relevant to WILDS and all future workflows refer back to these images.
+
+## Docker Image Guidelines
+
+- Because these Docker images will be used for individual steps within WDL workflows, they should be as minimal as possible in terms of the number of tools installed in each image (1 or 2 max).
+- As a general (but flexible) rule, try to start from as basic of a [parent image](https://github.com/getwilds/wilds-docker-library/blob/5b5aa0d936ad71267002c7df64638a16dcea02dc/samtools/Dockerfile_latest#L3) as possible, e.g. `scratch`, `ubuntu`, `python`, `r-base`, etc.
+    - Outside parent images are fine, as long as they are from a VERY trusted source, e.g. [Ubuntu](https://hub.docker.com/_/ubuntu), [Python](https://hub.docker.com/_/python), [Conda](https://hub.docker.com/u/continuumio), [Rocker](https://hub.docker.com/u/rocker), etc.
+- To speed up build and deployment of containers, try to keep image sizes relatively small (a few hundred MB on average, 2GB max).
+    - For this reason, reference data should not be stored in an image unless absolutely necessary.
+    - Unnecessary tools should also be avoided, even if they serve a "just-in-case" functionality.
+
+## Dockerfile Guidelines
+- Every Dockerfile must contain the [labels below](https://github.com/getwilds/wilds-docker-library/blob/5b5aa0d936ad71267002c7df64638a16dcea02dc/bwa/Dockerfile_0.7.17#L6C1-L13C44) at a minimum. This provides users with increased visibility in terms of where the image came from and open access to the necessary resources in case they have any questions or concerns.
+```
+LABEL org.opencontainers.image.title="awesomeimage" # Name of the image in question
+LABEL org.opencontainers.image.description="Short description of awesomeimage and its purpose"
+LABEL org.opencontainers.image.version="1.0" # Version tag of the image
+LABEL org.opencontainers.image.authors="[email protected]" # Author email address
+LABEL org.opencontainers.image.url=https://hutchdatascience.org/ # Home page
+LABEL org.opencontainers.image.documentation=https://getwilds.org/ # Documentation page
+LABEL org.opencontainers.image.source=https://github.com/getwilds/wilds-docker-library # GitHub repo to link with
+LABEL org.opencontainers.image.licenses=MIT # License type for the image in question
+```
+- When creating a different version of an existing image, use one of the other Dockerfiles as a starting template and modify it as needed.
+    - This will help to ensure that the only thing that has changed between image versions is the version of tool in question, not any strange formatting/configuration issues.
+- Try to be as specific as possible in terms of [tool versions](https://github.com/getwilds/wilds-docker-library/blob/5b5aa0d936ad71267002c7df64638a16dcea02dc/bwa/Dockerfile_0.7.17#L17C49-L20C64) within the Dockerfile, especially the [parent image](https://github.com/getwilds/wilds-docker-library/blob/5b5aa0d936ad71267002c7df64638a16dcea02dc/bwa/Dockerfile_0.7.17#L3).
+    - If you just specify "latest", a tag that get updated frequently over time, your image could be completely different the next time you build it, even though it uses the exact same Dockerfile.
+    - On the other hand, specifying "v1.2.3" will always pull the same instance of the tool every time, providing greater reproducibility over time.
+
+## Repository Guidelines
+
+- In terms of the repo organization, each image should have its own directory named after the tool being used in the image. Each version of the image should have its own Dockerfile in that directory following the naming convention of `[IMAGENAME]/Dockerfile_[VERSIONTAG]`.
+    - If formatted correctly, a GitHub Action will [automatically build and upload](https://github.com/getwilds/wilds-docker-library/blob/main/.github/workflows/docker-images.yml) the image to the [WILDS GitHub container registry](https://github.com/orgs/getwilds/packages) upon merging into the `main` branch.
+
+- Before merging your changes to `main` (and therefore uploading a new image to the WILDS package registry), try uploading it to your user-specific package registry using the command below and make sure it works for the WDL task in question.
+```
+docker build --platform linux/amd64 -t ghcr.io/GITHUBUSERNAME/IMAGENAME:VERSIONTAG -f IMAGENAME/Dockerfile_VERSIONTAG --push .
+```
+
+- Upon creation or modification of a pull request in this repo, a GitHub Action will [run a check](https://github.com/getwilds/wilds-docker-library/blob/main/.github/workflows/dockerfile-linting.yml) using a linting tool specific to Dockerfiles called [Hadolint](https://github.com/hadolint/hadolint).
+    - If any major warnings pop up, the check will fail and the user will be unable to merge the branch into `main` until the warning is resolved.
+    - Smaller stylistic issues will still be reported, but they will not restrict you from merging your branch into `main`.
+    - Details about the location and root cause of each warning can be found in the details of the check via the [Actions tab](https://github.com/getwilds/wilds-docker-library/actions) of the repo.
+
+
+
diff --git a/wdlconfig.qmd b/wdlconfig.qmd
@@ -0,0 +1,68 @@
+
+# WDL Configuration Guide {{< iconify file-icons wdl >}} {#sec-wdlconfig}
+
+This WILDS WDL configuration guide was inspired by the [BioWDL](https://biowdl.github.io/styleGuidelines.html) and [WARP](https://broadinstitute.github.io/warp/docs/Best_practices/suggested_formats) guidelines and is intended to cater to the pedagogical "proof-of-concept" nature of the WILDS.
+
+## WILDS WDL Philosophy
+
+- The mindset behind WILDS WDLs is for each repository to be a self-contained demonstration of a particular bioinformatic functionality. An ideal use-case would proceed as follows:
+    1. A researcher reviews the repository to deem whether it is relevant for their needs, starting with the README for the over-arching purpose of the workflow, but extending to the the input json and WDL script itself for specific questions about toolsets, settings, and input/output data types.
+    2. If the workflow seems useful, the researcher clones the repository locally, makes minimal updates to the input json, and executes the code with minimal effort using their favorite WDL executor.
+    3. If the researcher would like to add their own flavor to the workflow, they can fork the repository, customize it as necessary to fit their exact research needs, and even resubmit the changes back to the original repository for consideration and review.
+- To that end, WILDS WDL repositories are relatively minimal and will usually consist of:
+    - a detailed [README](https://github.com/getwilds/fastq-to-cram/blob/489ccdf0697ab902ca6f775b8e51d4f0603c0c01/README.md) describing the intended functionality and input/output file types
+    - a single [WDL script](https://github.com/getwilds/fastq-to-cram/blob/489ccdf0697ab902ca6f775b8e51d4f0603c0c01/fastq-to-cram.wdl) containing the workflow as well as the tasks that make up the workflow
+    - a [input json](https://github.com/getwilds/fastq-to-cram/blob/489ccdf0697ab902ca6f775b8e51d4f0603c0c01/fastq-to-cram-inputs.json) template providing examples of expected inputs
+    - a [test case](https://github.com/getwilds/fastq-to-cram/tree/489ccdf0697ab902ca6f775b8e51d4f0603c0c01/tests/data) to ensure the workflow is running as expected
+- We believe the minimal nature of this setup will aid from a readability/ease-of-use standpoint.
+
+## Structural Guidelines
+
+- [Structs](https://github.com/getwilds/fastq-to-cram/blob/489ccdf0697ab902ca6f775b8e51d4f0603c0c01/fastq-to-cram.wdl#L5) should be at the top of the WDL script, followed by the [workflow](https://github.com/getwilds/fastq-to-cram/blob/489ccdf0697ab902ca6f775b8e51d4f0603c0c01/fastq-to-cram.wdl#L21) itself, followed by all of its corresponding [tasks](https://github.com/getwilds/fastq-to-cram/blob/489ccdf0697ab902ca6f775b8e51d4f0603c0c01/fastq-to-cram.wdl#L80).
+- Tasks should be broken down into as small of operations as possible.
+    - If a task uses more than one or two command line tools, it should probably be broken up into individual tasks.
+- [Docker containers](https://hutchdatascience.org/WDL_Workflows_Guide/the-first-task.html#docker-images-and-containers) should be assigned to every task to ensure uniform execution, regardless of local context.
+    - Outside of very basic images from very trusted sources, Docker images should be pulled directly from [WILDS' Docker Library](https://github.com/getwilds/wilds-docker-library) whenever possible.
+    - If you think a particular tool should be added to that library, [submit an issue](https://github.com/getwilds/wilds-docker-library/issues) or email us at [email protected].
+- In general, [runtime attributes](https://hutchdatascience.org/WDL_Workflows_Guide/the-first-task.html#runtime-attributes) should be defined whenever possible in order to enable execution on as many [backends](https://hutchdatascience.org/WDL_Workflows_Guide/appendix-backends-and-executors.html) as possible.
+
+## Stylistic Guidelines
+
+- **Indentation**: [braces contents](https://github.com/getwilds/fastq-to-cram/blob/489ccdf0697ab902ca6f775b8e51d4f0603c0c01/fastq-to-cram.wdl#L7C23-L11C2), [inputs](https://github.com/getwilds/fastq-to-cram/blob/489ccdf0697ab902ca6f775b8e51d4f0603c0c01/fastq-to-cram.wdl#L36C9-L44C31), and [line continuations](https://github.com/getwilds/fastq-to-cram/blob/489ccdf0697ab902ca6f775b8e51d4f0603c0c01/fastq-to-cram.wdl#L143C5-L147C67) should all be indented by two spaces (not four).
+- **White Space**: different [input groups](https://github.com/getwilds/fastq-to-cram/blob/489ccdf0697ab902ca6f775b8e51d4f0603c0c01/fastq-to-cram.wdl#L36C1-L44C31) and [code blocks](https://github.com/getwilds/fastq-to-cram/blob/489ccdf0697ab902ca6f775b8e51d4f0603c0c01/fastq-to-cram.wdl#L95C11-L107C6) should be separated by a single blank line.
+- **Line Breaks**: line breaks should only occur in the following places:
+    - After a [comma](https://github.com/getwilds/fastq-to-cram/blob/489ccdf0697ab902ca6f775b8e51d4f0603c0c01/fastq-to-cram.wdl#L37C49-L37C50)
+    - Before the `else` of an `if` statement
+    - [Between inputs](https://github.com/getwilds/fastq-to-cram/blob/489ccdf0697ab902ca6f775b8e51d4f0603c0c01/fastq-to-cram.wdl#L36C1-L44C31)
+    - [Opening](https://github.com/getwilds/fastq-to-cram/blob/489ccdf0697ab902ca6f775b8e51d4f0603c0c01/fastq-to-cram.wdl#L23C37-L23C38) and [closing](https://github.com/getwilds/fastq-to-cram/blob/489ccdf0697ab902ca6f775b8e51d4f0603c0c01/fastq-to-cram.wdl#L45C1-L46C29) braces
+- **Line Character Limit**: lines should be a maximum of 100 characters.
+- **Expression Spacing**: spaces should surround [operators](https://github.com/getwilds/fastq-to-cram/blob/489ccdf0697ab902ca6f775b8e51d4f0603c0c01/fastq-to-cram.wdl#L39) to increase clarity and readability.
+- **Naming Conventions**:
+    - [Tasks](https://github.com/getwilds/fastq-to-cram/blob/489ccdf0697ab902ca6f775b8e51d4f0603c0c01/fastq-to-cram.wdl#L83C6-L83C24), [workflows](https://github.com/getwilds/fastq-to-cram/blob/489ccdf0697ab902ca6f775b8e51d4f0603c0c01/fastq-to-cram.wdl#L23C10-L23C36), and [structs](https://github.com/getwilds/fastq-to-cram/blob/489ccdf0697ab902ca6f775b8e51d4f0603c0c01/fastq-to-cram.wdl#L7C8-L7C22) should follow upper camel case (`SuperAwesomeTask`)
+    - Call [aliases](https://github.com/getwilds/wdl-101/blob/b1de97d360b524e1932368c04b6e2dec2c85f134/mutation_calling.wdl#L55C20-L55C31) should follow lower camel case (`superAwesomeCall`)
+    - [Variables](https://github.com/getwilds/fastq-to-cram/blob/489ccdf0697ab902ca6f775b8e51d4f0603c0c01/fastq-to-cram.wdl#L28C10-L28C21) should follow lowercase underscore (`super_awesome_variable`)
+- **Descriptive Commenting**:
+    - [Comments](https://github.com/getwilds/fastq-to-cram/blob/489ccdf0697ab902ca6f775b8e51d4f0603c0c01/fastq-to-cram.wdl#L82C1-L82C41) should be placed above each task in the workflow describing its function.
+    - Input descriptors should be provided in the [`parameter_meta`](https://github.com/getwilds/fastq-to-cram/blob/489ccdf0697ab902ca6f775b8e51d4f0603c0c01/fastq-to-cram.wdl#L71C3-L77C4) component.
+- **Command Syntax**:
+    - Command sections within a WDL task should use [Heredoc syntax](https://hutchdatascience.org/WDL_Workflows_Guide/the-first-task.html#referencing-inputs-in-the-command-section) for added clarity in terms of input variables.
+    - Quotation marks around string/file variables are recommended within the command section to avoid confusion involving spaces.
+    - While it is usually not an issue within the context of Cromwell, [file localization](https://hutchdatascience.org/WDL_Workflows_Guide/the-first-task.html#file-localization) is also recommended in order to maximize the utility of the workflow across different WDL executors.
+
+## Repository Guidelines
+
+- As with all repositories, each workflow should include a detailed [README](https://github.com/getwilds/fastq-to-cram/blob/489ccdf0697ab902ca6f775b8e51d4f0603c0c01/README.md) containing:
+    - Purpose and functionality of the workflow
+    - Basic diagram illustrating the flow of data
+    - Contact information in case issues pop up 
+    - [WILDS Badge](https://github.com/getwilds/badges) at the top describing the development status of the workflow
+- Make sure to include an example input json in the repository for users to modify and easily execute the workflow.
+    - For a skeleton template, try the `inputs` action of [WOMtool](https://cromwell.readthedocs.io/en/stable/WOMtool/#inputs).
+- A GitHub Action executing [WOMtool](https://cromwell.readthedocs.io/en/stable/WOMtool/#validate) `validate` is highly recommended as a [check](https://github.com/getwilds/fastq-to-cram/blob/489ccdf0697ab902ca6f775b8e51d4f0603c0c01/.github/workflows/womtools-validate.yml) before merging new features into the main branch.
+    - If you're feeling adventurous, try automating an [entire test run](https://github.com/getwilds/fastq-to-cram/blob/489ccdf0697ab902ca6f775b8e51d4f0603c0c01/.github/workflows/cromwell-test-run.yml) using a very small [validation dataset](https://github.com/getwilds/fastq-to-cram/tree/489ccdf0697ab902ca6f775b8e51d4f0603c0c01/tests/data).
+
+## Additional Resources
+
+- [Fred Hutch DaSL WDL 101 Online Course](https://hutchdatascience.org/WDL_Workflows_Guide/introduction-to-wdl.html)
+- [OpenWDL Documentation](https://docs.openwdl.org/en/latest/)
+