Updated usage and installation docs

naobservatory · Jan 22, 2025 · 4b8d686 · 4b8d686
1 parent 1e83724
commit 4b8d686
Show file tree

Hide file tree

Showing 2 changed files with 102 additions and 88 deletions.
diff --git a/docs/installation.md b/docs/installation.md
@@ -54,11 +54,11 @@ newgrp docker
 docker run hello-world
 ```
 
-#### 3. Clone this repository
+## 3. Clone this repository
 
 Clone this repo into a new directory as normal.
 
-#### 4. Run index/reference workflow
+## 4. Run index/reference workflow
 
 > [!TIP]
 > If someone else in your organization already uses this pipeline, it's likely they've already run the index workflow and generated an output directory. If this is the case, you can reduce costs and increase reproducibility by using theirs instead of generating your own. If you want to do this, skip this step, and edit `configs/run.config` such that `params.ref_dir` points to `INDEX_DIR/output`.
@@ -83,3 +83,31 @@ nextflow run PATH_TO_REPO_DIR -resume
 > You don't need to point `nextflow run` at `main.nf` any other workflow file; pointing to the directory will cause Nextflow to automatically run `main.nf` from that directory.
 
 Wait for the workflow to run to completion; this is likely to take several hours at least.
+
+## 5. Run the pipeline on test data
+
+To confirm that the pipeline works in your hands, we recommend running it on a small test dataset, such as the one provided at `s3://nao-testing/gold-standard-test/raw/`, before running it on larger input data. To do this with our test dataset, follow the instructions below, or do it yourself according to the directions given [here](./docs/usage.md).
+
+1. Prepare the launch directory:
+    - Create a clean launch directory outside the repository directory.
+    - Copy over the run workflow config file to a new file in the launch directory labeled `nextflow.config`.
+    - Copy the test-data sample sheet from the repository directory to the launch directory.
+
+```
+mkdir launch
+cd launch
+cp REPO_DIR/configs/run.config nextflow.config
+cp REPO_DIR/test-data/samplesheet.csv samplesheet.csv
+```
+
+2. Edit the config file (`nextflow.config`):
+    - Edit `params.ref_dir` to point to the index directory you chose or created above (specifically `PATH_TO_REF_DIR/output`)
+    - Edit `params.base_dir` to point to where you would like the pipeline to save intermediate and final pipeline outputs.
+3. Choose a profile as described [here](./docs/usage.md).
+4. Run the pipeline from the launch directory:
+
+```
+nextflow run -resume -profile <PROFILE> REPO_DIR
+```
+
+Once the pipeline is complete, output and logging files will be available in the `output` subdirectory of the base directory specified in the config file.
diff --git a/docs/usage.md b/docs/usage.md
@@ -1,120 +1,106 @@
-# Usage
+# Pipeline Usage
 
-To run any of the workflows you must choose a profile, and then have data to run the pipeline on. To help with this process, we provide a small test dataset to run through the `run` workflow.
+This page describes the process of running the pipeline's [core workflow](./docs/run.md) on available data.
 
-<!-- TOC start (generated with https://github.com/derlin/bitdowntoc) -->
-- [Usage](#usage)
-  - [Profiles and modes](#profiles-and-modes)
-    - [Compute resource requirements](#compute-resource-requirements)
-  - [Running on new data](#running-on-new-data)
-  - [Tutorial: Running the `run` workflow on a test dataset](#tutorial-running-the-run-workflow-on-a-test-dataset)
-    - [Setup](#setup)
-    - [Profile specific instructions](#profile-specific-instructions)
-      - [`ec2_local`](#ec2_local)
-      - [`ec2_s3`](#ec2_s3)
-      - [`batch`](#batch)
-<!-- TOC end -->
+> [!IMPORTANT]
+> Before following the instructions on this page, make sure you have followed the [installation and setup instructions](./docs/installation.md), including running the [index workflow](./docs/index.md) or otherwise having a complete and up-to-date index directory in an accessible location.
 
-## Profiles and modes
+> [!IMPORTANT]
+> Currently, the pipeline only accepts paired short-read data; single-end and Oxford Nanopore versions are under development but are not ready for general use.
 
-The pipeline has three main workflows: `INDEX`, `RUN`, and `RUN_VALIDATION`. Each of these calls their corresponding subworkflows, which are located in the `subworkflows` directory.
+## 1. Preparing input files
 
-The pipeline can be run in multiple ways by modifying various configuration variables specified in `configs/profiles.config`. Currently, three profiles are implemented, all of which assume the workflow is being launched from an AWS EC2 instance:
-
-- `batch (default)`:  **Most efficient way to run the pipeline**
-  - This profile is the default and attempts to run the pipeline with AWS Batch. This is the quickest and most efficient way to run the pipeline, but requires significant additional setup not described in this repo. To set up AWS Batch for this pipeline, follow the instructions [here](./batch.md) (steps 1-3), then modify your config file to point `process.queue` to the name of your Batch job queue.
-- `ec2_local`: **Simple and can be relatively fast, but is bottlenecked by your instance's CPU and memory allocations.**
-  - This profile attempts to run the whole workflow locally on your EC2 instance, storing intermediate and outflow files on instance-linked block storage. This is simple and can be relatively fast, but is bottlenecked by your instance's CPU and memory allocations; in particular, if you don't use an instance with very high memory, the pipeline is likely to fail when loading the Kraken2 reference DB.
-- `ec2_s3`: **Avoids storage issues on your EC2 instance, but is still constrained by your instance's memory allocation.**
-  - This profile runs the pipeline on your EC2 instance, but attempts to read and write files to a specified S3 directory. This avoids problems caused by insufficient storage on your EC2 instance, but (1) is significantly slower and (2) is still constrained by your instance's memory allocation.
+To run the workflow on new data, you need:
 
-To run the pipeline with a specified profile, run `nextflow run PATH_TO_REPO_DIR -profile PROFILE_NAME -resume`. Calling the pipeline without specifying a profile will run the `batch` profile by default. Future example commands in this README will assume you are using Batch; if you want to instead use a different profile, you'll need to modify the commands accordingly.
+1. Accessible **raw data** files in Gzipped FASTQ format, named appropriately.
+2. A **sample sheet** file specifying the samples to be analyzed, along with paths to the forward and reverse read files for each sample. `bin/generate_samplesheet.sh` (see below) can make this for you.
+3. A **config file** in a clean launch directory, pointing to:
+    - The base directory in which to put the working and output directories (`params.base_dir`).
+    - The directory containing the outputs of the reference workflow (`params.ref_dir`).
+    - The sample sheet (`params.sample_sheet`).
+    - Various other parameter values.
 
 > [!TIP]
-> It's highly recommended that you always run `nextflow run` with the `-resume` option enabled. It doesn't do any harm if you haven't run a workflow before, and getting into the habit will help you avoid much sadness when you want to resume it without rerunning all your jobs.
+> We recommend starting each Nextflow pipeline run in a clean launch directory, containing only your sample sheet and config file.
 
-### Compute resource requirements
+### 1.1. The sample sheet
 
-To run the pipeline as is you need at least 128GB of memory and 64 cores. This is because we use the whole KrakenDB whihc is large (128GB) and some for processes consume 64 cores. Simiarly, if one would like to run BLAST, they must have at least 256GB of memory. 
+The sample sheet must be an uncompressed CSV file with the following fields:
 
-To change the compute resources for a process, you can modify the `resources.config` file. This file specifies the compute resources for each process based on the label of the process. For example, to change the compute resources for the `kraken` process, you can add the following to the `resources.config` file:
+- First column: Sample ID
+- Second column: Path to FASTQ file 1 which should be the forward read for this sample
+- Third column: Path to FASTQ file 2 which should be the reverse read for this sample
 
-In the case that you change the resources, you'll need to also change the index.
+The easiest way to generate this file is typically using `dev/generate_samplesheet.sh`. This script takes in a path to a directory containing raw FASTQ files (`dir_path`) along with forward (`forward_suffix`) and reverse (`reverse_suffix`) read suffixes (both of which support regex) and an optional output path (`output_path`). The `--s3` flag indicates that the target directory is specified as an S3 path. As output, the script generates a CSV file (names `samplesheet.csv` by default) which can be used as input for the pipeline.
 
+For example:
+```
+../bin/generate_samplesheet.sh \
+   --s3
+   --dir_path s3://nao-restricted/MJ-2024-10-21/raw/ \
+   --forward_suffix _1 \
+   --reverse_suffix _2
+```
 
-## Running on new data
+In addition, the script can also add an additional `group` column that can be used to combine reads from different files that should be processed together (e.g. for deduplication). There are two options for doing this:
 
-To run the workflow on new data, you need:
+- Provide a path to a CSV file containing `sample` and `group` headers, specifying the mapping between samples and groups.
+- Specify the `--group_across_illumina_lanes` option if the target directory contains data from one or more libraries split across lanes on an Illumina flowcell.
 
-1. Accessible raw data files in Gzipped FASTQ format, named appropriately.
-2. A sample sheet file specifying the samples, along with paths to the forward and reverse read files for each sample. `generate_samplesheet.sh` (see below) can make this for you.
-3. A config file in a clean launch directory, pointing to:
-    - The base directory in which to put the working and output directories (`params.base_dir`).
-    - The directory containing the outputs of the reference workflow (`params.ref_dir`).
-    - The sample sheet (`params.sample_sheet`).
-    - Various other parameter values.
+Alternatively, the sample sheet can be manually edited to provide the `groups` column prior to running the pipeline.
 
-> [!NOTE]
-> The samplesheet must have the following format for each row:
-> - First column: Sample ID
-> - Second column: Path to FASTQ file 1 which should be the forward read for this sample
-> - Third column: Path to FASTQ file 2 which should be the reverse read for this sample
-> 
-> The easiest way to get this file is by using the `generate_samplesheet.sh` script. As input, this script takes a path to raw FASTQ files (`dir_path`), and forward (`forward_suffix`) and reverse (`reverse_suffix`) read suffixes, both of which support regex, and an optional output path (`output_path`). Those using data from s3 should make sure to pass the `s3` parameter. Those who would like to group samples by some metadata can pass a path to a CSV file containing a header column named `sample,group`, where each row gives the sample name and the group to group by (`group_file`), edit the samplesheet manually after generation (since manually editing the samplesheet will be easier when the groups CSV isn't readily available), or provide the --group_across_illumina_lanes option if a single library was run across a single Illumina flowcell. As output, the script generates a CSV file named (`samplesheet.csv` by default), which can be used as input for the pipeline.
->
-> For example:
-> ```
-> ../bin/generate_samplesheet.sh \
->    --s3
->    --dir_path s3://nao-restricted/MJ-2024-10-21/raw/ \
->    --forward_suffix _1 \
->    --reverse_suffix _2
-> ```
+### 1.2. The config file
 
-If running on Batch, a good process for starting the pipeline on a new dataset is as follows:
+The config file specifies parameters and other configuration options used by Nextflow in executing the pipeline. To create a config file for your pipeline run, copy `configs/run.config` into your launch directory as a file named `nextflow.config`, then modify the file as follows:
+    - Make sure `params.mode = "run"`; this instructs the pipeline to execute the [core run workflow](./docs/run.md).
+    - Edit `params.ref_dir` to point to the directory containing the outputs of the reference workflow.
+    - Edit `params.sample_sheet` to point to your sample sheet.
+    - Edit `params.base_dir` to point to the directory in which Nextflow should put the pipeline working and output directories.
+    - Edit `params.grouping` to specify whether to group samples together for common processing, based on the `group` column in the sample sheet.
+    - If running on AWS Batch (see below), edit `process.queue` to the name of your Batch job queue.
 
-1. Process the raw data to have appropriate filenames (see above) and deposit it in an accessible S3 directory.
-2. Create a clean launch directory and copy `configs/run.config` to a file named `nextflow.config` in that directory.
-3. Create a sample sheet in that launch directory (see above)
-4. Edit `nextflow.config` to specify each item in `params` as appropriate, as well as setting `process.queue` to the appropriate Batch queue.
-5. Run `nextflow run PATH_TO_REPO_DIR -resume`.
-6. Navigate to `{params.base_dir}/output` to view and download output files.
+Most other entries in the config file can be left at their default values for most runs. See [here](./docs/config.md) for a full description of config file parameters and their meanings.
 
+## 2. Choosing a profile
 
-## Tutorial: Running the `run` workflow on a test dataset
+The pipeline can be run in multiple ways by modifying various configuration variables specified in `configs/profiles.config`. Currently, three profiles are implemented, all of which assume the workflow is being launched from an AWS EC2 instance:
 
-To confirm that the pipeline works in your hands, we provide a small test dataset (`s3://nao-testing/gold-standard-test/raw/`) to run through the `run` workflow. Feel free to use any profile, but we recommend using the `ec2_local` profile as long as [you have the resources](usage.md#compute-resource-requirements) to handle it. 
+- `batch (default)`:  **Most reliable way to run the pipeline**
+  - This profile is the default and attempts to run the pipeline with AWS Batch. This is the most reliable and convenient way to run the pipeline, but requires significant additional setup (described [here](./batch.md)). Before running the pipeline using this profile, make sure `process.queue` in your config file is pointing to the correct Batch job queue.
+- `ec2_local`: **Requires the least setup, but is bottlenecked by your instance's compute, memory and storage.**
+  - This profile attempts to run the whole pipeline locally on your EC2 instance, storing all files on instance-linked block storage.
+  - This is simple and can be relatively fast, but requires large CPU, memory and storage allocations. In particular, if you don't use an instance with very high memory, the pipeline is likely to fail when loading the Kraken2 reference DB.
+- `ec2_s3`: **Avoids storage issues on your EC2 instance, but is still constrained by local compute and memory.**
+  - This profile runs the pipeline on your EC2 instance, but attempts to read and write files to a specified S3 directory. This avoids problems arising from insufficient local storage, but (a) is significantly slower and (b) is still constrained by local compute and memory allocations.
 
-When running with any profile, there is a shared setup that you need to do, before running the workflow. The shared setup is the same for all profiles, and is described below, followed by profile specific instructions.
+To run the pipeline with a specified profile, run `nextflow run PATH_TO_REPO_DIR -profile PROFILE_NAME -resume`. Calling the pipeline without specifying a profile will run the `batch` profile by default. Future example commands in this README will assume you are using Batch; if you want to instead use a different profile, you'll need to modify the commands accordingly.
 
-### Setup
+> [!TIP]
+> It's highly recommended that you always run `nextflow run` with the `-resume` option enabled. It doesn't do any harm if you haven't run a workflow before, and getting into the habit will help you avoid much sadness when you want to resume it without rerunning all your jobs.
 
-1. Create a new directory outside the repo directory and copy over the run workflow config file as `nextflow.config` in that directory:
+### Compute resource requirements
 
-```
-mkdir launch
-cd launch
-cp REPO_DIR/configs/run.config nextflow.config
-```
+To run the pipeline as is you need at least 128GB of memory and 64 cores. This is because we use the whole KrakenDB whihc is large (128GB) and some for processes consume 64 cores. Simiarly, if one would like to run BLAST, they must have at least 256GB of memory. 
 
-2. Edit `nextflow.config` to set `params.ref_dir` to the index directory you chose or created above (specifically `PATH_TO_REF_DIR/output`)
-3. Set the samplesheet path to the test dataset samplesheet `${projectDir}/test-data/samplesheet.csv`
+To change the compute resources for a process, you can modify the `resources.config` file. This file specifies the compute resources for each process based on the label of the process. For example, to change the compute resources for the `kraken` process, you can add the following to the `resources.config` file:
 
-### Profile specific instructions
+In the case that you change the resources, you'll need to also change the index.
 
-#### `ec2_local`
+## 3. Running the pipeline
 
-4. Within this directory, run `nextflow run -profile ec2_local .. -resume`. Wait for the workflow to finish. 
-5. Inspect the `output` directory to view the processed output files.
+After creating your sample sheet and config files and choosing a profile, navigate to the launch directory containing your config file. You can then run the pipeline as follows:
 
-#### `ec2_s3`
+```
+nextflow run -resume -profile <PROFILE> PATH/TO/PIPELINE/DIR
+```
 
-4. Edit `nextflow.config` to set `params.base_dir` to the S3 directory of your choice. 
-5. Still within that directory, run `nextflow run -profile ec2_s3 .. -resume`. 
-6. Wait for the workflow to finish, and inspect the output on S3.
+where PATH/TO/PIPELINE/DIR specifies the path to the directory containing the pipeline files from this repository (in particular, `main.nf`) from the launch directory.
 
-#### `batch`
+> [!TIP]
+> If you are running the pipeline with its default profile (`batch`) you can omit the `-profile` declaration and simply write:
+>
+> ```
+> nextflow run -resume PATH/TO/PIPELINE/DIR
+> ```
 
-4. Edit `nextflow.config` to set `params.base_dir` to the S3 directory of your choice and `process.queue` to the name of your Batch job queue. 
-5. Still within that directory, run `nextflow run -profile batch .. -resume` (or simply `nextflow run .. -resume`). 
-6. Wait for the workflow to finish, and inspect the output on S3.
+Once the pipeline has finished, output and logging files will be available in the `output` subdirectory of the base directory specified in the config file.