Merge pull request #241 from karel-brinda/docs

Update documentation
karel-brinda · Nov 27, 2023 · 87237f4 · 87237f4
2 parents c8407de + 359d90a
commit 87237f4
Show file tree

Hide file tree

Showing 2 changed files with 132 additions and 86 deletions.
diff --git a/.editorconfig b/.editorconfig
@@ -0,0 +1,17 @@
+# EditorConfig is awesome: http://EditorConfig.org
+
+# top-most EditorConfig file
+root = true
+
+[*]
+end_of_line = lf
+insert_final_newline = true
+trim_trailing_whitespace = true
+max_line_length = 80
+charset = utf-8
+indent_style = space
+indent_size = 4
+
+[*.{yml,yaml}]
+indent_style = space
+indent_size = 2
diff --git a/README.md b/README.md
@@ -18,59 +18,58 @@ all within only several hours.
 
 <!-- vim-markdown-toc GFM -->
 
-* [Introduction](#introduction)
+* [1. Introduction](#1-introduction)
   * [Citation](#citation)
-* [Requirements](#requirements)
-  * [Hardware](#hardware)
-  * [Dependencies](#dependencies)
-* [Installation](#installation)
-  * [Step 1: Install dependencies](#step-1-install-dependencies)
-  * [Step 2: Clone the repository](#step-2-clone-the-repository)
-  * [Step 3: Run a simple test](#step-3-run-a-simple-test)
-  * [Step 4: Download the database](#step-4-download-the-database)
-* [Usage](#usage)
-  * [Step 1: Copy or symlink your queries](#step-1-copy-or-symlink-your-queries)
-  * [Step 2: Adjust configuration](#step-2-adjust-configuration)
-  * [Step 3: Clean up intermediate files](#step-3-clean-up-intermediate-files)
-  * [Step 4: Run the pipeline](#step-4-run-the-pipeline)
-  * [Step 5: Analyze your results](#step-5-analyze-your-results)
-* [Additional information](#additional-information)
-  * [List of workflow commands](#list-of-workflow-commands)
-  * [Directories](#directories)
-  * [Running on a cluster](#running-on-a-cluster)
-  * [Known limitations](#known-limitations)
-* [License](#license)
-* [Contacts](#contacts)
+* [2. Requirements](#2-requirements)
+  * [2a) Hardware](#2a-hardware)
+  * [2b) Dependencies](#2b-dependencies)
+* [3. Installation](#3-installation)
+  * [3a) Step 1: Install dependencies](#3a-step-1-install-dependencies)
+  * [3b) Step 2: Clone the repository](#3b-step-2-clone-the-repository)
+  * [3c) Step 3: Run a simple test](#3c-step-3-run-a-simple-test)
+  * [3d) Step 4: Download the database](#3d-step-4-download-the-database)
+* [4. Usage](#4-usage)
+  * [4a) Step 1: Copy or symlink your queries](#4a-step-1-copy-or-symlink-your-queries)
+  * [4b) Step 2: Adjust configuration](#4b-step-2-adjust-configuration)
+  * [4c) Step 3: Clean up intermediate files](#4c-step-3-clean-up-intermediate-files)
+  * [4d) Step 4: Run the pipeline](#4d-step-4-run-the-pipeline)
+  * [4e) Step 5: Analyze your results](#4e-step-5-analyze-your-results)
+* [5. Additional information](#5-additional-information)
+  * [5a) List of workflow commands](#5a-list-of-workflow-commands)
+  * [5b) Directories](#5b-directories)
+  * [5c) File formats](#5c-file-formats)
+  * [5d) Running on a cluster](#5d-running-on-a-cluster)
+  * [5e) Known limitations](#5e-known-limitations)
+* [6. License](#6-license)
+* [7. Contacts](#7-contacts)
 
 <!-- vim-markdown-toc -->
 
 
-## Introduction
+## 1. Introduction
 
-The central idea behind MOF-Search,
-enabling alignment locally at such a large scale,
-is
-[**phylogenetic compression**](https://brinda.eu/mof)
-([paper](https://doi.org/10.1101/2023.04.15.536996)) -
-a technique based
-on using estimated evolutionary history to guide compression and
-search of large genome collections using existing algorithms and
-data structures.
+The central idea behind MOF-Search, enabling alignment locally at such a large
+scale, is [**phylogenetic compression**](https://brinda.eu/mof)
+([paper](https://doi.org/10.1101/2023.04.15.536996)) - a technique based on
+using estimated evolutionary history to guide compression and search of large
+genome collections using existing algorithms and data structures.
 
-In short, input data are reorganized according to the topology
-of the estimated phylogenies, which makes data highly locally compressible even
-using basic techniques. Existing software packages for compression, indexing,
-and search - in this case [XZ](https://tukaani.org/xz/),
+In short, input data are reorganized according to the topology of the estimated
+phylogenies, which makes data highly locally compressible even using basic
+techniques. Existing software packages for compression, indexing, and search
+- in this case [XZ](https://tukaani.org/xz/),
 [COBS](https://github.com/iqbal-lab-org/cobs), and
 [Minimap2](https://github.com/lh3/minimap2) - are then used as low-level tools.
 The resulting performance gains come from a wide range of benefits of
 phylogenetic compression, including easy parallelization, small memory
 requirements, small database size, better memory locality, and better branch
 prediction.
 
-For more information about phylogenetic compression and the implementation details of MOF-Search, see the [corresponding
-paper](https://www.biorxiv.org/content/10.1101/2023.04.15.536996v2) (including its
-[supplementary material](https://www.biorxiv.org/content/biorxiv/early/2023/04/18/2023.04.15.536996/DC1/embed/media-1.pdf)
+For more information about phylogenetic compression and the implementation
+details of MOF-Search, see the [corresponding
+paper](https://www.biorxiv.org/content/10.1101/2023.04.15.536996v2) (including
+its [supplementary
+material](https://www.biorxiv.org/content/biorxiv/early/2023/04/18/2023.04.15.536996/DC1/embed/media-1.pdf)
 and visit the [associated website](https://brinda.eu/mof).
 
 
@@ -79,20 +78,22 @@ and visit the [associated website](https://brinda.eu/mof).
 > K. Břinda, L. Lima, S. Pignotti, N. Quinones-Olvera, K. Salikhov, R. Chikhi, G. Kucherov, Z. Iqbal, and M. Baym. **[Efficient and Robust Search of Microbial Genomes via Phylogenetic Compression.](https://doi.org/10.1101/2023.04.15.536996)** *bioRxiv* 2023.04.15.536996, 2023. https://doi.org/10.1101/2023.04.15.536996
 
 
-## Requirements
+## 2. Requirements
 
-### Hardware
+### 2a) Hardware
 
 MOF-Search requires a standard desktop or laptop computer with an \*nix system,
 and it can also run on a cluster. The minimal hardware requirements are **12 GB
 RAM** and approximately **120 GB of disk space** (102 GB for the database and
 a margin for intermediate files).
 
 
-### Dependencies
+### 2b) Dependencies
 
 MOF-Search is implemented as a [Snakemake](https://snakemake.github.io)
-pipeline, using the Conda system to manage non-standard dependencies. Ensure you have [Conda](https://docs.conda.io/en/latest/miniconda.html) installed with the following packages:
+pipeline, using the Conda system to manage non-standard dependencies. Ensure
+you have [Conda](https://docs.conda.io/en/latest/miniconda.html) installed with
+the following packages:
 
 * [GNU Time](https://www.gnu.org/software/time/) (on Linux present by default; on OS X, install with `brew install gnu-time`).
 * [Python](https://www.python.org/) (>=3.7)
@@ -104,32 +105,38 @@ Additionally, MOF-Search uses standard Unix tools like
 [cURL](https://curl.se/),
 [XZ Utils](https://tukaani.org/xz/), and
 [GNU Gzip](https://www.gnu.org/software/gzip/).
-These tools are typically included in standard \*nix installations. However, in minimal setups (e.g., virtualization, continuous integration), you might need to install them using the corresponding package managers.
+These tools are typically included in standard \*nix installations. However, in
+minimal setups (e.g., virtualization, continuous integration), you might need
+to install them using the corresponding package managers.
 
 
-## Installation
+## 3. Installation
 
-### Step 1: Install dependencies
+### 3a) Step 1: Install dependencies
 
 Make sure you have Conda and GNU Time installed. On Linux:
+
 ```bash
 sudo apt-get install conda
 ```
 
 On OS X (using Homebrew):
+
 ```bash
 brew install conda
 brew install gnu-time
 ```
 
-Install Python (>=3.7), Snakemake (>=6.2.0), and Mamba (optional but recommended) using Conda:
+Install Python (>=3.7), Snakemake (>=6.2.0), and Mamba (optional but
+recommended) using Conda:
+
 ```bash
 conda install -y -c bioconda -c conda-forge \
     "python>=3.7" "snakemake>=6.2.0" "mamba>=0.20.0"
 ```
 
 
-### Step 2: Clone the repository
+### 3b) Step 2: Clone the repository
 
 Clone the MOF-Search repository from GitHub and navigate into the directory:
 
@@ -139,26 +146,28 @@ Clone the MOF-Search repository from GitHub and navigate into the directory:
 ```
 
 
-### Step 3: Run a simple test
+### 3c) Step 3: Run a simple test
 
-Run the following command to ensure the pipeline works for sample queries and 3 batches (this will also install all additional dependencies using Conda):
+Run the following command to ensure the pipeline works for sample queries and
+3 batches (this will also install all additional dependencies using Conda):
 
 ```bash
 make test
 ```
 
-Make sure the test returns 0 (success) and that you see the expected output message:
+Make sure the test returns 0 (success) and that you see the expected output
+message:
 
 ```bash
  Success! Test run produced the expected output.
 ```
 
 
-### Step 4: Download the database
+### 3d) Step 4: Download the database
+
+Download all phylogenetically compressed assemblies and COBS *k*-mer indexes
+for the [661k-HQ collection](https://doi.org/10.1371/journal.pbio.3001421) by:
 
-Download all phylogenetically compressed assemblies and COBS *k*-mer
-indexes for the [661k-HQ
-collection](https://doi.org/10.1371/journal.pbio.3001421) by:
 ```bash
 make download
 ```
@@ -172,9 +181,9 @@ The downloaded files will be located in the `asms/` and `cobs/` directories.
   control.
 
 
-## Usage
+## 4. Usage
 
-### Step 1: Copy or symlink your queries
+### 4a) Step 1: Copy or symlink your queries
 
 Remove the default test files or your old files in the `queries/` directory and
 copy or symlink (recommended) your query files. The supported input formats are
@@ -186,34 +195,39 @@ merged together.
 * All non-`ACGT` characters in your query sequences will be translated to `A`.
 
 
-### Step 2: Adjust configuration
+### 4b) Step 2: Adjust configuration
 
-Edit the [`config.yaml`](config.yaml) file for your desired search. All available options are
-documented directly there.
+Edit the [`config.yaml`](config.yaml) file for your desired search. All
+available options are documented directly there.
 
-### Step 3: Clean up intermediate files
+### 4c) Step 3: Clean up intermediate files
 
-Run `make clean` to clean intermediate files from the previous runs. This includes COBS matching files, alignment files, and various reports.
+Run `make clean` to clean intermediate files from the previous runs. This
+includes COBS matching files, alignment files, and various reports.
 
-### Step 4: Run the pipeline
+### 4d) Step 4: Run the pipeline
 
-Simply run `make`, which will execute Snakemake with the corresponding parameters. If you want to run the pipeline step by step, run `make match` followed by `make map`.
+Simply run `make`, which will execute Snakemake with the corresponding
+parameters. If you want to run the pipeline step by step, run `make match`
+followed by `make map`.
 
-### Step 5: Analyze your results
+### 4e) Step 5: Analyze your results
 
-Check the output files in `output/`. The `.sam_summary.gz` files contain output alignments in a headerless SAM format. The `.sam_summary.stats` files contain statistics about your computed alignments.
+Check the output files in `output/` (for more info about formats, see
+[5c) File formats](#5c-file-formats)).
 
 If the results do not correspond to what you expected and you need to re-adjust
-your search parameters, go to Step 2. If only the mapping part is affected by the
-changes, you proceed more rapidly by manually removing the files in
+your search parameters, go to Step 2. If only the mapping part is affected by
+the changes, you proceed more rapidly by manually removing the files in
 `intermediate/05_map` and `output/` and running directly `make map`.
 
 
-## Additional information
+## 5. Additional information
 
-### List of workflow commands
+### 5a) List of workflow commands
 
-MOF-Search is executed via [GNU Make](https://www.gnu.org/software/make/), which handles all parameters and passes them to Snakemake.
+MOF-Search is executed via [GNU Make](https://www.gnu.org/software/make/),
+which handles all parameters and passes them to Snakemake.
 
 Here's a list of all implemented commands (to be executed as `make {command}`):
 
@@ -248,51 +262,66 @@ Here's a list of all implemented commands (to be executed as `make {command}`):
     format               Reformat Python and Snakemake files
 ```
 
-### Directories
+### 5b) Directories
 
 * `asms/`, `cobs/` Downloaded assemblies and COBS indexes
-* `input/` Queries, to be provided within one or more FASTA/FASTQ files, possibly gzipped (`.fa`)
+* `input/` Queries, to be provided within one or more FASTA/FASTQ files,
+  possibly gzipped (`.fa`)
 * `intermediate/` Intermediate files
-   * `00_queries_preprocessed` Preprocessed queries
-   * `01_queries_merged` Merged queries
-   * `02_cobs_decompressed` Decompressed COBS indexes (temporary, used only in the disk mode is used)
-   * `03_match` COBS matches
-   * `04_filter` Filtered candidates
-   * `05_map` Minimap2 alignments
+   * `00_queries_preprocessed/` Preprocessed queries
+   * `01_queries_merged/` Merged queries
+   * `02_cobs_decompressed/` Decompressed COBS indexes (temporary, used only in
+     the disk mode is used)
+   * `03_match/` COBS matches
+   * `04_filter/` Filtered candidates
+   * `05_map/` Minimap2 alignments
 * `logs/` Logs and benchmarks
 * `output/` The resulting files (in a headerless SAM format)
 
 
-### Running on a cluster
 
-Running on a cluster is much faster as the jobs produced by this pipeline are quite light and usually start running as
-soon as they are scheduled.
+### 5c) File formats
+
+**Input files:** FASTA or FASTQ files possibly compressed by gzipped. The files
+are searched in the `input/` directory, as files with the following suffixes:
+`.fa`, `.fasta`, `.fq`, `.fastq` (possibly with `.gz` at the end).
+
+**Output files:**
+
+* `output/{name}.sam_summary.gz`: output alignments in a headerless SAM format
+* `output/{name}.sam_summary.stats`: statistics about your computed alignments
+  in TSV
+
+
+### 5d) Running on a cluster
+
+Running on a cluster is much faster as the jobs produced by this pipeline are
+quite light and usually start running as soon as they are scheduled.
 
 **For LSF clusters:**
 
 1. Test if the pipeline is working on a LSF cluster: `make cluster_lsf_test`;
 2. Configure you queries and run the full pipeline: `make cluster_lsf`;
 
 
-### Known limitations
+### 5e) Known limitations
 
 * **Swapping if the number of queries too high.** If the number of queries is
   too   high, the auxiliary Python scripts start to use too much memory, which
   may result in swapping. Try to keep the number of queries moderate and
   ideally their names short.
-
 * **No support for ambiguous characters in queries.** As the tools used
   internally by MOF-Search support only the nucleotide alphabet, all non-ACGT
   characters in queries are first converted to A.
 
 
-## License
+## 6. License
 
 [MIT](https://github.com/karel-brinda/mof-search/blob/master/LICENSE)
 
 
 
-## Contacts
+## 7. Contacts
 
 * [Karel Brinda](https://brinda.eu) \<[email protected]\>
 * [Leandro Lima](https://github.com/leoisl) \<[email protected]\>