Skip to content

Commit

Permalink
Merge pull request #241 from karel-brinda/docs
Browse files Browse the repository at this point in the history
Update documentation
  • Loading branch information
karel-brinda authored Nov 27, 2023
2 parents c8407de + 359d90a commit 87237f4
Show file tree
Hide file tree
Showing 2 changed files with 132 additions and 86 deletions.
17 changes: 17 additions & 0 deletions .editorconfig
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# EditorConfig is awesome: http://EditorConfig.org

# top-most EditorConfig file
root = true

[*]
end_of_line = lf
insert_final_newline = true
trim_trailing_whitespace = true
max_line_length = 80
charset = utf-8
indent_style = space
indent_size = 4

[*.{yml,yaml}]
indent_style = space
indent_size = 2
201 changes: 115 additions & 86 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,59 +18,58 @@ all within only several hours.

<!-- vim-markdown-toc GFM -->

* [Introduction](#introduction)
* [1. Introduction](#1-introduction)
* [Citation](#citation)
* [Requirements](#requirements)
* [Hardware](#hardware)
* [Dependencies](#dependencies)
* [Installation](#installation)
* [Step 1: Install dependencies](#step-1-install-dependencies)
* [Step 2: Clone the repository](#step-2-clone-the-repository)
* [Step 3: Run a simple test](#step-3-run-a-simple-test)
* [Step 4: Download the database](#step-4-download-the-database)
* [Usage](#usage)
* [Step 1: Copy or symlink your queries](#step-1-copy-or-symlink-your-queries)
* [Step 2: Adjust configuration](#step-2-adjust-configuration)
* [Step 3: Clean up intermediate files](#step-3-clean-up-intermediate-files)
* [Step 4: Run the pipeline](#step-4-run-the-pipeline)
* [Step 5: Analyze your results](#step-5-analyze-your-results)
* [Additional information](#additional-information)
* [List of workflow commands](#list-of-workflow-commands)
* [Directories](#directories)
* [Running on a cluster](#running-on-a-cluster)
* [Known limitations](#known-limitations)
* [License](#license)
* [Contacts](#contacts)
* [2. Requirements](#2-requirements)
* [2a) Hardware](#2a-hardware)
* [2b) Dependencies](#2b-dependencies)
* [3. Installation](#3-installation)
* [3a) Step 1: Install dependencies](#3a-step-1-install-dependencies)
* [3b) Step 2: Clone the repository](#3b-step-2-clone-the-repository)
* [3c) Step 3: Run a simple test](#3c-step-3-run-a-simple-test)
* [3d) Step 4: Download the database](#3d-step-4-download-the-database)
* [4. Usage](#4-usage)
* [4a) Step 1: Copy or symlink your queries](#4a-step-1-copy-or-symlink-your-queries)
* [4b) Step 2: Adjust configuration](#4b-step-2-adjust-configuration)
* [4c) Step 3: Clean up intermediate files](#4c-step-3-clean-up-intermediate-files)
* [4d) Step 4: Run the pipeline](#4d-step-4-run-the-pipeline)
* [4e) Step 5: Analyze your results](#4e-step-5-analyze-your-results)
* [5. Additional information](#5-additional-information)
* [5a) List of workflow commands](#5a-list-of-workflow-commands)
* [5b) Directories](#5b-directories)
* [5c) File formats](#5c-file-formats)
* [5d) Running on a cluster](#5d-running-on-a-cluster)
* [5e) Known limitations](#5e-known-limitations)
* [6. License](#6-license)
* [7. Contacts](#7-contacts)

<!-- vim-markdown-toc -->


## Introduction
## 1. Introduction

The central idea behind MOF-Search,
enabling alignment locally at such a large scale,
is
[**phylogenetic compression**](https://brinda.eu/mof)
([paper](https://doi.org/10.1101/2023.04.15.536996)) -
a technique based
on using estimated evolutionary history to guide compression and
search of large genome collections using existing algorithms and
data structures.
The central idea behind MOF-Search, enabling alignment locally at such a large
scale, is [**phylogenetic compression**](https://brinda.eu/mof)
([paper](https://doi.org/10.1101/2023.04.15.536996)) - a technique based on
using estimated evolutionary history to guide compression and search of large
genome collections using existing algorithms and data structures.

In short, input data are reorganized according to the topology
of the estimated phylogenies, which makes data highly locally compressible even
using basic techniques. Existing software packages for compression, indexing,
and search - in this case [XZ](https://tukaani.org/xz/),
In short, input data are reorganized according to the topology of the estimated
phylogenies, which makes data highly locally compressible even using basic
techniques. Existing software packages for compression, indexing, and search
- in this case [XZ](https://tukaani.org/xz/),
[COBS](https://github.com/iqbal-lab-org/cobs), and
[Minimap2](https://github.com/lh3/minimap2) - are then used as low-level tools.
The resulting performance gains come from a wide range of benefits of
phylogenetic compression, including easy parallelization, small memory
requirements, small database size, better memory locality, and better branch
prediction.

For more information about phylogenetic compression and the implementation details of MOF-Search, see the [corresponding
paper](https://www.biorxiv.org/content/10.1101/2023.04.15.536996v2) (including its
[supplementary material](https://www.biorxiv.org/content/biorxiv/early/2023/04/18/2023.04.15.536996/DC1/embed/media-1.pdf)
For more information about phylogenetic compression and the implementation
details of MOF-Search, see the [corresponding
paper](https://www.biorxiv.org/content/10.1101/2023.04.15.536996v2) (including
its [supplementary
material](https://www.biorxiv.org/content/biorxiv/early/2023/04/18/2023.04.15.536996/DC1/embed/media-1.pdf)
and visit the [associated website](https://brinda.eu/mof).


Expand All @@ -79,20 +78,22 @@ and visit the [associated website](https://brinda.eu/mof).
> K. Břinda, L. Lima, S. Pignotti, N. Quinones-Olvera, K. Salikhov, R. Chikhi, G. Kucherov, Z. Iqbal, and M. Baym. **[Efficient and Robust Search of Microbial Genomes via Phylogenetic Compression.](https://doi.org/10.1101/2023.04.15.536996)** *bioRxiv* 2023.04.15.536996, 2023. https://doi.org/10.1101/2023.04.15.536996

## Requirements
## 2. Requirements

### Hardware
### 2a) Hardware

MOF-Search requires a standard desktop or laptop computer with an \*nix system,
and it can also run on a cluster. The minimal hardware requirements are **12 GB
RAM** and approximately **120 GB of disk space** (102 GB for the database and
a margin for intermediate files).


### Dependencies
### 2b) Dependencies

MOF-Search is implemented as a [Snakemake](https://snakemake.github.io)
pipeline, using the Conda system to manage non-standard dependencies. Ensure you have [Conda](https://docs.conda.io/en/latest/miniconda.html) installed with the following packages:
pipeline, using the Conda system to manage non-standard dependencies. Ensure
you have [Conda](https://docs.conda.io/en/latest/miniconda.html) installed with
the following packages:

* [GNU Time](https://www.gnu.org/software/time/) (on Linux present by default; on OS X, install with `brew install gnu-time`).
* [Python](https://www.python.org/) (>=3.7)
Expand All @@ -104,32 +105,38 @@ Additionally, MOF-Search uses standard Unix tools like
[cURL](https://curl.se/),
[XZ Utils](https://tukaani.org/xz/), and
[GNU Gzip](https://www.gnu.org/software/gzip/).
These tools are typically included in standard \*nix installations. However, in minimal setups (e.g., virtualization, continuous integration), you might need to install them using the corresponding package managers.
These tools are typically included in standard \*nix installations. However, in
minimal setups (e.g., virtualization, continuous integration), you might need
to install them using the corresponding package managers.


## Installation
## 3. Installation

### Step 1: Install dependencies
### 3a) Step 1: Install dependencies

Make sure you have Conda and GNU Time installed. On Linux:

```bash
sudo apt-get install conda
```

On OS X (using Homebrew):

```bash
brew install conda
brew install gnu-time
```

Install Python (>=3.7), Snakemake (>=6.2.0), and Mamba (optional but recommended) using Conda:
Install Python (>=3.7), Snakemake (>=6.2.0), and Mamba (optional but
recommended) using Conda:

```bash
conda install -y -c bioconda -c conda-forge \
"python>=3.7" "snakemake>=6.2.0" "mamba>=0.20.0"
```


### Step 2: Clone the repository
### 3b) Step 2: Clone the repository

Clone the MOF-Search repository from GitHub and navigate into the directory:

Expand All @@ -139,26 +146,28 @@ Clone the MOF-Search repository from GitHub and navigate into the directory:
```


### Step 3: Run a simple test
### 3c) Step 3: Run a simple test

Run the following command to ensure the pipeline works for sample queries and 3 batches (this will also install all additional dependencies using Conda):
Run the following command to ensure the pipeline works for sample queries and
3 batches (this will also install all additional dependencies using Conda):

```bash
make test
```

Make sure the test returns 0 (success) and that you see the expected output message:
Make sure the test returns 0 (success) and that you see the expected output
message:

```bash
Success! Test run produced the expected output.
```


### Step 4: Download the database
### 3d) Step 4: Download the database

Download all phylogenetically compressed assemblies and COBS *k*-mer indexes
for the [661k-HQ collection](https://doi.org/10.1371/journal.pbio.3001421) by:

Download all phylogenetically compressed assemblies and COBS *k*-mer
indexes for the [661k-HQ
collection](https://doi.org/10.1371/journal.pbio.3001421) by:
```bash
make download
```
Expand All @@ -172,9 +181,9 @@ The downloaded files will be located in the `asms/` and `cobs/` directories.
control.


## Usage
## 4. Usage

### Step 1: Copy or symlink your queries
### 4a) Step 1: Copy or symlink your queries

Remove the default test files or your old files in the `queries/` directory and
copy or symlink (recommended) your query files. The supported input formats are
Expand All @@ -186,34 +195,39 @@ merged together.
* All non-`ACGT` characters in your query sequences will be translated to `A`.


### Step 2: Adjust configuration
### 4b) Step 2: Adjust configuration

Edit the [`config.yaml`](config.yaml) file for your desired search. All available options are
documented directly there.
Edit the [`config.yaml`](config.yaml) file for your desired search. All
available options are documented directly there.

### Step 3: Clean up intermediate files
### 4c) Step 3: Clean up intermediate files

Run `make clean` to clean intermediate files from the previous runs. This includes COBS matching files, alignment files, and various reports.
Run `make clean` to clean intermediate files from the previous runs. This
includes COBS matching files, alignment files, and various reports.

### Step 4: Run the pipeline
### 4d) Step 4: Run the pipeline

Simply run `make`, which will execute Snakemake with the corresponding parameters. If you want to run the pipeline step by step, run `make match` followed by `make map`.
Simply run `make`, which will execute Snakemake with the corresponding
parameters. If you want to run the pipeline step by step, run `make match`
followed by `make map`.

### Step 5: Analyze your results
### 4e) Step 5: Analyze your results

Check the output files in `output/`. The `.sam_summary.gz` files contain output alignments in a headerless SAM format. The `.sam_summary.stats` files contain statistics about your computed alignments.
Check the output files in `output/` (for more info about formats, see
[5c) File formats](#5c-file-formats)).

If the results do not correspond to what you expected and you need to re-adjust
your search parameters, go to Step 2. If only the mapping part is affected by the
changes, you proceed more rapidly by manually removing the files in
your search parameters, go to Step 2. If only the mapping part is affected by
the changes, you proceed more rapidly by manually removing the files in
`intermediate/05_map` and `output/` and running directly `make map`.


## Additional information
## 5. Additional information

### List of workflow commands
### 5a) List of workflow commands

MOF-Search is executed via [GNU Make](https://www.gnu.org/software/make/), which handles all parameters and passes them to Snakemake.
MOF-Search is executed via [GNU Make](https://www.gnu.org/software/make/),
which handles all parameters and passes them to Snakemake.

Here's a list of all implemented commands (to be executed as `make {command}`):

Expand Down Expand Up @@ -248,51 +262,66 @@ Here's a list of all implemented commands (to be executed as `make {command}`):
format Reformat Python and Snakemake files
```

### Directories
### 5b) Directories

* `asms/`, `cobs/` Downloaded assemblies and COBS indexes
* `input/` Queries, to be provided within one or more FASTA/FASTQ files, possibly gzipped (`.fa`)
* `input/` Queries, to be provided within one or more FASTA/FASTQ files,
possibly gzipped (`.fa`)
* `intermediate/` Intermediate files
* `00_queries_preprocessed` Preprocessed queries
* `01_queries_merged` Merged queries
* `02_cobs_decompressed` Decompressed COBS indexes (temporary, used only in the disk mode is used)
* `03_match` COBS matches
* `04_filter` Filtered candidates
* `05_map` Minimap2 alignments
* `00_queries_preprocessed/` Preprocessed queries
* `01_queries_merged/` Merged queries
* `02_cobs_decompressed/` Decompressed COBS indexes (temporary, used only in
the disk mode is used)
* `03_match/` COBS matches
* `04_filter/` Filtered candidates
* `05_map/` Minimap2 alignments
* `logs/` Logs and benchmarks
* `output/` The resulting files (in a headerless SAM format)


### Running on a cluster

Running on a cluster is much faster as the jobs produced by this pipeline are quite light and usually start running as
soon as they are scheduled.
### 5c) File formats

**Input files:** FASTA or FASTQ files possibly compressed by gzipped. The files
are searched in the `input/` directory, as files with the following suffixes:
`.fa`, `.fasta`, `.fq`, `.fastq` (possibly with `.gz` at the end).

**Output files:**

* `output/{name}.sam_summary.gz`: output alignments in a headerless SAM format
* `output/{name}.sam_summary.stats`: statistics about your computed alignments
in TSV


### 5d) Running on a cluster

Running on a cluster is much faster as the jobs produced by this pipeline are
quite light and usually start running as soon as they are scheduled.

**For LSF clusters:**

1. Test if the pipeline is working on a LSF cluster: `make cluster_lsf_test`;
2. Configure you queries and run the full pipeline: `make cluster_lsf`;


### Known limitations
### 5e) Known limitations

* **Swapping if the number of queries too high.** If the number of queries is
too high, the auxiliary Python scripts start to use too much memory, which
may result in swapping. Try to keep the number of queries moderate and
ideally their names short.

* **No support for ambiguous characters in queries.** As the tools used
internally by MOF-Search support only the nucleotide alphabet, all non-ACGT
characters in queries are first converted to A.


## License
## 6. License

[MIT](https://github.com/karel-brinda/mof-search/blob/master/LICENSE)



## Contacts
## 7. Contacts

* [Karel Brinda](https://brinda.eu) \<[email protected]\>
* [Leandro Lima](https://github.com/leoisl) \<[email protected]\>

0 comments on commit 87237f4

Please sign in to comment.