-
Notifications
You must be signed in to change notification settings - Fork 7
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #241 from karel-brinda/docs
Update documentation
- Loading branch information
Showing
2 changed files
with
132 additions
and
86 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,17 @@ | ||
# EditorConfig is awesome: http://EditorConfig.org | ||
|
||
# top-most EditorConfig file | ||
root = true | ||
|
||
[*] | ||
end_of_line = lf | ||
insert_final_newline = true | ||
trim_trailing_whitespace = true | ||
max_line_length = 80 | ||
charset = utf-8 | ||
indent_style = space | ||
indent_size = 4 | ||
|
||
[*.{yml,yaml}] | ||
indent_style = space | ||
indent_size = 2 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -18,59 +18,58 @@ all within only several hours. | |
|
||
<!-- vim-markdown-toc GFM --> | ||
|
||
* [Introduction](#introduction) | ||
* [1. Introduction](#1-introduction) | ||
* [Citation](#citation) | ||
* [Requirements](#requirements) | ||
* [Hardware](#hardware) | ||
* [Dependencies](#dependencies) | ||
* [Installation](#installation) | ||
* [Step 1: Install dependencies](#step-1-install-dependencies) | ||
* [Step 2: Clone the repository](#step-2-clone-the-repository) | ||
* [Step 3: Run a simple test](#step-3-run-a-simple-test) | ||
* [Step 4: Download the database](#step-4-download-the-database) | ||
* [Usage](#usage) | ||
* [Step 1: Copy or symlink your queries](#step-1-copy-or-symlink-your-queries) | ||
* [Step 2: Adjust configuration](#step-2-adjust-configuration) | ||
* [Step 3: Clean up intermediate files](#step-3-clean-up-intermediate-files) | ||
* [Step 4: Run the pipeline](#step-4-run-the-pipeline) | ||
* [Step 5: Analyze your results](#step-5-analyze-your-results) | ||
* [Additional information](#additional-information) | ||
* [List of workflow commands](#list-of-workflow-commands) | ||
* [Directories](#directories) | ||
* [Running on a cluster](#running-on-a-cluster) | ||
* [Known limitations](#known-limitations) | ||
* [License](#license) | ||
* [Contacts](#contacts) | ||
* [2. Requirements](#2-requirements) | ||
* [2a) Hardware](#2a-hardware) | ||
* [2b) Dependencies](#2b-dependencies) | ||
* [3. Installation](#3-installation) | ||
* [3a) Step 1: Install dependencies](#3a-step-1-install-dependencies) | ||
* [3b) Step 2: Clone the repository](#3b-step-2-clone-the-repository) | ||
* [3c) Step 3: Run a simple test](#3c-step-3-run-a-simple-test) | ||
* [3d) Step 4: Download the database](#3d-step-4-download-the-database) | ||
* [4. Usage](#4-usage) | ||
* [4a) Step 1: Copy or symlink your queries](#4a-step-1-copy-or-symlink-your-queries) | ||
* [4b) Step 2: Adjust configuration](#4b-step-2-adjust-configuration) | ||
* [4c) Step 3: Clean up intermediate files](#4c-step-3-clean-up-intermediate-files) | ||
* [4d) Step 4: Run the pipeline](#4d-step-4-run-the-pipeline) | ||
* [4e) Step 5: Analyze your results](#4e-step-5-analyze-your-results) | ||
* [5. Additional information](#5-additional-information) | ||
* [5a) List of workflow commands](#5a-list-of-workflow-commands) | ||
* [5b) Directories](#5b-directories) | ||
* [5c) File formats](#5c-file-formats) | ||
* [5d) Running on a cluster](#5d-running-on-a-cluster) | ||
* [5e) Known limitations](#5e-known-limitations) | ||
* [6. License](#6-license) | ||
* [7. Contacts](#7-contacts) | ||
|
||
<!-- vim-markdown-toc --> | ||
|
||
|
||
## Introduction | ||
## 1. Introduction | ||
|
||
The central idea behind MOF-Search, | ||
enabling alignment locally at such a large scale, | ||
is | ||
[**phylogenetic compression**](https://brinda.eu/mof) | ||
([paper](https://doi.org/10.1101/2023.04.15.536996)) - | ||
a technique based | ||
on using estimated evolutionary history to guide compression and | ||
search of large genome collections using existing algorithms and | ||
data structures. | ||
The central idea behind MOF-Search, enabling alignment locally at such a large | ||
scale, is [**phylogenetic compression**](https://brinda.eu/mof) | ||
([paper](https://doi.org/10.1101/2023.04.15.536996)) - a technique based on | ||
using estimated evolutionary history to guide compression and search of large | ||
genome collections using existing algorithms and data structures. | ||
|
||
In short, input data are reorganized according to the topology | ||
of the estimated phylogenies, which makes data highly locally compressible even | ||
using basic techniques. Existing software packages for compression, indexing, | ||
and search - in this case [XZ](https://tukaani.org/xz/), | ||
In short, input data are reorganized according to the topology of the estimated | ||
phylogenies, which makes data highly locally compressible even using basic | ||
techniques. Existing software packages for compression, indexing, and search | ||
- in this case [XZ](https://tukaani.org/xz/), | ||
[COBS](https://github.com/iqbal-lab-org/cobs), and | ||
[Minimap2](https://github.com/lh3/minimap2) - are then used as low-level tools. | ||
The resulting performance gains come from a wide range of benefits of | ||
phylogenetic compression, including easy parallelization, small memory | ||
requirements, small database size, better memory locality, and better branch | ||
prediction. | ||
|
||
For more information about phylogenetic compression and the implementation details of MOF-Search, see the [corresponding | ||
paper](https://www.biorxiv.org/content/10.1101/2023.04.15.536996v2) (including its | ||
[supplementary material](https://www.biorxiv.org/content/biorxiv/early/2023/04/18/2023.04.15.536996/DC1/embed/media-1.pdf) | ||
For more information about phylogenetic compression and the implementation | ||
details of MOF-Search, see the [corresponding | ||
paper](https://www.biorxiv.org/content/10.1101/2023.04.15.536996v2) (including | ||
its [supplementary | ||
material](https://www.biorxiv.org/content/biorxiv/early/2023/04/18/2023.04.15.536996/DC1/embed/media-1.pdf) | ||
and visit the [associated website](https://brinda.eu/mof). | ||
|
||
|
||
|
@@ -79,20 +78,22 @@ and visit the [associated website](https://brinda.eu/mof). | |
> K. Břinda, L. Lima, S. Pignotti, N. Quinones-Olvera, K. Salikhov, R. Chikhi, G. Kucherov, Z. Iqbal, and M. Baym. **[Efficient and Robust Search of Microbial Genomes via Phylogenetic Compression.](https://doi.org/10.1101/2023.04.15.536996)** *bioRxiv* 2023.04.15.536996, 2023. https://doi.org/10.1101/2023.04.15.536996 | ||
|
||
## Requirements | ||
## 2. Requirements | ||
|
||
### Hardware | ||
### 2a) Hardware | ||
|
||
MOF-Search requires a standard desktop or laptop computer with an \*nix system, | ||
and it can also run on a cluster. The minimal hardware requirements are **12 GB | ||
RAM** and approximately **120 GB of disk space** (102 GB for the database and | ||
a margin for intermediate files). | ||
|
||
|
||
### Dependencies | ||
### 2b) Dependencies | ||
|
||
MOF-Search is implemented as a [Snakemake](https://snakemake.github.io) | ||
pipeline, using the Conda system to manage non-standard dependencies. Ensure you have [Conda](https://docs.conda.io/en/latest/miniconda.html) installed with the following packages: | ||
pipeline, using the Conda system to manage non-standard dependencies. Ensure | ||
you have [Conda](https://docs.conda.io/en/latest/miniconda.html) installed with | ||
the following packages: | ||
|
||
* [GNU Time](https://www.gnu.org/software/time/) (on Linux present by default; on OS X, install with `brew install gnu-time`). | ||
* [Python](https://www.python.org/) (>=3.7) | ||
|
@@ -104,32 +105,38 @@ Additionally, MOF-Search uses standard Unix tools like | |
[cURL](https://curl.se/), | ||
[XZ Utils](https://tukaani.org/xz/), and | ||
[GNU Gzip](https://www.gnu.org/software/gzip/). | ||
These tools are typically included in standard \*nix installations. However, in minimal setups (e.g., virtualization, continuous integration), you might need to install them using the corresponding package managers. | ||
These tools are typically included in standard \*nix installations. However, in | ||
minimal setups (e.g., virtualization, continuous integration), you might need | ||
to install them using the corresponding package managers. | ||
|
||
|
||
## Installation | ||
## 3. Installation | ||
|
||
### Step 1: Install dependencies | ||
### 3a) Step 1: Install dependencies | ||
|
||
Make sure you have Conda and GNU Time installed. On Linux: | ||
|
||
```bash | ||
sudo apt-get install conda | ||
``` | ||
|
||
On OS X (using Homebrew): | ||
|
||
```bash | ||
brew install conda | ||
brew install gnu-time | ||
``` | ||
|
||
Install Python (>=3.7), Snakemake (>=6.2.0), and Mamba (optional but recommended) using Conda: | ||
Install Python (>=3.7), Snakemake (>=6.2.0), and Mamba (optional but | ||
recommended) using Conda: | ||
|
||
```bash | ||
conda install -y -c bioconda -c conda-forge \ | ||
"python>=3.7" "snakemake>=6.2.0" "mamba>=0.20.0" | ||
``` | ||
|
||
|
||
### Step 2: Clone the repository | ||
### 3b) Step 2: Clone the repository | ||
|
||
Clone the MOF-Search repository from GitHub and navigate into the directory: | ||
|
||
|
@@ -139,26 +146,28 @@ Clone the MOF-Search repository from GitHub and navigate into the directory: | |
``` | ||
|
||
|
||
### Step 3: Run a simple test | ||
### 3c) Step 3: Run a simple test | ||
|
||
Run the following command to ensure the pipeline works for sample queries and 3 batches (this will also install all additional dependencies using Conda): | ||
Run the following command to ensure the pipeline works for sample queries and | ||
3 batches (this will also install all additional dependencies using Conda): | ||
|
||
```bash | ||
make test | ||
``` | ||
|
||
Make sure the test returns 0 (success) and that you see the expected output message: | ||
Make sure the test returns 0 (success) and that you see the expected output | ||
message: | ||
|
||
```bash | ||
Success! Test run produced the expected output. | ||
``` | ||
|
||
|
||
### Step 4: Download the database | ||
### 3d) Step 4: Download the database | ||
|
||
Download all phylogenetically compressed assemblies and COBS *k*-mer indexes | ||
for the [661k-HQ collection](https://doi.org/10.1371/journal.pbio.3001421) by: | ||
|
||
Download all phylogenetically compressed assemblies and COBS *k*-mer | ||
indexes for the [661k-HQ | ||
collection](https://doi.org/10.1371/journal.pbio.3001421) by: | ||
```bash | ||
make download | ||
``` | ||
|
@@ -172,9 +181,9 @@ The downloaded files will be located in the `asms/` and `cobs/` directories. | |
control. | ||
|
||
|
||
## Usage | ||
## 4. Usage | ||
|
||
### Step 1: Copy or symlink your queries | ||
### 4a) Step 1: Copy or symlink your queries | ||
|
||
Remove the default test files or your old files in the `queries/` directory and | ||
copy or symlink (recommended) your query files. The supported input formats are | ||
|
@@ -186,34 +195,39 @@ merged together. | |
* All non-`ACGT` characters in your query sequences will be translated to `A`. | ||
|
||
|
||
### Step 2: Adjust configuration | ||
### 4b) Step 2: Adjust configuration | ||
|
||
Edit the [`config.yaml`](config.yaml) file for your desired search. All available options are | ||
documented directly there. | ||
Edit the [`config.yaml`](config.yaml) file for your desired search. All | ||
available options are documented directly there. | ||
|
||
### Step 3: Clean up intermediate files | ||
### 4c) Step 3: Clean up intermediate files | ||
|
||
Run `make clean` to clean intermediate files from the previous runs. This includes COBS matching files, alignment files, and various reports. | ||
Run `make clean` to clean intermediate files from the previous runs. This | ||
includes COBS matching files, alignment files, and various reports. | ||
|
||
### Step 4: Run the pipeline | ||
### 4d) Step 4: Run the pipeline | ||
|
||
Simply run `make`, which will execute Snakemake with the corresponding parameters. If you want to run the pipeline step by step, run `make match` followed by `make map`. | ||
Simply run `make`, which will execute Snakemake with the corresponding | ||
parameters. If you want to run the pipeline step by step, run `make match` | ||
followed by `make map`. | ||
|
||
### Step 5: Analyze your results | ||
### 4e) Step 5: Analyze your results | ||
|
||
Check the output files in `output/`. The `.sam_summary.gz` files contain output alignments in a headerless SAM format. The `.sam_summary.stats` files contain statistics about your computed alignments. | ||
Check the output files in `output/` (for more info about formats, see | ||
[5c) File formats](#5c-file-formats)). | ||
|
||
If the results do not correspond to what you expected and you need to re-adjust | ||
your search parameters, go to Step 2. If only the mapping part is affected by the | ||
changes, you proceed more rapidly by manually removing the files in | ||
your search parameters, go to Step 2. If only the mapping part is affected by | ||
the changes, you proceed more rapidly by manually removing the files in | ||
`intermediate/05_map` and `output/` and running directly `make map`. | ||
|
||
|
||
## Additional information | ||
## 5. Additional information | ||
|
||
### List of workflow commands | ||
### 5a) List of workflow commands | ||
|
||
MOF-Search is executed via [GNU Make](https://www.gnu.org/software/make/), which handles all parameters and passes them to Snakemake. | ||
MOF-Search is executed via [GNU Make](https://www.gnu.org/software/make/), | ||
which handles all parameters and passes them to Snakemake. | ||
|
||
Here's a list of all implemented commands (to be executed as `make {command}`): | ||
|
||
|
@@ -248,51 +262,66 @@ Here's a list of all implemented commands (to be executed as `make {command}`): | |
format Reformat Python and Snakemake files | ||
``` | ||
|
||
### Directories | ||
### 5b) Directories | ||
|
||
* `asms/`, `cobs/` Downloaded assemblies and COBS indexes | ||
* `input/` Queries, to be provided within one or more FASTA/FASTQ files, possibly gzipped (`.fa`) | ||
* `input/` Queries, to be provided within one or more FASTA/FASTQ files, | ||
possibly gzipped (`.fa`) | ||
* `intermediate/` Intermediate files | ||
* `00_queries_preprocessed` Preprocessed queries | ||
* `01_queries_merged` Merged queries | ||
* `02_cobs_decompressed` Decompressed COBS indexes (temporary, used only in the disk mode is used) | ||
* `03_match` COBS matches | ||
* `04_filter` Filtered candidates | ||
* `05_map` Minimap2 alignments | ||
* `00_queries_preprocessed/` Preprocessed queries | ||
* `01_queries_merged/` Merged queries | ||
* `02_cobs_decompressed/` Decompressed COBS indexes (temporary, used only in | ||
the disk mode is used) | ||
* `03_match/` COBS matches | ||
* `04_filter/` Filtered candidates | ||
* `05_map/` Minimap2 alignments | ||
* `logs/` Logs and benchmarks | ||
* `output/` The resulting files (in a headerless SAM format) | ||
|
||
|
||
### Running on a cluster | ||
|
||
Running on a cluster is much faster as the jobs produced by this pipeline are quite light and usually start running as | ||
soon as they are scheduled. | ||
### 5c) File formats | ||
|
||
**Input files:** FASTA or FASTQ files possibly compressed by gzipped. The files | ||
are searched in the `input/` directory, as files with the following suffixes: | ||
`.fa`, `.fasta`, `.fq`, `.fastq` (possibly with `.gz` at the end). | ||
|
||
**Output files:** | ||
|
||
* `output/{name}.sam_summary.gz`: output alignments in a headerless SAM format | ||
* `output/{name}.sam_summary.stats`: statistics about your computed alignments | ||
in TSV | ||
|
||
|
||
### 5d) Running on a cluster | ||
|
||
Running on a cluster is much faster as the jobs produced by this pipeline are | ||
quite light and usually start running as soon as they are scheduled. | ||
|
||
**For LSF clusters:** | ||
|
||
1. Test if the pipeline is working on a LSF cluster: `make cluster_lsf_test`; | ||
2. Configure you queries and run the full pipeline: `make cluster_lsf`; | ||
|
||
|
||
### Known limitations | ||
### 5e) Known limitations | ||
|
||
* **Swapping if the number of queries too high.** If the number of queries is | ||
too high, the auxiliary Python scripts start to use too much memory, which | ||
may result in swapping. Try to keep the number of queries moderate and | ||
ideally their names short. | ||
|
||
* **No support for ambiguous characters in queries.** As the tools used | ||
internally by MOF-Search support only the nucleotide alphabet, all non-ACGT | ||
characters in queries are first converted to A. | ||
|
||
|
||
## License | ||
## 6. License | ||
|
||
[MIT](https://github.com/karel-brinda/mof-search/blob/master/LICENSE) | ||
|
||
|
||
|
||
## Contacts | ||
## 7. Contacts | ||
|
||
* [Karel Brinda](https://brinda.eu) \<[email protected]\> | ||
* [Leandro Lima](https://github.com/leoisl) \<[email protected]\> |