Skip to content

Commit

Permalink
Merge branch 'main' of github.com:bluenote-1577/skani into main
Browse files Browse the repository at this point in the history
bluenote-1577 committed Jun 13, 2023
2 parents 2237fc9 + cc61132 commit d9da3f1
Showing 2 changed files with 55 additions and 41 deletions.
47 changes: 37 additions & 10 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,15 +1,42 @@
v0.1.3
### v0.1.3 (pre)released - 2023-05-09

- Fixed a bug where memory was blowing up in dist and triangle when the marker-index was activated.
- For all modes, implemented writing outputs during processing instead of storing all results until the end of the command.
- Changed the marker index hash table population method. Used to overestimate memory usage slightly.
- New help message for marker parameters. Turns out that for small genomes, having more markers may make filtering significantly better.
- Added -i option to sketch so you can sketch individual records in multifastas -- does not work for search yet though, only for sketching.
#### Major
* Fixed a bug where memory was blowing up in `dist` and `triangle` when the marker-index was activated. For big datasets, there could be > 100 GBs of wasted memory.
* skani now outputs intermediate results after processing each batch of 5000 queries. **This will mean that outputs may no longer be deterministically ordered if there are > 5000 genomes**, but you can sort the output file to get deterministic outputs, i.e ``skani triangle *.fa | sort -k 3 -n > sorted_skani_result.txt`` will guarantee deterministic output order.

v0.1.2
#### Minor
* Changed the marker index hash table population method. Used to overestimate memory usage slightly.
* New help message for marker parameters. Turns out that for small genomes, having more markers may make filtering significantly better.
* Added -i option to sketch so you can sketch individual records in multifastas -- does not work for search yet though, only for sketching.

- Added medium preset.
- Added distance argument in triangle for distance instead of similarity matrices.
- Changed --marker-index option to --no-marker-index, which is a much more sane option.
### v0.1.2 released - 2023-04-28.

Small fixes.

* Added `--medium` pre-set, which is just `-c 70`. Seems to work okay for comparing fragmented genomes.
* **BREAKING**: Changed `--marker-index` to `--no-marker-index` as a more sane option.
* Added `--distance` option to `skani triangle` to output distance matrix (i.e. 100 - ANI) instead of similarity matrix.
* Misc. help message fixes

### v0.1.1 released - 2023-04-09.

Small fixes.

* Made aligned fraction in `triangle mode` a full matrix by default. This is not a symmetric matrix since AF is not symmetric.
* Misc. help message fixes

### v0.1.0 released - 2023-02-07.

We added new experiments on the revised version of our preprint (Extended Data Figs 11-14). We show skani has quite good AF correlation with MUMmer, and that it works decently on simple eukaryotic MAGs, especially with the `--slow` option (see below).

#### Major

* **ANI debiasing added** - skani now uses a debiasing step with a regression model trained on MAGs to give more accurate ANIs. Old version gave robust, but slightly overestimated ANIs, especially around 95-97% range. Debiasing is enabled by default, but can be turned off with ``--no-learned-ani``.
* **More accurate aligned fraction** - chaining algorithm changed to give a more accurate aligned fraction (AF) estimate. The previous version had more variance and underestimated AF for certain assemblies.

#### Minor

* **Small contig/genome defaults made better** - should be more sensitive so that they don't get filtered by default.
* **Repetitive k-mer masking made better** - smarter settings and should work better for eukaryotic genomes; shouldn't affect prokaryotic genomes much.
* **`--fast` and `--slow` mode added** - alias for `-c 200` and `-c 30` respectively.
* **More non x86_64 builds should work** - there was a bug before where skani would be dysfunctional on non x86_64 architectures. It seems to at least build on ARM64 architectures successfully now.
49 changes: 18 additions & 31 deletions README.md
Original file line number Diff line number Diff line change
@@ -55,8 +55,9 @@ Note: the binary is compiled with a different set of libraries (musl instead of
See the [Releases](https://github.com/bluenote-1577/skani/releases) page for obtaining specific versions of skani.


#### Option 3: Conda (conda version: 0.1.1 - source version: 0.1.2)

#### Option 3: Conda (source version: 0.1.3)
[![Anaconda-Server Badge](https://anaconda.org/bioconda/skani/badges/version.svg)](https://anaconda.org/bioconda/skani)
[![Anaconda-Server Badge](https://anaconda.org/bioconda/skani/badges/latest_release_date.svg)](https://anaconda.org/bioconda/skani)
```sh
conda install -c bioconda skani
```
@@ -72,16 +73,18 @@ skani dist genome2.fa genome1.fa -t 5
# compare multiple genomes
skani dist -q query1.fa query2.fa -r reference1.fa reference2.fa -o all-to-all_results.txt

# compare individual fasta records (e.g. contigs)
skani dist --qi -q assembly1.fa --ri -r assembly2.fa

# construct database and do memory-efficient search
skani sketch genomes_to_search/* -o database
skani search query1.fa query2.fa ... -d database

# use sketch from "skani sketch" output as drop-in replacement
skani dist database/query.fa.sketch database/ref.fa.sketch

# construct similarity matrix for all genomes in folder
# construct similarity matrix/edge list for all genomes in folder
skani triangle genome_folder/* > skani_ani_matrix.txt
# output an edge list instead of a matrix for big computations
skani triangle genome_folder/* -E > skani_ani_edge_list.txt

# we provide a script in this repository for clustering/visualizing distance matrices.
@@ -107,7 +110,7 @@ For more information about using the specific skani subcommands, see the [guide
See the advanced usage guide linked above for more information about topics such as:

* optimizing sensitivity/speed of skani
* using skani for long-reads
* optimizing skani for long-reads or contigs
* making skani for memory efficient for huge data sets

## Output
@@ -127,43 +130,27 @@ refs/e.coli-EC590.fasta refs/e.coli-K12.fasta 99.39 93.95 93.37 NZ_CP016182.2 Es
- Aligned_fraction_query/reference: fraction of query/reference covered by alignments.
- Ref/Query_name: the id of the first record in the reference/query file.

The order of results is dependent on the command and not guaranteed to be deterministic when > 5000 query genomes are present. `dist` and `search` try to place the highest ANI results first.

## Citation

Jim Shaw and Yun William Yu. Fast and robust metagenomic sequence comparison through sparse chaining with skani. bioRxiv (2023). https://doi.org/10.1101/2023.01.18.524587. Submitted.

## Updates

### v0.1.2 released - 2023-04-28.

Small fixes.

* Added `--medium` pre-set, which is just `-c 70`. Seems to work okay for comparing fragmented genomes.
* **BREAKING**: Changed `--marker-index` to `--no-marker-index` as a more sane option.
* Added `--distance` option to `skani triangle` to output distance matrix (i.e. 100 - ANI) instead of similarity matrix.
* Misc. help message fixes

### v0.1.1 released - 2023-04-09.

Small fixes.

* Made aligned fraction in `triangle mode` a full matrix by default. This is not a symmetric matrix since AF is not symmetric.
* Misc. help message fixes

### v0.1.0 released - 2023-02-07.

We added new experiments on the revised version of our preprint (Extended Data Figs 11-14). We show skani has quite good AF correlation with MUMmer, and that it works decently on simple eukaryotic MAGs, especially with the `--slow` option (see below).
### v0.1.3 released - 2023-05-09

#### Major
* Fixed a bug where memory was blowing up in `dist` and `triangle` when the marker-index was activated. For big datasets, there could be > 100 GBs of wasted memory.
* skani now outputs intermediate results after processing each batch of 5000 queries. **This will mean that outputs may no longer be deterministically ordered if there are > 5000 genomes**, but you can sort the output file to get deterministic outputs, i.e. ``skani triangle *.fa | sort -k 3 -n > sorted_skani_result.txt`` will guarantee deterministic output order.

* **ANI debiasing added** - skani now uses a debiasing step with a regression model trained on MAGs to give more accurate ANIs. Old version gave robust, but slightly overestimated ANIs, especially around 95-97% range. Debiasing is enabled by default, but can be turned off with ``--no-learned-ani``.
* **More accurate aligned fraction** - chaining algorithm changed to give a more accurate aligned fraction (AF) estimate. The previous version had more variance and underestimated AF for certain assemblies.
#### Minor
* Changed the marker index hash table population method. Used to overestimate memory usage slightly.
* New help message for marker parameters. Turns out that for small genomes, having more markers may make filtering significantly better.
* Added -i option to sketch so you can sketch individual records in multifastas -- does not work for search yet though, only for sketching.

#### Minor

* **Small contig/genome defaults made better** - should be more sensitive so that they don't get filtered by default.
* **Repetitive k-mer masking made better** - smarter settings and should work better for eukaryotic genomes; shouldn't affect prokaryotic genomes much.
* **`--fast` and `--slow` mode added** - alias for `-c 200` and `-c 30` respectively.
* **More non x86_64 builds should work** - there was a bug before where skani would be dysfunctional on non x86_64 architectures. It seems to at least build on ARM64 architectures successfully now.
See the [CHANGELOG](https://github.com/bluenote-1577/skani/blob/main/CHANGELOG.md) for the skani's full versioning history.

## Feature requests, issues

0 comments on commit d9da3f1

Please sign in to comment.