From e2380c9f92b6524861ab3f3942391d0b4981303e Mon Sep 17 00:00:00 2001 From: Jim Shaw Date: Tue, 9 May 2023 17:40:56 -0700 Subject: [PATCH 1/7] Update CHANGELOG.md --- CHANGELOG.md | 47 +++++++++++++++++++++++++++++++++++++---------- 1 file changed, 37 insertions(+), 10 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index a369c68..0e24945 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -1,15 +1,42 @@ -v0.1.3 +### v0.1.3 (pre)released - 2023-05-09 -- Fixed a bug where memory was blowing up in dist and triangle when the marker-index was activated. -- For all modes, implemented writing outputs during processing instead of storing all results until the end of the command. -- Changed the marker index hash table population method. Used to overestimate memory usage slightly. -- New help message for marker parameters. Turns out that for small genomes, having more markers may make filtering significantly better. -- Added -i option to sketch so you can sketch individual records in multifastas -- does not work for search yet though, only for sketching. +#### Major +* Fixed a bug where memory was blowing up in `dist` and `triangle` when the marker-index was activated. For big datasets, there could be > 100 GBs of wasted memory. +* skani now outputs intermediate results after processing each batch of 5000 queries. **This will mean that outputs may no longer be deterministically ordered if there are > 5000 genomes**, but you can sort the output file to get deterministic outputs (`skani triangle *.fa | sort -k 3` will guarantee deterministic output order). -v0.1.2 +#### Minor +* Changed the marker index hash table population method. Used to overestimate memory usage slightly. +* New help message for marker parameters. Turns out that for small genomes, having more markers may make filtering significantly better. +* Added -i option to sketch so you can sketch individual records in multifastas -- does not work for search yet though, only for sketching. -- Added medium preset. -- Added distance argument in triangle for distance instead of similarity matrices. -- Changed --marker-index option to --no-marker-index, which is a much more sane option. +### v0.1.2 released - 2023-04-28. +Small fixes. +* Added `--medium` pre-set, which is just `-c 70`. Seems to work okay for comparing fragmented genomes. +* **BREAKING**: Changed `--marker-index` to `--no-marker-index` as a more sane option. +* Added `--distance` option to `skani triangle` to output distance matrix (i.e. 100 - ANI) instead of similarity matrix. +* Misc. help message fixes + +### v0.1.1 released - 2023-04-09. + +Small fixes. + +* Made aligned fraction in `triangle mode` a full matrix by default. This is not a symmetric matrix since AF is not symmetric. +* Misc. help message fixes + +### v0.1.0 released - 2023-02-07. + +We added new experiments on the revised version of our preprint (Extended Data Figs 11-14). We show skani has quite good AF correlation with MUMmer, and that it works decently on simple eukaryotic MAGs, especially with the `--slow` option (see below). + +#### Major + +* **ANI debiasing added** - skani now uses a debiasing step with a regression model trained on MAGs to give more accurate ANIs. Old version gave robust, but slightly overestimated ANIs, especially around 95-97% range. Debiasing is enabled by default, but can be turned off with ``--no-learned-ani``. +* **More accurate aligned fraction** - chaining algorithm changed to give a more accurate aligned fraction (AF) estimate. The previous version had more variance and underestimated AF for certain assemblies. + +#### Minor + +* **Small contig/genome defaults made better** - should be more sensitive so that they don't get filtered by default. +* **Repetitive k-mer masking made better** - smarter settings and should work better for eukaryotic genomes; shouldn't affect prokaryotic genomes much. +* **`--fast` and `--slow` mode added** - alias for `-c 200` and `-c 30` respectively. +* **More non x86_64 builds should work** - there was a bug before where skani would be dysfunctional on non x86_64 architectures. It seems to at least build on ARM64 architectures successfully now. From 22b0530e43464cd7b6c509e0cb4bba1d0e4bd72a Mon Sep 17 00:00:00 2001 From: Jim Shaw Date: Tue, 9 May 2023 17:49:52 -0700 Subject: [PATCH 2/7] Update README.md --- README.md | 34 ++++++++-------------------------- 1 file changed, 8 insertions(+), 26 deletions(-) diff --git a/README.md b/README.md index c4b1904..6b402e3 100644 --- a/README.md +++ b/README.md @@ -133,37 +133,19 @@ Jim Shaw and Yun William Yu. Fast and robust metagenomic sequence comparison thr ## Updates -### v0.1.2 released - 2023-04-28. - -Small fixes. - -* Added `--medium` pre-set, which is just `-c 70`. Seems to work okay for comparing fragmented genomes. -* **BREAKING**: Changed `--marker-index` to `--no-marker-index` as a more sane option. -* Added `--distance` option to `skani triangle` to output distance matrix (i.e. 100 - ANI) instead of similarity matrix. -* Misc. help message fixes - -### v0.1.1 released - 2023-04-09. - -Small fixes. - -* Made aligned fraction in `triangle mode` a full matrix by default. This is not a symmetric matrix since AF is not symmetric. -* Misc. help message fixes - -### v0.1.0 released - 2023-02-07. - -We added new experiments on the revised version of our preprint (Extended Data Figs 11-14). We show skani has quite good AF correlation with MUMmer, and that it works decently on simple eukaryotic MAGs, especially with the `--slow` option (see below). +### v0.1.3 (pre)released - 2023-05-09, conda update to follow at a later date #### Major +* Fixed a bug where memory was blowing up in `dist` and `triangle` when the marker-index was activated. For big datasets, there could be > 100 GBs of wasted memory. +* skani now outputs intermediate results after processing each batch of 5000 queries. **This will mean that outputs may no longer be deterministically ordered if there are > 5000 genomes**, but you can sort the output file to get deterministic outputs (`skani triangle *.fa | sort -k 3 -n > sorted_skani_result.txt` will guarantee deterministic output order). -* **ANI debiasing added** - skani now uses a debiasing step with a regression model trained on MAGs to give more accurate ANIs. Old version gave robust, but slightly overestimated ANIs, especially around 95-97% range. Debiasing is enabled by default, but can be turned off with ``--no-learned-ani``. -* **More accurate aligned fraction** - chaining algorithm changed to give a more accurate aligned fraction (AF) estimate. The previous version had more variance and underestimated AF for certain assemblies. +#### Minor +* Changed the marker index hash table population method. Used to overestimate memory usage slightly. +* New help message for marker parameters. Turns out that for small genomes, having more markers may make filtering significantly better. +* Added -i option to sketch so you can sketch individual records in multifastas -- does not work for search yet though, only for sketching. -#### Minor -* **Small contig/genome defaults made better** - should be more sensitive so that they don't get filtered by default. -* **Repetitive k-mer masking made better** - smarter settings and should work better for eukaryotic genomes; shouldn't affect prokaryotic genomes much. -* **`--fast` and `--slow` mode added** - alias for `-c 200` and `-c 30` respectively. -* **More non x86_64 builds should work** - there was a bug before where skani would be dysfunctional on non x86_64 architectures. It seems to at least build on ARM64 architectures successfully now. +See the [CHANGELOG](https://github.com/bluenote-1577/skani/blob/main/CHANGELOG.md) for the skani's full versioning history. ## Feature requests, issues From ada044d5c4f02da5bb345026442a72c7530a4b5b Mon Sep 17 00:00:00 2001 From: Jim Shaw Date: Tue, 9 May 2023 17:50:39 -0700 Subject: [PATCH 3/7] Update CHANGELOG.md --- CHANGELOG.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index 0e24945..b2dccb5 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -2,7 +2,7 @@ #### Major * Fixed a bug where memory was blowing up in `dist` and `triangle` when the marker-index was activated. For big datasets, there could be > 100 GBs of wasted memory. -* skani now outputs intermediate results after processing each batch of 5000 queries. **This will mean that outputs may no longer be deterministically ordered if there are > 5000 genomes**, but you can sort the output file to get deterministic outputs (`skani triangle *.fa | sort -k 3` will guarantee deterministic output order). +* skani now outputs intermediate results after processing each batch of 5000 queries. **This will mean that outputs may no longer be deterministically ordered if there are > 5000 genomes**, but you can sort the output file to get deterministic outputs, i.e ``skani triangle *.fa | sort -k 3 -n > sorted_skani_result.txt`` will guarantee deterministic output order. #### Minor * Changed the marker index hash table population method. Used to overestimate memory usage slightly. From 77240415fac74baf275997add3f65fca06f018c5 Mon Sep 17 00:00:00 2001 From: Jim Shaw Date: Tue, 9 May 2023 17:53:54 -0700 Subject: [PATCH 4/7] Update README.md --- README.md | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/README.md b/README.md index 6b402e3..a1dce9f 100644 --- a/README.md +++ b/README.md @@ -55,7 +55,7 @@ Note: the binary is compiled with a different set of libraries (musl instead of See the [Releases](https://github.com/bluenote-1577/skani/releases) page for obtaining specific versions of skani. -#### Option 3: Conda (conda version: 0.1.1 - source version: 0.1.2) +#### Option 3: Conda (conda version: 0.1.2 - source version: 0.1.3) ```sh conda install -c bioconda skani @@ -79,9 +79,8 @@ skani search query1.fa query2.fa ... -d database # use sketch from "skani sketch" output as drop-in replacement skani dist database/query.fa.sketch database/ref.fa.sketch -# construct similarity matrix for all genomes in folder +# construct similarity matrix/edge list for all genomes in folder skani triangle genome_folder/* > skani_ani_matrix.txt -# output an edge list instead of a matrix for big computations skani triangle genome_folder/* -E > skani_ani_edge_list.txt # we provide a script in this repository for clustering/visualizing distance matrices. @@ -127,6 +126,8 @@ refs/e.coli-EC590.fasta refs/e.coli-K12.fasta 99.39 93.95 93.37 NZ_CP016182.2 Es - Aligned_fraction_query/reference: fraction of query/reference covered by alignments. - Ref/Query_name: the id of the first record in the reference/query file. +The order of results is dependent on the command and not guaranteed to be deterministic when > 5000 query genomes are present. `dist` and `search` try to place the highest ANI results first. + ## Citation Jim Shaw and Yun William Yu. Fast and robust metagenomic sequence comparison through sparse chaining with skani. bioRxiv (2023). https://doi.org/10.1101/2023.01.18.524587. Submitted. @@ -137,7 +138,7 @@ Jim Shaw and Yun William Yu. Fast and robust metagenomic sequence comparison thr #### Major * Fixed a bug where memory was blowing up in `dist` and `triangle` when the marker-index was activated. For big datasets, there could be > 100 GBs of wasted memory. -* skani now outputs intermediate results after processing each batch of 5000 queries. **This will mean that outputs may no longer be deterministically ordered if there are > 5000 genomes**, but you can sort the output file to get deterministic outputs (`skani triangle *.fa | sort -k 3 -n > sorted_skani_result.txt` will guarantee deterministic output order). +* skani now outputs intermediate results after processing each batch of 5000 queries. **This will mean that outputs may no longer be deterministically ordered if there are > 5000 genomes**, but you can sort the output file to get deterministic outputs, i.e. ``skani triangle *.fa | sort -k 3 -n > sorted_skani_result.txt`` will guarantee deterministic output order. #### Minor * Changed the marker index hash table population method. Used to overestimate memory usage slightly. From da556f0d94472ccd5e8ca9389e88fe07db080526 Mon Sep 17 00:00:00 2001 From: Jim Shaw Date: Wed, 10 May 2023 17:18:46 -0700 Subject: [PATCH 5/7] Update README.md --- README.md | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index a1dce9f..c7b107a 100644 --- a/README.md +++ b/README.md @@ -55,8 +55,9 @@ Note: the binary is compiled with a different set of libraries (musl instead of See the [Releases](https://github.com/bluenote-1577/skani/releases) page for obtaining specific versions of skani. -#### Option 3: Conda (conda version: 0.1.2 - source version: 0.1.3) - +#### Option 3: Conda (source version: 0.1.3) +[![Anaconda-Server Badge](https://anaconda.org/bioconda/skani/badges/version.svg)](https://anaconda.org/bioconda/skani) +[![Anaconda-Server Badge](https://anaconda.org/bioconda/skani/badges/latest_release_date.svg)](https://anaconda.org/bioconda/skani) ```sh conda install -c bioconda skani ``` From 05b4480c182094b0f1e7b7b071765da0e4fb065c Mon Sep 17 00:00:00 2001 From: Jim Shaw Date: Thu, 18 May 2023 10:58:54 +0900 Subject: [PATCH 6/7] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index c7b107a..57ce01a 100644 --- a/README.md +++ b/README.md @@ -135,7 +135,7 @@ Jim Shaw and Yun William Yu. Fast and robust metagenomic sequence comparison thr ## Updates -### v0.1.3 (pre)released - 2023-05-09, conda update to follow at a later date +### v0.1.3 released - 2023-05-09 #### Major * Fixed a bug where memory was blowing up in `dist` and `triangle` when the marker-index was activated. For big datasets, there could be > 100 GBs of wasted memory. From cc6113232db9cf20143626a613d4207385d7f71d Mon Sep 17 00:00:00 2001 From: Jim Shaw Date: Mon, 12 Jun 2023 13:21:27 -0700 Subject: [PATCH 7/7] Update README.md --- README.md | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index 57ce01a..a000606 100644 --- a/README.md +++ b/README.md @@ -73,6 +73,9 @@ skani dist genome2.fa genome1.fa -t 5 # compare multiple genomes skani dist -q query1.fa query2.fa -r reference1.fa reference2.fa -o all-to-all_results.txt +# compare individual fasta records (e.g. contigs) +skani dist --qi -q assembly1.fa --ri -r assembly2.fa + # construct database and do memory-efficient search skani sketch genomes_to_search/* -o database skani search query1.fa query2.fa ... -d database @@ -107,7 +110,7 @@ For more information about using the specific skani subcommands, see the [guide See the advanced usage guide linked above for more information about topics such as: * optimizing sensitivity/speed of skani -* using skani for long-reads +* optimizing skani for long-reads or contigs * making skani for memory efficient for huge data sets ## Output