-
Notifications
You must be signed in to change notification settings - Fork 11
Home
skani is not necessarily designed for comparing long-reads or small contigs, but it seems to work relatively well for ANI when the reads/contigs are long enough.
- skani can not classify short-reads. Use a taxonomic classifier such as kraken for this.
- skani can not compare collections of short-reads. Use Mash or sourmash for this.
- skani can not compute AAI for long-reads.
For small contigs or long-reads, here are some suggestions:
- Make sure to use the
--qi
option forskani search
orskani dist
if your contigs/reads are all in one file. -
skani dist
will be faster thanskani search
, since the bottleneck will be loading genomes into memory.
For parameters:
- The default marker size
-m
is set to 1000, so we take one marker per every 1000 k-mers. A good rule of thumb is that you want at least 20 markers on average, so set-m
> avg_read_length / 20. - Set
-c
to lower values, e.g. 60, for noisy-long reads (mean identity < 95). The longer + higher identity the reads, the higher-c
can be. - skani currently loads the entire file into memory instead of processing one read at a time. Consider splitting large sets of reads.
The --marker-index
option is available for skani dist
and skani search
. This loads all marker k-mers into a hash table for constant time filtering. This is turned off if less than 100 query files are input or when using the --qi
option. Otherwise, it is turned on automatically.
Building the table can take up to a minute (for large databases), and the table itself is ~10 GB for 65000 genomes with default parameters. Consider changing the -m
option, which is inversely proportional to the memory of this table, if memory is an issue.
If you want skani to run faster, the main parameter to adjust is the -c
parameter. skani's speed and memory efficiency is inversely proportional to c, so increasing c by 2x means 2x faster and less memory. As a default, c = 120 for ANI and c = 15 for AAI.
ANI: for genomes of ANI > 95%, c can be comfortably made higher up to 200 or even 300, but aligned fraction may get less accurate.
AAI: for genomes of AAI > 65%, c can be made up to 30 and still relatively accurate. Results degrade a bit after c gets past 40.
However, decreasing c
may not necessarily improve ANI/AAI accuracy for > 85% ANI genomes since many other default algorithm parameters are designed these default values. Furthermore, increasing c means that distant genomes will no longer be comparable; see the section on Comparing lower ANI/AAI genomes.
skani triangle
should be used for all-to-all comparisons on reasonably sized data sets. However, it loads all genome indices into memory, so RAM may be an issue. If RAM is an issue, consider:
- Pre-sketch using
skani sketch -l list_of_genomes.txt -o sketched_genomes
and runskani search -d sketched_genomes -l list_of_genomes -o output
to do slower but low-memory all-to-all comparisons. - Raising the
-c
parameter can help, see the above section on the-c
parameter. - Consider raising the parameter
-m
for faster screens. It defaults to 1000 but 2000 is reasonable for most bacterial genomes, but may lose sensitivity on small genomes such as viruses.
skani focuses on ANI/AAI comparisons for genomes with > 85% ANI and > 60% AAI. To get more accurate results for low ANI/AAI values, one should use a lower value for c
.
For example, the supplied genome refs/MN-03.fa
is a Klebsiella Pneumoniae genome, and running skani dist refs/MN-03.fa refs/e.coli-K12.fa
returns nothing because the two genomes do not have a good enough alignment. However, skani dist refs/MN-03.fa refs/e.coli-K12.fa -c 30
returns an ANI estimate of ~79%.
For distant genomes, the aligned fraction output becomes more accurate as c
gets smaller. However, decreasing c
may not necessarily make high ANI calculations more accurate. Nevertheless, I would not recommend ANI comparisons for genomes with < 75% ANI, and advise using skani's AAI method instead, which is tuned for sensitive comparisons by default.
The option -s
controls for an approximate ANI/AAI cutoff. Computations proceed only if the putative ANI (obtained by k-mer max-containment index) is higher than -s
. By default, this is 0.8 (80%) for ANI and 0.6 (60%) for AAI.
You can use a higher value of -s
if you're only interested in comparing more similar strains.
This cutoff is only approximate. If the true predicted ANI is greater than -s
, but the putative is smaller than -s
, the calculation does not proceed. Therefore, too high -s
and you'll lose sensitivity. The reverse also holds: a putative ANI can be greater than -s
but the true predicted can be less than -s
, in which case calculation still proceeds.
We don't recommend a lower value of -s
unless you know what you're doing, since ANI/AAI calculations under 80%/60% will be bad with default parameters.