Skip to content

Tutorial: strain and species level clustering of MAGs with skani triangle

Jim Shaw edited this page Dec 3, 2022 · 22 revisions

In this tutorial, we will go through a basic introduction to clustering MAGs (or genomes) using skani.

Requirements:

  1. skani is installed and the clustermap_triangle.py script is downloaded for visualizing clustering.
  2. Python3 is installed and seaborn, numpy, and scipy are installed. To install, simply run pip install seaborn numpy scipy. This is needed to visualize the clustering.
  3. (Optional) Mash is installed.
  4. (Optional) FastANI is installed.

1. Download the MAGs for clustering

We will be using MAGs from Pasolli et al. (2019), which constructed hundreds of thousands of MAGs from short-read sequencing.

We will work with a specific set of ~200 genomes that have already been binned at the species level. To get these genomes, do

git clone https://github.com/bluenote-1577/skani-test

The genomes are available in skani-test/2328. These genomes were classified as Alistipes ihumii (also the same set of genomes used in our skani paper; figure 1).

2. Performing all-to-all ANI comparisons

Now we will have to calculate ANI between all genomes. This is equivalent to constructing a pairwise similarity matrix. To do this, simply do

skani triangle skani-test/2328/* -t (threads) > skani_matrix.txt
skani triangle skani-test/2328/* -t (threads) --robust > robust_skani_matrix.txt

#optional if you have Mash installed
mash triangle skani-test/2328/* -p (threads) > mash_matrix.txt

#optional if you have FastANI installed
ls skani-test/2328/* > genome_list.txt
fastANI --rl genome_list.txt --ql genome_list.txt --matrix -o fastani -t (threads)

with the (threads) replaced by the number of threads you want to use for computing the ANI similarity matrix.

The output skani_matrix.txt is a lower-triangular phyllip matrix that we can process. With 20 threads, this took my workstation about ~4 seconds. Mash took about ~1 second. FastANI takes about ~200 seconds for me, so I would skip this commputation if you don't have enough computing resources.

The first line uses skani to construct an all-to-all ANI matrix. The second line usees the --robust option, which calculates the ANI by trimming off the bottom 90th and upper 10th quantile nucleotide identity estimates; we found this option gives slightly better clusterings, but it biases the ANI upwards.

3. Visualizing the clustered heatmap

We provide a helper script called clustermap_triangle.py for clustering lower-triangular similarity matrices that Mash and skani output. To obtain this script, either clone the skani repository or do

wget https://raw.githubusercontent.com/bluenote-1577/skani/main/scripts/clustermap_triangle.py

Make sure the python libraries outlined in the Requirements are installed, then specify one of the following commands to visualize the clustered ANI matrix

python clustermap_triangle.py skani_matrix.txt 3
python clustermap_triangle.py robust_skani_matrix.txt 3

#if the Mash ANI matrix was calculated
python clustermap_triangle.py mash_matrix.txt 3

#if the FastANI matrix was calculated
python clustermap_triangle fastani.matrix 3
#

The 3 argument limits the heatmap's minimum ANI to 100 - 3 = 97%. The bar on the top-left of the plot displays 100 - ANI.

To actually obtain clusters by setting, say, a 99% ANI threshold, look into the fcluster function from scipy, which takes a dendrogram (generated by the script) and outputs clusters. The clustermap_triangle.py script can be easily modified to do so, and can be extended for your own analysis.

(Optional) 4. Discussion: FastANI vs Mash vs skani ANI matrices

If you also used Mash or FastANI to generate ANI matrices, you'll see that the clustering for Mash, skani, and FastANI are very different on this data set. The exact distance matrices are also shown in Fig. 1 of our paper.

Why are the clusterings different?

Mash has issues with incompleteness

Mash's ANI estimation is contingent on genomes being complete. Mash's ANI estimation becomes more biased as the genomes become more incomplete. This fact has been noted by others before, and we also give a brief theoretical discussion in our paper. See the very good discussion by dRep on why Mash has issues with incomplete genomes.

FastANI has issues with low N50

It was noted FastANI's original paper that FastANI is most reliable when N50 > 10 kb. This turns out to be an important assumption.

In this data set, the median N50 is 20375 bp. There are plenty of MAGs with N50 < 10,000, and FastANI estimates that these MAGs have lower ANI comparisons than they actually do. On MAGs with higher N50, e.g. long-read assembled MAGs, FastANI seems to perform much better.

What is a good "true" clustering?

We argue in our paper that for high-ANI computations, ANIm (a mummer based ANI method), available in the pyani package seems to perform the best. This is because mummer is a relatively robust alignment method (for high ANI genomes) and is length agnostic. We showed that skani's clustering correlates much better with ANIm's in the paper, which is what we expect from the qualitative outputs. However, ANIm takes about ~400 times longer than skani to run.