Viratax was developed for the ICTV Challenge (https://ictv-vbeg.github.io/ICTV-TaxonomyChallenge/).
See results/results.csv
.
To re-generate results/results.csv
, choose one of the following methods.
Clone the repository and execute ./RUNME.bash
from the scripts/
directory.
git clone https://github.com/rcedgar/viratax.git cd scripts ./RUNME.bash
There is a Dockerfile in the docker/
directory. To build the container, run ./build_container.bash
. To run the container, use ./runit.bash
. The generated results.csv
file will be written to the docker/
directory.
cd viratax/docker ./build_container.bash ./runit.bash
To reproduce the pre-computed results exactly, you need minimap2
2.24-r1122, diamond
v2.0.14 and EMBOSS getorf
v6.6.0.0. For convenience, these binaries are included in the bin/
directory.
Also required are python3
to run scripts (any v3 should work, I used v3.10.12) and wget
to download data.
On a 32-core Intel i9-14900K Linux server, the entire pipeline from downloading data through final classification takes about 10 minutes. If you run on stand-alone Linux, the data will only be downloaded once. The second time RUNME.bash
is executed it will perform classification using the reference data and query sequences previously downloaded. If you use Docker, the file system is not persistent so the query and reference data must be downloaded from scratch each time. RAM requirements are determined by diamond
and minimap2
which are memory-efficient by modern standards; how much is needed depends on how many threads you are running but at a guess 8Gb or 16Gb is probably more than enough.
The pipeline should give essentially identical results if run twice on the same data there are no randomized procedures like boostrapping. Each query sequence is processed independently, so the results for any given query is reproducible regardless of other contigs present.
The Challenge rules specify that ICTV release 39 should be used, so there is no provision here for updating the reference. The process of creating a reference is mostly automated, though some manual curation is needed to deal with a few minor issues such as idiosyncrasies in the VMR spreadsheet (VMR_MSL39_v1
was used here). If there is interest in using a different VMR, please open an issue in the GitHub repository and I'll document the procedure.
Viratax was designed to be a fast, lightweight classifier which hopefully performs reasonably well on the Challenge. It emphasizes speed, simplicity and reproducibility over maximizing classification accuracy, which looks to be a very hard problem both in terms of implementation (phylogeny and trait prediction from contigs is often very difficult and clade-specific) and benchmarking (how can you verify if the classification of a highly diverged contig is correct?).
The idea is to generalize rules of thumb relating sequence identity to taxonomic rank. For example, 90% aa or nt identity is often used as a species threshold. Similarly, 75% aa identity is roughly genus, and 50% aa identity is roughly family. These thresholds are gene- and taxon-specific; e.g. 90% aa identity for species is often applied to RNA virus RdRp while other genes diverge more quickly. To address this, every gene in every reference genome (in practice, every ORF) is aligned against all other reference genes to determine ORF-specific thresholds. ORFs in a query contig are aligned to ORFs from reference genomes, and a consensus is taken over aligned ORFs to determine the classification. The confidence estimate reflects identities of the top hits (higher identity gives higher confidence) and consistency (lower consensus gives lower confidence). A fast nucleotide search is performed first to identify hits which are consistent with a species match, i.e. >90% nt identity with >90% coverage of the contig.