Skip to content

Reference-based alignment and sequence database search of SARS-CoV-2 genomes.

License

GPL-3.0, GPL-3.0 licenses found

Licenses found

GPL-3.0
LICENSE
GPL-3.0
COPYING
Notifications You must be signed in to change notification settings

quadram-institute-bioscience/uvaia

Repository files navigation

Uvaia

Leonardo de Oliveira Martins1 Alison E Mather1,2 Andrew J Page1
1. Quadram Institute Bioscience, Norwich Research Park, NR4 7UQ, UK

2. University of East Anglia, Norfolk, UK

License: GPL v3 DOI

Latest stable version (conda etc.): v2.0.1 (use long options --trim and --nthreads if needed, instead of short option -t)

Current version (source code only): v2.0.2

Preprint: https://www.biorxiv.org/content/10.1101/2023.01.31.526458v1

peerJ article: https://peerj.com/articles/16890/

Video tutorials: https://www.youtube.com/playlist?list=PLPZ2aSS2ApqoU6-FCd2H035uJHnLu9fTL

Introduction

Uvaia is a program for pairwise reference-based alignment, and subsequent search against an aligned database. The alignment uses the promising WFA library implemented by Santiago Marco-Sola, and the database search is based on score distances from my biomcmc-lib library. The first versions used the kseq.h library, by Heng Li, for reading fasta files, but it currently relies on general compression libraries available on biomcmc-lib. In particular all functions should work with XZ compressed files for optimal compression.

Uvaia has been developed to help with SARS-CoV-2 analysis, being used as a quick replacement for civet in 2020/2021. Nowadays it can handle data sets too big for most equivalent software.

Etymology

Uvaia (Eugenia_pyriformis) (also know as uvaieira, uaieira, ubaia e uvalha) is a fruit tree typical of Brazil. Its name comes from the tupi iwa'ya, which means "sour fruit", and if you have good imagination, its pronunciation resembles WFA.

Installation

Conda and Docker/Singularity

Anaconda-Server Badge Anaconda-Server Badge Anaconda-Server Badge

This is the suggested installation route if you have access to a GNU/Linux machine. Otherwise please see the docker/singularity option, or how to compile it from source. After you install miniconda, simply run

# recommended: creates a new environment and installs uvaia there
conda create -n uvaia_env uvaia

Feel free to change the new environment name from uvaia_env to something sensible. Alternatively you can install uvaia in your current environment:

# not recommended, since it installs uvaia on current environment
conda install -c bioconda uvaia

The conda version may not be up-to-date. Notice that the conda version depends on sysroot_linux-64, which may interfere with your global C environment. Thus we recommend it above to install uvaia on a separate environment. You then need to activate this environment before running uvaia (assuming the above environment name uvaia_env):

conda activate uvaia_env

Conda offers a biocontainer (https://quay.io/repository/biocontainers/uvaia?tab=tags) such that you can run uvaia through docker or https://docs.sylabs.io/guides/2.6/user-guide/singularity_and_docker.html. For instance, you can try (check link above for most recent tag):

docker pull quay.io/biocontainers/uvaia:2.0.1--hc308579_0
# run the command "tatajuba -h" using the current directory
docker run -v `pwd`:`pwd` -w `pwd` quay.io/biocontainers/uvaia:2.0.1--hc308579_0 uvaia --help

Compiling from source

Use this option only if you know what you're doing, and if you need the bleeding-edge version.
To install it from source, you should download this repository with git clone --recursive to ensure it also downloads its submodules (see below for a tl;dr). If you forgot to do so, you can update it with

git submodule update --init --recursive

It will compile from the submodules biomcmc-lib and our modified WFA before finally compiling uvaia. Notice that executables for other software it relies on are not generated, only their libraries are used as dependencies.

This sofware uses autotools, which means you can install it with configure + make. You may need to define where you want it installed with configure --prefix=DIR, which is where are your unix-like include/, lib/, and bin/ directories. My favourite is ~/local. Another popular option if you are on a conda environment is --prefix=${CONDA_PREFIX}

Here is an example of its installation, please modify accordingly.

/home/simpson/$ git clone --recursive git@github.com:quadram-institute-bioscience/uvaia.git
/home/simpson/$ cd uvaia && ./autogen.sh
/home/simpson/$ mkdir build && cd build
/home/simpson/$ ../configure --prefix=${HOME}/local
/home/simpson/$ make; make install
/home/simpson/$ make check  # battery of unit and integration tests, not mandatory

Remember that the installation and autogen.sh in particular modify/add local files; therefore updating the repository from github will complain about uncommited changes. You can run git stash (or reinstall from scratch) before git pull. check the directory recipe for having a better idea of how conda/docker install it.

If the compilation is unsuccessful, you can check if all libraries and packages below are installed:

## packages necessary for autotools, o.w. it will complain when you run "autogen.sh": 
/home/simpson/$ apt-get install pkg-config autotools-dev autoconf automake libtool
/home/simpson/$ (cd tatajuba && ./autogen.sh)  ## the parentheses avoid entering the directory afterwards

## C libraries needed or suggested by uvaia (if you do not have root access see conda/mamba below):
/home/simpson/$ sudo apt-get install zlib1g-dev libomp-dev libbz2-dev check liblzma-dev
/home/simpson/$ (cd build && ../configure)  ## etc.  

The libraries rely on pkg-config to find their location, otherwise you'll see a cryptic error message about possibly undefined macro. If you can only install pkg-config through conda then you may need to install the C libraries via conda as well. Or checking and updating your $PKG_CONFIG_PATH environment variable. Thus to install all dependent packages through conda (or mamba, which is faster 😉):

mamba install automake libtool pkg-config make libgcc-ng check zlib xz bzip2 libgomp

Running

The two main programs are:

  • uvaialign, to align your query sequences against a reference genome. Output goes to stdout (your screen).
  • uvaia, to search for aligned queries against an aligned database. Both query and database fasta files can be aligned to same reference sequence with mafft, viralMSA, or uvaialign. Outputs a file with a table of reference sequences which are closest neighbours to each query. It also generates a fasta file which includes these closest reference sequences (and also other, similar sequences).

There are also two experimental programs uvaiaclust and uvaiaball, which are still under development. And we also include the original uvaia program, called uvaia_legacy and used before 2022, which however cannot cope with the huge file sizes we have currently.

We also offer a small reference alignment and other test files in directory data/.

uvaialign

Align query sequences against a reference
The complete syntax is:

 uvaialign  [-h|--help] [-v|--version] [-a|--ambiguity=<double>] -r|--reference=<ref.fa|ref.fa.gz> <seqs.fa|seqs.fa.gz>

  -h, --help                       print a longer help and exit
  -v, --version                    print version and exit
  -a, --ambiguity=<double>         maximum allowed ambiguity for sequence to be excluded (default=0.5)
  -r, --reference=<ref.fa|ref.fa.gz> reference sequence
  <seqs.fa|seqs.fa.gz>             sequences to align

The mandatory arguments are the reference sequence (Wuhan-Hu-1 MN908947.3 usually) and your unaligned sequences in gzip or uncompressed format (xz format not yet available).

uvaia

The main program, for searching for closest neighbours. Help page (called with uvaia --help):

For every query sequence, finds closest neighbours in reference alignment.
Notice that this software is multithreaded (and its performance depends on it)

The complete syntax is:

 uvaia  [-h|--help] [-v|--version] [--acgt] [-k|--keep_resolved] [-x|--exclude_self] [-n|--nbest=<int>] [-t|--trim=<int>] [-A|--ref_ambiguity=<double>] [-a|--query_ambiguity=<double>] [-p|--pool=<int>] -r|--reference=<ref.fa(.gz,.xz)> [-r|--reference=<ref.fa(.gz,.xz)>]... <seqs.fa(.gz,.xz)> [-t|--nthreads=<int>] [-o|--output=<without suffix>]

  -h, --help                       print a longer help and exit
  -v, --version                    print version and exit
  --acgt                           considers only ACGT sites (i.e. unambiguous SNP differences) in query sequences (mismatch-based)
  -k, --keep_resolved              keep more resolved and exclude redundant query seqs (default is to keep all)
  -x, --exclude_self               Exclude reference sequences with same name as a query sequence
  -n, --nbest=<int>                number of best reference sequences per query to store (default=100)
  --trim=<int>                     number of sites to trim from both ends (default=0, suggested for sarscov2=230)
  -A, --ref_ambiguity=<double>     maximum allowed ambiguity for REFERENCE sequence to be excluded (default=0.5)
  -a, --query_ambiguity=<double>   maximum allowed ambiguity for QUERY sequence to be excluded (default=0.5)
  -p, --pool=<int>                 Pool size, i.e. how many reference seqs are queued to be processed in parallel (larger than number of threads, defaults to 64 per thread)
  -r, --reference=<ref.fa(.gz,.xz)> aligned reference sequences (can be several files)
  <seqs.fa(.gz,.xz)>               aligned query sequences
  -t, --nthreads=<int>             suggested number of threads (default is to let system decide; I may not honour your suggestion btw)
  -o, --output=<without suffix>    prefix of xzipped output alignment and table with nearest neighbour sequences

Notice that versions before v2.0.2 had both long options --trim and --nthreads mapped to short option -t (in which case please use the long option).

This is a rewrite of the old uvaia (currently available as uvaia_legacy) which works with very big reference alignments (which cannot fit in memory at once). It uses priority queues (called min_heap internally) to store, for each query sequence, only the best scoring references. It assumes that the database of reference sequences is too large to fit into memory, and thus we read the (possibly compressed) reference alignment in batches, keeping only their names in memory and dumping all sequences that at some point belonged to the set. Thus in the beginning the software saves a lot of reference sequences which may not be very close to the query sequences, and later it updates the output alignment less often, with closer sequences only.

It runs in parallel using the requested number of threads (default is to use all available processors), and it relies on a "compression" of the query sequences into variable sites and common variants. It can also remove redundant sequences, i.e. those identical to or equivalent but with fewer unambiguous sites with another.

The pool size is how many reference genomes are stored in memory at once, and should usually be a multiple of the number of threads (i.e. a large number, between 10 and 1000 times the number of threads let's say). The query alignment file should not be huge, since all query sequences are stored in memory (except for a few optimisations).

output table nn_uvaia.csv.xz or nn_uvaia_acgt.csv.xz

By default uvaia will generate an alignment file named nn_uvaia.aln.xz and a csv table nn_uvaia.csv.xz. The csv table will look like

query,reference,rank,ACGT_matches,text_matches,partial_matches,valid_pair_comparisons,ACGT_matches_unique,valid_ref_sites
England/NORW-3078E97/2021,England/NORW-3061C36/2021,1,14985,14985,14987,14988,14985,29843
England/NORW-3078E97/2021,England/NORW-302EA07/2021,2,14984,14984,14986,14988,14984,29875
England/NORW-3078E97/2021,England/NORW-3034A7D/2021,3,14984,14984,14986,14988,14984,29869
England/NORW-3078E97/2021,England/NORW-3034B26/2021,4,14984,14984,14986,14988,14984,29851
England/NORW-3078E97/2021,England/NORW-306AB26/2021,5,14983,14983,14985,14988,14983,29813
England/NORW-3078E97/2021,England/NORW-31425AC/2022,6,14982,14982,14985,14988,14982,29169

Where the first three columns are the query sequence name, the reference sequence name, and the rank of the reference with regard to the query — i.e. how close they are, such that rank=1 means that the reference is the closest to the query, rank=2 means that it is the second closest to the query etc. The other columns are similarity (match) measures and other numbers used in ranking them. The output is ordered by rank, for each query.

Unlike other distance calculation software, it actually calculates the number of matches between sequences as well as the number of valid pairwise comparisons. A distance can be created from the difference between valid_pair_comparisons (or genome length) and one of the number of matches. As mentioned above, the rank (of how close the reference is to the query sequence) is given by the number of matches etc, in the order given in the table below from the 4th to the 9th column.

column column name description
1 query query sequence name
2 reference reference sequence name (close to query sequence from column 1)
3 rank order of closeness between reference and query, as given by columns 4-9
4 ACGT_matches number of matches between reference and query, considering only ACGT
5 text_matches number of exact (character) matches, thus M-M is a match but M-A is not
6 partial_matches M-A is considered a match since the partially ambiguous M equals {A,C}. However the fully ambiguous N is neglected, as usual
7 valid_pair_comparisons the effective sequence length for the comparison, i.e. sum of sites without gaps or N in any of the two sequences
8 ACGT_matches_unique a 'consensus' between query seqs is created, and this is the number of matches present in the query but not in the consensus (in short, it prefers neighbours farther from the common ancestor of the queries, in case of ties)
9 valid_ref_sites if everything else is the same, then sequences with less gaps and Ns are preferred (caveat is that some sequencing labs artificially impute states, in practice removing all gaps and Ns)

In other words, the rank is given by the number of ACGT_matches, where ties are broken by the number of text_matches, and so forth. The 4th, 6th, and 7th columns above are the most useful for the final user. But you can simply look at their rank, as described below.

output alignment file nn_uvaia.aln.xz or nn_uvaia_acgt.aln.xz

The alignment file will include all reference sequences temporarily assigned as closest to the queries. Thus it may have many more sequences than the strict set of K nearest neighbours. If you want to extract the reference sequences which would compose the set of closest neighbours, you'll need to do something like:

xzcat nn_uvaia.csv.xz | cut -d "," -f 2 | sort | uniq > closest_names.txt
seqkit grep -f closest_names.txt nn_uvaia.aln.xz > closest_names.aln

If you are using a recent version of seqkit (https://github.com/shenwei356/seqkit) which can handle XZ files. You can also use goalign subset (https://github.com/evolbioinfo/goalign) instead of seqkit to extract some sequences.

By using the rank column from the csv file, you can chose a subset of "best" sequences to add. For instance, to get only the closest neighbours to each query sequence, the first command above could be replaced by

# reference sequence with rank = 1 (i.e. closest)
xzcat nn_uvaia.csv.xz | grep -e ",1," | cut -d "," -f 2 | sort | uniq > closest_names.txt

# all reference sequences with rank below 21 (i.e. the 20 closest references) 
xzcat nn_uvaia.csv.xz | gawk -F "," '$3 < 21 {print $2}' | sort | uniq > closest_names.txt

(Notice that the old version of uvaia, now called uvaia_legacy, created the final alignment with only the closest references; this is not possible anymore in one pass due to the very large sequence files).

uvaia reports matches, not distances

The reported number of matches may differ between programs or runs due to how the query sequences are compressed and indexed, however their relative ranks should be preserved. Uvaia reports total numbers of matches, which are measures of similarity. Other programs report distances, which are a measure of dissimilarity.

For instance valid_pair_comparison - partial_matches generates similar distances as snp-dists. This is because snp-dists excludes every non-ACGT, even partially informative sites, and counts only ACGT. Here, sites with a gap or N in one of the sequences are ignored, while partially informative sites (e.g. M or R) are included.

The example sequences below illustrate the difference between them. All 3 sequences have 7 valid sites, since we exclude N and -:

seq1  AAC GTT A--    7 valid sites: 7 x ACGT + 0 partial
seq2  AAC G-T AM-    7 valid sites: 6 x ACGT + 1 partial (M)
seq3  MNC GTT MC-    7 valid sites: 5 x ACGT + 2 partial (M) 

ACGT_matches (seq1,seq2) = 6   partial_matches (seq1,seq2) = 6   valid_pair_comparisons(seq1,seq2) = 6
ACGT_matches (seq1,seq3) = 4   partial_matches (seq1,seq3) = 6   valid_pair_comparisons(seq1,seq3) = 6 
ACGT_matches (seq2,seq3) = 3   partial_matches (seq2,seq3) = 6   valid_pair_comparisons(seq2,seq3) = 6

So, although all sequences have a distance of zero from each other (no mismatches), they are not identical, and these differences are relevant phylogenetically. This highlights how uvaia and other software (in our example, snp-dists) see the sequences differently, remembering that only columns where neither sequence has a - are considered:

      uvaia          snp-dists
seq1  AAC GTT A--    AAC GTT A--
seq2  AAC G-T AM-    AAC G-T A--
seq3  M-C GTT MC-    --C GTT -C-

I hope this example helps seeing why the difference between valid_pair_comparison and partial_matches (from uvaia) is usually close to the number of ACGT mismatches (the default distances produced by snp-dists). They do disagree if for instance seq3 were instead KNC GTT KC-, since the snp-dists distance would be zero (K is neglected, as is M), but uvaia knows that K={G,T} which is incompatible (and thus a mismatch) with A or M={A,C}.

matches are better for phylogenetic analysis

Usually uvaia should be used to find neighbouring sequences which will be used in downstream phylogenetic analysis. Thus it keeps track of these several measures of similarity: in the limit, two very poor sequences with many gaps and very few columns in common (e.g. ------AAA and CCC------) have no mismatches! The program periodically reports partial results, and it is not uncommon for the Highest number of ACGT mismatches to increase as the run progresses. This apparent contradiction is in line with our objective of tracking matches instead of mismatches, since a sequence pair with few mismatches may be due to too many ambiguous calls, and lower quality overall. And thus as the program runs, it finds other sequences with many more valid comparisons, at the cost of a few extra mismatches. Furthermore the reported "highest number of mismatches" is the highest over query samples and neighbour sets, and may be due to one or a few query sequences.

However, if you do want to reproduce other software behaviour and consider only As,Cs,Gs, and Ts, there is an option conveniently invoked as --acgt. With this option, it still tracks the matches but considering partially informative sites (e.g. M or K) together with gaps and Ns. The output table will be a bit different, with two extra columns describing the distance (mismatches) from the reference to the consensus columns of the query, and the distance using the "unique" (polymorphic) columns. If you are confused, don't worry and just use their sum as the "SNP distance". Or the difference between the number of valid comparisons and the number of matches, as usual. If you use the option --acgt, then the generate files will be nn_uvaia_acgt.aln.xz and nn_uvaia_acgt.csv.xz.

Other programs

Some description about the extra software included are provided in the README_extra.md file, but I do not advise you to use them.

License

SPDX-License-Identifier: GPL-3.0-or-later

Copyright (C) 2020-today Leonardo de Oliveira Martins

This is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 3 of the License, or (at your option) any later version (http://www.gnu.org/copyleft/gpl.html).

Tauari uses the library WFA by Santiago Marco-Sola, distributed under the MIT license. It also uses biomcmc-lib by Leonardo de Oliveira Martins under a GPL3.0 license. The bearded goat from Uvaia's logo belongs to Thomas Pennant's Allgemeine Uebersicht der vierfüssigen Thiere (public domain).

Anurag's github stats

About

Reference-based alignment and sequence database search of SARS-CoV-2 genomes.

Topics

Resources

License

GPL-3.0, GPL-3.0 licenses found

Licenses found

GPL-3.0
LICENSE
GPL-3.0
COPYING

Stars

Watchers

Forks

Packages

No packages published

Languages