Skip to content

Latest commit

 

History

History
112 lines (93 loc) · 7.76 KB

README.md

File metadata and controls

112 lines (93 loc) · 7.76 KB

image skandiver: a divergence-based analysis tool for identifying intercellular mobile genetic elements

Current version: v0.1.1

Introduction

skandiver is a program for identifying mobile genetic elements (prophages, plasmids, transposases, etc.) from assembled whole genome sequences using average nucleotide identity (ANI), genome fragmentation, and evolutionary divergence time. skandiver can find putative mobile genetic elements without the use of gene annotation or training data, and can efficiently query large datasets of hundreds of assemblies or greater within minutes.

Requirements

  1. skani (Version 0.2.1 or higher)
  2. Database of representative genomes (Recommended GTDB Database, see below for installation details)
  3. Python 3 (with pandas and bio packages)

Setting up skani

skandiver uses skani (Developed by Jim Shaw at https://github.com/bluenote-1577/skani), a scalable and robust search tool for computing average nucleotide identity between whole genomes. skani can be installed using conda via:

conda install -c bioconda skani

Alternatively, a binary version of skani can be downloaded for x86-64 Linux systems via:

wget https://github.com/bluenote-1577/skani/releases/download/latest/skani
chmod +x skani
./skani -h

Setting up database of representative genomes

skani search requires a database of representative genomes to query against. The current recommended database is the Genome Taxonomy Database (GTDB), which contains >85,000 representative genomes. To setup this database, first ensure that the following requirements are met:

  • skani is installed and in PATH (i.e. typing skani -h works). Visit https://github.com/bluenote-1577/skani for more information on setting up skani.
  • ~120 GB free disk space is available for the uncompressed database and indexing.

First, download the compressed GTDB database and unzip it:

wget
https://data.gtdb.ecogenomic.org/releases/release214/214.1/genomic_files_reps/gtdb_genomes_reps_r214.tar.gz
tar -xf gtdb_genomes_reps_r214.tar.gz

The gtdb database is formatted in a special way. In order to process the reference genome files inside the gtdb folder, we have to do a bit of work. We can run the following to collect all genomes locations into a file called gtdb_file_names.txt.

find gtdb_genomes_reps_r214/ | grep .fna > gtdb_file_names.txt

Finally, we can construct the indexed database to query against using:

skani sketch -l gtdb_file_names.txt -o gtdb_skani_database_ani -t 20

Note: this process of setting up the database of representative genomes can be replicated for any directory of representative fna.gz files. You can create your own custom representative genome database to search against by downloading a set of representative whole genomes from NCBI, Ensembl, RefSeq, etc. Once the directory of genomes has been initialized, simply run:

find [PATH_TO_REP_DIRECTORY/] | grep .fna > customdb_file_names.txt
skani sketch -l customdb_file_names.txt -o custondb_skani_database_ani -t 20

Now you have created a custom database of representative genomes that skandiver/skani can be used to query against.

Installation and quick start

Once the three prerequisites have been met (skani, database of representative genomes, python), you are now ready to initialize and begin working with skandiver. To begin, download the skandiver repository:

git clone https://github.com/YoukaiFromAccounting/skandiver
cd skandiver
chmod +x skandiver.sh
bash SETUP.sh

The provided setup script will test your environment for dependencies and download an example data set. You can also install all needed dependencies using the following:

sudo apt-get install python3-pip
pip3 install bio pandas

skandiver is now installed on your system, and can be called using the following command structure:

./skandiver.sh [INPUT_DIRECTORY] [OUTPUT_NAME] [CHUNK_SIZE] [PATH_TO_REPRESENTATIVE_GENOME_DB]

You can test skandiver against a sample whole genome assembly of Acinetobacter baumannii by executing the following command:

./skandiver.sh test_files/abaumannii results 10000 [PATH_TO_REPRESENTATIVE_GENOME_DB]

For example, if you followed the above instructions for setting up the GTDB database of representative genomes in the skandiver directory, you can run:

./skandiver.sh test_files/abaumannii results 10000 gtdb_skani_database_ani

This should output four files; results.txt, resultsskani.txt, resultsskanifiltered.txt, and resultssearch.fna. results.txt contains the summary of potential mobile genetic elements found by skandiver, while resultsskani.txt and resultsskanifiltered.txt contain the skani search results for the query whole genome assembly (with resultsskanifiltered only displaying genome matches with greater than 95% average nucleotide identity and 90% align fraction). resultssearch.fna contains the entire fragmented genome assembly used for the skani search.

The results file looks like the following for a sample whole genome assembly of Pseudomonas aeruginosa:

GenomeID/AccessionNumber	QuerySpecies	GenomePosition	NumberofHits  TotalDivergence	AverageDivergence  RefSpeciesHits
LFMS01000010.1	Pseudomonas_aeruginosa	46306-56305	2	0.00101	0.000505  Pseudomonas_taiwanensis, Pseudomonas_jinjuensis
LFMS01000011.1	Pseudomonas_aeruginosa	1662427-1672426	8	4954.9287	619.3660875  Stutzerimonas_stutzeri, Cronobacter_muytjensii, Cronobacter_universalis, Pseudomonas_putida, Pseudomonas_mosselii, Pseudomonas_saponiphila, Achromobacter_xylosoxidans
LFMS01000011.1	Pseudomonas_aeruginosa	1672427-1682426	6	1992.8886999999997	332.1481166666666  Stutzerimonas_stutzeri, Pseudomonas_putida, Pseudomonas_mosselii, Pseudomonas_saponiphila, Achromobacter_xylosoxidans

  • GenomeID/AccessionNumber: the unique sequence identifier for the complete query species.
  • QuerySpecies: the NCBI common name of the query assembly.
  • GenomePosition: the estimated fragment of the whole genome assembly containing the mobile genetic element.
  • NumberofHits: the number of unique species that the query fragment mapped to with >95% ANI and >90% align fraction (extremely high degree of similarity).
  • TotalDivergence: the total divergence time for all species the query fragment mapped to, in millions of years.
  • AverageDivergence: the average divergence time per species the query fragment mapped to, in millions of years.

As skandiver is considerably faster than gene annotation-based mobile element finders, you can bulk download a large set of whole genome assemblies in .fna or .fasta format (compressed or uncompressed both work) into the [INPUT_DIRECTORY] of skandiver to perform efficient analysis of potential mobile genetic elements in metagenomic data.

Contact

Brian Zhang, [email protected] (Contributing author)
Grace Oualline, [email protected] (Contributing author)

Acknowledgements

We would like to express our gratitude to the following individuals and organizations for their major contributions and support in the development of skandiver:

  • Jim Shaw (https://github.com/bluenote-1577) for the creation and continuous support of skani, a fundamental tool utilized by skandiver for ANI computations, as well as providing valuable guidance regarding the overall quality and usability of skandiver.
  • Yun William Yu (https://github.com/yunwilliamyu) for providing algorithmic support and troubleshooting expertise, greatly improving skandiver's efficiency.

This implementation of skandiver was based on the ideas and software from the following paper:
Shaw, J., & Yu, Y. W. (2023). Fast and robust metagenomic sequence comparison through sparse chaining with skani. Nature methods, 20(11), 1661–1665. https://doi.org/10.1038/s41592-023-02018-3