ECT (Easy Consensus Tree)

by Mateusz Chojnacki, Krzysztof Łukasz, Younginn Park and Daniel Zalewski

Overview

Easy Consensus Tree allows the user to easily construct a whole-proteome consensus tree based on a specified list of species names. It automates the workflow from downloading proteomes and clustering sequences to building individual cluster trees and generating a final consensus tree, ensuring a streamlined and efficient process. The user-friendly setup makes it accessible even for those with minimal bioinformatics experience.

Workflow outline

There are many tools avaiable to make philogenic tree given multiple sequences alignment (MSA) file or fasta file, as well as many tools, which can be used to culster set of sequences. However, currently there arent any publicly avaiable software, which costruct philogenetic tree given only list of species names/ids. ECT allows user to make just that in following steps:

downoading proteomes from public databases (Uniprot Proteome and NCBI datasets),
merging set of found proteomes
clustering merged fasta file with using MMseq2 - easy-cluser
selecting clusters containing at least 3 or 0.3*[number of species] sequeances
making multiple sequences alignment using ClustalW, Muscle or Mafft
constructing NJ trees using Biopython package
construction consensus tree using DendroPy package
simple visualiation of computed consensus tree using Biopythob

Requirements

To run this tool, you need to have conda installed. We recommend using miniforge, which is a lightweight installer for conda.

Instruction for installing miniforge

# Run
curl -L -O "https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-$(uname)-$(uname -m).sh"
# OR
wget "https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-$(uname)-$(uname -m).sh"
# Then
bash Miniforge3-$(uname)-$(uname -m).sh

Installation

git clone https://github.com/M-Chojnacki6/ECT.git

# Go to the project directory
cd ECT

# Run
conda env create -f environment.yml
# mamba env create -f environment.yml # faster

# Prepare main run script to execution
chmod +x ect.sh

Usage

Example usage: If the list of species called species.txt is located in the parent directory relative to ECT directory then run:

./ECT/ect.sh -i species.txt

To show the most basic help, run:

./ECT/ect.sh -h
# or
./ECT/ect.sh --help

If you have prepared same files from middle analasys, but your work was interrupted, leading to stop of the of the workflow, yu can use -e option, to start from the last saved point. E.g. when you stopped workflow after clustering, running:

./ECT/ect.sh -i species.txt -e 3

starts workflow from filetering step (file species_merged[x]_all_seqs.fasta - output of MMseq2 clustering). To see detailed description, use flag -h or --help.

Options description

Shorter version of description provided in --help.

short flag	long flag	description
-i	--input	Text file with species names or taxonomy id in lines (default: species.txt)
-p	--minCons	Minimum support consensus for consensus tree construction; (default: 0.5)
-s	--msi	MMseq2 option: list matches above this sequence identity (range 0.0-1.0); (default: 0.3)
-l	--clusterMode	MMseq2 option: select clustering mode
-v	--covMode	MMseq2 option: sevuence coverage mode
-c	--cov	MMseq2 option: list matches above this fraction of aligned (covered) residues; (default: 0.800)
-m	--msa	Algorithm used to MSA (default: ClustalW)
-d	--description	Show help information of not-skipped subscripts
-r	--remove	Text file with species names or taxonomy id in lines to remove from local database and describing it taxon_library.csv file
-e	--step	Select step, from which you want to start script:

0 All steps (default)
1 start with merging step
2 start with MMseq2 clustering
3 start with filtering step
4 Start with making MSA
5 start with construction NJ trees
6 start with preparing consensus (final) tree

DISCLAIMER: We cannot guarantee that the resulting trees will accurately reflect the true relationships between species, especially if the provided species are distantly related.

Potential future enhancements

Option for the user to define species for an outgroup to root the tree on
Adding supertrees (fasturec) for paralogus clusters
Cutoff for number of sequences / number of genomes
Type of consensus and cutoff to consensus

References

The UniProt Consortium, UniProt: the Universal Protein Knowledgebase in 2023, Nucleic Acids Research, Volume 51, Issue D1, 6 January 2023, Pages D523–D531, https://doi.org/10.1093/nar/gkac1052
NCBI Datasets, https://github.com/ncbi/datasets
Steinegger, M., Söding, J. Clustering huge protein sequence sets in linear time. Nat Commun 9, 2542 (2018). https://doi.org/10.1038/s41467-018-04964-5
Cock PJ, Antao T, Chang JT, Chapman BA, Cox CJ, Dalke A, et al. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics. 2009;25(11):1422–3.
Moreno, M. A., Sukumaran, J., and M. T. Holder. 2024. DendroPy 5: a mature Python library for phylogenetic computing. arXiv preprint arXiv:2405.14120. https://doi.org/10.48550/arXiv.2405.14120
M.A. Larkin, G. Blackshields, N.P. Brown, R. Chenna, P.A. McGettigan, H. McWilliam, F. Valentin, I.M. Wallace, A. Wilm, R. Lopez, J.D. Thompson, T.J. Gibson, D.G. Higgins, Clustal W and Clustal X version 2.0, Bioinformatics, Volume 23, Issue 21, November 2007, Pages 2947–2948, https://doi.org/10.1093/bioinformatics/btm404
Kazutaka Katoh, Daron M. Standley, MAFFT Multiple Sequence Alignment Software Version 7: Improvements in Performance and Usability, Molecular Biology and Evolution, Volume 30, Issue 4, April 2013, Pages 772–780, https://doi.org/10.1093/molbev/mst010
Edgar, Robert C. (2004), MUSCLE: multiple sequence alignment with high accuracy and high throughput, Nucleic Acids Research 32(5), 1792-97.
Kumar S, Suleski M, Craig JM, Kasprowicz AE, Sanderford M, Li M, Stecher G, Hedges SB (2022) TimeTree 5: An Expanded Resource for Species Divergence Times. Mol Biol Evol doi.org/10.1093/molbev/msac174

Name		Name	Last commit message	Last commit date
Latest commit History 211 Commits
img		img
scripts		scripts
test		test
LICENSE		LICENSE
Presentation.pptx		Presentation.pptx
README.md		README.md
ect.sh		ect.sh
environment.yml		environment.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ECT (Easy Consensus Tree)

Overview

Workflow outline

Requirements

Installation

Usage

Options description

Potential future enhancements

References

About

Releases

Packages

Languages

License

SleepDealler/ECT

Folders and files

Latest commit

History

Repository files navigation

ECT (Easy Consensus Tree)

Overview

Workflow outline

Requirements

Installation

Usage

Options description

Potential future enhancements

References

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages