- MSA Transformer takes a multiple sequence alignment as an input to reveal conservation patterns, and is trained
with Masked Language Modeling objectives to capture epistasis [1]. Previous research showed that combinations of MSA
Transformer's column attention heads correlate with the Hamming
distance between input sequences, suggesting their application in tracing evolutionary lineages of proteins [2].
- We further found that its embedding tree can be used for phylogenetic reconstruction, based on the hypothesis that MSA Transformer primarily relies on column-wise conservation information to infer phylogeny. We anticipate it to not replace but complement classical phylogenetic inference, to recover the evolutionary history of protein families.
- Unlike traditional phylogenetic trees, embedding tree is assumed to capture epistasis effects and is more sensitive to gaps.
Install MsaPhylo
git clone https://github.com/Cassie818/MsaPhylo.git
cd MsaPhylo
Install pytorch: https://pytorch.org/
pip install fair-esm --quiet
pip install transformers --quiet
pip install pysam --quiet
pip install Bio
pip install ete3
Check out guidance information python MsaPhylo.py -h
python MsaPhylo.py
--i <The_MSA_FILE> \
--name <NAME_OF_OUTPUT_FILE> \
--o <OUTPUT_DIRECTORY> \
--l <LAYER_OF_THE_MSA_TRANSFORMER>
Examples:
python MsaPhylo.py
--i "./data/Pfam/PF00066.fasta" \
--name 'PF00066' \
--o "/results/trees/" \
--l 2
- INPUT MSA FILE
>Seq1 -SVNINELDLDLIRPGMKLIIIGRPGSGKSVIIKSLIASKRYIPAAIVISGSEEANHFYKTIFPSCFIYNKFNISIIEKI HKRQITAKNILGTSWLLLIIDDCMDDSKLFCEKTVMDLFKNGRHWNILVVVASQYVMDLKPVIRATIDGVFLLREPNMTY KEKMWLNFASIIP-KKEFFILMEKITQDHTALYIDNTIINAHWSDCVKYYKASLNIDELFGCEEYKAYCV---- >Seq2 -SIEIKELDLNYVRPGMKIIVIGRPGSGKSTLIKSLIASKRHIPAAVVISGSEEANHFYKNLFPECFVYNKFNLSLIDRI HKRQITAKNLLDMSWLLLIIDDCMDDSKLFCDKMVMDLFKNGRHWNILVIVASQYVMDLKPVIRSTLDGVFLLREPNMSY KEKMWLNFASIIP-KKYFFDLMEEITQDHTALYIDNTAINSHWSDCVKYYKATINVDEPFGCEEYKSYII---- >Seq3 ----------TELRPGMKLIVLGKPQRGKSVLIKSIIAAKRHIPAAVVISGSEEANHFYSKLLPNCFVYNKFDADIITRV KQRQLALKNVDPHSWLMLIFDDCMDNAKMFNHEAVMDLFKNGRHWNVLVIIASQYIMDLNASLRCCIDGIFLFTETSQTC VDKIYKQFGGNIP-KQTFHTLMEKVTQDHTCLYIDNTTTRQKWEDMVRYYKAPLDADVGFGFKDY--------- >Seq4 ----------MSSLPDKSTVLFGESGTGKSTIIDDILFQIKPVGQIIVFCPTDRNNKAYSGRVPLPCIHDKITDEVLRDI WSRQSALTQVYKNPRLVIIFDDCSSQLNLKKNKVIQDIFYQGRHVFITTLIAIQTDKVLDPEIKKNAFVSIFTEETCASS ------YFERKSNDLDKEAKNRARNASKHQKLAWVRDEKR------FYKLMATKHDDFRFGNPIIWNYCEQIQ-
- Theoretically, it can handle up to 1,024 protein sequences with an average alignment length of 1,024, but the actual capacity depends on memory requirements.
- To construct the embedding tree, you can specify any layer from 1 to 12. It is recommended to use early layers, ranging from 2 to 5.
- The current implementation of the embedding tree lacks support for bootstrapping values; therefore, the original MSA is utilized instead. This functionality is scheduled for enhancement in future updates.
If you are using the MSA Transformer for phylogenetic reconstruction, please consider citing:
@article{chen2024learning,
title={Learning the Language of Phylogeny with MSA Transformer},
author={Chen, Ruyi and Foley, Gabriel and Boden, Mikael},
journal={bioRxiv},
pages={2024--12},
year={2024},
publisher={Cold Spring Harbor Laboratory}
}
Feel free to contact me ([email protected]) if you have any questions about phylogenetic reconstruction🌟
[1] Rao, Roshan M., et al. "MSA transformer." International Conference on Machine Learning. PMLR, 2021.
[2] Lupo, Umberto, Damiano Sgarbossa, and Anne-Florence Bitbol. "Protein language models trained on multiple sequence
alignments learn phylogenetic relationships." Nature Communications 13.1 (2022): 6298.