Skip to content

Cassie818/MsaPhylo

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Learning the Language of Phylogeny with MSA Transformer

Introduction

  1. MSA Transformer takes a multiple sequence alignment as an input to reveal conservation patterns, and is trained with Masked Language Modeling objectives to capture epistasis [1]. Previous research showed that combinations of MSA Transformer's column attention heads correlate with the Hamming distance between input sequences, suggesting their application in tracing evolutionary lineages of proteins [2].
  2. We further found that its embedding tree can be used for phylogenetic reconstruction, based on the hypothesis that MSA Transformer primarily relies on column-wise conservation information to infer phylogeny. We anticipate it to not replace but complement classical phylogenetic inference, to recover the evolutionary history of protein families.
  3. Unlike traditional phylogenetic trees, embedding tree is assumed to capture epistasis effects and is more sensitive to gaps.


Install packages

Install MsaPhylo

git clone https://github.com/Cassie818/MsaPhylo.git
cd MsaPhylo

Install pytorch: https://pytorch.org/

pip install fair-esm --quiet
pip install transformers --quiet
pip install pysam --quiet
pip install Bio
pip install ete3

Usages

Check out guidance information python MsaPhylo.py -h

python MsaPhylo.py
    --i <The_MSA_FILE> \
    --name <NAME_OF_OUTPUT_FILE> \
    --o <OUTPUT_DIRECTORY> \
    --l <LAYER_OF_THE_MSA_TRANSFORMER>

Examples:

python MsaPhylo.py
        --i "./data/Pfam/PF00066.fasta" \
        --name 'PF00066' \
        --o "/results/trees/" \
        --l 2

Instructions

  1. INPUT MSA FILE
    >Seq1
    -SVNINELDLDLIRPGMKLIIIGRPGSGKSVIIKSLIASKRYIPAAIVISGSEEANHFYKTIFPSCFIYNKFNISIIEKI
    HKRQITAKNILGTSWLLLIIDDCMDDSKLFCEKTVMDLFKNGRHWNILVVVASQYVMDLKPVIRATIDGVFLLREPNMTY
    KEKMWLNFASIIP-KKEFFILMEKITQDHTALYIDNTIINAHWSDCVKYYKASLNIDELFGCEEYKAYCV----
    >Seq2
    -SIEIKELDLNYVRPGMKIIVIGRPGSGKSTLIKSLIASKRHIPAAVVISGSEEANHFYKNLFPECFVYNKFNLSLIDRI
    HKRQITAKNLLDMSWLLLIIDDCMDDSKLFCDKMVMDLFKNGRHWNILVIVASQYVMDLKPVIRSTLDGVFLLREPNMSY
    KEKMWLNFASIIP-KKYFFDLMEEITQDHTALYIDNTAINSHWSDCVKYYKATINVDEPFGCEEYKSYII----
    >Seq3
    ----------TELRPGMKLIVLGKPQRGKSVLIKSIIAAKRHIPAAVVISGSEEANHFYSKLLPNCFVYNKFDADIITRV
    KQRQLALKNVDPHSWLMLIFDDCMDNAKMFNHEAVMDLFKNGRHWNVLVIIASQYIMDLNASLRCCIDGIFLFTETSQTC
    VDKIYKQFGGNIP-KQTFHTLMEKVTQDHTCLYIDNTTTRQKWEDMVRYYKAPLDADVGFGFKDY---------
    >Seq4
    ----------MSSLPDKSTVLFGESGTGKSTIIDDILFQIKPVGQIIVFCPTDRNNKAYSGRVPLPCIHDKITDEVLRDI
    WSRQSALTQVYKNPRLVIIFDDCSSQLNLKKNKVIQDIFYQGRHVFITTLIAIQTDKVLDPEIKKNAFVSIFTEETCASS
    ------YFERKSNDLDKEAKNRARNASKHQKLAWVRDEKR------FYKLMATKHDDFRFGNPIIWNYCEQIQ-
    
  2. Theoretically, it can handle up to 1,024 protein sequences with an average alignment length of 1,024, but the actual capacity depends on memory requirements.
  3. To construct the embedding tree, you can specify any layer from 1 to 12. It is recommended to use early layers, ranging from 2 to 5.
  4. The current implementation of the embedding tree lacks support for bootstrapping values; therefore, the original MSA is utilized instead. This functionality is scheduled for enhancement in future updates.

Citation

If you are using the MSA Transformer for phylogenetic reconstruction, please consider citing:

@article{chen2024learning,
  title={Learning the Language of Phylogeny with MSA Transformer},
  author={Chen, Ruyi and Foley, Gabriel and Boden, Mikael},
  journal={bioRxiv},
  pages={2024--12},
  year={2024},
  publisher={Cold Spring Harbor Laboratory}
}

Contact

Feel free to contact me ([email protected]) if you have any questions about phylogenetic reconstruction🌟

References

[1] Rao, Roshan M., et al. "MSA transformer." International Conference on Machine Learning. PMLR, 2021.
[2] Lupo, Umberto, Damiano Sgarbossa, and Anne-Florence Bitbol. "Protein language models trained on multiple sequence alignments learn phylogenetic relationships." Nature Communications 13.1 (2022): 6298.