Skip to content

Latest commit

 

History

History
39 lines (24 loc) · 3.48 KB

File metadata and controls

39 lines (24 loc) · 3.48 KB

viral-protein-function-annotation-with-protein-language-model

Accompanying jupyter notebooks and data for figure generation for published manuscript

to run notebooks data must be downloaded from project google cloud platform bucket and stored in the root directory:
Final_Super_Condensed_Annotations-updated_efam.tsv (184.8 Mb) from: https://storage.googleapis.com/viral_protein_family_plm_embeddings/efam/Final_Super_Condensed_Annotations-updated_efam.tsv PHROG_index_downloaded_01232022.csv (4.4 Mb) from: https://storage.googleapis.com/viral_protein_family_plm_embeddings/phrogs/PHROG_index_downloaded_01232022.csv PHROG_index_revised_v4_10292022.csv (4.1 Mb) from: https://storage.googleapis.com/viral_protein_family_plm_embeddings/phrogs/PHROG_index_revised_v4_10292022.csv

(function classifier training only) protbert_bfd_embeddings_phrog/ (38,800 pkl objects) from: https://storage.googleapis.com/viral_protein_family_plm_embeddings/phrogs/protbert_bfd_embeddings_phrog/ + 'phrog_#' for all 38,880 PHROG families

(figure4 only) phrog_familiy_centroid/ (38,880 pkl objects, ~4.1Kb/per) from: https://storage.googleapis.com/viral_protein_family_plm_embeddings/phrogs/phrog_family_centroid/ + 'phrog_#' for all 38,880 PHROG families

(supplemental table3 only) all_sequence_ids_to_vectors_dict.pkl (2.1 Gb) from: https://storage.googleapis.com/viral_protein_family_plm_embeddings/phanns/all_sequence_ids_to_vectors_dict.pkl

To download sequence embeddings generated for this project:

  1. PHROG families- each directory contains 38,800 pkl objects corresponding to the 38,880 viral protein families in PHROGs. To download the pkl objects, use the base url below + 'phrog_#.pkl' for the phrog of interest.
    For example- for the Transformer_BFD embeddings of PHROG 1- https://storage.googleapis.com/viral_protein_family_plm_embeddings/phrogs/protbert_bfd_embeddings_phrog/phrog_1.pkl. pkl object size varies with number of sequences the family.
    Embedding order in the object corresponds to the order of the sequences in the corresponding phrog faa file, which can be downloaded from https://storage.googleapis.com/viral_protein_family_plm_embeddings/phrogs/faa_downloaded_04052022/ with + 'phrog_#.faa'.

Transformer_BFD from: https://storage.googleapis.com/viral_protein_family_plm_embeddings/phrogs/protbert_bfd_embeddings_phrog/
LSTM_Uniref90 from: https://storage.googleapis.com/viral_protein_family_plm_embeddings/phrogs/bepler_dlm_embed_phrog/
LSTM_Uniref90_MT from: https://storage.googleapis.com/viral_protein_family_plm_embeddings/phrogs/bepler_mt_embed_phrog/
Transformer_Uniref90_MT from: https://storage.googleapis.com/viral_protein_family_plm_embeddings/phrogs/proteinbert_embeddings_phrog/

  1. EFAM sequences- downloaded as a single pkl dictionary object with sequence:embedding. All embedding done with Transformer_BFD.

EFAM embeddings (10.2 Gb)- https://storage.googleapis.com/viral_protein_family_plm_embeddings/efam/identifier_to_vector_protbert_bdf_11012022_dict.pkl
EFAM protein fasta used (767.5 Mb)- https://storage.googleapis.com/viral_protein_family_plm_embeddings/efam/dereplicated_filtered_proteins_efam_downloaded_10162022.faa

  1. PHANN sequences- downloaded as a single pkl dictionary object with sequence:embedding. All embedding done with Transformer_BFD.

PHANN embeddings (2.1 Gb)- https://storage.googleapis.com/viral_protein_family_plm_embeddings/phanns/all_sequence_ids_to_vectors_dict.pkl

Citation: DOI