Skip to content

This repository allows reproducing the MEX-ArChIPelago analysis

Notifications You must be signed in to change notification settings

autosome-ru/MEX-ArChIPelago

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

67 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MEX-ArChIPelago

This repository allows reproducing the MEX-ArChIPelago analysis. It contains scripts for sequences generation, model training/testing, and plot generation for — ArChIPelago https://github.com/autosome-ru/ArChIPelago — the arrangement of multiple position weight matrices with ChIP-Seq and machine learning for prediction of transcription factor binding sites.

Before you start

Make sure that you have installed:

Getting started

Please clone this directory with git clone https://github.com/autosome-ru/MEX-ArChIPelago/

Then cd in MEX-ArChIPelago: cd MEX-ArChIPelago

Install the SARUS PWM scanner into ./sarus directory, copying the sarus.jar-file should be enough; see also the instructions at https://github.com/autosome-ru/sarus
Note: SARUS is written in Java (hence it requires JRE).

Input data organization

(1) Download and unpack GHT-SELEX and ChIP-Seq peaks and the respective negative sets from ZENODO doi:10.5281/zenodo.10515307 into Input_data directory and move it into MEX-ArChIPelago

(2) Download and unzip the hg38 reference genome wget https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz; gunzip hg38.fa.gz

(3) (Optionally) Download top 20 PWMs for TFs from Codebook Motif Explorer https://mex.autosome.org (Data download) and move it to best_20_motif_CHS_GHTS directory with GHTS and CHS subfolders (already provided in the repo).

Steps to reproduce the MEX-ArChIPelago analysis

MEX-ArChIPelago was developed and tested on Ubuntu 20.04.6 LTS (GNU/Linux 5.15.0-113-generic x86_64). The model training process utilized 100 AMD EPYC 7662 64-Core Processor cores, with a total runtime of approximately 10 hours for training across all transcription factors (TFs).

For DEMO tests use ["GABPA"] instead of TFs_CHS_AFS.

(1) Create the MEX-ArChIPelago environment by running conda env create -f environment.yml and activate it with conda activate MEX_ArChIPelago

(2) The train-test data were generated by using https://github.com/autosome-ru/ibis-challenge. Download and extract the files from ZENODO doi:10.5281/zenodo.10515307 as described above.

(3) Generate sequences from the train-test data splits:

  • 1_Data_preparation_bed_to_fasta.ipynb - takes bed files generated by BIBIS and extracts genomic sequences

  • 2_Sequences_scanning_with_SARUS.ipynb - takes generated fasta files with genomic sequences for train/test with positive/negative examples and scans them with PWMs using SARUS https://github.com/autosome-ru/sarus

(4) Train PWM-based models and classify sequences containing TFBSs:

  • 3.1_Archipelago_training.ipynb - uses identified motif scores as features for RandomForestClassifier models

  • 3.2_Archipelago_training_with_PWM_sampling.ipynb - same as previous but allowes PWM sampling and saturation probing

  • To estimate the ArChIPelago performance with auROC and auPRC metrics, we complement the repository with scorer_module.py from PRROC R package [Grau, J., Grosse, I. & Keilwagen, J. PRROC: computing and visualizing precision-recall and receiver operating characteristic curves in R. Bioinformatics 31, 2595–2597 (2015)]

(5) Collect model performances and visualize results:

  • 4.1_Metrics_data_collector.ipynb - the script for collection of model performances from 3.1_Archipelago_training.ipynb

  • 4.2_Violin_plot_data_collector.ipynb - the script for collection of model performances from 3.2_Archipelago_training_with_PWM_sampling.ipynb

  • Use R_scripts_for_visualization to reproduce Fig 3, Fig. S5, Fig. S6

Citing

Cross-platform DNA motif discovery and benchmarking to explore binding specificities of poorly studied human transcription factors Ilya E. Vorontsov, Ivan Kozin, Sergey Abramov, Alexandr Boytsov, Arttu Jolma, Mihai Albu, Giovanna Ambrosini, Katerina Faltejskova, Antoni J. Gralak, Nikita Gryzunov, Sachi Inukai, Semyon Kolmykov, Pavel Kravchenko, Judith F. Kribelbauer-Swietek, Kaitlin U. Laverty, Vladimir Nozdrin, Zain M. Patel, Dmitry Penzar, Marie-Luise Plescher, Sara E. Pour, Rozita Razavi, Ally W.H. Yang, Ivan Yevshin, Arsenii Zinkevich, Matthew T. Weirauch, Philipp Bucher, Bart Deplancke, Oriol Fornes, Jan Grau, Ivo Grosse, Fedor A. Kolpakov, The Codebook/GRECO-BIT Consortium, Vsevolod J. Makeev, Timothy R. Hughes, Ivan V. Kulakovskiy; bioRxiv 2024.11.11.619379; [https://doi.org/10.1101/2024.11.11.619379]

License

ArChIPelago is distributed under WTFPL. If you prefer more standard licenses, feel free to treat WTFPL as CC-BY.

--2024--

About

This repository allows reproducing the MEX-ArChIPelago analysis

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published