This repository allows reproducing the MEX-ArChIPelago analysis. It contains scripts for sequences generation, model training/testing, and plot generation for — ArChIPelago https://github.com/autosome-ru/ArChIPelago — the arrangement of multiple position weight matrices with ChIP-Seq and machine learning for prediction of transcription factor binding sites.
Make sure that you have installed:
- Python 3.7 (or upper) https://www.python.org/
- Optionally: R 4.2.1 (or upper) and RStudio https://posit.co/download/rstudio-desktop/
Please clone
this directory with git clone https://github.com/autosome-ru/MEX-ArChIPelago/
Then cd
in MEX-ArChIPelago: cd MEX-ArChIPelago
Install the SARUS PWM scanner into ./sarus
directory, copying the sarus.jar
-file should be enough; see also the instructions at https://github.com/autosome-ru/sarus
Note: SARUS is written in Java (hence it requires JRE).
(1) Download and unpack GHT-SELEX and ChIP-Seq peaks and the respective negative sets from ZENODO doi:10.5281/zenodo.10515307
into Input_data
directory and move it into MEX-ArChIPelago
(2) Download and unzip the hg38 reference genome wget https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz; gunzip hg38.fa.gz
(3) (Optionally) Download top 20 PWMs for TFs from Codebook Motif Explorer https://mex.autosome.org (Data download) and move it to best_20_motif_CHS_GHTS
directory with GHTS
and CHS
subfolders (already provided in the repo).
MEX-ArChIPelago was developed and tested on Ubuntu 20.04.6 LTS (GNU/Linux 5.15.0-113-generic x86_64). The model training process utilized 100 AMD EPYC 7662 64-Core Processor cores, with a total runtime of approximately 10 hours for training across all transcription factors (TFs).
For DEMO tests use ["GABPA"] instead of TFs_CHS_AFS.
(1) Create the MEX-ArChIPelago environment by running conda env create -f environment.yml
and activate it with conda activate MEX_ArChIPelago
(2) The train-test data were generated by using https://github.com/autosome-ru/ibis-challenge. Download and extract the files from ZENODO doi:10.5281/zenodo.10515307
as described above.
(3) Generate sequences from the train-test data splits:
1_Data_preparation_bed_to_fasta.ipynb
- takes bed files generated by BIBIS and extracts genomic sequences2_Sequences_scanning_with_SARUS.ipynb
- takes generated fasta files with genomic sequences for train/test with positive/negative examples and scans them with PWMs using SARUS https://github.com/autosome-ru/sarus
(4) Train PWM-based models and classify sequences containing TFBSs:
3.1_Archipelago_training.ipynb
- uses identified motif scores as features for RandomForestClassifier models3.2_Archipelago_training_with_PWM_sampling.ipynb
- same as previous but allowes PWM sampling and saturation probing- To estimate the ArChIPelago performance with auROC and auPRC metrics, we complement the repository with scorer_module.py from PRROC R package [Grau, J., Grosse, I. & Keilwagen, J. PRROC: computing and visualizing precision-recall and receiver operating characteristic curves in R. Bioinformatics 31, 2595–2597 (2015)]
(5) Collect model performances and visualize results:
4.1_Metrics_data_collector.ipynb
- the script for collection of model performances from 3.1_Archipelago_training.ipynb4.2_Violin_plot_data_collector.ipynb
- the script for collection of model performances from 3.2_Archipelago_training_with_PWM_sampling.ipynb- Use
R_scripts_for_visualization
to reproduce Fig 3, Fig. S5, Fig. S6
Cross-platform DNA motif discovery and benchmarking to explore binding specificities of poorly studied human transcription factors
Ilya E. Vorontsov, Ivan Kozin, Sergey Abramov, Alexandr Boytsov, Arttu Jolma, Mihai Albu, Giovanna Ambrosini, Katerina Faltejskova, Antoni J. Gralak, Nikita Gryzunov, Sachi Inukai, Semyon Kolmykov, Pavel Kravchenko, Judith F. Kribelbauer-Swietek, Kaitlin U. Laverty, Vladimir Nozdrin, Zain M. Patel, Dmitry Penzar, Marie-Luise Plescher, Sara E. Pour, Rozita Razavi, Ally W.H. Yang, Ivan Yevshin, Arsenii Zinkevich, Matthew T. Weirauch, Philipp Bucher, Bart Deplancke, Oriol Fornes, Jan Grau, Ivo Grosse, Fedor A. Kolpakov, The Codebook/GRECO-BIT Consortium, Vsevolod J. Makeev, Timothy R. Hughes, Ivan V. Kulakovskiy;
bioRxiv 2024.11.11.619379; [https://doi.org/10.1101/2024.11.11.619379]
ArChIPelago is distributed under WTFPL. If you prefer more standard licenses, feel free to treat WTFPL as CC-BY.
--2024--