-
Notifications
You must be signed in to change notification settings - Fork 36
ChromBPNet training
Anusri Pampari edited this page Dec 13, 2022
·
15 revisions
Lets get started with training a ChromBPNet model with a pre-trained bias model on the downloaded and preprocessed tutorial data.
We will first start by downloading a pre-trained bias model provided with this repo here.
mkdir ~/bias_model
wget https://mitra.stanford.edu/kundaje/oak/akundaje/anusri/chrombpnet_data/input_files/bias_models/ATAC/ENCSR868FGK_bias_fold_0.h5 -O ~/bias_model/ENCSR868FGK_bias_fold_0.h5
- TODO: Add Notes on how to pick a bias model
Use the pre-trained bias model to train a bias-factorized ChromBPNet model on ENCSR868FGK using the command below
train_chrombpnet_model.sh \
-i ~/data/downloads/merged.bam \
-t "bam" \
-d "ATAC" \
-g ~/data/downloads/hg38.fa \
-c ~/data/downloads/hg38.chrom.sizes \
-p ~/data/peaks_no_blacklist.bed \
-n ~/data/negatives_data/negatives_with_summit.bed \
-f ~/data/splits/fold_0.json \
-b ~/bias_model/ENCSR868FGK_bias_fold_0.h5 \
-o ~/chrombpnet_model/ \
The inputs can be changed as follows for custom datasets -
-
-i
: input file path with filtered reads. Example files for supported types - bam, fragment, tagalign -
-t
: type of input file. Following string inputs are supported - "bam", "fragment", "tagalign". -
-d
: assay type. Following types are supported - "ATAC" or "DNASE" -
-g
: reference genome fasta file. Example file human reference - hg38.fa -
-c
: chromosome and size tab seperated file. Example file in human reference - hg38.chrom.sizes -
-p
: Input peaks in narrowPeak file format, and must have 10 columns, with values minimally for chr, start, end and summit (10th column). Every region is centered at start + summit internally, across all regions. Example file with ENCSR868FGK dataset - peaks.bed -
-n
: Input nonpeaks (background regions)in narrowPeak file format, and must have 10 columns, with values minimally for chr, start, end and summit (10th column). Every region is centered at start + summit internally, across all regions. Example file with ENCSR868FGK dataset - nonpeaks.bed -
-f
: json file showing split of chromosomes for train, test and valid. Example 5 fold jsons for human reference - folds -
-b
: Bias model in.h5
format. Bias models are generally transferable across assay types following similar protocol. Repository of pre-trained bias models for use here. Instructions to train custom bias model below. -
-o
: Output directory path
The output directory ~/chrombpnet_model/
will be populated as follows -
models\
...
chrombpnet.h5
chrombpnet_nobias.h5 (TF-Model i.e model to predict bias corrected accessibility profile)
...
logs\
...
intermediates\
...
evaluation\
...
pwm_from_input.png
bias_metrics.json
chrombpnet_metrics.json
chrombpnet_only_peaks.counts_pearsonr.png
chrombpnet_only_peaks.profile_jsd.png
profile_motifs.pdf
counts_motifs.pdf
footprints/bias_footprints_score.txt
footprints/corrected_footprints_score.txt
...
Following are some things to keep in mind when using custom datasets:
- If
~/chrombpnet_model/evaluation/footprints/corrected_footprints_score.txt
has values greater than 0.003, the chrombpnet models are not fully corrected for the bias. This is possible if the bias model transfer has failed. In this case you will seeTn5
like motifs in~/chrombpnet_model/evaluation/profile_motifs.pdf
- If the bias model transfer has failed try a different bias model or train a custom bias model following the instructions in here.