Skip to content

ChromBPNet training

Anusri Pampari edited this page Dec 13, 2022 · 15 revisions

Lets get started with training a ChromBPNet model with a pre-trained bias model on the downloaded and preprocessed tutorial data.

Step 1

We will first start by downloading a pre-trained bias model provided with this repo here.

mkdir ~/bias_model
wget https://mitra.stanford.edu/kundaje/oak/akundaje/anusri/chrombpnet_data/input_files/bias_models/ATAC/ENCSR868FGK_bias_fold_0.h5 -O ~/bias_model/ENCSR868FGK_bias_fold_0.h5
  • TODO: Add Notes on how to pick a bias model

Step 2

Use the pre-trained bias model to train a bias-factorized ChromBPNet model on ENCSR868FGK using the command below

train_chrombpnet_model.sh \
  -i ~/data/downloads/merged.bam \
  -t "bam" \
  -d "ATAC" \
  -g ~/data/downloads/hg38.fa \
  -c ~/data/downloads/hg38.chrom.sizes \ 
  -p ~/data/peaks_no_blacklist.bed \
  -n ~/data/negatives_data/negatives_with_summit.bed \
  -f ~/data/splits/fold_0.json \
  -b ~/bias_model/ENCSR868FGK_bias_fold_0.h5 \ 
  -o ~/chrombpnet_model/ \

The inputs can be changed as follows for custom datasets -

Input Format

  • -i: input file path with filtered reads. Example files for supported types - bam, fragment, tagalign
  • -t: type of input file. Following string inputs are supported - "bam", "fragment", "tagalign".
  • -d: assay type. Following types are supported - "ATAC" or "DNASE"
  • -g: reference genome fasta file. Example file human reference - hg38.fa
  • -c: chromosome and size tab seperated file. Example file in human reference - hg38.chrom.sizes
  • -p: Input peaks in narrowPeak file format, and must have 10 columns, with values minimally for chr, start, end and summit (10th column). Every region is centered at start + summit internally, across all regions. Example file with ENCSR868FGK dataset - peaks.bed
  • -n: Input nonpeaks (background regions)in narrowPeak file format, and must have 10 columns, with values minimally for chr, start, end and summit (10th column). Every region is centered at start + summit internally, across all regions. Example file with ENCSR868FGK dataset - nonpeaks.bed
  • -f: json file showing split of chromosomes for train, test and valid. Example 5 fold jsons for human reference - folds
  • -b: Bias model in .h5 format. Bias models are generally transferable across assay types following similar protocol. Repository of pre-trained bias models for use here. Instructions to train custom bias model below.
  • -o: Output directory path

Output Format

The output directory ~/chrombpnet_model/ will be populated as follows -

models\
	...
	chrombpnet.h5
	chrombpnet_nobias.h5 (TF-Model i.e model to predict bias corrected accessibility profile) 
	...
logs\
	...
	
intermediates\
	...

evaluation\
	...
	pwm_from_input.png 
	bias_metrics.json 
	chrombpnet_metrics.json
	chrombpnet_only_peaks.counts_pearsonr.png
	chrombpnet_only_peaks.profile_jsd.png
	profile_motifs.pdf
	counts_motifs.pdf
	footprints/bias_footprints_score.txt
	footprints/corrected_footprints_score.txt
	...

Following are some things to keep in mind when using custom datasets:

  • If ~/chrombpnet_model/evaluation/footprints/corrected_footprints_score.txt has values greater than 0.003, the chrombpnet models are not fully corrected for the bias. This is possible if the bias model transfer has failed. In this case you will see Tn5 like motifs in ~/chrombpnet_model/evaluation/profile_motifs.pdf
  • If the bias model transfer has failed try a different bias model or train a custom bias model following the instructions in here.