A pipeline for processing ChIP-seq read allignments in bam format to find allele-specific TF binding (ASB) events. It consists of 5 main parts:
This part uses GATK and PICARD tools for variant calling. The result is a vcf file with SNV calls in GATK vcf format.
Homozygous SNVs, SNVs with less than 5 reads on each allele and not present in dbSNP common colection are filtered out from the vcf files obtained on the previous step. The resulting variants are annotated with ChIP-seq peaks from 4 different peak callers (if available in bed format).
Background Allelic Dosage (BAD) estimation and full-genome BAD maps construction. See BABACHI.
Fitting read count distributions separately for reference and alternative alleles and each BAD with Negative Binomial Mixtures.
Performing one-tailed tests and aggregating the resulting P-values on TF and cell type level using Mudholkar-George method, FDR-correcting the resulting P-values. Evaluating ASB Effect Size.
- Clone this repository to your machine or server
git clone https://github.com/autosome-ru/ADASTRA-pipeline/
- Fill the paths to the required files (listed below) in CONFIG.cfg file.
- Run
python3 construct_parameters_python.py
, then install adastra package withpip3 install ./
command - Execute
pipline_start.sh <n_tr> <stage>
n_tr
is max. number of jobs,
stage
is a flag, corresponding to a part of pipeline you wish to start with (listed in order):
--create-reference
create normalized genome and index--snp-call
GATK snp calling--peak-annotation
peak annotation and filtering--bad-call
BAD estimation--nb-fit
fit negative binomial distributions--pvalue-count
evaluate statistical significance--aggregate-pvalues
perform cell-type and TF-level aggregation of p-values
- Java SE 8
- Python >= 3.6
- GATK >= 4.0.12.0
- PICARD
- GNU Parallel
numpy>=1.19.0
pandas>=1.1.0
scipy>=1.5.1
statsmodels>=0.11.1
To run the pipeline successfully one must fill path for each file in the CONFIG.cfg file.
- alignments_path = "/home/user/Alignments/" The directory with .bam files of experiment and control alignments. Should contains directories with experiment name with corresponding .bam files in them.
- results_path = "/home/user/DATA/" A directory to save final ASB calls into.
- intervals_path = "/home/user/interval/" A directory with peak calling data. Should contain a subdir for every caller (e.g. MACS), in each of which should be zipped bed-like files with peak calls (names are arbitrary, ending with .interval.zip). However, peaks from different callers, but for the same experiment must have the same name.
-
master_list_path = "/home/user/PARAMETERS/Master-lines.tsv"
A .tsv file with the following required columns(columns with other names are ignored), each row corresponding to a single experiment:
'#EXP' - Unique experiment identifier. Must correspond to the folder in alignments_path with the bam file.
TF_UNIPROT_ID - TF uniprot name, e.g. Q9GZV8 (or arbitrary TF identifier).
CELLS - Name or identifier of cell type. Used in BADmaps groupping.
READS - Used
ALIGNS - name of corresponding .bam file without extention ('.bam').
PEAKS - name of corresponding peak call files (without .interval.zip) or 'None'
GEO - GSE of the study or 'None'
ENCODE - encode id of the experiment or 'None'
WG_ENCODE - wgEncode id of the experiment or 'None
READS_ALIGNED - Number of the reads aligned (or '' if no info available) -
genome_path = "/home/user/REFERENCE/genome.fasta" Path to the reference genome file.
-
dbsnp_vcf_path = "/home/user/REFERENCE/dbsnp_common.vcf.gz" Path to dbsnp common collection (gzipped)
-
repeats_path = "/home/user/repeats" Path to repeat annotation .bed file