MAnorm - identifying differential binding in chip-seq data using linear normalization on shared peaks
MAnorm3 is my improved version which allows for 2 v 2 comparisons (replicates) using edgeR. I published a paper comparing this method to others: https://www.ncbi.nlm.nih.gov/pubmed/25972895
Original MAnorm can be found here Published in Genome Biology 2012
- StepI: clean and sort input
- StepII: classify common or unique peaks
- StepIII: count peak read
- StepIV: normalize using common peaks
Experience from running on a quad core workstation with 32gb of ram:
- StepI is I/O bound (mostly reading from disk and writing back to disk)
- StepII & StepIII is CPU bound (lots of parallel bedtools)
- Ram usage I've seen with ENCODE datasets is 20gb+/-5gb but I dont actually think all of this is used for processing
- MAnorm - original
- MAnorm2 - retains input/output compatibility w/original but will be faster and will pool replicates
- MAnorm3 - has different input/output, uses edgeR with replicates (2v2 comparisons only currently)
- I think there is something wrong with how the p-values are calculated (see code in MAnorm2.R starting from line 50 for details).
- pval calculation is not optimized (very slow)
- It is faster to use
choose()
and run in parallel
- It is faster to use
- Stirling approximation seems to be done incorrectly
- This calculation is consistant in matlab version (more details in matlab MAnorm than R MAnorm)
- pval are not symmetric (calculations from x vs y do not give the same pvalues as y vs x)
- pval calculation is not optimized (very slow)
- mergeBed command in MAnorm.sh does not actually work (need to sort first)
- Lots of tmp files generated by MAnorm.sh and a lot of steps could be done in parallel
This version uses edgeR for significance testing.
- Requires a replicate (since ENCODE samples have 2 reps)
- Run bedtools in parallel (faster unless you have IO bottleneck)
- Added parameter specify output directory
- MA adjustment is included as an offset to edgeR model
- sortBed before mergeBed command so the merged dataset is actually merged
- Requires input to be gzipped (will add switch for this later)
- Outputs a bunch of plots
- Contact MAnorm author about some of the issues in p-value calculation and merging
- Add error checks and helpful diagnostic messages
- Add switch to check for .gz ending
- Change code so samples with no replicates can run
This version tries to keep the same input/output as original MAnorm but has the following changes:
- I run bedtools in parallel in the background for significant speedup.
- Allows replicates (see 'read in' section of MAnorm2.sh code)
- Concatenates replicates (proper way would be to downsample so replicates have equal weight)
- Not well tested, use MAnorm3 if you have replicates
MAnorm3.sh test_cond1_cond2 \
peaksA.bed.gz \
peaksB.bed.gz \
readsA1.bed.gz \
readsA2.bed.gz \
readsB1.bed.gz \
readsB2.bed.gz \
60 70 55 53
Where 60
70
55
53
correspond to shift lengths (half of estimated fragment length) for readsA1.bed
readsA2.bed
readsB1.bed
readsB2.bed
and readsA1
and readsA2
are replicates used to identify peaks in peaksA
. You must have MAnorm3.sh in your $PATH
and MAnorm3.R must
be in current working directory (you could modify this at the end of MAnorm3.sh). As with before, the output is tab seperated file ending in .xls
Second example:
Usage: MAnorm3.sh test_cond1_cond2 gr_peak.bed er_peak.bed gr_rep1.bed gr_rep2.bed er_rep1.bed er_rep2.bed 120 110 95 100
gr_peak.bed: sample 1 significant peak list
er_peak.bed: sample 2 significant peak list
gr_rep1.bed: sample 1 raw reads rep1
gr_rep2.bed: sample 1 raw reads rep2
er_rep1.bed: sample 2 raw reads rep1
er_rep2.bed: sample 2 raw reads rep2
shifts (1/2 est fragment size):
gr_rep1: 120
gr_rep2: 110
er_rep1: 95
er_rep2: 100
Note that test_cond1_cond2 is the name of the folder that will be created. The script is currently hardcoded to look for a (name)(type1)(type2) layout