Skip to content

Latest commit

 

History

History

MAnorm

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

MAnorm - identifying differential binding in chip-seq data using linear normalization on shared peaks

MAnorm3 is my improved version which allows for 2 v 2 comparisons (replicates) using edgeR. I published a paper comparing this method to others: https://www.ncbi.nlm.nih.gov/pubmed/25972895

Original MAnorm can be found here Published in Genome Biology 2012

Workflow

  • StepI: clean and sort input
  • StepII: classify common or unique peaks
  • StepIII: count peak read
  • StepIV: normalize using common peaks

Experience from running on a quad core workstation with 32gb of ram:

  • StepI is I/O bound (mostly reading from disk and writing back to disk)
  • StepII & StepIII is CPU bound (lots of parallel bedtools)
  • Ram usage I've seen with ENCODE datasets is 20gb+/-5gb but I dont actually think all of this is used for processing

Files

  • MAnorm - original
  • MAnorm2 - retains input/output compatibility w/original but will be faster and will pool replicates
  • MAnorm3 - has different input/output, uses edgeR with replicates (2v2 comparisons only currently)

Problems with MAnorm

  • I think there is something wrong with how the p-values are calculated (see code in MAnorm2.R starting from line 50 for details).
    • pval calculation is not optimized (very slow)
      • It is faster to use choose() and run in parallel
    • Stirling approximation seems to be done incorrectly
      • This calculation is consistant in matlab version (more details in matlab MAnorm than R MAnorm)
    • pval are not symmetric (calculations from x vs y do not give the same pvalues as y vs x)
  • mergeBed command in MAnorm.sh does not actually work (need to sort first)
  • Lots of tmp files generated by MAnorm.sh and a lot of steps could be done in parallel

Changes in MAnorm3

This version uses edgeR for significance testing.

  • Requires a replicate (since ENCODE samples have 2 reps)
  • Run bedtools in parallel (faster unless you have IO bottleneck)
  • Added parameter specify output directory
  • MA adjustment is included as an offset to edgeR model
  • sortBed before mergeBed command so the merged dataset is actually merged
  • Requires input to be gzipped (will add switch for this later)
  • Outputs a bunch of plots

Todo

  • Contact MAnorm author about some of the issues in p-value calculation and merging
  • Add error checks and helpful diagnostic messages
  • Add switch to check for .gz ending
  • Change code so samples with no replicates can run

Changes in MAnorm2

This version tries to keep the same input/output as original MAnorm but has the following changes:

  • I run bedtools in parallel in the background for significant speedup.
  • Allows replicates (see 'read in' section of MAnorm2.sh code)
    • Concatenates replicates (proper way would be to downsample so replicates have equal weight)
    • Not well tested, use MAnorm3 if you have replicates

Example of MAnorm3 run:

MAnorm3.sh test_cond1_cond2 \
  peaksA.bed.gz \
  peaksB.bed.gz \
  readsA1.bed.gz \
  readsA2.bed.gz \
  readsB1.bed.gz \
  readsB2.bed.gz \
  60 70 55 53

Where 60 70 55 53 correspond to shift lengths (half of estimated fragment length) for readsA1.bed readsA2.bed readsB1.bed readsB2.bed and readsA1 and readsA2 are replicates used to identify peaks in peaksA. You must have MAnorm3.sh in your $PATH and MAnorm3.R must be in current working directory (you could modify this at the end of MAnorm3.sh). As with before, the output is tab seperated file ending in .xls

Second example:

Usage: MAnorm3.sh test_cond1_cond2 gr_peak.bed er_peak.bed gr_rep1.bed gr_rep2.bed er_rep1.bed er_rep2.bed 120 110 95 100

gr_peak.bed: sample 1 significant peak list
er_peak.bed: sample 2 significant peak list
gr_rep1.bed: sample 1 raw reads rep1
gr_rep2.bed: sample 1 raw reads rep2
er_rep1.bed: sample 2 raw reads rep1
er_rep2.bed: sample 2 raw reads rep2

shifts (1/2 est fragment size):
gr_rep1: 120
gr_rep2: 110
er_rep1: 95
er_rep2: 100

Note that test_cond1_cond2 is the name of the folder that will be created. The script is currently hardcoded to look for a (name)(type1)(type2) layout