DEPRECATED. See https://github.com/endrebak/SICER2
epic is a software package for finding medium to diffusely enriched domains in chip-seq data. It is a fast, parallel and memory-efficient implementation of the incredibly popular SICER algorithm. By running epic on a set of data ("ChIP") files and control ("Input") files, epic is able to quickly differentially enriched regions.
epic is an improvement over the original SICER by being faster, more memory efficient, multicore, and significantly much easier to install and use.
The MIT-licensed code is available at https://github.com/endrebak/epic
The documentation is currently being worked on.
Table of Contents
For now, please cite the original SICER paper, but include a link to this repo.
epic is available for python2.7 and above. It can be installed from the Python
Package Index with pip install bioepic
or preferably from bioconda with conda install -c bioconda epic
, or by cloning the repo at
https://github.com/endrebak/epic and running python setup.py install
# 0.2.12 (02.08.18)
- Fix bug in bigwig creation (due to new pd behavior)
# 0.2.11 (01.08.18)
- Massive speedup with Cython
- Fix bugs added by 0.2.10
# 0.2.10 (31.07.18)
- Use categoricals for memory efficiency and speed
- Fix bug in epic when using chromsizes
$ pip install bioepic
$ # you only need git clone to get the test data
$ git clone https://github.com/endrebak/epic.git
$ # -t is treatment files, -c input (control) files
$ epic -t epic/examples/test.bed -c epic/examples/control.bed > results.csv
Will be maintained and further updated.
We hope to make further refinements to the actual algorithm and make it even better.
epic accepts several input and ChIP files at the same time and accepts bed files
- both block-zipped and gzipped.
Works on files of any size.
Works on all Python versions 2.7/3+.
epic can use one core per chromosome, which should give a speedup of < ~22-25 (differs by species) by itself. In addition, epic uses the Python science stack, including Pandas, for almost all tasks, which means each core runs heavily optimized C, Fortran and Cython code for further speed gains.
epic streams the data instead of loading it all into memory, which should result in a much smaller memory footprint.
Instead of needing eleven command line arguments to run, epic contains sensible defaults and only needs the files it is to analyze as parameters.
epic can be run from whichever location with files found anywhere on the disk.
Contains many genomes, with updated effective genome sizes. If your genome isn't listed, please request it!
This is a beta release. Please do aggressively report issues, quirks, complaints and anything that just feels slightly off to the issue tracker. Also please ask questions and make docrequests - there is loads of neat stuff I have not documented.
MIT
Python data science stack and a fairly recent version of Pandas (0.17 >=). Python 2.7 or 3+. Various unix tools found on all major distributions.
For the effective genome size script jellyfish2 is required.
See this page for the various helper scripts that are a part of epic.
If you have bam files, these can be converted to bed with the command
bamToBed -i file.bam > file.bed
If you have paired-end data, you can use
bamToBed -bedpe -i paired_end_file.bam > file.bedpe
(Might be slightly out of date.)
usage: epic [-h] --treatment TREATMENT [TREATMENT ...] --control CONTROL
[CONTROL ...] [--number-cores NUMBER_CORES] [--genome GENOME]
[--keep-duplicates] [--window-size WINDOW_SIZE]
[--gaps-allowed GAPS_ALLOWED] [--fragment-size FRAGMENT_SIZE]
[--false-discovery-rate-cutoff FALSE_DISCOVERY_RATE_CUTOFF]
[--effective_genome_fraction EFFECTIVE_GENOME_FRACTION]
[--chromsizes CHROMSIZES] [--store-matrix STORE_MATRIX]
[--bigwig BIGWIG] [--sum-bigwig SUM_BIGWIG] [--bed BED]
[--log LOG] [--outfile OUTFILE] [--version]
Diffuse domain ChIP-Seq caller based on SICER. (Visit github.com/endrebak/epic
for examples and help.)
optional arguments:
-h, --help show this help message and exit
--treatment TREATMENT [TREATMENT ...], -t TREATMENT [TREATMENT ...]
Treatment (pull-down) file(s) in (b/gzipped) bed/bedpe
format.
--control CONTROL [CONTROL ...], -c CONTROL [CONTROL ...]
Control (input) file(s) in (b/gzipped) bed/bedpe
format.
--number-cores NUMBER_CORES, -cpu NUMBER_CORES
Number of cpus to use. Can use at most one per
chromosome. Default: 1.
--genome GENOME, -gn GENOME
Which genome to analyze. Default: hg19.
--keep-duplicates, -k
Keep reads mapping to the same position on the same
strand within a library. Default is to remove all but
the first duplicate.
--window-size WINDOW_SIZE, -w WINDOW_SIZE
Size of the windows to scan the genome. WINDOW_SIZE is
the smallest possible island. Default 200.
--gaps-allowed GAPS_ALLOWED, -g GAPS_ALLOWED
Multiple of window size used to determine the gap
size. Must be an integer. Default: 3.
--fragment-size FRAGMENT_SIZE, -fs FRAGMENT_SIZE
(Single end reads only) Size of the sequenced
fragment. The center of the the fragment will be taken
as half the fragment size. Default 150.
--false-discovery-rate-cutoff FALSE_DISCOVERY_RATE_CUTOFF, -fdr FALSE_DISCOVERY_RATE_CUTOFF
Remove all islands with an FDR below cutoff. Default
0.05.
--effective_genome_fraction EFFECTIVE_GENOME_FRACTION, -egf EFFECTIVE_GENOME_FRACTION
Use a different effective genome fraction than the one
included in epic. The default value depends on the
genome and readlength, but is a number between 0 and
1.
--chromsizes CHROMSIZES, -cs CHROMSIZES
Set the chromosome lengths yourself in a file with two
columns: chromosome names and sizes. Useful to analyze
custom genomes, assemblies or simulated data. Only
chromosomes included in the file will be analyzed.
--store-matrix STORE_MATRIX, -sm STORE_MATRIX
Store the matrix of counts per bin for ChIP and input
to gzipped file <STORE_MATRIX>.
--bigwig BIGWIG, -bw BIGWIG
For each file, store a bigwig of both enriched and
non-enriched regions to folder <BIGWIG>. Requires
different basenames for each file.
--sum-bigwig SUM_BIGWIG, -sbw SUM_BIGWIG
Store two bigwigs - one of ChIP, one of input - to
folder <SUM-BIGWIG>.
--bed BED, -b BED A summary bed file of all regions for display in the
UCSC genome browser or downstream analyses with e.g.
bedtools. The score field is log2(#ChIP/#Input) * 100
capped at a 1000.
--log LOG, -l LOG File to write log messages to.
--outfile OUTFILE, -o OUTFILE
File to write results to. By default sent to stdout.
--version, -v show program's version number and exit
Chongzhi Zang, Dustin E. Schones, Chen Zeng, Kairong Cui, Keji Zhao and Weiqun Peng for the original SICER. Please consider citing their paper (in addition to our eventual paper) if you use epic. And if you use any (helper) scripts in SICER that are not included in epic you should of course cite the SICER paper!
Most of the improvements in epic were possible due to Python Science libraries that were not available when SICER was originally written. Thanks to the Pandas developers!
Endre Bakken Stovner
- Pål Sætrom (algorithmic/theoretical discussions, endless patience)
- Dario Beraldi (argparsing)
- Ryan Dale (bioconda, ideas, genome info script)
- Ryan C. Thompson (bigwig improvements, mypy typing)
- Piotr Balwierz (helping me debug the paired-end mode)
- Keith Siklenka (lending me a subset of some paired-end ChIP-Seq files)
Briefings in Bioinformatics, 2016, 1–14
Answers to some questions no-one has ever asked me.
The wonderful thing about the SICER algorithm is that is very careful about dropping windows with few reads in them. All ChIP-seq callers I know have some preprocessing step where this is done liberally. SICER pools these windows together and gives them a composite score, allowing very long stretches of very diffuse signal to be detected.
MACS2 is great for narrow peaks, but epic performs better on diffuse domains. For medium size domains, such as PolII, our tests indicate that both perform about equally well, but epic uses only a fraction of the time.
SICER contains a great algorithm and is a wonderful piece of software, but advances in the Python data science libraries has made it possible to implement it much more efficiently. Furthermore, SICER was not made to handle the mountains of data we have now; it simply cannot run on very large datasets due to (sensible) restrictions in the original implementation.
Dunno. We do not want to write a methods paper, but rather just include a section about epic in an appropriate biology paper sometime.
It stands for electronic pic [sic] caller or epigenome cartographer, whichever you prefer. Or perhaps it isn't just another bogus bioinformatics acronym. Hope you find the name fitting.
But suggestions for better names accepted. On paper I liked the epi/epic/epigenetics link but now when I hear it it sounds so boastful I cringe. exorcised sounds like a slight on the original software... Mad MACS?
- SICER - great diffuse domain ChIP-Seq caller (which epic is based on.)
- SICERpy - a wrapper around SICER for convenience/parallelism. Stole some good ideas from there.
- csaw - R package. Uses an approach to island finding that complements epic very well. Requires more statistical sophistication and programming skill to use.
- MACS2 - my preferred peak caller.