-
Notifications
You must be signed in to change notification settings - Fork 3
/
README.Rmd
328 lines (257 loc) · 12.1 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
---
title: "⚖<code>EpiCompare</code>⚖<br>QC and Benchmarking of Epigenomic Datasets"
author: "`r rworkflows::use_badges(add_doi = 'https://doi.org/10.1101/2022.07.22.501149',
add_bioc_release = TRUE,
add_bioc_download_month = TRUE,
add_bioc_download_total = TRUE,
add_bioc_download_rank = TRUE)`"
date: "<h5><i>Updated</i>: `r format(Sys.Date(), '%b-%d-%Y')`</h5>"
output:
github_document
---
```{r, echo=FALSE, include=FALSE}
pkg <- read.dcf("DESCRIPTION", fields = "Package")[1]
title <- read.dcf("DESCRIPTION", fields = "Title")[1]
description <- read.dcf("DESCRIPTION", fields = "Description")[1]
URL <- read.dcf('DESCRIPTION', fields = 'URL')[1]
owner <- tolower(strsplit(URL,"/")[[1]][4])
```
# Introduction
`EpiCompare` is an R package for comparing multiple epigenomic datasets
for quality control and benchmarking purposes. The function outputs a
report in HTML format consisting of three sections:
1. **General Metrics**: Metrics on peaks (percentage of blacklisted and
non-standard peaks, and peak widths) and fragments (duplication
rate) of samples.
2. **Peak Overlap**: Frequency, percentage, statistical significance of
overlapping and non-overlapping peaks. This also includes Upset,
precision-recall and correlation plots.
3. **Functional Annotation**: Functional annotation (ChromHMM, ChIPseeker
and enrichment analysis) of peaks. Also includes peak enrichment
around Transcription Start Site.
*Note*: Peaks located in blacklisted regions and non-standard chromosomes are
removed from the files prior to analysis.
# Installation
## Standard
To install `EpiCompare` use:
```r
if (!require("BiocManager", quietly = TRUE)) install.packages("BiocManager")
BiocManager::install("EpiCompare")
```
## All dependencies
<details>
<summary>👈 <strong>Details</strong></summary>
Installing all *Imports* and *Suggests* will allow you to use the full functionality of `EpiCompare` right away, without having to stop and install extra dependencies later on.
To install these packages as well, use:
```R
BiocManager::install("EpiCompare", dependencies=TRUE)
```
Note that this will increase installation time,
but it means that you won't have to worry about installing any R packages
when using functions with certain suggested dependencies
</details>
## Development
<details>
<summary>👈 <strong>Details</strong></summary>
To install the development version of `EpiCompare`, use:
```R
if (!require("remotes")) install.packages("remotes")
remotes::install_github("neurogenomics/EpiCompare")
```
</details>
## Citation
If you use ``r pkg``, please cite:
<!-- Modify this by editing the file: inst/CITATION -->
> `r citation(pkg)$textVersion`
# Documentation
## [EpiCompare website](https://neurogenomics.github.io/EpiCompare)
## [Docker/Singularity container](https://neurogenomics.github.io/EpiCompare/articles/docker)
## [Bioconductor page](https://doi.org/doi:10.18129/B9.bioc.EpiCompare)
### :warning: Note on documentation versioning
The documentation in this README and the [GitHub Pages website](https://neurogenomics.github.io/EpiCompare/)
pertains to the *development* version of `EpiCompare`.
Older versions of `EpiCompare` may have slightly different documentation
(e.g. available functions, parameters). For documentation in older versions of
`EpiCompare`, please see the **Documentation** section of the relevant
version on [Bioconductor](https://doi.org/doi:10.18129/B9.bioc.EpiCompare)
# Usage
Load package and example datasets.
```r
library(EpiCompare)
data("encode_H3K27ac") # example peakfile
data("CnT_H3K27ac") # example peakfile
data("CnR_H3K27ac") # example peakfile
data("CnT_H3K27ac_picard") # example Picard summary output
data("CnR_H3K27ac_picard") # example Picard summary output
```
Prepare input files:
```r
# create named list of peakfiles
peakfiles <- list("CnT"=CnT_H3K27ac,
"CnR"=CnR_H3K27ac)
# set ref file and name
reference <- list("ENCODE_H3K27ac" = encode_H3K27ac)
# create named list of Picard summary
picard_files <- list("CnT"=CnT_H3K27ac_picard,
"CnR"=CnR_H3K27ac_picard)
```
<details>
<summary><strong>👈 Tips on importing user-supplied files</strong></summary>
`EpiCompare::gather_files` is helpful for identifying and importing
peak or picard files.
```r
# To import BED files as GRanges object
peakfiles <- EpiCompare::gather_files(dir = "path/to/peaks/",
type = "peaks.stringent")
# EpiCompare alternatively accepts paths (to BED files) as input
peakfiles <- list(sample1="/path/to/peaks/file1_peaks.stringent.bed",
sample2="/path/to/peaks/file2_peaks.stringent.bed")
# To import Picard summary output txt file as data frame
picard_files <- EpiCompare::gather_files(dir = "path/to/peaks",
type = "picard")
```
</details>
Run `EpiCompare()`:
```r
EpiCompare::EpiCompare(peakfiles = peakfiles,
genome_build = list(peakfiles="hg19",
reference="hg38"),
genome_build_output = "hg19",
picard_files = picard_files,
reference = reference,
run_all = TRUE
output_dir = tempdir())
```
#### Required Inputs
These input parameters must be provided:
<details>
<summary>👈 <strong>Details</strong></summary>
- `peakfiles` : Peakfiles you want to analyse. EpiCompare accepts
peakfiles as GRanges object and/or as paths to BED files. Files must
be listed and named using `list()`.
E.g. `list("name1"=peakfile1, "name2"=peakfile2)`.
- `genome_build` : A named list indicating the human genome build used to
generate each of the following inputs:
- `peakfiles` : Genome build for the `peakfiles` input. Assumes genome build
is the same for each element in the `peakfiles` list.
- `reference` : Genome build for the `reference` input.
- `blacklist` : Genome build for the `blacklist` input. <br>
E.g. `genome_build = list(peakfiles="hg38", reference="hg19", blacklist="hg19")`
- `genome_build_output` Genome build to standardise all inputs to. Liftovers
will be performed automatically as needed. Default is "hg19".
- `blacklist` : Peakfile as GRanges object specifying genomic regions
that have anomalous and/or unstructured signals independent of the
cell-line or experiment. For human hg19 and hg38 genome, use
built-in data `data(hg19_blacklist)` and `data(hg38_blacklist)`
respectively. For mouse mm10 genome, use built-in data `data(mm10_blacklist)`.
- `output_dir` : Please specify the path to directory, where all
`EpiCompare` outputs will be saved.
</details>
#### Optional Inputs
The following input files are optional:
<details>
<summary>👈 <strong>Details</strong></summary>
- `picard_files` : A list of summary metrics output from
[Picard](https://broadinstitute.github.io/picard/). *Picard MarkDuplicates*
can be used to identify the duplicate reads amongst the alignment. This tool
generates a summary output, normally with the ending
*.markdup.MarkDuplicates.metrics.txt*. If this input is provided, metrics on
fragments (e.g. mapped fragments and duplication rate) will be included
in the report. Files must be in data.frame format and listed using `list()`
and named using `names()`. To import Picard duplication metrics (.txt file)
into R as data frame, use
`picard <- read.table("/path/to/picard/output", header = TRUE, fill = TRUE)`.
- `reference` : Reference peak file(s) is used in `stat_plot` and
`chromHMM_plot`. File must be in `GRanges` object, listed and named
using `list("reference_name" = GRanges_obect)`. If more than one reference
is specified, `EpiCompare` outputs individual reports for each reference.
However, please note that this can take awhile.
</details>
#### Optional Plots
By default, these plots will not be included in the report unless set to `TRUE`.
To turn on all features at once, simply use the `run_all=TRUE` argument:
<details>
<summary>👈 <strong>Details</strong></summary>
- `upset_plot` : Upset plot of overlapping peaks between samples.
- `stat_plot` : included only if a `reference` dataset is provided.
The plot shows statistical significance (p/q-values) of sample peaks
that are overlapping/non-overlapping with the `reference` dataset.
- `chromHMM_plot` : ChromHMM annotation of peaks. If a `reference`
dataset is provided, ChromHMM annotation of overlapping and
non-overlapping peaks with the `reference` is also included in the
report.
- `chipseeker_plot` : ChIPseeker annotation of peaks.
- `enrichment_plot` : KEGG pathway and GO enrichment analysis of
peaks.
- `tss_plot` : Peak frequency around (+/- 3000bp) transcriptional
start site. Note that it may take awhile to generate this plot for
large sample sizes.
- `precision_recall_plot` : Plot showing the precision-recall score across
the peak calling stringency thresholds.
- `corr_plot` : Plot showing the correlation between the quantiles when the
genome is binned at a set size. These quantiles are based on the intensity
of the peak, dependent on the peak caller used (q-value for MACS2).
</details>
#### Other Options
<details>
<summary>👈 <strong>Details</strong></summary>
- `chromHMM_annotation` : Cell-line annotation for ChromHMM. Default
is K562. Options are:
- "K562" = K-562 cells
- "Gm12878" = Cellosaurus cell-line GM12878
- "H1hesc" = H1 Human Embryonic Stem Cell
- "Hepg2" = Hep G2 cell
- "Hmec" = Human Mammary Epithelial Cell
- "Hsmm" = Human Skeletal Muscle Myoblasts
- "Huvec" = Human Umbilical Vein Endothelial Cells
- "Nhek" = Normal Human Epidermal Keratinocytes
- "Nhlf" = Normal Human Lung Fibroblasts
- `interact` : By default, all heatmaps (percentage overlap and
ChromHMM heatmaps) in the report will be interactive. If set FALSE,
all heatmaps will be static. N.B. If `interact=TRUE`, interactive
heatmaps will be saved as html files, which may take time for larger
sample sizes.
- `output_filename` : By default, the report is named *EpiCompare.html*.
You can specify the file name of the report here.
- `output_timestamp` : By default FALSE. If TRUE, the filename of the
report includes the date.
</details>
#### Outputs
`EpiCompare` outputs the following:
1. **HTML report**: A summary of all analyses saved in specified
`output_dir`
2. **EpiCompare_file**: if `save_output=TRUE`, all plots generated by
`EpiCompare` will be saved in *EpiCompare_file* directory also in
specified `output_dir`
An example report comparing ATAC-seq and DNase-seq can be found
[here](https://neurogenomics.github.io/EpiCompare/articles/example_report)
## Datasets
`EpiCompare` includes several built-in datasets:
<details>
<summary>👈 <strong>Details</strong></summary>
- `encode_H3K27ac`: Human H3K27ac peak file generated with ChIP-seq using K562
cell-line. Taken from [ENCODE](https://www.encodeproject.org/files/ENCFF044JNJ/)
project. For more information, run `?encode_H3K27ac`.
- `CnT_H3K27ac`: Human H3K27ac peak file generated with CUT&Tag using K562
cell-line from [Kaya-Okur et al., (2019)](https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR8383507). For more
information, run `?CnT_H3K27ac`.
- `CnR_H3K27ac`: Human H3K27ac peak file generated with CUT&Run using K562
cell-line from [Meers et al., (2019)](https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR8581604).
For more details, run `?CnR_H3K27ac`.
</details>
## Contact
### [Neurogenomics Lab](https://www.neurogenomics.co.uk/inst/report/EpiCompare.html)
UK Dementia Research Institute
Department of Brain Sciences
Faculty of Medicine
Imperial College London
[GitHub](https://github.com/neurogenomics)
[DockerHub](https://hub.docker.com/orgs/neurogenomicslab)
## Session Info
<details>
<summary>👈 <strong>Details</strong></summary>
```{r Session Info}
utils::sessionInfo()
```
</details>
<hr>