-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathanalysis.Rmd
1837 lines (1438 loc) · 61.8 KB
/
analysis.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: 'chromswitch: A flexible method to detect chromatin state switches - *Analysis*'
author: "Selin Jessa and Claudia L. Kleinman"
date: "`r format(Sys.time(), '%d %B, %Y')`"
output:
html_document:
df_print: paged
number_sections: yes
toc: yes
toc_float: yes
code_folding: show
theme: "flatly"
highlight: "default"
html_notebook:
df_print: paged
number_sections: yes
toc: yes
toc_float: yes
---
```{r setup, echo = FALSE, message = FALSE, warning = FALSE}
knitr::opts_chunk$set(dev = c("png", "pdf"), # Display PNG but also keep PDF
fig.keep = "all",
fig.path = "figures/",
cache = TRUE,
cache.path = "cache/")
```
# Introduction
This document executes all the analysis presented in
*chromswitch: A flexible method for detecting chromatin state switches*, from
downloading the data, to running experiments, to generating the tables and figures
included in the paper. The dropdown in the top right corner of this HTML doument
controls whether code is shown or hidden. The document can be navigated using the menu on the left.
**About the project:**
In this paper, we present an R/Bioconductor package, `chromswitch`, which implements a flexible method to detect chromatin state switches between samples in two biological conditions in a specific genomic region of interest given peaks or chromatin state calls from ChIP-seq data. `chromswitch` is available from Bioconductor 3.6 at [https://bioconductor.org/packages/chromswitch](https://bioconductor.org/packages/chromswitch). We analyze its
performance on a benchmark dataset, study its robustness to changes in tuning parameters, sample size and imbalance, window size, and varying mark profiles; and perform a genome-wide validation.
**How this repository is set up:**
- This HTML file is generated by rendering `analysis.Rmd`, and executes
all the analysis
- `input` contains a few individual data and metadata files needed for the analysis
- `data` stores data which is downloaded during the analysis
- `figures` contains all figures produced in the analysis; the subfolder `heatmaps` stores the figures output by the tool
- `functions` stores a few long functions used (`source()`'d by) the analysis
- `computations` stores the results of computationally-intensive steps of the analysis,
which are saved in the main document after being executed once, and then loaded
for further analysis
- `tables` stores the data behind each figure as TSV files, and tables presented in the supplementary material
**How to re-run the analysis:**
To run the analysis, this R Markdown file `analysis.Rmd` needs to be [knit](https://yihui.name/knitr/). This involves executing every chunk, or piece of code, sequentially. To feasibly handle chunks with computationally-intensive code, we
typically evaluate the chunk once, and then set `eval = FALSE` for that chunk. The results are saved as an `.RData` file in the `computations` folder, and then loaded in a following chunk so that the results can be processed and visualized. To re-run the analysis, including these computationally-intensive steps, *that option needs to be changed back to `eval = TRUE` for every chunk*.
Since the analysis would take multiple days to run, we
recommend running the analysis from the command line, using the `run.sh`
bash script. If HPC resources are available, much of the work can be parallelized
by modifying that script and setting the `n_cores` variable in the next section accordingly.
**Dependencies:**
We used R 3.3.1 and [Bioconductor](http://bioconductor.org/) 3.5. The R packages required are loaded below, see the exact
versions in [Session Info](#sessioninfo). All the data needed for the analysis
is provided in the repository or downloaded in this document.
# Preliminaries
Load general packages:
```{r install_pkg, message = FALSE, warning = FALSE, cache = FALSE}
# Visualization
library(RColorBrewer)
library(ggplot2)
# Install and download custom ggplot2 theme
devtools::install_github("sjessa/ggmin")
library(ggmin)
# Genomics
library(GenomicRanges)
library(rtracklayer)
library(Homo.sapiens)
library(biomaRt)
# General-purpose
library(pROC)
library(xtable)
library(BiocParallel)
library(parallel)
library(magrittr)
library(plyr)
library(purrr)
library(tidyverse)
library(dplyr)
library(stringi)
# Source functions
source("functions/plotting_functions.R")
source("functions/subsampling_functions.R")
source("functions/misc_functions.R")
```
Install `chromswitch` from the [GitHub repository](https://github.com/sjessa/chromswitch):
```{r install_cs, message = FALSE, cache = FALSE}
devtools::install_github("sjessa/[email protected]")
library(chromswitch)
```
**Note on chromswitch version**: `chromswitch` is [now available from
Bioconductor 3.6](https://bioconductor.org/packages/release/bioc/html/chromswitch.html), but for reproducibility, the above command will download
v0.99.0 of `chromswitch`, with which the bulk of the analysis was done: https://github.com/sjessa/chromswitch/releases/tag/v0.99.0.
Set-up for parallel computations:
```{r n_cores}
# For other parallel computations
n_cores <- 1
```
`chromswitch` is implemented in a relatively modular way, with a function for
each step of the algorithm. It also provides two wrapper functions, one for
each feature construction strategy, which runs the whole analysis.
Although the wrapper functions in chromswitch can analyze multiple query regions in parallel,
we will parallelize outside the function in order to catch any errors that
might arise.
```{r safely, cache = FALSE}
safeCallSummary <- purrr::safely(chromswitch::callSummary)
safeCallBinary <- purrr::safely(chromswitch::callBinary)
```
# Download epigenomic data
The data we analyze comes from the Roadmap Epigenomics Program. We evaluate three types of input:
1. H3K4me3 peaks
2. DNase I peaks/DNase I hypersensitive sites
3. ChromHMM segmentation learned on the core histone marks in the Roadmap analysis, filtered to regions assigned the state "Active TSS".
We will work with 23 adult tissue samples from Roadmap. 7 are brain tissues,
and 16 are various other tissues.
```{r metadata, message = FALSE}
# Read in the samples
meta <- read_tsv("input/sample_metadata.tsv")
meta
table(meta$Condition)
```
```{r noclobber}
# Download the file if it doesn't exist
noClobberDownload <- function(url, destpath) {
if(!(file.exists(destpath))) download.file(url, destpath)
}
```
We retrieve the data from this link: [http://egg2.wustl.edu/roadmap/data/byFileType/](http://egg2.wustl.edu/roadmap/data/byFileType/).
```{r pk_k4me3, message = FALSE, eval = FALSE}
urls_k4 <- paste0("http://egg2.wustl.edu/roadmap/data/byFileType/peaks/consolidated/narrowPeak/",
meta$Sample, "-H3K4me3.narrowPeak.gz")
basenames_k4 <- lapply(urls_k4, basename) %>% unlist()
destpaths_k4 <- paste0("data/H3K4me3_peaks/", basenames_k4)
dl_k4 <- mcMap(noClobberDownload, urls_k4, destpaths_k4, mc.cores = n_cores)
extra_cols <- c("name" = "character", "score" = "integer", "strand" = "character",
"signalValue" = "numeric", "pValue" = "numeric", "qValue" = "numeric",
"peak" = "numeric")
pk_k4me3 <- lapply(destpaths_k4, rtracklayer::import, format = "bed",
extraCols = extra_cols)
names(pk_k4me3) <- meta$Sample
save(pk_k4me3, file = "computations/pk_k4me3.RData")
```
```{r pk_dnase, message = FALSE, eval = FALSE}
urls_dnase <- paste0("http://egg2.wustl.edu/roadmap/data/byFileType/peaks/consolidatedImputed/narrowPeak/",
meta$Sample,
"-DNase.imputed.narrowPeak.bed.nPk.gz")
basenames_dnase <- lapply(urls_dnase, basename) %>% unlist()
destpaths_dnase <- paste0("data/DNase_peaks/", basenames_dnase)
dl_dnase <- mcMap(noClobberDownload, urls_dnase, destpaths_dnase,
mc.cores = n_cores)
pk_dnase <- lapply(destpaths_dnase, rtracklayer::import, format = "bed")
names(pk_dnase) <- meta$Sample
save(pk_dnase, file = "computations/pk_dnase.RData")
```
To convert the chromatin state segmentations into a format that can be used
by `chromswitch`, we will filter to regions assigned "Active_TSS", the state
associated with active transcription.
```{r read_chromhmm, eval = FALSE}
urls_st <- paste0("http://egg2.wustl.edu/roadmap/data/byFileType/chromhmmSegmentations/ChmmModels/coreMarks/jointModel/final/",
meta$Sample, "_15_coreMarks_mnemonics.bed.gz")
basenames_st <- lapply(urls_st, basename) %>% unlist()
destpaths_st <- paste0("data/ChromHMM_segmentations/", basenames_st)
dl_st <- mcMap(noClobberDownload, urls_st, destpaths_st, mc.cores = 6)
extra_cols <- c("State" = "character")
chmm_states <- lapply(destpaths_st, rtracklayer::import, format = "bed",
extraCols = extra_cols) %>%
lapply(as.data.frame) %>%
# We will treat regions assigned the state associated with active
# transcription as the epigenomic features in our analysis
lapply(filter, State == "1_TssA") %>%
lapply(makeGRangesFromDataFrame)
names(chmm_states) <- meta$Sample
save(chmm_states, file = "computations/chmm_states.RData")
```
```{r load_epi}
load("computations/pk_k4me3.RData")
load("computations/pk_dnase.RData")
load("computations/chmm_states.RData")
```
# Benchmark `chromswitch` performance
## Prepare benchmarking dataset
We manually assembled a list of 120 genes on which to benchmark `chromswitch`
performance. 60 of these genes exhibit chromatin state switches between brain
and other tissues based on change in H3K4me3 signal (whose enrichment at promoters is associated with active transcription), DNase I hypersensitivity (associated with open chromatin), and expression (the functional consequence of chromatin changes). The other 60 are negative examples, where there was no change
across the tissues in H3K4me3, DNase I, or expression data. These genes are
specified in the file `input/benchmark.tsv`.
To define windows for these genes to use as input for our experiment, we retrieve
a list of genes and coordinates in the human genome from UCSC. This file is
saved in the `data` directory, and we obtained it using the [UCSC Table Browser](http://genome.ucsc.edu/cgi-bin/hgTables?hgsid=602106695_Ii01YSgPob6wraDd1cmsVdOT4eau&clade=mammal&org=&db=hg19&hgta_group=genes&hgta_track=refSeqComposite&hgta_table=ncbiRefSeq&hgta_regionType=genome&position=&hgta_outputType=primaryTable&hgta_outFileName=), with the following selections:
* `group`: Genes and Gene Predictions
* `track`: RefSeq
* Fields: `name`, `chrom`, `strand`, `txStart`, `txEnd`, and `name2`
```{r refseq, message = FALSE}
all_genes <- read_tsv("data/refseq_genes.hg19.tsv",
col_names = c("transcript_id", "chromosome", "strand",
"txStart", "txEnd", "gene_name"), skip = 1)
head(all_genes)
```
Define 5kbp windows around the TSS:
```{r bm, message = FALSE}
benchmark <- read_tsv("input/benchmark.tsv") %>%
left_join(all_genes, by = c("transcript_id", "gene_name")) %>%
# Define window around the TSS
mutate(start = case_when(
strand == "+" ~ as.numeric(txStart) - 2500,
strand == "-" ~ as.numeric(txEnd) - 2500)) %>%
mutate(end = case_when(
strand == "+" ~ as.numeric(txStart) + 2500,
strand == "-" ~ as.numeric(txEnd) + 2500)) %T>%
write_tsv("input/benchmark_coords.5kbp.tsv") %>%
makeGRangesFromDataFrame(keep.extra.columns = TRUE) %>%
keepStandardChromosomes()
```
*Table S1*
```{r table_s1}
benchmark %>%
as.data.frame %>%
dplyr::select(gene_name, transcript_id, chr = seqnames, start, end, switch) %>%
arrange(desc(switch), gene_name) %T>%
write_tsv("tables/tableS2.tsv") %>%
xtable()
```
## Prepare genome-wide windows for validation
In order to later validate `chromswitch` genome-wide, we also clean up the gene list to define 5kbp
windows genome-wide.
```{r refseq_unique}
refseq_unique <- all_genes %>%
# Keep the longest transcript per gene
mutate(length = txEnd - txStart) %>%
group_by(gene_name) %>%
dplyr::slice(which.max(length)) %>%
ungroup() %>%
# Filter out genes where the length of the gene is less than half the
# window size
filter(length >= 2500)
```
```{r get_refseq}
getRefSeq <- function(window) {
d <- window / 2
refseq_unique %>%
# Define window around the TSS
mutate(start = case_when(
strand == "+" ~ as.numeric(txStart) - d,
strand == "-" ~ as.numeric(txEnd) - d)) %>%
mutate(end = case_when(
strand == "+" ~ as.numeric(txStart) + d,
strand == "-" ~ as.numeric(txEnd) + d)) %T>%
write_tsv(paste0("input/benchmark_coords.", window, ".tsv")) %>%
makeGRangesFromDataFrame(keep.extra.columns = TRUE) %>%
keepStandardChromosomes(pruning.mode = "coarse")
}
refseq_5kbp <- getRefSeq(5000)
```
## Run `chromswitch` on benchmark dataset
Here we evaluate each feature matrix construction strategy on each type of input.
*Summary strategy applied to H3K4me3*
```{r run_bm_s_k4, eval = FALSE}
bench_s_k_raw <- mclapply(benchmark, safeCallSummary,
peaks = pk_k4me3,
metadata = meta,
mark = "H3K4me3",
filter = TRUE,
filter_columns = c("qValue", "signalValue"),
filter_thresholds = c(10, 4),
normalize = TRUE,
normalize_columns = c("qValue", "signalValue"),
summarize_columns = c("qValue", "signalValue"),
length = FALSE,
fraction = TRUE,
n = FALSE,
optimal_clusters = TRUE,
heatmap = FALSE,
BPPARAM = SerialParam(), mc.cores = n_cores) %>%
purrr::transpose()
bench_s_k <- bench_s_k_raw$result[map_lgl(bench_s_k_raw$error, is_null)] %>%
bind_rows()
save(bench_s_k, file = "computations/bench_s_k.RData")
```
*Summary strategy applied to DNase I*
```{r run_bm_s_dnase, eval = FALSE}
bench_s_d_raw <- mclapply(benchmark, safeCallSummary,
peaks = pk_dnase,
metadata = meta,
mark = "DNase",
filter = FALSE,
normalize = FALSE,
summarize_columns = NULL,
length = FALSE,
fraction = TRUE,
n = FALSE,
optimal_clusters = TRUE,
heatmap = FALSE,
BPPARAM = SerialParam(),
mc.cores = n_cores) %>%
purrr::transpose()
bench_s_d <- bench_s_d_raw$result[map_lgl(bench_s_d_raw$error, is_null)] %>%
bind_rows()
save(bench_s_d, file = "computations/bench_s_d.RData")
```
*Binary strategy applied to H3K4me3*
```{r run_bm_b_k4, warning = FALSE, eval = FALSE}
bench_b_k_raw <- mclapply(benchmark,
safeCallBinary,
peaks = pk_k4me3,
metadata = meta,
filter = TRUE,
filter_columns = c("qValue","signalValue"),
filter_thresholds = c(10, 4),
reduce = TRUE,
gap = 300,
p = 0.4,
optimal_clusters = TRUE,
n_features = TRUE, mc.cores = n_cores) %>%
purrr::transpose()
bench_b_k <- bench_b_k_raw$result[map_lgl(bench_b_k_raw$error,
is_null)] %>%
bind_rows()
save(bench_b_k, file = "computations/bench_b_k.RData")
```
*Binary strategy applied to DNase I*
```{r run_bm_b_dnase, warning = FALSE, eval = FALSE}
bench_b_d_raw <- mclapply(benchmark,
safeCallBinary,
peaks = pk_dnase,
metadata = meta,
filter = FALSE,
reduce = TRUE,
gap = 300,
p = 0.4,
optimal_clusters = TRUE,
n_features = TRUE, mc.cores = n_cores) %>%
purrr::transpose()
bench_b_d <- bench_b_d_raw$result[map_lgl(bench_b_d_raw$error, is_null)] %>%
bind_rows()
save(bench_b_d, file = "computations/bench_b_d.RData")
```
*Summary strategy applied to ChromHMM*
```{r run_bm_s_chmm, warning = FALSE, eval = FALSE}
bench_s_c_raw <- mclapply(benchmark,
safeCallSummary,
peaks = chmm_states,
metadata = meta,
mark = "State",
filter = FALSE,
normalize = FALSE,
summarize_columns = NULL,
length = FALSE,
fraction = TRUE,
n = FALSE,
heatmap = FALSE,
optimal_clusters = TRUE,
BPPARAM = SerialParam(),
mc.cores = n_cores) %>%
purrr::transpose()
bench_s_c <- bench_s_c_raw$result[map_lgl(bench_s_c_raw$error, is_null)] %>%
bind_rows()
save(bench_s_c, file = "computations/bench_s_c.RData")
```
*Binary strategy applied to ChromHMM*
```{r run_bench_b_chmm, warning = FALSE, eval = FALSE}
bench_b_c_raw <- mclapply(benchmark,
safeCallBinary,
peaks = chmm_states,
metadata = meta,
filter = FALSE,
reduce = FALSE,
p = 0.4,
optimal_clusters = TRUE,
n_features = TRUE,
mc.cores = n_cores) %>%
purrr::transpose()
bench_b_c <- bench_b_c_raw$result[map_lgl(bench_b_c_raw$error,
is_null)] %>%
bind_rows()
save(bench_b_c, file = "computations/bench_b_c.RData")
```
```{r load_bm, cache = FALSE}
load("computations/bench_s_k.RData")
load("computations/bench_s_d.RData")
load("computations/bench_s_c.RData")
load("computations/bench_b_k.RData")
load("computations/bench_b_d.RData")
load("computations/bench_b_c.RData")
```
## Random predictor
We compare with a random predictor by drawing from
from a $Uniform(0, 1)$ distribution, repeating this 10,000 times, and then
averaging the results to assign a score. This is done for each region. It should track with the diagonal
and have an AUROC of around 0.5
```{r random_predictor, eval = FALSE}
#' @param k, the number of query regions to make a prediction for
#' @param N, the number of times to draw
predictRandom <- function(k, N) {
rep(k, N) %>%
mclapply(runif, mc.cores = 2) %>%
data.frame() %>%
rowMeans()
}
# Repeat twice, to get one set of random predictions per method
draws <- replicate(2, predictRandom(120, 10000), simplify = FALSE)
save(draws, file = "computations/draws.RData")
```
```{r load_random}
load("computations/draws.RData")
bench_s_k <- bench_s_k %>%
add_column(Random = draws[[1]], .after = "Consensus")
bench_b_d <- bench_b_d %>%
add_column(Random = draws[[2]], .after = "Consensus")
```
## Shuffled labels
Our method predicts chromatin state switches by computing a score of the similarity
between the known biological condition labels and inferred clusters. We can
assess how these scores look when we shuffle the class labels. We cluster samples based on the feature matrix as usual, but before calculating any of the external validity indices, we randomly shuffle the labels.
```{r shuffle}
predictShuffle <- function(clusters, N) {
shuffle <- function(labels, clusters) {
shuf <- sample(labels)
contingency <- table(shuf, clusters)
data.frame(
ARI_shuffle = mclust::adjustedRandIndex(clusters, shuf),
NMI_shuffle = chromswitch:::NMI(clusters, shuf),
V_measure_shuffle = chromswitch:::vMeasure(contingency)
)
}
# Repeat N times
replicate(N, shuffle(labels = meta$Condition, clust = unlist(clusters)),
simplify = FALSE) %>%
bind_rows() %>%
colMeans()
}
```
```{r shuffle_s, eval = FALSE}
shuffle_s_k <- bench_s_k %>%
dplyr::select(matches("^E[0-9]{3}")) %>%
t() %>%
as.data.frame %>%
mclapply(predictShuffle, N = 10000, mc.cores = n_cores) %>%
as.data.frame() %>%
t() %>%
as.data.frame()
shuffle_s_d <- bench_s_d %>%
dplyr::select(matches("^E[0-9]{3}")) %>%
t() %>%
as.data.frame %>%
mclapply(predictShuffle, N = 10000, mc.cores = n_cores) %>%
as.data.frame() %>%
t() %>%
as.data.frame()
shuffle_s_c <- bench_s_c %>%
dplyr::select(matches("^E[0-9]{3}")) %>%
t() %>%
as.data.frame %>%
mclapply(predictShuffle, N = 10000, mc.cores = n_cores) %>%
as.data.frame() %>%
t() %>%
as.data.frame()
save(shuffle_s_k, shuffle_s_d, shuffle_s_c,
file = "computations/shuffle_s.RData")
```
```{r shuffle_b, eval = FALSE}
shuffle_b_k <- bench_b_k %>%
dplyr::select(matches("^E[0-9]{3}")) %>%
t() %>%
as.data.frame %>%
mclapply(predictShuffle, N = 10000, mc.cores = n_cores) %>%
as.data.frame() %>%
t() %>%
as.data.frame()
shuffle_b_d <- bench_b_d %>%
dplyr::select(matches("^E[0-9]{3}")) %>%
t() %>%
as.data.frame %>%
mclapply(predictShuffle, N = 10000, mc.cores = n_cores) %>%
as.data.frame() %>%
t() %>%
as.data.frame()
shuffle_b_c <- bench_b_c %>%
dplyr::select(matches("^E[0-9]{3}")) %>%
t() %>%
as.data.frame %>%
mclapply(predictShuffle, N = 10000, mc.cores = n_cores) %>%
as.data.frame() %>%
t() %>%
as.data.frame()
save(shuffle_b_c, shuffle_b_d, shuffle_b_k,
file = "computations/shuffle_b.RData")
```
```{r load_shuf}
load("computations/shuffle_s.RData")
load("computations/shuffle_b.RData")
addShuf <- function(df) {
df %>%
mutate(Consensus_shuffle =
rowMeans(dplyr::select(., ARI_shuffle, NMI_shuffle, V_measure_shuffle)))
}
bench_s_k <- bind_cols(bench_s_k, shuffle_s_k) %>% addShuf()
bench_s_d <- bind_cols(bench_s_d, shuffle_s_d) %>% addShuf()
bench_s_c <- bind_cols(bench_s_c, shuffle_s_c) %>% addShuf()
bench_b_k <- bind_cols(bench_b_k, shuffle_b_k) %>% addShuf()
bench_b_d <- bind_cols(bench_b_d, shuffle_b_d) %>% addShuf()
bench_b_c <- bind_cols(bench_b_c, shuffle_b_c) %>% addShuf()
```
## ROC curves
The problem of a detecting a chromatin state switch in a region can be reframed as the problem of classifying the region as containing a switch between conditions or not. Therefore, our method can be regarded as a binary classifier, and evaluated accordingly. We evaluate the performance of our method on the benchmark set by computing Receiver-Operating Characteristic (ROC) curves based on the consensus score computed by `chromswitch`. ROC curves measure the tradeoff between the true positive rate and the false positive rate as the threshold on the consensus score is varied, as well as the area under the ROC curve (AUROC) which takes on a value of 1 for a perfect classifier and can be interpreted as the probability that `chromswitch` will score a randomly chosen region corresponding to a chromatin state switch higher than a randomly chosen region which does not exhibit a switch.
```{r calcROC = FALSE}
calcROC <- function(scores, method, input, smooth) {
# Compute ROC info using each score as a predictor
roc_out <- scores %>%
dplyr::select(
matches("Consensus|Random")) %>%
map(~ pROC::roc(response = scores$switch,
predictor = as.numeric(.),
smooth = smooth))
# Extract values to plot ROC curves
roc_val <- roc_out %>%
map(`[`, c("sensitivities", "specificities")) %>%
map(as.data.frame) %>%
map2_df(names(roc_out), function(df, index) mutate(df, index = index)) %>%
mutate(TPR = sensitivities,
FPR = 1 - specificities,
Method = method,
Input = input)
return(roc_val)
}
```
```{r benchmark_roc}
benchmark_roc <- bind_rows(
calcROC(bench_s_k, "Summary", "H3K4me3", smooth = TRUE),
calcROC(bench_s_d, "Summary", "DNase I", smooth = TRUE),
calcROC(bench_s_c, "Summary", "ChromHMM", smooth = TRUE),
calcROC(bench_b_k, "Binary", "H3K4me3", smooth = TRUE),
calcROC(bench_b_d, "Binary", "DNase I", smooth = TRUE),
calcROC(bench_b_c, "Binary", "ChromHMM", smooth = TRUE)) %>%
filter(grepl("Consensus|Random", index)) %>%
filter(index != "Random" |
(index == "Random") & (Method == "Binary") & (Input == "DNase I")) %>%
mutate(Input = ifelse(index == "Random", "NA", Input)) %T>%
write_tsv("tables/benchmark_roc.tsv")
roc_gg <- benchmark_roc %>%
makeLabel()
```
*Figure 1b*
```{r fig1b, fig.width = 3.7, fig.height = 4.7}
roc_gg %>%
filter(Input != "DNase I") %>%
filter(index != "Consensus_shuffle") %>%
ggplot(aes(x = FPR, y = TPR, colour = Label, linetype = Label)) +
geom_line(size = 1) +
scale_colour_manual(name = "Strategy", values = pal_strategy) +
scale_linetype_manual(name = "Strategy", values = linetype_strategy) +
theme_min() +
xlab("False Positive Rate") + ylab("True Positive Rate") +
theme(legend.position = "bottom") +
labs(colour = "") + labs(linetype = "") +
guides(colour = guide_legend(ncol = 1))
```
*Figure S2*
```{r figS2, fig.width = 7, fig.height = 5.2}
roc_gg %>%
filter(index != "Random") %>%
ggplot(aes(x = FPR, y = TPR, colour = Label, linetype = Label)) +
geom_line(size = 1) +
scale_colour_manual(name = "Strategy", values = pal_strategy) +
scale_linetype_manual(name = "Strategy", values = linetype_strategy2) +
theme_min() +
# theme(legend.position = "bottom") +
xlab("False Positive Rate") + ylab("True Positive Rate") +
theme(legend.position = "bottom") +
facet_wrap(~ Input) +
labs(colour = "") + labs(linetype = "") +
guides(colour = guide_legend(ncol = 1))
```
## AUROC
```{r auroc, warning = FALSE}
calcAUC <- function(scores, method, input) {
scores %>%
dplyr::select(
matches("Consensus|Random")) %>%
map(~pROC::roc(response = scores$switch, predictor = as.numeric(.))) %>%
map_df("auc") %>%
mutate(Method = method,
Input = input)
}
benchmark_auc <- bind_rows(
calcAUC(bench_s_k, "Summary", "H3K4me3"),
calcAUC(bench_s_d, "Summary", "DNase I"),
calcAUC(bench_s_c, "Summary", "ChromHMM"),
calcAUC(bench_b_k, "Binary", "H3K4me3"),
calcAUC(bench_b_d, "Binary", "DNase I"),
calcAUC(bench_b_c, "Binary", "ChromHMM")) %>%
select(Method, Input, Consensus, Consensus_shuffle, Random)
benchmark_auc
```
*Figure S2*
```{r auc_latex}
# A little manipulation to make the table for the supplementary
benchmark_auc %>%
gather(Stat, Score, Consensus, Consensus_shuffle) %>%
rowwise() %>%
mutate(Method = ifelse(grepl("shuffle", Stat),
paste0(Method, " (shuffled)"), Method)) %>%
ungroup() %>%
dplyr::select(Method, Input, Score) %>%
spread(Input, Score) %T>%
write_tsv("tables/benchmark_auc.tsv") %>%
xtable() # Generate latex code
```
# Robustness experiments
## Tuning parameters
There are two parameters used in the feature construction process for the binary
approach, `gap` which controls whether nearby peaks within samples are connected, and `p` which controls how
much overlap is required to call two peaks the same. To assess the impact
of these parameters, we'll perform a grid search
on various combinations.
Since we want to investigate the effect of varying the tuning parameters,
we select the best threshold on the *Consensus* score on this benchmarking
dataset:
```{r best_threshold, cache = FALSE}
getBest <- function(bench_out) {
best_consensus <- pROC::roc(predictor = bench_out$Consensus,
response = bench_out$switch) %>%
pROC::coords("best",
ret = c("threshold", "sensitivity",
"specificity", "accuracy", "ppv",
as.list = TRUE, drop = TRUE))
best_threshold <- best_consensus["threshold"]
return(best_threshold)
}
best_k <- getBest(bench_b_k)
best_k
```
```{r tuning_grid, eval = FALSE}
grid <- expand.grid(gap = seq(200, 500, by = 50),
p_overlap = seq(0.2, 0.8, by = 0.1))
evaluateParams <- function(gap, p, best_threshold, ...) {
out <- benchmark %>%
mclapply(safeCallBinary, gap = gap, p = p, ...) %>%
purrr::transpose()
ok <- map_lgl(out$error, is_null)
out <- out$result[ok] %>%
bind_rows() %>%
mutate(pred = ifelse(Consensus >= best_threshold, 1, 0))
# Now we can tally the results by comparing the predictions
# to the known classes (switch or not)
stats <- table(benchmark$switch[ok], out$pred)
tn = fn = tp = fp = 0
try(tn <- stats["0", "0"], silent = TRUE)
try(fn <- stats["1", "0"], silent = TRUE)
try(tp <- stats["1", "1"], silent = TRUE)
try(fp <- stats["0", "1"], silent = TRUE)
df <- data.frame(gap, p, tp, tn, fp, fn)
return(list(predictions = out$pred, stats = df))
}
safeEvalParams <- purrr::safely(evaluateParams)
grid_out <- Map(function(gap, p) safeEvalParams(gap, p,
best_threshold = best_k,
peaks = pk_k4me3,
metadata = meta,
filter = TRUE,
filter_columns = c("qValue", "signalValue"),
filter_thresholds = c(10, 4),
reduce = TRUE,
heatmap = FALSE,
BPPARAM = SerialParam(),
optimal_clusters = TRUE,
mc.cores = n_cores),
grid$gap, grid$p_overlap) %>%
purrr::transpose()
save(grid_out, file = "computations/grid.RData")
```
*Figure S4*
```{r figS4_grid_k4me3, fig.width = 9, fig.height = 2.9, fig.keep = "last", fig.show = "last"}
load("computations/grid.RData")
makeGrid(grid_out, "H3K4me3")
load("tables/grid_scores.H3K4me3.RData")
plotGrid(grid_scores)
```
## Sample size and class imbalance
Next, we investigate how sensitive or robust `chromswitch` is to small numbers
of samples and/or to datasets with high class imbalance (*i.e.* when samples
in one condition greatly outnumber samples in the other). To do so, we will
subsample from the samples in the full 23-tissue dataset to create many smaller
datasets with fewer samples and greater class imbalance, and evaluate how
`chromswitch` performs in classifying the benchmark regions using these smaller
datasets.
For the summary feature matrix construction strategy, the computation of the feature vector for each sample is independent of all other samples, so we can save the feature matrices and subsample rows from the matrices. Here
we compute and save the feature matrix for each region:
```{r get_ft_s, eval = FALSE}
getSummaryFeatures <- function(query) {
cols <- c("qValue", "signalValue")
ft_mat <- pk_k4me3 %>%
filterPeaks(columns = cols,
thresholds = c(10, 4)) %>%
normalizePeaks(columns = cols) %>%
retrievePeaks(metadata = meta, region = query) %>%
summarizePeaks(mark = "H3K4me3", cols = cols,
length = FALSE, fraction = TRUE, n = FALSE)
return(ft_mat)
}
safeGetFeatures <- purrr::safely(getSummaryFeatures)
ft_s_k_raw <- benchmark %>%
mclapply(safeGetFeatures, mc.cores = n_cores) %>%
purrr::transpose()
ft_s_k <- ft_s_k_raw$result
save(ft_s_k, ft_s_k_raw,
file = "computations/benchmark_features_s_k.RData")
```
For the binary strategy, the feature matrices are derived from the union of
peaks supplied across samples, and are therefore dependent on the exact set of
samples, so we will only save the local peaks (the peaks in the query region)
for each sample.
```{r get_pk_b, eval = FALSE}
getBinaryPeaks <- function(query) {
pks <- pk_k4me3 %>%
retrievePeaks(metadata = meta, region = query) %>%
reducePeaks(gap = 300)
return(pks)
}
safeGetBinaryPeaks <- purrr::safely(getBinaryPeaks)
pk_b_k_raw <- benchmark %>%
mclapply(safeGetBinaryPeaks, mc.cores = n_cores) %>%
purrr::transpose()
pk_b_k <- pk_b_k_raw$result
# Extract the peaks from the localPeaks object
lpk_b_k <- lapply(pk_b_k, chromswitch:::lpkPeaks)
save(pk_b_k_raw, pk_b_k, lpk_b_k,
file = "computations/benchmark_peaks_b_k.RData")
```
```{r load_saved_ft, eval = TRUE, cache = FALSE}
load("computations/benchmark_features_s_k.RData")
load("computations/benchmark_peaks_b_k.RData")
```
These are the different datasets we'll construct, each with a different
number and proportion of brain & other samples.
```{r subsamp_combos, eval = TRUE}
combos <- data.frame(
n_other = c(2, 3, 4, 5, 6, 7, 8, 10, 12, 14, 16, 16, 16, 10, 12, 15, 16, 16),
n_brain = c(2, 3, 4, 5, 6, 7, 7, 7, 7, 7, 7, 2, 3, 5, 4, 3, 4, 5)
)
combos
lvls_combos <- c(unlist(Map(function(x, y)
paste0(x ," brain vs. ", y, " other"),
combos$n_brain, combos$n_other)))
idx_other <- which(meta$Condition == "Other")
idx_brain <- which(meta$Condition == "Brain")
```
Since in some cases it's possible that there are few enough combinations that we
can exhaustively try all of them, we will do so when there are fewer than 100
combinations. Otherwise, we will evaluate 100 randomly subsample ddatasets to for each combination.
How many possible datasets exist for each pair of `n_brain` and `n_other`?
```{r comb}
n_pos <- function(n_other, n_brain, idx_other) choose(length(idx_other), n_other) * choose(length(idx_brain), n_brain)
combos$n_pos <- unlist(Map(function(n_other, n_brain) n_pos(n_other, n_brain, idx_other),
combos$n_other, combos$n_brain))
combos
```
For the code below, the functions that carry out the subsampling experiment
are sourced from `functions/subsampling_functions.R`.
```{r subsamp_fun, cache = TRUE}
safeSubsampleFt <- purrr::safely(subsampleFromFeatures)
safeSubsamplePk <- purrr::safely(subsampleFromPeaks)
```
```{r subsamp_s, eval = FALSE}
subsamp_s_raw <- mcMap(function(n_other, n_brain)
safeSubsampleFt(ft_s_k, n_other, n_brain, idx_other),
combos$n_other, combos$n_brain,
mc.cores = n_cores) %>%
purrr::transpose()
subsamp_s <- subsamp_s_raw$result
save(subsamp_s, subsamp_s_raw, file = "computations/subsamp_s.RData")
```
```{r subsamp_b, eval = FALSE, cache = FALSE}