analysis.Rmd

---
title: 'chromswitch: A flexible method to detect chromatin state switches - *Analysis*'
author: "Selin Jessa and Claudia L. Kleinman"
date: "`r format(Sys.time(), '%d %B, %Y')`"
output:
  html_document:
    df_print: paged
    number_sections: yes
    toc: yes
    toc_float: yes
    code_folding: show
    theme: "flatly"
    highlight: "default"
  html_notebook:
    df_print: paged
    number_sections: yes
    toc: yes
    toc_float: yes
---

```{r setup, echo = FALSE, message = FALSE, warning = FALSE}
knitr::opts_chunk$set(dev = c("png", "pdf"), # Display PNG but also keep PDF
                      fig.keep = "all",
                      fig.path = "figures/",
                      cache = TRUE,
                      cache.path = "cache/")
```

# Introduction

This document executes all the analysis presented in
*chromswitch: A flexible method for detecting chromatin state switches*, from
downloading the data, to running experiments, to generating the tables and figures
included in the paper. The dropdown in the top right corner of this HTML doument
controls whether code is shown or hidden. The document can be navigated using the menu on the left.

**About the project:**

In this paper, we present an R/Bioconductor package, `chromswitch`, which implements a flexible method to detect chromatin state switches between samples in two biological conditions in a specific genomic region of interest given peaks or chromatin state calls from ChIP-seq data. `chromswitch` is available from Bioconductor 3.6 at [https://bioconductor.org/packages/chromswitch](https://bioconductor.org/packages/chromswitch). We analyze its
performance on a benchmark dataset, study its robustness to changes in tuning parameters, sample size and imbalance, window size, and varying mark profiles; and perform a genome-wide validation.

**How this repository is set up:**

- This HTML file is generated by rendering `analysis.Rmd`, and executes
all the analysis
- `input` contains a few individual data and metadata files needed for the analysis
- `data` stores data which is downloaded during the analysis
- `figures` contains all figures produced in the analysis; the subfolder `heatmaps` stores the figures output by the tool
- `functions` stores a few long functions used (`source()`'d by) the analysis
- `computations` stores the results of computationally-intensive steps of the analysis,
which are saved in the main document after being executed once, and then loaded
for further analysis
- `tables` stores the data behind each figure as TSV files, and tables presented in the supplementary material


**How to re-run the analysis:**

To run the analysis, this R Markdown file `analysis.Rmd` needs to be [knit](https://yihui.name/knitr/). This involves executing every chunk, or piece of code, sequentially. To feasibly handle chunks with computationally-intensive code, we
typically evaluate the chunk once, and then set `eval = FALSE` for that chunk. The results are saved as an `.RData` file in the `computations` folder, and then loaded in a following chunk so that the results can be processed and visualized. To re-run the analysis, including these computationally-intensive steps, *that option needs to be changed back to `eval = TRUE` for every chunk*.

Since the analysis would take multiple days to run, we
recommend running the analysis from the command line, using the `run.sh`
bash script. If HPC resources are available, much of the work can be parallelized
by modifying that script and setting the `n_cores` variable in the next section accordingly.

**Dependencies:**

We used R 3.3.1 and [Bioconductor](http://bioconductor.org/) 3.5. The R packages required are loaded below, see the exact
versions in [Session Info](#sessioninfo). All the data needed for the analysis
is provided in the repository or downloaded in this document.

# Preliminaries

Load general packages:

```{r install_pkg, message = FALSE, warning = FALSE, cache = FALSE}
# Visualization
library(RColorBrewer)
library(ggplot2)

# Install and download custom ggplot2 theme
devtools::install_github("sjessa/ggmin")
library(ggmin) 

# Genomics
library(GenomicRanges)
library(rtracklayer)
library(Homo.sapiens)
library(biomaRt)

# General-purpose
library(pROC)
library(xtable)
library(BiocParallel)
library(parallel)
library(magrittr)
library(plyr)
library(purrr)
library(tidyverse)
library(dplyr)
library(stringi)

# Source functions
source("functions/plotting_functions.R")
source("functions/subsampling_functions.R")
source("functions/misc_functions.R")

```

Install `chromswitch` from the [GitHub repository](https://github.com/sjessa/chromswitch):
```{r install_cs, message = FALSE, cache = FALSE}
devtools::install_github("sjessa/chromswitch@v0.99.0")
library(chromswitch)
```

**Note on chromswitch version**: `chromswitch` is [now available from
Bioconductor 3.6](https://bioconductor.org/packages/release/bioc/html/chromswitch.html), but for reproducibility, the above command will download
v0.99.0 of `chromswitch`, with which the bulk of the analysis was done: https://github.com/sjessa/chromswitch/releases/tag/v0.99.0.

Set-up for parallel computations:

```{r n_cores}
# For other parallel computations
n_cores <- 1
```

`chromswitch` is implemented in a relatively modular way, with a function for
each step of the algorithm. It also provides two wrapper functions, one for
each feature construction strategy, which runs the whole analysis.
Although the wrapper functions in chromswitch can analyze multiple query regions in parallel,
we will parallelize outside the function in order to catch any errors that
might arise.

```{r safely, cache = FALSE}
safeCallSummary <- purrr::safely(chromswitch::callSummary)
safeCallBinary  <- purrr::safely(chromswitch::callBinary)
```

# Download epigenomic data

The data we analyze comes from the Roadmap Epigenomics Program. We evaluate three types of input:

1. H3K4me3 peaks
2. DNase I peaks/DNase I hypersensitive sites
3. ChromHMM segmentation learned on the core histone marks in the Roadmap analysis, filtered to regions assigned the state "Active TSS".

We will work with 23 adult tissue samples from Roadmap. 7 are brain tissues,
and 16 are various other tissues.

```{r metadata, message = FALSE}
# Read in the samples
meta <- read_tsv("input/sample_metadata.tsv")
meta
table(meta$Condition)
```

```{r noclobber}
# Download the file if it doesn't exist
noClobberDownload <- function(url, destpath) {
    if(!(file.exists(destpath))) download.file(url, destpath)
}
```

We retrieve the data from this link: [http://egg2.wustl.edu/roadmap/data/byFileType/](http://egg2.wustl.edu/roadmap/data/byFileType/).
```{r pk_k4me3, message = FALSE, eval = FALSE}

urls_k4 <- paste0("http://egg2.wustl.edu/roadmap/data/byFileType/peaks/consolidated/narrowPeak/",
                      meta$Sample, "-H3K4me3.narrowPeak.gz")
basenames_k4 <- lapply(urls_k4, basename) %>% unlist()
destpaths_k4 <- paste0("data/H3K4me3_peaks/", basenames_k4)

dl_k4 <- mcMap(noClobberDownload, urls_k4, destpaths_k4, mc.cores = n_cores)

extra_cols <- c("name" = "character", "score" = "integer", "strand" = "character",
                "signalValue" = "numeric", "pValue" = "numeric", "qValue" = "numeric",
                "peak" = "numeric")

pk_k4me3 <- lapply(destpaths_k4, rtracklayer::import, format = "bed",
                   extraCols = extra_cols)
names(pk_k4me3) <- meta$Sample

save(pk_k4me3, file = "computations/pk_k4me3.RData")
```

    
```{r pk_dnase, message = FALSE, eval = FALSE}
urls_dnase <- paste0("http://egg2.wustl.edu/roadmap/data/byFileType/peaks/consolidatedImputed/narrowPeak/",
               meta$Sample,
               "-DNase.imputed.narrowPeak.bed.nPk.gz")
basenames_dnase <- lapply(urls_dnase, basename) %>% unlist()
destpaths_dnase <- paste0("data/DNase_peaks/", basenames_dnase)

dl_dnase <- mcMap(noClobberDownload, urls_dnase, destpaths_dnase,
                  mc.cores = n_cores)

pk_dnase <- lapply(destpaths_dnase, rtracklayer::import, format = "bed")
names(pk_dnase) <- meta$Sample

save(pk_dnase, file = "computations/pk_dnase.RData")
```

To convert the chromatin state segmentations into a format that can be used
by `chromswitch`, we will filter to regions assigned "Active_TSS", the state
associated with active transcription.

```{r read_chromhmm, eval = FALSE}

urls_st <- paste0("http://egg2.wustl.edu/roadmap/data/byFileType/chromhmmSegmentations/ChmmModels/coreMarks/jointModel/final/",
                      meta$Sample, "_15_coreMarks_mnemonics.bed.gz")
basenames_st <- lapply(urls_st, basename) %>% unlist()
destpaths_st <- paste0("data/ChromHMM_segmentations/", basenames_st)

dl_st <- mcMap(noClobberDownload, urls_st, destpaths_st, mc.cores = 6)

extra_cols <- c("State" = "character")

chmm_states <- lapply(destpaths_st, rtracklayer::import, format = "bed",
                   extraCols = extra_cols) %>% 
    lapply(as.data.frame) %>% 
    # We will treat regions assigned the state associated with active
    # transcription as the epigenomic features in our analysis
    lapply(filter, State == "1_TssA") %>% 
    lapply(makeGRangesFromDataFrame)

names(chmm_states) <- meta$Sample

save(chmm_states, file = "computations/chmm_states.RData")
```


```{r load_epi}
load("computations/pk_k4me3.RData")
load("computations/pk_dnase.RData")
load("computations/chmm_states.RData")
```


# Benchmark `chromswitch` performance

## Prepare benchmarking dataset

We manually assembled a list of 120 genes on which to benchmark `chromswitch`
performance. 60 of these genes exhibit chromatin state switches between brain
and other tissues based on change in H3K4me3 signal (whose enrichment at promoters is associated with active transcription), DNase I hypersensitivity (associated with open chromatin), and expression (the functional consequence of chromatin changes). The other 60 are negative examples, where there was no change
across the tissues in H3K4me3, DNase I, or expression data. These genes are
specified in the file `input/benchmark.tsv`.

To define windows for these genes to use as input for our experiment, we retrieve
a list of genes and coordinates in the human genome from UCSC. This file is
saved in the `data` directory, and we obtained it using the [UCSC Table Browser](http://genome.ucsc.edu/cgi-bin/hgTables?hgsid=602106695_Ii01YSgPob6wraDd1cmsVdOT4eau&clade=mammal&org=&db=hg19&hgta_group=genes&hgta_track=refSeqComposite&hgta_table=ncbiRefSeq&hgta_regionType=genome&position=&hgta_outputType=primaryTable&hgta_outFileName=), with the following selections:

* `group`: Genes and Gene Predictions
* `track`: RefSeq
* Fields: `name`, `chrom`, `strand`, `txStart`, `txEnd`, and `name2`

```{r refseq, message = FALSE}
all_genes <- read_tsv("data/refseq_genes.hg19.tsv",
                     col_names = c("transcript_id", "chromosome", "strand",
                                   "txStart", "txEnd", "gene_name"), skip = 1)

head(all_genes)
```
 
Define 5kbp windows around the TSS:
```{r bm, message = FALSE}
benchmark <- read_tsv("input/benchmark.tsv") %>% 
    left_join(all_genes, by = c("transcript_id", "gene_name")) %>% 
    # Define window around the TSS
  mutate(start = case_when(
    strand == "+" ~ as.numeric(txStart) - 2500,
    strand == "-" ~ as.numeric(txEnd) - 2500)) %>%
  mutate(end = case_when(
    strand == "+" ~ as.numeric(txStart) + 2500,
    strand == "-" ~ as.numeric(txEnd) + 2500)) %T>% 
    write_tsv("input/benchmark_coords.5kbp.tsv") %>% 
    makeGRangesFromDataFrame(keep.extra.columns = TRUE) %>% 
    keepStandardChromosomes()
```

*Table S1*
```{r table_s1}

benchmark %>%
  as.data.frame %>%
  dplyr::select(gene_name, transcript_id, chr = seqnames, start, end, switch) %>% 
  arrange(desc(switch), gene_name) %T>% 
  write_tsv("tables/tableS2.tsv") %>% 
  xtable()

```

## Prepare genome-wide windows for validation

In order to later validate `chromswitch` genome-wide, we also clean up the gene list to define 5kbp
windows genome-wide.

```{r refseq_unique}

refseq_unique <- all_genes %>% 
      # Keep the longest transcript per gene
      mutate(length = txEnd - txStart) %>%
      group_by(gene_name) %>%
      dplyr::slice(which.max(length)) %>% 
      ungroup() %>% 
      # Filter out genes where the length of the gene is less than half the
      # window size
      filter(length >= 2500)

```

```{r get_refseq}

getRefSeq <- function(window) {
    
  d <- window / 2
    
  refseq_unique %>%
      # Define window around the TSS
      mutate(start = case_when(
          strand == "+" ~ as.numeric(txStart) - d,
          strand == "-" ~ as.numeric(txEnd) - d)) %>%
      mutate(end = case_when(
          strand == "+" ~ as.numeric(txStart) + d,
          strand == "-" ~ as.numeric(txEnd) + d)) %T>% 
      write_tsv(paste0("input/benchmark_coords.", window, ".tsv")) %>% 
      makeGRangesFromDataFrame(keep.extra.columns = TRUE) %>% 
      keepStandardChromosomes(pruning.mode = "coarse")
}

refseq_5kbp <- getRefSeq(5000)

```

## Run `chromswitch` on benchmark dataset

Here we evaluate each feature matrix construction strategy on each type of input.

*Summary strategy applied to H3K4me3*

```{r run_bm_s_k4, eval = FALSE}

bench_s_k_raw <- mclapply(benchmark, safeCallSummary,
                       peaks = pk_k4me3,
                       metadata = meta,
                       mark = "H3K4me3",
                       filter = TRUE,
                       filter_columns = c("qValue", "signalValue"),
                       filter_thresholds = c(10, 4),
                       normalize = TRUE,
                       normalize_columns = c("qValue", "signalValue"),
                       summarize_columns = c("qValue", "signalValue"),
                       length = FALSE,
                       fraction = TRUE,
                       n = FALSE,
                       optimal_clusters = TRUE,
                       heatmap = FALSE,
                       BPPARAM = SerialParam(), mc.cores = n_cores) %>% 
  purrr::transpose()

bench_s_k <- bench_s_k_raw$result[map_lgl(bench_s_k_raw$error, is_null)] %>%
    bind_rows()

save(bench_s_k, file = "computations/bench_s_k.RData")
```

*Summary strategy applied to DNase I*

```{r run_bm_s_dnase, eval = FALSE}

bench_s_d_raw <- mclapply(benchmark, safeCallSummary,
                               peaks = pk_dnase,
                               metadata = meta,
                               mark = "DNase",
                               filter = FALSE,
                               normalize = FALSE,
                               summarize_columns = NULL,
                               length = FALSE,
                               fraction = TRUE,
                               n = FALSE,
                               optimal_clusters = TRUE,
                               heatmap = FALSE,
                               BPPARAM = SerialParam(),
                               mc.cores = n_cores) %>% 
    purrr::transpose()

bench_s_d <- bench_s_d_raw$result[map_lgl(bench_s_d_raw$error, is_null)] %>%
    bind_rows()

save(bench_s_d, file = "computations/bench_s_d.RData")
```

*Binary strategy applied to H3K4me3*

```{r run_bm_b_k4, warning = FALSE, eval = FALSE}
bench_b_k_raw <- mclapply(benchmark,
                          safeCallBinary,
                          peaks = pk_k4me3,
                          metadata = meta,
                          filter = TRUE,
                          filter_columns = c("qValue","signalValue"),
                          filter_thresholds = c(10, 4),
                          reduce = TRUE,
                          gap = 300,
                          p = 0.4,
                          optimal_clusters = TRUE,
                          n_features = TRUE, mc.cores = n_cores) %>% 
  purrr::transpose()

bench_b_k <- bench_b_k_raw$result[map_lgl(bench_b_k_raw$error,
                                                is_null)] %>%
  bind_rows()

save(bench_b_k, file = "computations/bench_b_k.RData")
```

*Binary strategy applied to DNase I*

```{r run_bm_b_dnase, warning = FALSE, eval = FALSE}
bench_b_d_raw <- mclapply(benchmark,
                          safeCallBinary,
                          peaks = pk_dnase,
                          metadata = meta,
                          filter = FALSE,
                          reduce = TRUE,
                          gap = 300,
                          p = 0.4,
                          optimal_clusters = TRUE,
                          n_features = TRUE, mc.cores = n_cores) %>% 
  purrr::transpose()

bench_b_d <- bench_b_d_raw$result[map_lgl(bench_b_d_raw$error, is_null)] %>%
  bind_rows()

save(bench_b_d, file = "computations/bench_b_d.RData")
```

*Summary strategy applied to ChromHMM*

```{r run_bm_s_chmm, warning = FALSE, eval = FALSE}
bench_s_c_raw <- mclapply(benchmark,
                          safeCallSummary,
                          peaks = chmm_states,
                          metadata = meta,
                          mark = "State",
                          filter = FALSE,
                          normalize = FALSE,
                          summarize_columns = NULL,
                          length = FALSE,
                          fraction = TRUE,
                          n = FALSE,
                          heatmap = FALSE,
                          optimal_clusters = TRUE,
                          BPPARAM = SerialParam(),
                          mc.cores = n_cores) %>% 
  purrr::transpose()

bench_s_c <- bench_s_c_raw$result[map_lgl(bench_s_c_raw$error, is_null)] %>%
    bind_rows()

save(bench_s_c, file = "computations/bench_s_c.RData")
```

*Binary strategy applied to ChromHMM*

```{r run_bench_b_chmm, warning = FALSE, eval = FALSE}
bench_b_c_raw <- mclapply(benchmark,
                          safeCallBinary,
                          peaks = chmm_states,
                          metadata = meta,
                          filter = FALSE,
                          reduce = FALSE,
                          p = 0.4,
                          optimal_clusters = TRUE,
                          n_features = TRUE,
                          mc.cores = n_cores) %>% 
  purrr::transpose()

bench_b_c <- bench_b_c_raw$result[map_lgl(bench_b_c_raw$error,
                                          is_null)] %>%
    bind_rows()

save(bench_b_c, file = "computations/bench_b_c.RData")
```


```{r load_bm, cache = FALSE}
load("computations/bench_s_k.RData")
load("computations/bench_s_d.RData")
load("computations/bench_s_c.RData")
load("computations/bench_b_k.RData")
load("computations/bench_b_d.RData")
load("computations/bench_b_c.RData")
```


## Random predictor
We compare with a random predictor by drawing from
from a $Uniform(0, 1)$ distribution, repeating this 10,000 times, and then
averaging the results to assign a score. This is done for each region. It should track with the diagonal
and have an AUROC of around 0.5

```{r random_predictor, eval = FALSE}
#' @param k, the number of query regions to make a prediction for
#' @param N, the number of times to draw
predictRandom <- function(k, N) {
    
    rep(k, N) %>%
        mclapply(runif, mc.cores = 2) %>% 
        data.frame() %>% 
        rowMeans()
}

# Repeat twice, to get one set of random predictions per method
draws <- replicate(2, predictRandom(120, 10000), simplify = FALSE)
save(draws, file = "computations/draws.RData")
```

```{r load_random}
load("computations/draws.RData")

bench_s_k <- bench_s_k %>%
    add_column(Random = draws[[1]], .after = "Consensus")

bench_b_d <- bench_b_d %>%
    add_column(Random = draws[[2]], .after = "Consensus")
```

## Shuffled labels

Our method predicts chromatin state switches by computing a score of the similarity
between the known biological condition labels and inferred clusters. We can
assess how these scores look when we shuffle the class labels. We cluster samples based on the feature matrix as usual, but before calculating any of the external validity indices, we randomly shuffle the labels.

```{r shuffle}
predictShuffle <- function(clusters, N) {
  
  shuffle <- function(labels, clusters) {
    
    shuf <- sample(labels)
    contingency <- table(shuf, clusters)
    
    data.frame(
      ARI_shuffle          = mclust::adjustedRandIndex(clusters, shuf),
      NMI_shuffle          = chromswitch:::NMI(clusters, shuf),
      V_measure_shuffle    = chromswitch:::vMeasure(contingency)
    )
  }
  
  # Repeat N times
  replicate(N, shuffle(labels = meta$Condition, clust = unlist(clusters)),
            simplify = FALSE) %>%
    bind_rows() %>%
    colMeans()
}
```


```{r shuffle_s, eval = FALSE}

shuffle_s_k <- bench_s_k %>%
    dplyr::select(matches("^E[0-9]{3}")) %>%
    t() %>%
    as.data.frame %>%
    mclapply(predictShuffle, N = 10000, mc.cores = n_cores) %>% 
    as.data.frame() %>% 
    t() %>% 
    as.data.frame()

shuffle_s_d <- bench_s_d %>%
    dplyr::select(matches("^E[0-9]{3}")) %>%
    t() %>%
    as.data.frame %>%
    mclapply(predictShuffle, N = 10000, mc.cores = n_cores) %>% 
    as.data.frame() %>% 
    t() %>% 
    as.data.frame()

shuffle_s_c <- bench_s_c %>%
    dplyr::select(matches("^E[0-9]{3}")) %>%
    t() %>%
    as.data.frame %>%
    mclapply(predictShuffle, N = 10000, mc.cores = n_cores) %>% 
    as.data.frame() %>% 
    t() %>% 
    as.data.frame()

save(shuffle_s_k, shuffle_s_d, shuffle_s_c,
     file = "computations/shuffle_s.RData")
```

```{r shuffle_b, eval = FALSE}

shuffle_b_k <- bench_b_k %>%
    dplyr::select(matches("^E[0-9]{3}")) %>%
    t() %>%
    as.data.frame %>%
    mclapply(predictShuffle, N = 10000, mc.cores = n_cores) %>% 
    as.data.frame() %>% 
    t() %>% 
    as.data.frame()

shuffle_b_d <- bench_b_d %>%
    dplyr::select(matches("^E[0-9]{3}")) %>%
    t() %>%
    as.data.frame %>%
    mclapply(predictShuffle, N = 10000, mc.cores = n_cores) %>% 
    as.data.frame() %>% 
    t() %>% 
    as.data.frame()

shuffle_b_c <- bench_b_c %>%
    dplyr::select(matches("^E[0-9]{3}")) %>%
    t() %>%
    as.data.frame %>%
    mclapply(predictShuffle, N = 10000, mc.cores = n_cores) %>% 
    as.data.frame() %>% 
    t() %>% 
    as.data.frame()

save(shuffle_b_c, shuffle_b_d, shuffle_b_k,
     file = "computations/shuffle_b.RData")

```

```{r load_shuf}
load("computations/shuffle_s.RData")
load("computations/shuffle_b.RData")

addShuf <- function(df) {
  df %>% 
    mutate(Consensus_shuffle =
               rowMeans(dplyr::select(., ARI_shuffle, NMI_shuffle, V_measure_shuffle)))
}

bench_s_k <- bind_cols(bench_s_k, shuffle_s_k) %>% addShuf()
bench_s_d <- bind_cols(bench_s_d, shuffle_s_d) %>% addShuf()
bench_s_c <- bind_cols(bench_s_c, shuffle_s_c) %>% addShuf()

bench_b_k <- bind_cols(bench_b_k, shuffle_b_k) %>% addShuf()
bench_b_d <- bind_cols(bench_b_d, shuffle_b_d) %>% addShuf()
bench_b_c <- bind_cols(bench_b_c, shuffle_b_c) %>% addShuf()
```

## ROC curves

The problem of a detecting a chromatin state switch in a region can be reframed as the problem of classifying the region as containing a switch between conditions or not. Therefore, our method can be regarded as a binary classifier, and evaluated accordingly. We evaluate the performance of our method on the benchmark set by computing Receiver-Operating Characteristic (ROC) curves based on the consensus score computed by `chromswitch`. ROC curves measure the tradeoff between the true positive rate and the false positive rate as the threshold on the consensus score is varied, as well as the area under the ROC curve (AUROC) which takes on a value of 1 for a perfect classifier and can be interpreted as the probability that `chromswitch` will score a randomly chosen region corresponding to a chromatin state switch higher than a randomly chosen region which does not exhibit a switch.

```{r calcROC = FALSE}
calcROC <- function(scores, method, input, smooth) {
    
    # Compute ROC info using each score as a predictor
    roc_out <- scores %>% 
        dplyr::select(
        matches("Consensus|Random")) %>% 
        map(~ pROC::roc(response = scores$switch,
                       predictor = as.numeric(.),
                       smooth = smooth))
    
    # Extract values to plot ROC curves
    roc_val <- roc_out %>%
        map(`[`, c("sensitivities", "specificities")) %>%
        map(as.data.frame) %>%
        map2_df(names(roc_out), function(df, index) mutate(df, index = index)) %>%
        mutate(TPR = sensitivities,
               FPR = 1 - specificities,
               Method = method,
               Input = input)

    return(roc_val)
}
```

```{r benchmark_roc}
benchmark_roc <- bind_rows(
  calcROC(bench_s_k, "Summary", "H3K4me3",  smooth = TRUE),
  calcROC(bench_s_d, "Summary", "DNase I",  smooth = TRUE),
  calcROC(bench_s_c, "Summary", "ChromHMM", smooth = TRUE),
  calcROC(bench_b_k, "Binary",  "H3K4me3",  smooth = TRUE),
  calcROC(bench_b_d, "Binary",  "DNase I",  smooth = TRUE),
  calcROC(bench_b_c, "Binary",  "ChromHMM", smooth = TRUE)) %>% 
  filter(grepl("Consensus|Random", index)) %>%
  filter(index != "Random" |
          (index == "Random") & (Method == "Binary") & (Input == "DNase I")) %>% 
  mutate(Input = ifelse(index == "Random", "NA", Input)) %T>% 
  write_tsv("tables/benchmark_roc.tsv")

roc_gg <- benchmark_roc %>% 
  makeLabel()

```

*Figure 1b*
```{r fig1b, fig.width = 3.7, fig.height = 4.7}

roc_gg %>%
    filter(Input != "DNase I") %>%
    filter(index != "Consensus_shuffle") %>%
    ggplot(aes(x = FPR, y = TPR, colour = Label, linetype = Label)) +
    geom_line(size = 1) +
    scale_colour_manual(name = "Strategy", values = pal_strategy) +
    scale_linetype_manual(name = "Strategy", values = linetype_strategy) +
    theme_min() +
    xlab("False Positive Rate") + ylab("True Positive Rate") +
    theme(legend.position = "bottom") +
    labs(colour = "") + labs(linetype = "") +
    guides(colour = guide_legend(ncol = 1))
```

*Figure S2*
```{r figS2, fig.width = 7, fig.height = 5.2}

roc_gg %>% 
    filter(index != "Random") %>% 
    ggplot(aes(x = FPR, y = TPR, colour = Label, linetype = Label)) +
    geom_line(size = 1) +
    scale_colour_manual(name = "Strategy", values = pal_strategy) +
    scale_linetype_manual(name = "Strategy", values = linetype_strategy2) +
    theme_min() +
    # theme(legend.position = "bottom") +
    xlab("False Positive Rate") + ylab("True Positive Rate") +
    theme(legend.position = "bottom") +
    facet_wrap(~ Input) +
    labs(colour = "") + labs(linetype = "") +
    guides(colour = guide_legend(ncol = 1))

```


## AUROC

```{r auroc, warning = FALSE}

calcAUC <- function(scores, method, input) {

    scores %>%
        dplyr::select(
        matches("Consensus|Random")) %>%
        map(~pROC::roc(response = scores$switch, predictor = as.numeric(.))) %>%
        map_df("auc") %>%
        mutate(Method = method,
               Input = input)
}

benchmark_auc <- bind_rows(
  calcAUC(bench_s_k, "Summary", "H3K4me3"),
  calcAUC(bench_s_d, "Summary", "DNase I"),
  calcAUC(bench_s_c, "Summary", "ChromHMM"),
  calcAUC(bench_b_k, "Binary", "H3K4me3"),
  calcAUC(bench_b_d, "Binary", "DNase I"),
  calcAUC(bench_b_c, "Binary", "ChromHMM")) %>% 
  select(Method, Input, Consensus, Consensus_shuffle, Random)

benchmark_auc
```

*Figure S2*
```{r auc_latex}
# A little manipulation to make the table for the supplementary

benchmark_auc %>%
  gather(Stat, Score, Consensus, Consensus_shuffle) %>%
  rowwise() %>%
  mutate(Method = ifelse(grepl("shuffle", Stat),
                         paste0(Method, " (shuffled)"), Method)) %>%
  ungroup() %>% 
  dplyr::select(Method, Input, Score) %>%
  spread(Input, Score) %T>%
  write_tsv("tables/benchmark_auc.tsv") %>% 
  xtable() # Generate latex code
```

# Robustness experiments

## Tuning parameters

There are two parameters used in the feature construction process for the binary
approach, `gap` which controls whether nearby peaks within samples are connected, and `p` which controls how
much overlap is required to call two peaks the same. To assess the impact
of these parameters, we'll perform a grid search
on various combinations.

Since we want to investigate the effect of varying the tuning parameters,
we select the best threshold on the *Consensus* score on this benchmarking
dataset:
```{r best_threshold, cache = FALSE}

getBest <- function(bench_out) {
    
    best_consensus <- pROC::roc(predictor = bench_out$Consensus,
                            response = bench_out$switch) %>%
    pROC::coords("best",
                 ret = c("threshold", "sensitivity",
                         "specificity", "accuracy", "ppv",
                 as.list = TRUE, drop = TRUE))
    
    best_threshold <- best_consensus["threshold"]
    return(best_threshold)
}

best_k <- getBest(bench_b_k)
best_k
```


```{r tuning_grid, eval = FALSE}
grid <- expand.grid(gap = seq(200, 500, by = 50),
                    p_overlap = seq(0.2, 0.8, by = 0.1))

evaluateParams <- function(gap, p, best_threshold, ...) {
    
    out <- benchmark %>% 
        mclapply(safeCallBinary, gap = gap, p = p, ...) %>% 
        purrr::transpose()
    
    ok <- map_lgl(out$error, is_null)
    out <- out$result[ok] %>%
        bind_rows() %>% 
        mutate(pred = ifelse(Consensus >= best_threshold, 1, 0))
    
    # Now we can tally the results by comparing the predictions
    # to the known classes (switch or not)
    stats <- table(benchmark$switch[ok], out$pred)
    
    tn = fn = tp = fp = 0
    try(tn <- stats["0", "0"], silent = TRUE)
    try(fn <- stats["1", "0"], silent = TRUE)
    try(tp <- stats["1", "1"], silent = TRUE)
    try(fp <- stats["0", "1"], silent = TRUE)
    
    df <- data.frame(gap, p, tp, tn, fp, fn)
    return(list(predictions = out$pred, stats = df))
} 

safeEvalParams <- purrr::safely(evaluateParams)
grid_out <- Map(function(gap, p) safeEvalParams(gap, p,
                                                   best_threshold = best_k,
                                                   peaks = pk_k4me3,
                                                   metadata = meta,
                                                   filter = TRUE,
                                                   filter_columns = c("qValue", "signalValue"),
                                                   filter_thresholds = c(10, 4),
                                                   reduce = TRUE,
                                                   heatmap = FALSE,
                                                   BPPARAM = SerialParam(),
                                                   optimal_clusters = TRUE,
                                                   mc.cores = n_cores),
                   grid$gap, grid$p_overlap) %>%
    purrr::transpose()

save(grid_out, file = "computations/grid.RData")
```

*Figure S4*
```{r figS4_grid_k4me3, fig.width = 9, fig.height = 2.9, fig.keep = "last", fig.show = "last"}
load("computations/grid.RData")
makeGrid(grid_out, "H3K4me3")

load("tables/grid_scores.H3K4me3.RData")
plotGrid(grid_scores)
```


## Sample size and class imbalance

Next, we investigate how sensitive or robust `chromswitch` is to small numbers
of samples and/or to datasets with high class imbalance (*i.e.* when samples
in one condition greatly outnumber samples in the other). To do so, we will
subsample from the samples in the full 23-tissue dataset to create many smaller
datasets with fewer samples and greater class imbalance, and evaluate how
`chromswitch` performs in classifying the benchmark regions using these smaller
datasets.

For the summary feature matrix construction strategy, the computation of the feature vector for each sample is independent of all other samples, so we can save the feature matrices and subsample rows from the matrices. Here
we compute and save the feature matrix for each region:

```{r get_ft_s, eval = FALSE}

getSummaryFeatures <- function(query) {
  
  cols <- c("qValue", "signalValue")
  
  ft_mat <- pk_k4me3 %>% 
    filterPeaks(columns = cols,
                thresholds = c(10, 4)) %>% 
    normalizePeaks(columns = cols) %>% 
    retrievePeaks(metadata = meta, region = query) %>% 
    summarizePeaks(mark = "H3K4me3", cols = cols,
                   length = FALSE, fraction = TRUE, n = FALSE)
  
  return(ft_mat)
  
}

safeGetFeatures <- purrr::safely(getSummaryFeatures)

ft_s_k_raw <- benchmark %>% 
  mclapply(safeGetFeatures, mc.cores = n_cores) %>% 
  purrr::transpose()

ft_s_k <- ft_s_k_raw$result
save(ft_s_k, ft_s_k_raw,
     file = "computations/benchmark_features_s_k.RData")
```

For the binary strategy, the feature matrices are derived from the union of
peaks supplied across samples, and are therefore dependent on the exact set of
samples, so we will only save the local peaks (the peaks in the query region)
for each sample.
```{r get_pk_b, eval = FALSE}

getBinaryPeaks <- function(query) {
  
  pks <- pk_k4me3 %>% 
    retrievePeaks(metadata = meta, region = query) %>% 
    reducePeaks(gap = 300)
  
  return(pks)
  
}

safeGetBinaryPeaks <- purrr::safely(getBinaryPeaks)

pk_b_k_raw <- benchmark %>% 
  mclapply(safeGetBinaryPeaks, mc.cores = n_cores) %>% 
  purrr::transpose()
  
pk_b_k  <- pk_b_k_raw$result

# Extract the peaks from the localPeaks object
lpk_b_k <- lapply(pk_b_k, chromswitch:::lpkPeaks)

save(pk_b_k_raw, pk_b_k, lpk_b_k,
     file = "computations/benchmark_peaks_b_k.RData")
```

```{r load_saved_ft, eval = TRUE, cache = FALSE}
load("computations/benchmark_features_s_k.RData")
load("computations/benchmark_peaks_b_k.RData")
```

These are the different datasets we'll construct, each with a different
number and proportion of brain & other samples.
```{r subsamp_combos, eval = TRUE}

combos <- data.frame(
    n_other = c(2, 3, 4, 5, 6, 7, 8, 10, 12, 14, 16, 16, 16, 10, 12, 15, 16, 16),
    n_brain = c(2, 3, 4, 5, 6, 7, 7,  7,  7,  7,  7,  2,  3,  5,  4,  3,  4,  5)
)

combos

lvls_combos <- c(unlist(Map(function(x, y)
    paste0(x ," brain vs. ", y, " other"),
    combos$n_brain, combos$n_other)))

idx_other <- which(meta$Condition == "Other")
idx_brain <- which(meta$Condition == "Brain")

```


Since in some cases it's possible that there are few enough combinations that we
can exhaustively try all of them, we will do so when there are fewer than 100
combinations. Otherwise, we will evaluate 100 randomly subsample ddatasets to for each combination.

How many possible datasets exist for each pair of `n_brain` and `n_other`?

```{r comb}
n_pos <- function(n_other, n_brain, idx_other) choose(length(idx_other), n_other) * choose(length(idx_brain), n_brain)

combos$n_pos <- unlist(Map(function(n_other, n_brain) n_pos(n_other, n_brain, idx_other),
                            combos$n_other, combos$n_brain))
combos
```
For the code below, the functions that carry out the subsampling experiment
are sourced from `functions/subsampling_functions.R`.

```{r subsamp_fun, cache = TRUE}

safeSubsampleFt <- purrr::safely(subsampleFromFeatures)
safeSubsamplePk <- purrr::safely(subsampleFromPeaks)

```


```{r subsamp_s, eval = FALSE}

subsamp_s_raw <- mcMap(function(n_other, n_brain)
    safeSubsampleFt(ft_s_k, n_other, n_brain, idx_other),
    combos$n_other, combos$n_brain,
    mc.cores = n_cores) %>%
    purrr::transpose()

subsamp_s <- subsamp_s_raw$result
save(subsamp_s, subsamp_s_raw, file = "computations/subsamp_s.RData")
```


```{r subsamp_b, eval = FALSE, cache = FALSE}
subsamp_b_raw <- mcMap(function(n_other, n_brain)
  safeSubsamplePk(pk_b_k, n_other, n_brain, idx_other),
    combos$n_other, combos$n_brain,
    mc.cores = n_cores) %>%
    purrr::transpose()

subsamp_b <- subsamp_b_raw$result
save(subsamp_b, subsamp_b_raw, file = "computations/subsamp_b.RData")
```

Now we can process the results, and calculate mean AUC
scores for each combination so we can quantify performance.
```{r prep_subsamp}

load("computations/subsamp_s.RData")
load("computations/subsamp_b.RData")

auc_s <- subsamp_s %>% map_df("auc") %>% 
  mutate(Method = "Summary", Input = "H3K4me3")

auc_b <- subsamp_b %>% map("scores") %>% 
  map_df("auc") %>% 
  mutate(Method = "Binary", Input = "H3K4me3")

# Simplify the facet labels
lvls_combos_new <- c(unlist(Map(function(x, y)
    paste0(x ," vs. ", y),
    combos$n_brain, combos$n_other)))
lvls_combos_new[1] <- "2 brain vs. 2 other"

# Calculate the mean AUC for each combination of n_brain, n_other
mean_auc <- bind_rows(auc_s, auc_b) %>% 
    rowwise() %>% 
    mutate(dataset = ifelse(n_brain == 2 && n_other == 2,
                            "2 brain vs. 2 other",
                            paste0(n_brain, " vs. ", n_other))) %>%
    ungroup() %>% 
    mutate(dataset = factor(dataset, levels = lvls_combos_new)) %>%
    dplyr::select(dataset, iteration, AUC = Consensus, Method, Input) %>%
    group_by(dataset, Method, Input) %>%
    summarise(mean_AUC = paste0("AUC: ", round(mean(AUC), 2))) %T>%
    write_tsv("tables/subsampling_auc.tsv")

roc_s <- subsamp_s %>% map_df("roc") %>%
  mutate(Method = "Summary", Input = "H3K4me3")

roc_b <- subsamp_b %>% map("scores") %>% 
  map_df("roc") %>% 
  mutate(Method = "Binary", Input = "H3K4me3")

roc_df <- bind_rows(roc_s, roc_b) %>% 
  rowwise() %>% 
  mutate(dataset = ifelse(n_brain == 2 && n_other == 2,
                          "2 brain vs. 2 other",
                          paste0(n_brain, " vs. ", n_other))) %>%
  ungroup() %>%
  mutate(dataset = factor(dataset, levels = lvls_combos_new)) %>%
  filter(index == "Consensus") %>% 
  dplyr::select(-index)

# Ensures that the iterations are plot in an order that looks nice
# with the colours of the lines
roc_df$iteration <- factor(roc_df$iteration,
                           levels = sort(unique(roc_df$iteration),
                                         decreasing = TRUE))
```

In the figures below, the ROC curve for every individual subsample dataset is
plot in a shade of gray, while the mean across iterations, for each combination,
is plot in red (summary strategy) or blue (binary strategy).

*Figure S3a*
```{r figS3a_subsamp_summary, fig.width = 8, fig.height = 5}
roc_df %T>%
    write_tsv("tables/subsampling_roc.tsv") %>% 
    filter(Method == "Summary" & Input == "H3K4me3") %>%
    plotSubsamp("red", "Summary", "H3K4me3")
```

*Figure S3b*
```{r figS3b_subsamp_binary, fig.width = 8, fig.height = 5, warn = FALSE}
roc_df %>%
    filter(Method == "Binary" & Input == "H3K4me3") %>%
    plotSubsamp("blue", "Binary", "H3K4me3")
```

## Window size

Since the `chromswitch` algorithm depends on the data in the genomic query window,
which is selected by the user,
we next considered the effect of variations in the window size. We studied how the
consensus score changed as a function of the window size for three genes in our
benchmark dataset, using H3K4me3 peaks.

```{r windows}

window_genes <- c("GRIK2", "CHRM1", "TTYH1")

windows <- refseq_5kbp %>%
	as.data.frame %>%
	filter(gene_name %in% window_genes) %>%
	select(-start, -end)

get_window <- function(window_size, genes) {

	d <- window_size / 2

	genes %>%
        # Define window around the TSS
        mutate(start = case_when(
            strand == "+" ~ as.numeric(txStart) - d,
            strand == "-" ~ as.numeric(txEnd) - d)) %>%
        mutate(end = case_when(
            strand == "+" ~ as.numeric(txStart) + d,
            strand == "-" ~ as.numeric(txEnd) + d)) %>%
   	     mutate(gene_name = paste0(gene_name, "_", window_size/1000, "kbp"))
}

# Window sizes to test, in base pairs
sizes <- c(seq(1000, 10000, by = 1000),
           seq(10000, 50000, by = 10000),
           100000, 250000, 500000)

windows_gr <- lapply(sizes, get_window, windows) %>% bind_rows() %>%
  makeGRangesFromDataFrame(keep.extra.columns = TRUE)

```


```{r windows_s, eval = FALSE, cache = FALSE}

win_s <- callSummary(windows_gr,
         peaks = pk_k4me3,
         metadata = meta,
         mark = "H3K4me3",
         filter = TRUE,
         filter_columns = c("qValue", "signalValue"),
         filter_thresholds = c(10, 4),
         normalize = TRUE,
         normalize_columns = c("qValue", "signalValue"),
         summarize_columns = c("qValue", "signalValue"),
         length = FALSE,
         fraction = TRUE,
         n = FALSE,
         optimal_clusters = TRUE,
         heatmap = TRUE,
         titles = paste0(windows_gr$gene_name, "_summary"),
         outdir = "figures/heatmaps")

win_s

save(win_s, file = "computations/windowsize_s.RData")

```


```{r windows_b, eval = FALSE, cache = FALSE}

win_b <- callBinary(query = win_to_test,
                          peaks = pk_k4me3,
                          metadata = meta,
                          filter = TRUE,
                          filter_columns = c("qValue","signalValue"),
                          filter_thresholds = c(10, 4),
                          reduce = TRUE,
                          gap = 300,
                          p = 0.4,
                          optimal_clusters = TRUE,
                          heatmap = TRUE,
                          titles = paste0(windows_gr$gene_name, "_binary"),
                          outdir = "figures/heatmaps",
                          n_features = TRUE)

win_b

save(win_b, file = "computations/windowsize_b.RData")

```

*Figure S5*
```{r figS5_windowsize, fig.width = 6, fig.height = 5}

load("computations/windowsize_s.RData")
load("computations/windowsize_b.RData")

windows <- bind_rows(mutate(win_s, method = "Summary"), mutate(win_b, method = "Binary")) %>% 
  select(gene_name, Consensus, method) %>% 
  separate(gene_name, into = c("gene_name", "window_size"), sep = "_") %>% 
  separate(window_size, into = c("window_size", "drop"), sep = "k") %>% 
  select(-drop) %>% 
  mutate(window_size = as.numeric(window_size)) %T>%
  write_tsv("tables/window_size.tsv")

# Plot TTYH1, "zoomed" in in the 0-50kbp range and in the 0-500kbp range
ttyh1_short <- windows %>% 
  filter(gene_name == "TTYH1") %>% 
  filter(window_size <= 50) %>% 
  mutate(range = "50kbp")

ttyh1_long <- windows %>% 
  filter(gene_name == "TTYH1") %>% 
  filter(window_size <= 500) %>% 
  mutate(range = "500kbp")

bind_rows(ttyh1_short, ttyh1_long) %>% 
  mutate(range = factor(range, levels = c("50kbp", "500kbp"))) %>% 
  ggplot(aes(x = window_size, y = Consensus)) +
  geom_point(aes(colour = method)) +
  geom_line(aes(colour = method), alpha = 0.8, size = 0.5) +
  scale_colour_manual(name = "Method", values = pal_methods) +
  xlab("Window Size (kbp)") + ylab("Consensus score") +
  facet_wrap(~ range, ncol = 1, scales = "free_x") +
  ggmin::theme_min() +
  theme(legend.position = c(0.9, 0.33)) +
  theme(strip.background = element_blank(), strip.text.x = element_blank())

```

## Analysis of a dispersed mark

This package was initially designed to facilitate the task of identifying epigenetic changes for challenging datasets with abnormal patterns of enrichment of certain histone modifications, like H3K27me3. To analyze how `chromswitch` performs for a mark with a broad profile of enrichment, we will investigate a few regions where we expect a difference in K27me3 enrichment between human embryonic stem cells (ESC) and differentiated adult tissues.

### Download data

We use ESC samples from the Roadmap Epigenomics Project, compared to the same adult tissues we used above.
```{r esc_meta, warning = FALSE}
esc_meta <- meta %>%
  filter(Condition == "Other") %>%
  mutate(Condition = "Adult") %>%
  bind_rows(read_tsv("input/esc_metadata.tsv"))

esc_meta

k27m_regions <- c(GRanges("chr7:27122134-27298855"), # HOX
                  GRanges("chr16:54125741-57125741"), # IRX3 3 Mbp region
                  GRanges("chr16:54313663-54328454")) # IRX3

mcols(k27m_regions)$id <- c("HOX", "IRX3_3Mbp", "IRX3")
```

Download narrow peaks for H3K27me3:

```{r pk_k27me3, message = FALSE, eval = TRUE}

urls_k27 <- paste0("http://egg2.wustl.edu/roadmap/data/byFileType/peaks/consolidated/narrowPeak/",
                      esc_meta$Sample, "-H3K27me3.narrowPeak.gz")
basenames_k27 <- lapply(urls_k27, basename) %>% unlist()
destpaths_k27 <- paste0("data/H3K27me3_peaks/", basenames_k27)

dl_k27 <- mcMap(noClobberDownload, urls_k27, destpaths_k27, mc.cores = n_cores)

extra_cols <- c("name" = "character", "score" = "integer", "strand" = "character",
                "signalValue" = "numeric", "pValue" = "numeric", "qValue" = "numeric",
                "peak" = "numeric")

pk_k27me3 <- lapply(destpaths_k27, rtracklayer::import, format = "bed",
                   extraCols = extra_cols)
names(pk_k27me3) <- esc_meta$Sample

save(pk_k27me3, file = "computations/pk_k27me3.RData")
```

### Run `chromswitch` using H3K27me3 peaks


```{r k27me3_s, eval = FALSE}

k27m_s <- callSummary(k27m_regions,
            peaks = pk_k27me3,
            metadata = esc_meta,
            mark = "H3K27me3",
            filter = TRUE,
            filter_columns = c("qValue", "signalValue"),
            filter_thresholds = c(10, 10),
            normalize = TRUE,
            normalize_columns = c("qValue", "signalValue"),
            summarize_columns = c("qValue", "signalValue"),
            length = FALSE,
            fraction = TRUE,
            n = FALSE,
            optimal_clusters = TRUE,
            heatmap = TRUE,
            outdir = "figures/heatmaps",
            titles = paste0(k27m_regions$id, "_summary"))

save(k27m_s, file = "computations/h3k27me3_summary.RData")

```

```{r k27me3_b, eval = FALSE}

k27m_b <- callBinary(query = k27m_regions,
           peaks = pk_k27me3,
           metadata = esc_meta,
           filter = TRUE,
           filter_columns = c("qValue","signalValue"),
           filter_thresholds = c(10, 10),
           reduce = TRUE,
           gap = 300,
           p = 0.4,
           optimal_clusters = TRUE,
           heatmap = TRUE,
           n_features = TRUE,
           outdir = "figures/heatmaps",
           titles = paste0(k27m_regions$id, "_binary"))

save(k27m_b, file = "computations/h3k27me3_binary.RData")

```

Presented in *Figures S6-8*
```{r k27m_results}

load("computations/h3k27me3_summary.RData")
load("computations/h3k27me3_binary.RData")

# By the summary strategy
k27m_s %>% 
  mutate(Method = "Summary") %>% 
  select(region, id, k, Average_Silhouette, Consensus, Method)

# By the binary strategy
k27m_b %>% 
  mutate(Method = "Binary") %>% 
  select(region, id, k, Average_Silhouette, Consensus, Method)

```

# Genome-wide validation

To validate `chromswitch`, we will evaluate whether it's capable of detecting chromatin state switches
that result in brain-specific expression. We start by running `chromswitch`
genome-wide to 5kb windows around TSS. `Chromswitch` makes a simple assessment of how chromatin states
inferred from the data associate with the biological conditions of the samples,
but we will need a more refined set of candidates to reasonably expect an association with
gene expression data. Therefore, as candidates for validation, we select genes where the `chromswitch` finds an open chromatin state in brain tissues
and closed chromatin in other tissues, evaluate whether these correspond to
brain-specific expression.

## Prepare GTEx data

We downloaded data from the [Genotype-Tissue Expression Project](https://gtexportal.org/) database. The file `data/GTEx_Analysis_v6p_RNA-seq_RNA-SeQCv1.1.8_gene_median_rpkm.gct` was downloaded from the *RNA-seq* section at this link: [https://gtexportal.org/home/datasets](https://gtexportal.org/home/datasets) and
contains the mean RPKM per tissue type per gene.

First we assign more concise names to the tissues and filter to a set which is
comparable to our epigenomic dataset.

```{r read_gtex}

keep_tissue <- c("Amygdala", "Anterior cingulate cortex", "Basal ganglia", 
                 "Cerebella hemisphere", "Cerebellum", "Cortex",
                 "Frontal cortex",
                 "Hippocampus", "Hypothalamus", "Substantia nigra", "Aorta",
                 "Sigmoid colon", "Transverse colon", "Esophagus",
                 "Heart (atrial appendage)", "Heart (left ventricle)",
                 "Kidney", "Liver", "Lung", "Skeletal muscle",
                 "Pancreas", "Small intestine", "Stomach")

brain <-  c("Amygdala", "Anterior cingulate cortex", "Basal ganglia", 
            "Cerebella hemisphere", "Cerebellum", "Cortex", "Frontal cortex",
            "Hippocampus", "Hypothalamus", "Substantia nigra")

other <- keep_tissue[!(keep_tissue %in% brain)]

gtex_rpkm <- read_tsv(
      "data/GTEx_Analysis_v6p_RNA-seq_RNA-SeQCv1.1.8_gene_median_rpkm.gct",
      skip = 2) %>%
    dplyr::rename(gene_name = Description) %>%
    unite(col = "unite", sep = "=", Name, gene_name) %>%
    gather(key = "Tissue", value = "mean_RPKM", 2:length(.)) %>% 
    mutate(Tissue = recode(.x = Tissue,
        `Brain - Amygdala` = "Amygdala",
        `Brain - Anterior cingulate cortex (BA24)` = "Anterior cingulate cortex",
        `Brain - Caudate (basal ganglia)` = "Basal ganglia",
        `Brain - Cerebellar Hemisphere` = "Cerebellar hemisphere",
        `Brain - Cerebellum` = "Cerebellum",
        `Brain - Cortex` = "Cortex",
        `Brain - Frontal Cortex (BA9)` = "Frontal cortex",
        `Brain - Hippocampus` = "Hippocampus",
        `Brain - Hypothalamus` = "Hypothalamus",
        `Brain - Nucleus accumbens (basal ganglia)` = "Nucleus accumbens",
        `Brain - Putamen (basal ganglia)` = "Putamen",
        `Brain - Substantia nigra` = "Substantia nigra",
        `Adipose - Subcutaneous` = "Subcutaneous adipose",
        `Adipose - Visceral (Omentum)` = "Omentum",
        `Artery - Aorta` = "Aorta",
        `Artery - Coronary` = "Coronary artery",
        `Artery - Tibial` = "Tibial artery",
        `Bladder` = "Bladder",
        `Breast - Mammary Tissue` = "Breast",
        `Cervix - Ectocervix` = "Ectocervix",
        `Cervix - Endocervix` = "Endocervix", 
        `Colon - Sigmoid` = "Sigmoid colon",
        `Colon - Transverse` = "Transverse colon",
        `Esophagus - Gastroesophageal Junction` = "Esophagus (GEJ)",
        `Esophagus - Mucosa` = "Esophageal mucosa",
        `Esophagus - Muscularis` = "Esophagus (Muscularis)",
        `Fallopian Tube` = "Fallopian tube",
        `Heart - Atrial Appendage` = "Heart (atrial appendage)",
        `Heart - Left Ventricle` = "Heart (left ventricle)",
        `Kidney - Cortex` = "Kidney",
        `Liver` = "Liver",
        `Lung` = "Lung",
        `Muscle - Skeletal` = "Skeletal muscle",
        `Pancreas` = "Pancreas",
        `Prostate` = "Prostate",
        `Skin - Nut Sun Exposed (Suprapubic)` = "Skin",
        `Small Intestine - Terminal Ileum` = "Small intestine",
        `Spleen` = "Spleen",
        `Stomach` = "Stomach",
        `Thyroid` = "Thyroid",
        `Uterus` = "Uterus",
        `Vagina` = "Vagina", .default = "drop")) %>% 
    filter(Tissue %in% keep_tissue) %>% 
    spread(Tissue, mean_RPKM) %>% 
    separate(unite, into = c("Name", "gene_name"), sep = "=")

```

Calculate the log2 fold change of expression between brain tissues and other
tissues and the dataset.
```{r gtex_fc}
gtex_fc <- gtex_rpkm

# For each gene, calculate the median RPKM across the tissues in each group
gtex_fc$median_brain_RPKM <- apply(select_(gtex_rpkm, brain), 1, median)
gtex_fc$median_other_RPKM <- apply(select_(gtex_rpkm, other), 1, median)

# Calculate log2 FC with a pseudocount
gtex_fc <- gtex_fc %>%
    mutate(FC     = (median_brain_RPKM + 1) / (median_other_RPKM + 1),
           log_FC = log(FC, base = 2))
```

## Run `chromswitch` genome wide

### Summary strategy

We evaluate the summary strategy on each of the three input types.

H3K4me3: 
```{r val_s_k, eval = FALSE}

val_s_k_raw <- refseq_5kbp %>% 
  mclapply(safeCallSummary,
           peaks = pk_k4me3,
           metadata = meta,
           mark = "H3K4me3",
           filter = TRUE,
           filter_columns = c("qValue", "signalValue"),
           filter_thresholds = c(10, 4),
           normalize = TRUE,
           normalize_columns = c("qValue", "signalValue"),
           summarize_columns = c("qValue", "signalValue"),
           length = FALSE,
           fraction = TRUE,
           n = FALSE,
           heatmap = FALSE,
           estimate_state = TRUE,
           signal_col = "signalValue",
           test_condition = "Brain", 
           optimal_clusters = TRUE,
           BPPARAM = SerialParam(),
           mc.cores = n_cores) %>% 
  purrr::transpose()

val_s_k <- val_s_k_raw$result[map_lgl(val_s_k_raw$error, is_null)] %>% 
  bind_rows()

save(val_s_k, val_s_k_raw, file = "computations/validation_s_k.RData")
```

ChromHMM:
```{r val_s_c, warning = FALSE, eval = FALSE}
val_s_c_raw <- refseq_5kbp %>%
    mclapply(safeCallSummary,
             peaks = chmm_states,
             metadata = meta,
             mark = "State",
             filter = FALSE,
             normalize = FALSE,
             summarize_columns = NULL,
             length = FALSE,
             fraction = TRUE,
             n = FALSE,
             heatmap = FALSE,
             optimal_clusters = TRUE,
             estimate_state = TRUE,
             signal_col = "fraction",
             test_condition = "Brain",
             BPPARAM = SerialParam(),
             mc.cores = n_cores) %>% 
    purrr::transpose()

val_s_c <- val_s_c_raw$result[map_lgl(val_s_c_raw$error, is_null)] %>% 
    bind_rows()

save(val_s_c_raw, val_s_c, file = "computations/validation_s_c.RData")
```

DNase I:
```{r val_s_d, eval = FALSE}

val_s_d_raw <- refseq_5kbp %>%
    mclapply(safeCallSummary,
             peaks = pk_dnase,
             metadata = meta,
             mark = "DNase",
             filter = FALSE,
             normalize = FALSE,
             summarize_columns = NULL,
             length = FALSE,
             fraction = TRUE,
             n = FALSE,
             heatmap = FALSE,
             estimate_state = TRUE,
             signal_col = "fraction",
             test_condition = "Brain",
             optimal_clusters = TRUE,
             BPPARAM = SerialParam(), # Parallelize outside
             mc.cores = n_cores) %>% 
    purrr::transpose() 

val_s_d <- val_s_d_raw$result[map_lgl(val_s_d_raw$error, is_null)] %>%
    bind_rows()

save(val_s_d_raw, val_s_d, file = "computations/validation_s_d.RData")
```

### Binary strategy

H3K4me3:

```{r val_b_h, warning = FALSE, eval = FALSE}

val_b_h_raw <- refseq_5kbp %>% 
  mclapply(safeCallBinary,
           peaks = pk_k4me3,
           metadata = meta,
           filter = TRUE,
           filter_columns = c("qValue", "signalValue"),
           filter_thresholds = c(10, 4),
           reduce = TRUE,
           gap = 300,
           p = 0.4,
           heatmap = FALSE,
           n_features = TRUE,
           estimate_state = TRUE,
           test_condition = "Brain",
           optimal_clusters = TRUE,
           BPPARAM = SerialParam(),
           mc.cores = n_cores) %>% 
  purrr::transpose()

val_b_h <- val_b_h_raw$result[map_lgl(val_b_h_raw$error, is_null)] %>% 
  bind_rows()

save(val_b_h, val_b_h_raw, file = "computations/validation_b_h.RData")
```


DNase I:

```{r val_b_d, warning = FALSE, eval = FALSE}

val_b_d_raw <- refseq_5kbp[1] %>% 
  mclapply(safeCallBinary,
           peaks = pk_dnase,
           metadata = meta,
           filter = FALSE,
           reduce = TRUE,
           gap = 300,
           p = 0.4,
           heatmap = FALSE,
           n_features = TRUE,
           estimate_state = TRUE,
           test_condition = "Brain",
           optimal_clusters = TRUE,
           BPPARAM = SerialParam(),
           mc.cores = n_cores) %>% 
  purrr::transpose()

val_b_d <- val_b_d_raw$result[map_lgl(val_b_d_raw$error, is_null)] %>% 
  bind_rows()

save(val_b_d, val_b_d_raw, file = "computations/validation_b_d.RData")
```

ChromHMM:

```{r val_b_c, warning = FALSE, eval = FALSE}

val_b_c_raw <- refseq_5kbp %>% 
  mclapply(safeCallBinary,
           peaks = chmm_states,
           metadata = meta,
           filter = FALSE,
           reduce = FALSE,
           gap = 300,
           p = 0.4,
           heatmap = FALSE,
           n_features = TRUE,
           estimate_state = TRUE,
           test_condition = "Brain",
           optimal_clusters = TRUE,
           BPPARAM = SerialParam(),
           mc.cores = n_cores) %>% 
  purrr::transpose()

val_b_c <- val_b_c_raw$result[map_lgl(val_b_c_raw$error, is_null)] %>% 
  bind_rows()

save(val_b_c, val_b_c_raw, file = "computations/validation_b_c.RData")
```

```{r load_val, cache = FALSE}
load("computations/validation_s_k.RData")
load("computations/validation_s_d.RData")
load("computations/validation_s_c.RData")
load("computations/validation_b_h.RData")
val_b_k <- val_b_h # Rename this object for consistency
load("computations/validation_b_d.RData")
load("computations/validation_b_c.RData")
```

## Evaluate results

First, we will plot the mean RPKM across candidate genes and assess how
the distributions differ betwen brain and other tissues.

Define a function to get the candidate switches for each threshold:
```{r getCandidates}

joinGtex <- function(val_df) {
  
  val_df %>% 
    dplyr::select(gene_name, Consensus, state, k) %>% 
    inner_join(gtex_fc, by = "gene_name")
}

thresholdScores <- function(threshold, scores) {
    
    scores %>%
        dplyr::filter(Consensus >= threshold) %>%
        mutate(Threshold = threshold)
}

getCandidates <- function(df, method, input) {
  
  thresholds <- c(0.25, 0.5, 0.75)
  
  df <- df %>% filter(state == "ON")
  df_thresholded <- map_df(thresholds, thresholdScores, df) %>% 
    mutate(Threshold = paste0("Consensus >= ", Threshold),
           Method = method,
           Input = input)
}

```

```{r get_candidates}

val_out <- list(val_s_k, val_s_d, val_s_c,
                val_b_k, val_b_d, val_b_c)

eval_methods <- c(rep("Summary", 3), rep("Binary", 3))
eval_input <- rep(c("H3K4me3", "DNase I", "ChromHMM"), 2)

val_gtex <- val_out %>%
  mclapply(joinGtex, mc.cores = 2)

val_candidates <- val_gtex %>% 
  {mcMap(getCandidates, df = ., method = eval_methods, input = eval_input,
         mc.cores = 2)} %>% 
  bind_rows()

val_candidates_long <- val_candidates %>% 
  gather(Tissue, mean_RPKM, 6:26) %>% 
  filter(Threshold == "Consensus >= 0.75")

# Count candidates
counts <- ddply(.data = val_candidates,
                .(Method, Input, Threshold),
                summarize,
                # Define a string with the number of candidates which we can plot
                n = paste0(unique(Input), "\n(n=", length(gene_name), ")")) %>% 
    filter(Threshold == "Consensus >= 0.75")
```

```{r recode_val}

val_candidates_long_clean <- val_candidates_long %>% 
    mutate(Label = ifelse(Tissue %in% brain, "Brain", "Other")) %>% 
    mutate(Tissue = factor(Tissue, levels = keep_tissue)) %>% 
    makeLabelSimple() %T>%
    write_tsv("tables/validation_boxplot.tsv")

```

*Figure 1c*
```{r fig1c_validation_boxplot_k4me3, fig.height = 3, fig.width = 6}

val_candidates_long_clean %>%
    filter(Tissue %in% keep_tissue) %>%
    filter(Input == "H3K4me3") %>%
    mutate(Method = ifelse(Method == "Summary", "Summary (n=220)", 
                           "Binary (n=125)")) %>% 
    ggplot(aes(x = Tissue, y = mean_RPKM)) +
    geom_boxplot(width = 0.65, outlier.shape = NA, aes(fill = Method),
                 position = position_dodge(width = 0.8)) +
    scale_fill_manual(values = c("Summary (n=220)" = "red",
                                 "Binary (n=125)" = "#2f60af")) +
    facet_wrap(~ Input, ncol = 1) +
    ggmin::theme_min() +
    coord_cartesian(ylim = c(0, 55)) +
    ylab("Mean RPKM") +
    theme(text = element_text(size = 10),
          axis.text.x = element_text(angle = 90, hjust = 1),
	  axis.ticks.x = element_blank(),
          legend.position = c(0.9, 0.7))

```

*Figure S10*
```{r figS10_validation_boxplots, fig.height = 8, fig.width = 8}

val_candidates_long_clean %>% 
    ggplot(aes(x = Tissue, y = mean_RPKM)) +
    geom_boxplot(outlier.shape = NA, aes(fill = Label), width = 0.65) +
    scale_fill_manual(values = pal_strategy) +
    ggmin::theme_min() +
    facet_grid(Input ~ Method) +
    coord_cartesian(ylim = c(0, 55)) +
    ylab("Mean RPKM") +
    theme(text = element_text(size = 10),
          axis.text.x = element_text(angle = 90, hjust = 1, size = 7),
          strip.text.y = element_blank(),
          legend.position = "none") +
    geom_text(data = counts, aes(x = 19, y = 52, label = n), size = 3,
                  colour = "black", inherit.aes = FALSE, parse = FALSE)

```

Second, to quantify validation, we will assess how the difference in expression
between brain and other tissues associates with the consensus score output by
`chromswitch`. To do so, we calculate the log2 fold change of the mean RPKM
across tissues in each group, across candidate switches at each threshold on 
the consensus score.

```{r val_fc}

thresholds <- seq(0, 1, by = 0.1)

getFC <- function(threshold, scores) {

    scores %>%
        dplyr::filter(Consensus >= threshold) %>%
        `[[`("log_FC") %>%
        mean()
}

val_fc <- data.frame(thresholds)

val_fc$Summary_H3K4me3  <- data.frame(thresholds) %>% 
   apply(FUN = getFC, MARGIN = 1, val_gtex[[1]])

val_fc$Summary_DNase    <- data.frame(thresholds) %>% 
   apply(FUN = getFC, MARGIN = 1, val_gtex[[2]])

val_fc$Summary_ChromHMM <- data.frame(thresholds) %>% 
   apply(FUN = getFC, MARGIN = 1, val_gtex[[3]])

val_fc$Binary_H3K4me3   <- data.frame(thresholds) %>% 
   apply(FUN = getFC, MARGIN = 1, val_gtex[[4]])

val_fc$Binary_DNase     <- data.frame(thresholds) %>% 
   apply(FUN = getFC, MARGIN = 1, val_gtex[[5]])

val_fc$Binary_ChromHMM  <- data.frame(thresholds) %>% 
   apply(FUN = getFC, MARGIN = 1, val_gtex[[6]])

save(val_fc, file = "computations/validation_fc.RData")

```


```{r tidy_val_fc}

val_fc_tidy <- val_fc %>% 
  gather(strategy, median_logFC, 2:length(.)) %>% 
  separate(strategy, into = c("Method", "Input"), sep = "_") %>% 
  mutate(Input = recode(.x = Input, DNase = "DNase I")) %>% 
  makeLabelSimple() %T>%
  write_tsv("tables/validation_foldchange.tsv")

```

*Figure 1d*
```{r fig1d_validation_fc, fig.width = 3.7, fig.height = 4.5}

val_fc_tidy %>% 
  filter(Input != "DNase I") %>% 
  ggplot(aes(x = thresholds, y = median_logFC)) +
  stat_smooth(aes(colour = Label, linetype = Label), se = FALSE) +
  scale_colour_manual(name = "Strategy", values = pal_strategy) +
  scale_linetype_manual(name = "Strategy", values = linetype_strategy) +
  ggmin::theme_min() +
  ylab("Mean log2 fold change of expression") + xlab("Threshold") +
  theme(legend.position = "bottom") +
  labs(colour = "") + labs(linetype = "") +
  guides(colour = guide_legend(ncol = 1))

```

*Figure S11*
```{r figS11_validation_fc, fig.width = 7, fig.height = 7.5}

val_fc_tidy %>% 
  ggplot(aes(x = thresholds, y = median_logFC)) +
  stat_smooth(aes(colour = Label, linetype = Label), se = FALSE) +
  scale_colour_manual(name = "Strategy", values = pal_strategy) +
  scale_linetype_manual(name = "Strategy", values = linetype_strategy) +
  ggmin::theme_min() +
  # ggtitle("Median Log FC of expression across candidate genes") +
  ylab("Mean log2 fold change of expression") + xlab("Threshold") +
  theme(legend.position = "bottom") +
  labs(colour = "") + labs(linetype = "") +
  guides(colour = guide_legend(ncol = 1))

```


# Session Info {#sessioninfo}

```{r session, cache = FALSE}
sessionInfo()
```