-
Notifications
You must be signed in to change notification settings - Fork 43
/
Copy pathTCGA_summary.Rmd
305 lines (244 loc) · 15.1 KB
/
TCGA_summary.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
---
title: "TCGA survival analysis"
date: "`r Sys.Date()`"
output:
pdf_document:
toc: no
html_document:
theme: united
toc: no
csl: styles.ref/genomebiology.csl
bibliography: data.TCGA/TCGA.bib
editor_options:
chunk_output_type: console
---
```{r setup, echo=FALSE, message=FALSE, warning=FALSE}
# Set up the environment
library(knitr)
opts_chunk$set(cache.path='cache/', fig.path='img/', cache=F, tidy=T, fig.keep='high', echo=F, dpi=100, warnings=F, message=F, comment=NA, warning=F, results='as.is', fig.width = 10, fig.height = 6) #out.width=700,
library(pander)
panderOptions("table.split.table", Inf)
set.seed(1)
library(dplyr)
options(stringsAsFactors = FALSE)
```
# Settings
```{r settings}
# Settings
#Aggregation of the results generated by the [survival.Rmd](https://github.com/mdozmorov/TCGAsurvival/blob/master/survival.Rmd) file
# Select cancer type
cancer <- "BRCA"
# Select genes of interest
selected_genes <- "MYBL2" # Search and replace the name of the selected_genes
n_max <- 10 # Maximum number of rows to output in tables
p_val_cutoff <- 0.05 # Regular p-value cutoff
p_adj_cutoff <- 0.1 # FDR cutoff
```
# Global analysis of [survival.Rmd](https://github.com/mdozmorov/TCGAsurvival/blob/master/survival.Rmd)
## Analysis 2: Survival effect of `r selected_genes` in all cancers
The bar plot shows the significance of `r selected_genes` expression on survival in a given cancer. The wider (higher) the bar, the more significant survival effect the `r selected_genes` has. See abbreviations of cancer types at [http://www.liuzlab.org/TCGA2STAT/CancerDataChecklist.pdf](http://www.liuzlab.org/TCGA2STAT/CancerDataChecklist.pdf)
```{r echo=FALSE, out.width='60%'}
knitr::include_graphics(paste0(selected_genes, ".", cancer, ".Analysis2/", selected_genes, "_all_TCGA_cancers.png"))
```
\pagebreak
The same data in table format. Legend:
- `Cancer`, `Cancer.Name` - cancer abbreviation and description
- `Gene` - gene name for which survival analysis was run
- `p.value` - significance of the survival effect
- `HR`, `HR_left`, `HR_right` - hazard ratio, and left/right confidence interval
- `Min.`, `X1st.Qu.`, `Median`, `Mean`, `X3rd.Qu.`, `Max.` - expression level of the gene in a corresponding cancer
- `Cutoff_type`, `Cutoff_value` - gene expression cutoff best discriminating survival
```{r}
cancers <- openxlsx::read.xlsx("data.TCGA/TCGA_cancers.xlsx")
mtx <- read.table(paste0(selected_genes, ".", cancer, ".Analysis2/global_stats.txt"), sep = "\t", header = TRUE)
mtx <- mtx[order(mtx$p.value), ]
mtx <- mtx %>% mutate(p.adjust = p.adjust(p.value, method = "BH"))
mtx <- left_join(mtx, cancers, by = c("Cancer" = "Acronym"))
mtx <- mtx[, c("Cancer", "Cancer.Name", "p.value", "p.adjust", "HR", "HR_left", "HR_right", "Min.", "X1st.Qu.", "Median", "Mean", "X3rd.Qu.", "Max.", "Cutoff_value")]
colnames(mtx) <- c("Cancer", "Cancer Name", "p-value", "FDR", "Hazard Ratio", "HR left", "HR right", "Min gene expression", "1st Qu", "Median", "Mean", "3rd Quantile", "Max gene expression", "Expression cutoff")
# Save to file
x <- list(mtx)
names(x) <- selected_genes
writexl::write_xlsx(x, paste0(selected_genes, ".", cancer, ".Analysis2/global_stats.xlsx"))
# Modify number of columns in mtx for pdf
mtx <- mtx [,c("Cancer", "Cancer Name", "p-value", "FDR", "Hazard Ratio", "Expression cutoff")]
# DT::datatable(mtx)
pander(mtx[1:min(nrow(mtx), n_max), ])
```
## Analysis 3: Survival effect of `r selected_genes` in specific clinical subtypes, all cancers
Top `r n_max` clinincal subtypes across all cancers where `r selected_genes` significantly (adjusted p-value < `r p_adj_cutoff`) affects survival
```{r}
mtx <- read.table(paste0(selected_genes, ".", cancer, ".Analysis3/global_stats.txt"), sep = "\t", header = TRUE, fill = TRUE)
# Correct for multiple testing in individual cancers
# all_cancers = c("ACC", "BLCA", "PRAD", "CESC", "CHOL", "COAD", "COADREAD", "DLBC", "ESCA", "GBM", "GBMLGG", "HNSC", "KICH", "KIPAN", "KIRC", "KIRP", "LGG", "LIHC", "LUAD", "LUSC", "MESO", "OV", "PAAD", "PCPG", "PRAD", "READ", "SARC", "SKCM", "STAD", "STES", "TGCT", "THCA", "THYM", "UCEC", "UCS", "UVM") # "LAML",
# Get all cancers from the matrix itself
all_cancers <- sapply(mtx$Cancer, function(x) strsplit(x, "-")[[1]][1]) %>% unname %>% unique
all_adjusted_p_values <- c()
for (cancer_selected in all_cancers) {
mtx_subset <- mtx[grepl(paste0("^", cancer_selected, "-"), mtx$Cancer, ignore.case = FALSE), , drop = FALSE]
all_adjusted_p_values <- c(all_adjusted_p_values, p.adjust(mtx_subset$p.value, method = "BH"))
}
mtx$p.adjust <- all_adjusted_p_values
mtx$Subgroup <- mtx$Cancer
mtx$Cancer <- sapply(mtx$Cancer, function(x) strsplit(x, "-")[[1]][1]) %>% unname
mtx <- mtx[with(mtx, order(Cancer, p.value)), ]
mtx <- mtx[, c("Cancer", "Subgroup", "p.value", "p.adjust", "HR", "HR_left", "HR_right", "Min.", "X1st.Qu.", "Median", "Mean", "X3rd.Qu.", "Max.", "Cutoff_value")]
colnames(mtx) <- c("Cancer", "Subgroup", "p-value", "FDR", "Hazard Ratio", "HR left", "HR right", "Min gene expression", "1st Qu", "Median", "Mean", "3rd Quantile", "Max gene expression", "Expression cutoff")
rownames(mtx) <- NULL
#Creating subset of mtx for pdf table
mtx_pdf <- mtx [, c("Cancer", "Subgroup", "p-value", "FDR", "Hazard Ratio", "Expression cutoff")]
mtx_pdf[order(mtx_pdf$FDR), ] %>% head(., n = n_max) %>% pander
```
## Analysis 3: Cancer-specific top clinical subtypes where `r selected_genes` significantly affects survival, for each cancer
Only results for cancers with clinical subtypes where `r selected_genes` significantly (adjusted p-value < `r p_adj_cutoff`) affects survival are shown. The table is sorted by the "Number of significant" column. Cancers with zero Number of significant subtypes are not shown.
```{r}
clinical_subgroups_stats <- list() # List to collect statistics about clinical subgroups
for (cancer_selected in all_cancers) {
clinical_subgroups_stats <- c(clinical_subgroups_stats, list(c(
cancer_selected, # Selected cancer
sum(grepl(paste0("^", cancer_selected, "-"), mtx$Subgroup, ignore.case = FALSE)), # total clinical annotations
sum(grepl(paste0("^", cancer_selected, "-"), mtx$Subgroup, ignore.case = FALSE) & mtx$FDR < p_adj_cutoff) # significant at FDR
)))
}
clinical_subgroups_stats <- do.call(rbind.data.frame, clinical_subgroups_stats)
clinical_subgroups_stats <- clinical_subgroups_stats %>% mutate( Proportion = as.numeric(clinical_subgroups_stats[, 3]) / as.numeric(clinical_subgroups_stats[, 2]) )
colnames(clinical_subgroups_stats) <- c("Cancer", "Number of clinical subgroups", "Number of significant", "Proportion of significant")
clinical_subgroups_stats <- left_join(clinical_subgroups_stats, cancers, by = c("Cancer" = "Acronym"))
clinical_subgroups_stats <- clinical_subgroups_stats[order(as.numeric(clinical_subgroups_stats$`Number of significant`), decreasing = TRUE), ]
clinical_subgroups_stats <- clinical_subgroups_stats[as.numeric(clinical_subgroups_stats$`Number of significant`) > 0, ]
clinical_subgroups_stats$`Proportion of significant` <- round(clinical_subgroups_stats$`Proportion of significant`, digits = 2)
rownames(clinical_subgroups_stats) <- NULL
kable(clinical_subgroups_stats, caption = paste0("Summary of the number of clinical subtypes where the expression of ", selected_genes, " significantly affects survival"))
```
```{r}
# Save to file
x <- list(mtx, clinical_subgroups_stats)
names(x) <- c(selected_genes, paste0(selected_genes, ".summary"))
writexl::write_xlsx(x, paste0(selected_genes, ".", cancer, ".Analysis3/global_stats.xlsx"))
```
```{r results="asis"}
for (cancer_selected in all_cancers) {
mtx_subset <- mtx[grepl(paste0("^", cancer_selected, "-"), mtx$Subgroup, ignore.case = FALSE) & mtx$FDR < p_adj_cutoff, , drop = FALSE]
rownames(mtx_subset) <- NULL
mtx_subset <- mtx_subset[,c("Cancer","Subgroup","p-value","FDR","Hazard Ratio", "Expression cutoff")]
if (nrow(mtx_subset) > 0) {
mtx_subset %>% head(., n=n_max) %>% kable(., caption = cancers$Cancer.Name[cancers$Acronym == cancer_selected]) %>% print()
}
}
```
## Analysis 3: Survival effect of `r selected_genes` in manually defined clinical subtypes, all cancers
```{r results="asis"}
# Add any subcategories to check if the gene affects survival in them in any cancer
clinical_subcategories <- c("race-black or african american") #, "radiationtherapy-no", "pathologyMstage-m0", "pathologyNstage-n0")
for (selected_clinical_subcategory in clinical_subcategories) {
mtx_subset <- mtx[grepl(selected_clinical_subcategory, mtx$Subgroup) & mtx$FDR < p_adj_cutoff, ]
rownames(mtx_subset) <- NULL
mtx_subset <- mtx_subset[,c("Cancer","Subgroup","p-value","FDR","Hazard Ratio", "Expression cutoff")]
kable(mtx_subset, caption = selected_clinical_subcategory) %>% print()
}
```
\pagebreak
# Cancer-specific analysis
## Analysis 2: Survival effect of `r selected_genes` in `r cancer` cancer
```{r echo=FALSE, out.width='60%'}
knitr::include_graphics(paste0(selected_genes, ".", cancer, ".Analysis2/", selected_genes, "_", cancer, ".png"))
```
## Analysis 3: Survival effect of `r selected_genes` in specific clinical subtypes, `r cancer` cancer
The table lists clinical subtypes where the expression of `r selected_genes` in `r cancer` most significantly affects survival. The table is sorted by increasing p-values, most significant on top. Description of clinical subtypes can be found at [https://gdc.cancer.gov/about-data/data-harmonization-and-generation/clinical-data-harmonization](https://gdc.cancer.gov/about-data/data-harmonization-and-generation/clinical-data-harmonization)
```{r}
global_stats <- read.table(paste0(selected_genes, ".", cancer, ".Analysis3/global_stats.txt"), sep = "\t", header = TRUE, fill = TRUE)
global_stats <- global_stats[order(global_stats$p.value), ]
global_stats_subset <- global_stats[grepl(cancer, global_stats$Cancer, fixed = TRUE, ignore.case = FALSE), ]
rownames(global_stats_subset) <- NULL
#Creating subset of global_stats for pdf
global_stats_subset_pdf <- global_stats_subset [, c("Cancer", "Gene", "p.value", "HR", "Cutoff_value")]
global_stats_subset_pdf %>% head(., n = n_max) %>% pander
```
Top five corresponding survival plots
```{r echo=FALSE, out.width='60%'}
print(global_stats_subset$Cancer[1])
knitr::include_graphics(paste0(selected_genes, ".", cancer, ".Analysis3/", selected_genes, "_", global_stats_subset$Cancer[1], ".png"))
```
```{r echo=FALSE, out.width='60%'}
print(global_stats_subset$Cancer[2])
knitr::include_graphics(paste0(selected_genes, ".", cancer, ".Analysis3/", selected_genes, "_", global_stats_subset$Cancer[2], ".png"))
```
```{r echo=FALSE, out.width='60%'}
print(global_stats_subset$Cancer[3])
knitr::include_graphics(paste0(selected_genes, ".", cancer, ".Analysis3/", selected_genes, "_", global_stats_subset$Cancer[3], ".png"))
```
```{r echo=FALSE, out.width='60%'}
print(global_stats_subset$Cancer[4])
knitr::include_graphics(paste0(selected_genes, ".", cancer, ".Analysis3/", selected_genes, "_", global_stats_subset$Cancer[4], ".png"))
```
```{r echo=FALSE, out.width='60%'}
print(global_stats_subset$Cancer[5])
knitr::include_graphics(paste0(selected_genes, ".", cancer, ".Analysis3/", selected_genes, "_", global_stats_subset$Cancer[5], ".png"))
```
```{r}
if (cancer %in% c("BRCA","OV")){
continue_analysis <- TRUE
} else {
continue_analysis <- FALSE
}
opts_chunk$set(eval = continue_analysis, include = continue_analysis)
```
## Analysis 5: Clinical-centric analysis. Selected cancer, selected clinical subcategory, selected_genes expression differences across categories
Expression of `r selected_genes` in selected clinical subcategories. Valid only for BRCA or OV cancer types.
```{r out.height='250px'}
# 'PAM50Call_RNAseq' for PRAD, 'subtype' for OV
if (cancer == "BRCA"){
knitr::include_graphics(paste0(selected_genes, ".", cancer, ".Analysis5/", cancer, "_", selected_genes, "_PAM50Call_RNAseq.png"))
}
if (cancer == "OV"){
knitr::include_graphics(paste0(selected_genes, ".", cancer, ".Analysis5/", cancer, "_", selected_genes, "_subtype.png"))
}
```
```{r}
global_stats <- read.table(paste0(selected_genes, ".", cancer, ".Analysis5/global_stats.txt"), sep = "\t", header = TRUE, fill = TRUE)
global_stats <- global_stats[order(global_stats$p.value), c("Cancer", "Gene", "p.value", "HR")]
rownames(global_stats) <- NULL
pander(global_stats)
```
### ANOVA
What are the means of log2-expression per clinical subgroup?
```{r}
load(file = paste0(selected_genes, ".", cancer, ".Analysis5/mtx_to_plot.rda"))
tapply(mtx_to_plot$Gene, mtx_to_plot$Clinical, mean)
```
Is the expression of selected_genes `r selected_genes` significantly different across clinical subgroups? Significant "Pr(>F)" suggests "Yes"
```{r}
# ANOVA
groupdiff <- lm(mtx_to_plot$Gene ~ mtx_to_plot$Clinical)
# summary(groupdiff)
anova(groupdiff)
```
### Tukey HSD (Honest Significant Difference) test
Which pair of clinical categories have significant differences? "p.adj" and confidence intervals that do not cross 0 suggest significant differences in selected_genes expression between the subgroups in the corresponding pairwise comparison.
```{r}
# Tukey HSD
a1 <- aov(mtx_to_plot$Gene ~ mtx_to_plot$Clinical)
# summary(a1)
posthoc <- TukeyHSD(x=a1, 'mtx_to_plot$Clinical', conf.level=0.95)
posthoc <- posthoc$`mtx_to_plot$Clinical`
posthoc[order(posthoc[, "p adj"]), ] %>% pander()
# Differences in means plot
# par(las = 2)
# par(mar = c(5, 8, 5, 8))
# plot(posthoc)
```
Is there a survival difference between clinical subgroups?
```{r}
global_stats <- read.table(paste0(selected_genes, ".", cancer, ".Analysis5/global_stats.txt"), sep = "\t", header = TRUE, fill = TRUE)
global_stats <- global_stats[order(global_stats$p.value), c("Cancer", "Gene", "p.value", "HR")]
rownames(global_stats) <- NULL
pander(global_stats)
```
\pagebreak
# Methods
## Survival analysis of selected_genes expression data from TCGA
Gene expression data summarized as RSEM values were obtained using the `TCGA2STAT` R package v 1.2, along with the corresponding clinical annotations. Data for each of the 34 cancers were obtained separately. The data were log2-transformed and analyzed using Kaplan-Meier curves and Cox proportional hazard model. Each gene of interest was analyzed for its effect on survival by separating patients into high/low expression subgroups. A modified approach from [@Mihaly:2013aa] was used to estimate the best gene expression cutoff that separates high/low expression subgroups with differential survival.
We took advantage of the availability of clinical annotations. To identify if the expression of a gene of interest affects survival in any specific clinical subgroup, subsets of patients annotated with specific clinical annotations were selected (e.g., “males” or “females” in the “gender” clinical annotation). Subgroups with < 40 patients were not considered.
## Differential expression analysis
Samples in the selected cancer cohort were sorted by expression of the selected genes. Differentially expressed genes were detected between samples in the upper 75 percentile of the expression gradient and samples in the lower 25 percentile using `limma` v 3.32.6 R package [@Ritchie:2015aa; @Smyth:2004aa]. P-values were corrected for multiple testing using the False Discovery Rate (FDR) method [@Benjamini:1995aa]. Genes differentially expressed at FDR < 0.01 were selected for further analysis.
# References