-
Notifications
You must be signed in to change notification settings - Fork 7
/
Copy pathhcocena_main.Rmd
382 lines (196 loc) · 14.8 KB
/
hcocena_main.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
---
title: "hCoCena main markdown"
author: "Marie Oestreich, Lisa Holsten, Kilian Dahm"
date: "05 01 2024"
output:
html_document: default
pdf_document: default
---
# 0 Introduction
This Markdown contains the base workflow of hCoCena (v1.1.1). For a fully functional analysis, these steps are mandatory unless declared otherwise (flagged with OPTIONAL). Feel free to expand upon this base analysis with the provided satellite functions or your custom scripts. You will find detailed information on the satellite functions as well as intermediate outputs that you can use for your custom analyses in the repository's Wiki.
# 1 Pre-Integration Phase
## 1.1 Load hCoCena package
```{r}
library(hcocena)
```
## 1.2 Create hCoCena-Object
```{r}
init_object()
```
## 1.3 Set-up of the working directory
Please set the following parameters to strings that state the path to your expression data, meta data, etc. For more information on the different files and how to pass them to the function, please refer to the following section of the wiki: https://github.com/MarieOestreich/hCoCena/wiki/General-Backgroud-Information#count-files-annotation-files--reference-files
+ dir_count_data: path of your count files folder, set FALSE if you are NOT loading expression data from files but instead are using existing
data frames in your environment.
+ dir_annotation: path of your annotation files folder, set FALSE if you are NOT loading annotation data from files but instead are using existing
data frames in your environment.
+ dir_reference_files: path of your reference files folder (the folder with corresponding files can be downloaded from the github repo https://github.com/MarieOestreich/hCoCena, they are contained in the folder 'reference_files')
+ dir_output: path to the location in which the save folder for this analyses should be created (e.g., "~/" for your home directory).
```{r}
init_wd(dir_count_data = "PATH_TO_FOLDER_THAT_HOLDS_YOUR_GENE_EXPRESSION_TABLES",
dir_annotation = "PATH_TO_FOLDER_THAT_HOLDS_YOUR_ANNOTATION_TABLES",
dir_reference_files = "PATH_TO_REFERENCE_FILES_FOLDER",
dir_output = "PATH_TO_FOLDER_WHERE_HCOCENA_SHOULD_SAVE_ALL_ANALYSIS_OUTPUTS")
```
Ensure that all directories exist with the following chunk:
```{r check directories}
check_dirs()
```
Please define a folder in which your results shall be saved using the 'name' variable.
The folder is created automatically, if it does not yet exist in the set directory.
```{r, warning = FALSE}
init_save_folder(name = "outputs")
```
## 1.4 Defining layers
For detailed information regarding the structures of the count and annotation data as well as different options for providing data, refer to the function documentation by entering ?hcocena::define_layers into the console.
For detailed information regarding the structures of custom pathway to gene files or databases, refer to the function documentation by entering ?hcocena::set_supp_files into the console.
```{r defining Omics Layers}
define_layers(list(RNA_Seq = c("data_batch_donor.txt", "annotation_seq.txt") ,
Array = c("data.txt", "annotation_array.txt")))
set_supp_files(Tf = "TFcat.txt",
Go = "c5.go.bp.v2023.1.Hs.symbols.gmt",
Hallmark = "h.all.v2023.1.Hs.symbols.gmt",
Kegg = "c2.cp.kegg.v2023.1.Hs.symbols.gmt")
```
## 1.5 Data import
For detailed information regarding the different parameters, enter ?hcocena::read_data into the console.
```{r data import}
read_data(sep_counts = "\t",
sep_anno = "\t",
gene_symbol_col = "SYMBOL",
sample_col = "SampleID",
count_has_rn = F,
anno_has_rn = F)
read_supplementary()
```
## 1.6 Define global settings
For detailed information regarding the different settings, enter ?hcocena::set_global_settings into the console.
```{r global settings}
set_global_settings(organism = "human",
control_keyword = "baseline",
variable_of_interest = "merged",
min_nodes_number_for_network = 15,
min_nodes_number_for_cluster = 15,
range_GFC = 2.0,
layout_algorithm = "layout_with_fr",
data_in_log = T)
```
## OPTIONAL: Data-Based Definition of Top Most Variant Genes
Find inflection points in the ranked variances to filter for the top most variant genes in a data-driven way.
## 1.7 Define layer-specific settings
For detailed information regarding the different settings, enter ?hcocena::set_layer_settings into the console.
```{r layer-specific settings}
set_layer_settings(top_var = c("all", "all"),
min_corr = rep(0.9, length(hcobject[["layers"]])),
range_cutoff_length = rep(100, length(hcobject[["layers"]])),
print_distribution_plots = rep(F, length(hcobject[["layers"]])))
```
## OPTIONAL: Data distribution
There is an option to plot the distribution of counts for each sample to check for outliers or prominent differences between samples. To do so, refer to "Checking data distribution" in the satellite markdown.
## OPTIONAL: PCA
You can visualize your data in a PCA. To do so, please refer to the satellite markdown, section "PCA".
## OPTIONAL: Meta data visualization
You can visualize your meta data using the "Meta data plots" section in the satellite markdown.
## 1.8 Data processing part I
For detailed information on what happens in this step and what parameters can be set, enter ?hcocena::run_expression_analysis_1 into the console.
```{r expression analysis up to cutoff}
run_expression_analysis_1(corr_method = "pearson")
```
## 1.9 Data processing part II
**Choosing the cut-offs:**
Set a correlation cut-off for each of your data sets. To aid this choice, the following plot presents the different cut-off statistics per data set. For more details on cut-offs and the information in this plot as well as the parameters, enter ?hcocena::plot_cutoffs into the console.
```{r fig.height = 8, fig.width = 15}
plot_cutoffs(interactive = T)
```
The order in which the cutoffs are subsequently defined must correspond to the order in which the layers have previously been specified.
```{r choose cutoff}
set_cutoff(cutoff_vector = c(0.974, 0.984))
```
**Checking the scale-free topology**
For each data set, the logged degree distribution and the linear regression are plotted to visualize the preservation of the scale-free topology criterion.
NOTE: Even though biological networks are generally believed to follow a scale-free topology, experience has shown that a lot of transcriptomics data does not follow this principle perfectly. A deviation from the regression line is often observed at higher x-axis values.
```{r plot degree distribution for chosen cutoff, message = F, warning = F}
plot_deg_dist()
```
**Heatmap of top most variant genes and GFC calculation**
This function plots a heatmap for the network genes in each data layer and computes the Group-Fold-Changes for each genes per layer.
```{r, fig.width = 10, fig.height = 7}
run_expression_analysis_2()
```
# 2 Integration Phase
**Integrating the layer-specific networks**
Here, the previously constructed layer-specific networks will be integrated into one network that combines the information. The integration can be based on the union or intersection of layer-specific networks. Edges that are present in several networks with different lengths can be included based on different options. For detailed information, please refer to the Info Pages in the repository's Wiki.
```{r merge networks}
build_integrated_network(mode = "u", multi_edges = "min")
```
# 3 Post-Integration Phase
## 3.1 Module detection
**Clustering the network**
In this step, modules of strong co-expression will be detected in the network and their expression pattern across conditions will be represented in a GFC heatmap. For more information on the available clustering algorithms and setting the clustering resolution, run ?hcocena::cluster_calculation and visit the repository's Wiki pages.
NOTE: You can export your clustering for future use. To do so, please refer to the satellite script, section "Export clustering".
NOTE 2: Instead of clustering your network here, you can alternatively import a clustering model. To do so, please use the import_clusters() function (see satellite markdown for infos).
```{r compute clusters and plot module heatmap}
cluster_calculation(cluster_algo = "cluster_leiden",
no_of_iterations = 2,
resolution = 0.1,
partition_type = "ModularityVertexPartition",
max_cluster_count_per_gene = 1)
plot_cluster_heatmap()
```
## OPTIONAL: Module scores
If you would like to evaluate how well each gene belongs to its asserted cluster, please refer to the satellite markdown, section "Module scores".
## OPTIONAL: Evaluating the different community detection algorithms
If you want to compare the default Leiden clustering to other algorithms, please refer to the section "Evaluating different community detection algorithms" in the satellite markdown.
## 3.2 Plotting the network coloured by module
NOTE: due to issues in the communication between R and Cytoscape, please refer to the satellite markdown, section "Cytoscape" if you chose Cytoscape as the option for the network layout in your global settings.
```{r plot network coloured by cluster, fig.width=10, fig.height=7}
plot_integrated_network()
```
## OPTIONAL: Plotting network coloured by GFC
You can re-plot the network for each condition colouring the nodes according to their GFC in the different conditions. To do so, please refer to the section "Plotting network coloured by GFC" in the satellite markdown.
## 3.3 Database enrichment
Enrichment analysis of the found modules using the GO, KEGG, Hallmark and Reactome databases or based on a user-defined enrichment file. For detailed information on parameter settings, please run ?hcocena::functional_enrichment.
The outputs all have a slot called "enrichment"(find the results at hcobject[["integrated_output"]][["enrichments"]]).
```{r GO profiling, fig.width = 10, fig.height = 7, message = F, warning = F}
functional_enrichment(gene_sets = c("Hallmark", "Go", "Kegg"),
clusters = "all",
top = 5,
padj = "BH",
qval = 0.1)
```
## 3.4 Transcription factor enrichment analyses with ChEA3
Transcription factor enrichment analysis for each module and for the entire network based on ChEA3 (Kennan AB, 2019). For more details, refer to the function documentations with ?hcocena::TF_overrep and ?hcocena::TF_enrich_all. The plots can be found in the save folder, descriptions on how to read them can be found running ?hcocena::plot_TF_enrichment.
```{r , fig.width = 10, fig.height = 7}
TF_overrep_module(clusters = "all", topTF = 5, topTarget = 5)
```
## OPTIONAL: Transcription factor enrichment analysis based on the entire network
The transcription factor enrichment analysis can also be performed network wide. Please refer to the satellite markdown, section "Transcription factor enrichment network".
## OPTIONAL: Hub gene detection
Hub gene detection is available for selected modules or the entire network. Please refer to the satellite markdown, section "Hub gene detection".
## OPTIONAL: Expression of a specific gene set
You can plot the mean expression values per condition for a list of genes as a heatmap.
You find the corresponding function in the satellite markdown under "Visualize specific gene set".
## OPTIONAL: Colour specific gene set within the network
You further have the option of replotting the network and highlighting a particular gene set. To do so, please refer to the satellite markdown, section "Colour specific gene set".
## OPTIONAL: Colour single module within the network
You can plot the network with a module of interest highlighted (colour and node size), using the chunk in the satellite markdown in section "Colour single module".
## OPTIONAL: Correlate numeric or categorical meta data with modules
To see how numeric or categorical meta information correlates to the expression patterns of a module, refer to the satellite markdown, section "Correlate numeric meta data with modules" and "Correlate categorical meta data with modules", respectively.
## OPTIONAL: Module analysis and meta-annotation
You have the option to analyse the genes present in the found modules with regard to shared functionality or enriched gene sets as well as to annotate the groups of samples with meta information from your annotation file. To do so, please refer to the satellite markdown, section "Module analysis and meta-annotation".
## OPTIONAL: Replotting of heatmap with different variable of interest
In case you are uncertain if the "voi" you have initially chosen is the right choice for your analysis, you are given the opportunity to replot the final module heatmap based on a different annotation variable. To do so, please refer to the section "Replotting of heatmap with different variable of interest" in the satellite markdown.
## OPTIONAL: Regrouping samples
If you noticed that the variable of interest ("voi") does not go well with the clustering of the heatmaps returned by "run_expression_analysis_2" or as seen in the PCA, when evaluating different clustering algorithms, you can assign new group labels to your samples based on the data structure rather than meta information. In this case, please refer to the section "Regrouping samples" in the satellite markdown.
## OPTIONAL: Final module heatmap and network
If you regrouped your samples AND/OR conducted any steps in the "Module analysis and meta-annotation"-section, the heatmap will be re-plotted updated with respect to the grouping and annotation information along with the network coloured by modules. If You have neither regrouped your sample nor analysed your modules or samples groups, this chunk will provide the exact same output as the previous one and can be skipped.
```{r create module heatmap, fig.width = 10, fig.height = 7}
plot_cluster_heatmap()
plot_integrated_network(layout = hcobject[["integrated_output"]][["cluster_calc"]][["layout"]],
save = F)
```
# 4. Save essential information
The parameters of the analysis session are written to a text file to enhance reproducibility without keeping and sharing a markdown for every analysis. It documents the name of the files and their location used as count and annotation files, the global settings set in the session, the layer settings set for each dataset as well as the cut-offs and the clustering algorithm used.
The file can be found in the save-folder.
```{r write session info}
write_session_info()
```