-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathtalk04.Rmd
590 lines (402 loc) · 13.3 KB
/
talk04.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
---
title: 'R language basics, part 3: factor'
subtitle: "HUST Bioinformatics course series"
author: "Wei-Hua Chen (CC BY-NC 4.0)"
institute: "HUST, China"
date: "`r format(Sys.time(), '%d %B, %Y')`"
output:
beamer_presentation:
theme: AnnArbor
colortheme: beaver
fonttheme: structurebold
highlight: tango
includes:
in_header: mystyle.sty
---
```{r include=FALSE}
color_block = function(color) {
function(x, options) sprintf('\\color{%s}\\begin{verbatim}%s\\end{verbatim}',
color, x)
}
## 将错误信息用红色字体显示
knitr::knit_hooks$set(error = color_block('red'))
```
# section 1: TOC
## 前情提要
### data frame and tibble
* declaration & usage
* manipulation (更多相关内容会在介绍 ```dplyr``` 时讲到)
* differences between data.frame and tibble
* advantages of using tibble (更多内容以后会介绍)
* ```with```, ```within```,```attach```, ```detach``` 等的用法
### IO
* read from files of different formats
* write to files
* use GUI to read files (& get the corresponding code)
## 今次预报
1. IO, project management, working environment management
2. factors: R 中最重要的概念之一
3. exercises
# section 2: IO and working enviroment management
## R session 的概念
每个R session 是一个单独的工作空间(work space),包含各自的数据、变量和操作历史。
![two R sessions](images/talk04/r_sessions.png){height=40%}
## R session in RStudio
Each RStudio session is automatically associated with a R session
![R session in RStudio](images/talk04/r_session_in_rstudio.png){heigh=40%}
## start a new RStudio session by creating a new project
1. 右上角的 Project 按钮,在弹出菜单里选 New Project …
![create new project, step 1](images/talk04/rstudio_create_new_project.png){height=40%}
## create a new project, cont.
2. Select: New directory -> New Project in the popup window
![create new project, step 2](images/talk04/rstudio_create_project_2.png){height=40%}
## create a new project, cont.
3. Enter a new directory name, choose its mother directory ...
![create new project, step 3](images/talk04/rstudio_create_new_project_3.png){height=40%}
## 现场演示
演示~~
## working space
当前工作空间,包括所有已装入的数据、包和自制函数
可通过以下代码管理变量
\FontSmall
```{r eval=TRUE}
ls(); ## 显示当前环境下所有变量
rm( x ); ## 删除一个变量
ls();
##rm(list=ls()); ## 删除当前环境下所有变量!!!
```
## variables in working space in RStudio
在RStudio右上角的"Environment"窗口显示了所有当前工作间的变量
![RStudio enviroment window](images/talk04/rstudio_enviroment_window.png){height=40%}
## save and restore work space
\FontSmall
```{r eval=FALSE}
## -- save all loaded variables into an external .RData file
save.image( file = "prj_r_for_bioinformatics_aug3_2019.RData" );
## -- restore ( load ) saved work space
load( file = "prj_r_for_bioinformatics_aug3_2019.RData" );
```
\FontNormal
### Notes
* existing variables will be kept, however, those will the same names will be replaced by loaded variables
* please consider using ```rm( list=ls() )``` to remove all existing variables to have a clean start
* you may need to reload all the packages
## save selected variables
Sometimes you need to transfer processed data to a collaborator ...
\FontSmall
```{r eval=FALSE}
## save selected variables to external
save(city, country, file="1.RData"); ## you can specify directory name
## --
load( "1.RData" );
```
\FontNormal
## close and (re)open a project
close a project is easy:
![Two ways of closing a project](images/talk04/Rstudio_close_a_project.png){height=40%}
however ...
## 退出 projects 时的一些选项 (RStudio)
![Project options](images/talk04/rstudio_project_options.png){height=30%}
### notes
* 退出时保存
* 打开时装入
* 但数据较大时,装入时间可能过长 ...
## open a project
![Open a project](images/talk04/rstudio_open_a_project.png){height=40%}
演示项目的不同打开姿式(1-5)。
## 练习
* 创建一个项目
* 定义一些变量
* 从外部文件装入一些数据
* 保存workspace到 <file_name>.RData
* 退出 project
* 重新打开 project 并恢复 workspace
# section 3: factors
## 什么是factors?
Factor is a data structure used for fields that takes only predefined, finite number of values (categorical data).
Facor 用于限制某个字段(列),只允许其接受某些值
\FontSmall
```{r}
x <- c("single", "married", "married", "single");
str(x);
```
```{r}
## create factor as it is ...
x <- as.factor(x);
## please note the change in the displayed values ...
str(x);
```
```{r}
## create factor from scratch ...
x <- factor( c( "single", "married", "married", "single" ) );
str(x);
```
## factors, cont.
Factors 会限制输入数据的选择范围
\FontSmall
```{r}
str(x);
x[ length(x) + 1 ] <- "widowed";
```
```{r}
x;
```
Use ```levels()``` function to add new factors
```{r}
levels(x) <- c(levels(x), "widowed");
x[ length(x) + 1 ] <- "widowed";
str(x);
```
## factors, cont.
Play around with ```levels()```:
\FontSmall
```{r}
## other ways of assigning factors ...
y <- as.factor( c( "single", "married", "married", "single" ) );
levels( y );
levels(y) <- c("single", "married", "widowed");
str(y);
```
```{r eval=FALSE}
## 这个代码现在就没有问题了
y[ length(y) + 1 ] <- "widowed";
```
\FontNormal
** 注意 ** 用 `as.factor` 创建 factor 时,得到的 levels 按字母表排列;
但是,用 `levels( y )` 方式指定 levels 时,则按照指定的顺序;
## `levels`的顺序决定了排序的顺序
\FontSmall
```{r}
##
y <- as.factor( c( "single", "married", "married", "single" ) );
levels(y);
sort(y);
##
y2 <- y;
levels(y2) <- c("single", "married", "widowed");
sort(y2);
```
## sort data in a meaningful way ...
\FontSmall
```{r}
## Month
x1 <- c("Dec", "Apr", "Jan", "Mar");
sort(x1);
month_levels <- c(
"Jan", "Feb", "Mar", "Apr", "May", "Jun",
"Jul", "Aug", "Sep", "Oct", "Nov", "Dec"
)
y1 <- factor(x1, levels = month_levels)
sort(y1);
```
## 以出现的顺序为 `factor`
\FontSmall
```{r}
## Sometimes you’d prefer that the order of the levels match the order of the first appearance in the data.
f1 <- factor(x1, levels = unique(x1));
f1;
library(forcats); ## just to make sure the codes will run smoothly ...
## you can also use fct_inorder in the forcats package ...
f2 <- x1 %>% factor() %>% fct_inorder()
f2
```
## use factor to clean data
\FontSmall
```{r}
## 假设我有一组性别数据,其写法非常不规整;
gender <- c("f", "m ", "male ","male", "female", "FEMALE", "Male", "f", "m");
## 要求:都改为 Female, Male
gender <- as_factor( gender );
fct_count( gender );
gender <- fct_collapse(
gender,
Female = c("f", "female", "FEMALE"),
Male = c("m ", "m", "male ", "male", "Male")
)
fct_count(gender)
```
## or use `fct_relabel`
\FontSmall
```{r}
gender <- c("f", "m ", "male ","male", "female", "FEMALE", "Male", "f", "m")
gender <- as_factor(gender)
gender <- fct_relabel(gender, ~ ifelse(tolower(substring(., 1, 1)) == "f", "Female", "Male"))
fct_count(gender)
```
## factor 在做图中的应用(真正精髓)
\FontSmall
```{r fig.height=2, fig.width=6}
## 一项 mock 调查结果数据
( responses <- factor( c("Agree", "Agree", "Strongly Agree", "Disagree",
"Disagree", "Agree") ) );
## -- plot the results --
library(ggplot2);
barplot <-
ggplot( data = data.frame( res = responses ), aes( x = res )) +
geom_bar();
```
## factor 在做图中的应用, cont.
```{r fig.height=3, fig.width=8, echo=FALSE}
barplot;
```
\FontNormal
默认情况下, factor 按字母表排序: Agree -> Disagree -> Strong Agree 。ggplot2 也会按factor的排序作图
## 调整 factor 以调整画图顺序
\FontSmall
```{r fig.heigh=2, fig.width=8}
res <- data.frame( res=responses );
## -- 按照同意程度从强->弱 排序
res$res <- factor( res$res, levels = c( "Strongly Agree", "Agree", "Disagree" ), ordered = T );
str(res);
plot2 <-
ggplot( data = res, aes( x = res )) +
geom_bar() +
xlab( "Response" ) + ylab("Count");
```
## 调整 factor 以调整画图顺序, cont.
```{r fig.height=3, fig.width=8, echo=FALSE}
plot2;
```
** 练习 ** 按意程度从 弱->强 排序并作图!!
## ordered factor
通过 ordered 参数,让用户知道 factors 是经过精心排序的
\FontSmall
```{r}
( responses <- factor( c("Agree", "Agree", "Strongly Agree", "Disagree", "Disagree", "Agree"), levels = c( "Disagree", "Agree", "Strongly Agree"), ordered = T ) );
is.ordered( responses );
```
## 通过 factor 改变 值
使用 ```dplyr``` 包的 ```recode()```函数改变 value
\FontSmall
```{r}
( x <- factor( c( "alpha", "beta", "gamma", "theta", "beta", "alpha" ) ) );
## --
library( dplyr );
x <- recode( x, "alpha" = "one", "beta" = "two" );
str(x);
```
## 去除不用的 levels
?什么时候会用到:
\FontSmall
```{r}
mouse.genes <- read.delim( file = "data/talk04/mouse_genes_biomart_sep2018.txt",
sep = "\t", header = T, stringsAsFactors = T );
str(mouse.genes);
```
## 去除不用的 levels, cont.
\FontSmall
```{r fig.width=10, fig.height=4}
mouse.chr_10_12 <- subset( mouse.genes, Chromosome.scaffold.name %in% c( "10", "11", "12" ) );
## plot length distribution --
boxplot( Transcript.length..including.UTRs.and.CDS. ~ Chromosome.scaffold.name,
data = mouse.chr_10_12, las = 2 );
```
\FontNormal
```subset()``` 无法去除不用的 factors ...
## 去除不用的 levels, cont.
\FontSmall
```{r fig.width=10, fig.height=4}
mouse.chr_10_12$Chromosome.scaffold.name <-
droplevels( mouse.chr_10_12$Chromosome.scaffold.name );
levels( mouse.chr_10_12$Chromosome.scaffold.name );
## 再次 plot ...
boxplot( Transcript.length..including.UTRs.and.CDS. ~ Chromosome.scaffold.name,
data = mouse.chr_10_12, las = 2 );
```
## 也可以使用 tibble , 完全不用担心 factor 的问题 ...
\FontSmall
```{r fig.width=10, fig.height=5}
library( readr );
mouse.tibble <- read_delim( file = "data/talk04/mouse_genes_biomart_sep2018.txt",
delim = "\t", quote = "" )
mouse.tibble.chr10_12 <-
mouse.tibble %>% filter( `Chromosome/scaffold name` %in% c( "10", "11", "12" ) );
plot3 <-
ggplot( data = mouse.tibble.chr10_12,
aes( x = `Chromosome/scaffold name`,
y = `Transcript length (including UTRs and CDS)` ) ) +
geom_boxplot() +
coord_flip() +
ylim( 0, 2500 ) ;
## do not use ylim, but remove outliers
p1 <-
ggplot( data = mouse.tibble.chr10_12,
aes( x = `Chromosome/scaffold name`,
y = `Transcript length (including UTRs and CDS)` ) ) +
geom_boxplot() +
coord_flip(ylim = c(0, 5000), clip = "on");
```
## 用 tibble 解决 factor 的问题 , cont.
```{r fig.height=4, fig.width=8, echo=FALSE}
plot3;
```
## 按基因长度 中值 从 大 -> 小 排序
\FontSmall
```{r fig.width=10, fig.height=5}
plot4 <-
ggplot( data = mouse.tibble.chr10_12,
aes( x = reorder( `Chromosome/scaffold name`,
`Transcript length (including UTRs and CDS)`,
median ),
y = `Transcript length (including UTRs and CDS)` ) ) +
geom_boxplot() +
coord_flip() +
ylim( 0, 2500 ) ;
```
\FontNormal
```reorder( vector_with_factor, numeric_value , FUN = mean )``` 的用法
## 按基因长度 中值 从 大 -> 小 排序, cont.
```{r fig.height=3, fig.width=8, echo=FALSE}
plot4;
```
\FontNormal
** 注意 ** ```reorder( `Chromosome/scaffold name`, - `Transcript length (including UTRs and CDS)`, median ) ``` 的作用
## 按基因长度 中值 从 大 -> 小 排序, cont.
** 问题 **
1. 如果要按 小 -> 大 的顺序排序呢? ( ```reorder( `Chromosome/scaffold name`, - `Transcript length (including UTRs and CDS)`, median ) )
2. ```reorder``` 的作用是什么?? 只在 ggplot2 里有用吗??
## use `forcats::fct_reorder` to reorder factors
\FontSmall
```{r fig.width=10, fig.height=4}
ggplot( data = mouse.tibble.chr10_12,
aes( x = fct_reorder( `Chromosome/scaffold name`,
`Transcript length (including UTRs and CDS)`,
median ),
y = `Transcript length (including UTRs and CDS)` ) ) +
geom_boxplot() +
coord_flip() +
ylim( 0, 2500 ) ;
```
## play around with `gss_cat`: General Social Survey
先看一下数据:
\FontSmall
```{r echo=FALSE}
head( gss_cat, n = 15 );
```
## tv hours vs. religion
\FontSmall
```{r fig.width=10, fig.height=4}
relig_summary <- gss_cat %>% group_by(relig) %>%
summarise(
age = mean(age, na.rm = TRUE),
tvhours = mean(tvhours, na.rm = TRUE),
n = n()
)
ggplot(relig_summary, aes(tvhours, fct_reorder(relig, tvhours))) + geom_point()
```
# section 4: 练习 & 作业
## 练习 & 作业
- ```Exercises and homework``` 目录下 ```talk04-homework.Rmd``` 文件;
- 完成时间:见钉群的要求
## 小结
### 今次提要
* IO, project management, working environment management
* factor :R 另一个超级重要且难以上手的概念
* 定义
* 操作
* 使用
* 基础和进阶绘图(配合 factor 讲解)
### 下次预告
* data-wrangler: `dplyr`
### important
* all codes are available at Github: https://github.com/evolgeniusteam/R-for-bioinformatics