-
Notifications
You must be signed in to change notification settings - Fork 68
/
Copy pathtalk00_试讲.Rmd
441 lines (301 loc) · 12.9 KB
/
talk00_试讲.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
---
title: "Multi-omics data analysis and visualisation, #1"
subtitle: "HUST Bioinformatics course series"
author: "Wei-Hua Chen (CC BY-NC 4.0)"
institute: "HUST, China"
date: "`r format(Sys.time(), '%d %B, %Y')`"
output:
beamer_presentation:
theme: AnnArbor
colortheme: beaver
fonttheme: structurebold
highlight: tango
includes:
in_header: mystyle.sty
---
```{r include=FALSE}
color_block = function(color) {
function(x, options) sprintf('\\color{%s}\\begin{verbatim}%s\\end{verbatim}',
color, x)
library(tidyverse);
}
## 将错误信息用红色字体显示
knitr::knit_hooks$set(error = color_block('red'))
```
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
# section 1: TOC
## Course contents
1. 开发平台相关软件安装
2. R基础知识
3. 数据处理
4. 做图
5. 常用生信分析任务与解决方法
注:不一定按上面的顺序来
## 技能树
### 前置条件
| | | |
|---------------------|-------------------------|----|
| 编程? | 不需要 | plus |
| Linux? | 不需要 | plus |
| 数据处理经历? | 不需要 | plus |
| 数据处理需求 | 需要 | |
## Class rules
1. 每次随机点名
2. 课堂随机提问
3. 每人2次无理由缺课机会,占平时成绩的一半
4. 考试、平时成绩各占总成绩的50%
注: 严格按以上进行,如多次表现不好,可能会不及格
# section 2: why choose R?
## R语言简史
1993到2000这段时间R只在小范围内流传。2000年之后开始大爆发,用户数量直线上升。除去R本身的优秀之外,这种爆发与多个因素有关,比如自由软件的兴起,Linux的成熟等等;经济危机也促进大家采用免费的自由软件替代统计领域的传统强者如SPSS、SAS和Matlab等(注:均为收费软件)。
首先,越来越多的学术文章使用R作为分析工具。根据来自著名学术搜索引擎Google Scholar(谷歌学术)的数据,R的流行趋势有以下两个特点:1)在学术领域的市场份额逐年增加,且增势迅猛,2)R是为数不多市场份额增加的统计软件之一。
接下来我们就用R把这个趋势画出来!如下面代码所示,所需代码包括4个部分:装入所需要的包,读取数据,处理数据和作图。运行这段代码,既专业又美观的图片就生成了!
## R的流行性调查
代码
\fontsize{7}{8}\selectfont
```{r warning=FALSE, message=FALSE}
library("ggplot2"); library("reshape2");
dat <- read.csv(file = "data/talk01/chaper01_preface_scholarly_impact_2012.4.9.csv");
cols.subset <- c("Year", "JMP","Minitab","Stata","Statistica","Systat","R");
Subset <- dat[ , cols.subset];
ScholarLong <- melt(Subset, id.vars = "Year");
names(ScholarLong) <- c("Year","Software", "Hits");
plot1 <-
ggplot(ScholarLong, aes(Year, Hits, group=Software)) + #准备
geom_smooth(aes(fill=Software), position="fill", method="loess") + #画图
ggtitle("Market share") + #设置图标题
scale_x_continuous("Year") + # 改变X轴标题
scale_y_continuous("Google Scholar %", labels = NULL ) +
theme(axis.ticks = element_blank(), text = element_text(size=14)) +
guides(fill=guide_legend( title = "Software", reverse = F )) +
geom_text(data = data.frame( Year = 2011, Software = "R", Hits = 0.10 ),
aes(label = Software), hjust = 0, vjust = 0.5);
```
## Market share, result
```{r plot1, fig.width=10, fig.height=5, echo=FALSE, warning=FALSE, message=FALSE }
plot1
```
**注**:这里移除了市场占有率较大的SAS和SPSS
## R的招聘趋势
其次,统计分析相关工作的招聘信息中要求申请者会用R的也越来越多了。根据美国招聘搜索引擎indeed.com的数据,自2005年(此搜索引擎提供的最早数据)起,需要用到R的招聘信息占总体招聘的比例逐年上升,目前仅排在SAS和Matlab之后,处于第3位。而且,除了Stata之外,R是唯一一个占比上升。
同样的,我们用R把这个趋势画出来!
## R job trends
代码
\fontsize{7}{8}\selectfont
```{r warning=FALSE, message=FALSE}
library("ggplot2"); ## 主作图包
##2. -- 读取数据 --
dat <- read.table(file ="data/talk01/chaper01_preface_indeed_com_stats_2015.txt",
header = T, as.is = T);
##3. 处理数据
dat$date <- as.Date(dat$date); ## 把第一列改为日期
#根据job对software进行调整
dat <- transform(dat, software = reorder(software, job));
plot2 <-
ggplot( dat, aes( date, job, group = software, colour = software) ) +
geom_line( size = 0.8 ) +
ggtitle("Job trends (data from indeed.com)") + #设置图标题
xlab("Year") + ylab("%") +
#改变字体大小;要放在theme_grey()后面
theme( text = element_text(size=14) ) +
guides(colour=guide_legend( title = "Tool", reverse = TRUE )) +
scale_colour_brewer(palette="Set1") + #改变默认颜色
geom_text(data = dat[dat$date == "2015-01-01" & dat$software %in% c("R"), ],
aes(label = software), hjust = 0, vjust = 0.5);
```
## R job trends, plot
```{r plot2,echo=FALSE,fig.width=10, fig.height=5}
plot2;
```
## Programming language trends 2016 vs 2015
![Worldwide Google Trends, Dec 2016 vs Dec 2015](images/talk01/google_trends.png){width=70%}
## Popularity by year
![yet another statistics](images/talk01/R_python_and_matlab.png){height=50%}
Data from: http://blog.revolutionanalytics.com/2018/06/pypl-programming-language-trends.html
Accessed on: Sep 24, 2018
## Programming languages for bioinformatics
### Perl或Python
- 强大的文本处理能力(包括序列)
- 不错的运行速度(尤其是Python)
- 强大的生信和统计学扩展包(尤其是Python)
- 方便的并行计算
### R
- 强大的格式数据处理能力(二维表格, dplyr)
- 无以伦比的统计学专业性
- 专业而好看的数据可视化软件(ggplot2)
- 专业的生信扩展包(Bioconductor)
- 超级好用的整合开发环境IDE(RStudio)
## 我用过的 programming languages
\begin{columns}
\begin{column}{0.48\textwidth}
\begin{itemize}
\item - C
\item - \textbf{Perl}
\item - \textbf{R}
\item - \textbf{PHP}
\item - \textbf{Java}
\item - \textbf{MySQL}
\item - HTML
\item - Javascript
\end{itemize}
\href{https://www.evolgenius.info/evolview}{Evolview ver3.0}
cited 95 times in 2018, 75 so far (as of July 19)
\end{column}
\begin{column}{0.48\textwidth}
\begin{figure}
\includegraphics[width=\linewidth]{images/talk01/evolview_showcase3.png}
\caption{.Evolview showcase 3}
\label{fig:evolviewshowcase03}
\end{figure}
\end{column}
\end{columns}
# Section 3: typical applications in metagenomics
## 生信任务1 : 宏基因基因数据展示1 (单样本)
![肠道物种丰度示意图](images/talk06/gmrepo_run_statistics.png){height=60%}
Find more at the [GMrepo database](https://gmrepo.humangut.info/home).
## 什么是宏基因组?
Metagenomics is the study of genetic material recovered directly from environmental samples.
### research contents
- mostly prokaryotes
- unicellular eukaryotes
- viruses
### techniques
- 16S (universially conserved gene in prokaryotes)
- whole genome sequencing (WGS or metagenomics)
## what can metagenomics do?
- identify new species
- reveal species kinds and abundances
- relate species changes to human health and disease
## biomes
![Biomes](images/talk06/ebi_metagenomics_biomes_aug8_2019.png){height=60%}
Screenshot taken from the [EBI metagenomics database](https://www.ebi.ac.uk/metagenomics/) on Aug 8, 2019.
## enviromental microbiome
- soil
- ocean
![The tara oceans expedition](images/talk06/tara_oceans_boat.jpg){height=50%}
## host associated microbiomes
- human body sites
- animals & plants
![NIH human microbiome project](images/talk06/iHMP.png){height=30%}
## why human gut microbiota is important?
![human gut microbita](images/talk06/GutMicrobeInfographic.png){height=60%}
## tasks of human gut microbiota analysis
- identify new species
- find good, bad and commensal microbes
- link microbial variations to human health
- mechanisms
- modulation, intervention and regulation
## typical human gut microbiome data
Species abundances
\FontSmall
```{r}
abu <- read_delim(file = "data/talk06/relative_abundance_for_RUN_ERR1072629_taxonlevel_species.txt",
delim = "\t", quote = "", comment = "#");
nrow(abu);
```
## Species abundances, cont.
\FontSmall
```{r}
abu %>% arrange( desc( relative_abundance ) ) %>% do( head(., n = 10) );
```
## 相对丰度作图要求
1. 按丰度从高到低排序
2. 只取前10个species (保留10行)
3. 将后面的丰度累加在一起,汇总为“Others”分类
## 数据处理
\FontSmall
```{r}
library( tidytidbits );
abu.dat <-
abu %>% arrange( desc( relative_abundance ) ) %>%
lump_rows( scientific_name, relative_abundance, n = 10, other_level = "Others" );
head(abu.dat, n = 11);
```
## 尝试作图
\FontSmall
```{r fig.height=4, fig.width=10}
ggplot(abu.dat, aes(x = scientific_name, y = relative_abundance, fill = scientific_name ) ) +
geom_bar( stat = "identity" ) +
coord_flip()
```
## 调整排列顺序
\FontSmall
```{r fig.height=4, fig.width=10}
## 用 forcat 包的 fct_reorder 函数; 注意它3个参数的意义!!!
ggplot(abu.dat, aes(x = fct_reorder( scientific_name, relative_abundance, .desc = F),
y = relative_abundance, fill = scientific_name ) ) +
geom_bar( stat = "identity" ) +
coord_flip() + xlab("Species") + ylab( "Relative abundance %" )
```
## 更多 forcats 的应用 ...
see here: https://cran.r-project.org/web/packages/forcats/vignettes/forcats.html
更多应用将会在 作图 (ggplot2) 时讲到。
## 生信任务2:宏基因组多样本物种丰度图(多样本)
![Microbiome 3, 28 2015](images/talk09/microbiome_fig.png){height=50%}
## stacked bars , cont.
load data
\FontSmall
```{r}
speabu <-read_tsv( file = "data/talk09/mock_species_abundance.txt" );
head( speabu );
```
## stacked bars , cont.
\FontSmall
```{r fig.width=10, fig.height=4}
ggplot( speabu, aes( x = id, y = abundance, fill = genus ) ) +
geom_bar( stat = "identity", position = "stack", color = "black", width = 0.8 );
```
## 指定Genus展示顺序
factor 的操纵详见第4章。
\FontSmall
```{r fig.height=4, fig.width=10}
speabu$genus <- factor( speabu$genus, levels = rev( c( "Enterobacteriaceae", "Lachnospiraceae", "Bacteroidaceae", "Lactobacillaceae",
"Clostridiaceae", "Ruminococcaceae", "Prevotellaceae", "Erysipelotrichaceae", "Streptococcaceae", "Enterococcaceae", "Other" ) ) );
ggplot( speabu, aes( x = id, y = abundance, fill = genus ) ) +
geom_bar( stat = "identity", position = "stack", color = "black", width = 0.8 );
```
## 按丰度排序
按丰度中值大小排序
\FontSmall
```{r fig.height=4, fig.width=10}
speabu$genus <- reorder( speabu$genus, speabu$abundance, median );
ggplot( speabu, aes( x = id, y = abundance, fill = genus ) ) +
geom_bar( stat = "identity", position = "stack", color = "white", width = 0.8 );
```
## ``` position = "stack" ``` 又是什么??
``` position = "dodge" ``` : plot bars next to each other ...
\FontSmall
``` {r fig.height=4, fig.width=10}
ggplot( speabu, aes( x = id, y = abundance, fill = genus ) ) +
geom_bar( stat = "identity", position = "dodge", color = "white", width = 0.8 );
```
## 显示数值 ...
\FontSmall
``` {r fig.height=4, fig.width=10}
## 先计算显示位置
speabu <- speabu %>% arrange( id, desc( factor( genus ) ) ) %>%
group_by( id ) %>% mutate( ypos = cumsum( abundance ) - abundance / 2 );
## 画图
ggplot( speabu, aes( x = id, y = abundance, fill = genus ) ) +
geom_bar( stat = "identity", position = "stack", color = "black", width = 0.8 ) +
geom_text( aes( y = ypos, label = paste( abundance, "%", sep = "" ) ), color = "white" );
```
## 显示数值 ... , cont.
**要点**
* 使用 ``` ddplyr ``` 的 ``` cumsum() ``` 函数 ...
* 计算位置:当前累加值 - 自身值/2,使数字显示在当前值的中间
* 累加前,要对数据按 factors 进行排序;通过 ``` arrange ``` 函数实现;
# section 4: 写在后面
## 课程教材
- all codes are available at Github: https://github.com/evolgeniusteam/R-for-bioinformatics
## 网站链接、参考文献和扩展阅读
综上所述,R已经是最流行的免费统计分析软件,排名仅在几个传统的分析软件之后,而且大有赶超它们的趋势。学好R,不仅有助于在学术研究领域的发展,对找工作也有不少的帮助。
- R的官方网站: http://www.r-project.org
- R档案综合网络,即CRAN(Comprehensive R Archive Network): http://cran.r-project.org/
- ggplot2: http://ggplot2.org/
- RStudio: http://www.rstudio.com/
- 如何从Google Scholar抓取引用数据: http://librestats.com/2012/04/12/statistical-software-popularity-on-google-scholar/
- indeed招聘趋势: www.indeed.com/jobtrends
- R for data science: https://r4ds.had.co.nz (必读!!)