-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathtalk11.Rmd
703 lines (481 loc) · 16.1 KB
/
talk11.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
---
title: "Linear and nonlinear regression"
subtitle: "HUST Bioinformatics course series"
author: "Wei-Hua Chen (CC BY-NC 4.0)"
institute: "HUST, China"
date: "`r format(Sys.time(), '%d %B, %Y')`"
output:
beamer_presentation:
theme: AnnArbor
colortheme: beaver
fonttheme: structurebold
highlight: tango
includes:
in_header: mystyle.sty
---x
```{r include=FALSE}
color_block = function(color) {
function(x, options) sprintf('\\color{%s}\\begin{verbatim}%s\\end{verbatim}',
color, x)
}
## 将错误信息用红色字体显示
knitr::knit_hooks$set(error = color_block('red'))
```
# section 1: TOC
## 前情提要
- R basics
- R data wrangler
- R plot
- R string, regular expression
- R parallel computing
## 本次提要
- linear regression
- nonlinear regression
- modeling and prediction
- **K-fold** & **X times** cross-validation
- external validation
# section 2: Linear regression
## what is linear regression?
线性回归是利用数理统计中回归分析,来确定两种或两种以上变量间相互依赖的定量关系的一种统计分析方法
- Y 可以被 一个变量 `X` 解释;一元线性回归
- Y 可以被 `X`, `Z` 等多个变量解释;multivariate linear regression
## 举例
\FontSmall
```{r echo=FALSE, fig.width=10, fig.height=5, warning=FALSE, message=FALSE}
library(tidyverse);
m = lm(Fertility ~ Education, swiss);
eq <- substitute( paste( italic(y), " = ", a + b %.% italic(x), sep = "" ),
list(a = as.vector( format(coef(m)[1], digits = 2) ),
b = as.vector( format(coef(m)[2], digits = 2) )
));
eq <- as.character(as.expression(eq));
swiss %>% ggplot( aes( x = Education, y = Fertility )) +
geom_point() + geom_smooth( method = "lm" ) + ggtitle("SWISS data 1888") +
geom_text( data = NULL,
aes( x = 30, y = 80, label= eq, hjust = 0, vjust = 1),
size = 4, parse = TRUE, inherit.aes=FALSE)
```
## 解释
\FontSmall
```{r message=FALSE, warning=FALSE}
m <- lm(Fertility ~ Education, data = swiss);
summary(m);
```
**lm(\[target variable\] \~ \[predictor variables\], data = \[data source\])**
## 得到 Coefficients
\FontSmall
```{r message=FALSE, warning=FALSE}
coef( m );
```
## other useful functions
\FontSmall
```{r eval=FALSE}
# Other useful functions
coefficients(m) # model coefficients
confint(m, level=0.95) # CIs for model parameters
fitted(m) # predicted values
residuals(m) # residuals
anova(m) # anova table
vcov(m) # covariance matrix for model parameters
influence(m) # regression diagnostics
```
## R-squred a.k.a R2 是怎么来的?
\FontSmall
```{r warning=FALSE, message=FALSE}
library(magrittr);
library(caret);
predictions <- m %>% predict( swiss ); ## or use fitted(m) instead ...
# Model performance
data.frame(
RMSE = RMSE(predictions, swiss$Fertility),
R2 = R2(predictions, swiss$Fertility)
)
```
**RMSE** mean squared error, the smaller the better **R2** R, higher the better \* F-statistic: Higher the better
## R-squred a.k.a R2 是怎么来的? cont.
\FontSmall
```{r fig.width=10, fig.height=5}
data.frame( PREDICITON = predictions, ACTUAL = swiss$Fertility ) %>%
ggplot(aes( PREDICITON, ACTUAL )) + geom_point() +
ggtitle( "SWISS DATA 1888 Fertility" ) +
geom_abline(intercept = 0, slope = 1, colour = "red")
```
## Multivariate linear modeling
`datarium` package
\FontSmall
```{r eval=FALSE}
install.packages("datarium");
```
```{r message=FALSE, warning=FALSE}
library(datarium);
head(marketing);
```
**问题**: 广告投放在哪里对**销售**有帮助??
## multivariate linear modeling , cont.
\FontSmall
```{r message=FALSE, warning=FALSE}
m1 <- lm( sales ~ youtube + facebook + newspaper, data = marketing );
m2 <- lm( sales ~ youtube, data = marketing);
m3 <- lm( sales ~ facebook, data = marketing);
summary(m1);
```
## facebook vs. youtube
\FontSmall
```{r}
coef( m1 );
data.frame( YOUTUBE = summary(m2)$r.squared, FACEBOOK = summary(m3)$r.squared);
```
## predicted vs. actual
\FontSmall
```{r fig.width=10, fig.height=4}
predicted.m1 <- m1 %>% predict( marketing );
data.frame(p = predicted.m1, actual = marketing$sales ) %>%
ggplot( aes(x = p, y = actual) ) + geom_point() +
geom_abline(intercept = 0, slope = 1, colour = "red");
```
## performance evaluation
\FontSmall
```{r}
# Model performance
data.frame(
RMSE = RMSE(predicted.m1, marketing$sales ),
R2 = R2(predicted.m1, marketing$sales )
)
```
## get rid of `newspaper`
\FontSmall
```{r}
m4 <- lm( sales ~ youtube + facebook, data = marketing );
anova(m1, m4);
data.frame(
with_newspapers = summary(m1)$r.squared,
without_newspapers = summary(m4)$r.squared
)
```
## relative importance analysis
\FontSmall
```{r message=FALSE, error=FALSE}
library(relaimpo);
calc.relimp( sales ~ youtube + facebook + newspaper, data = marketing );
```
## interactions
**interactions**考虑因素之间的依赖关系或互作关系,比如,在一平台上投放广告会促进另一个平台上广告的效果,因为两个平台的用户可能是重叠的。他们在两个平台都看到广告时,更可能购买产品。
\FontSmall
```{r}
m5 <- lm( sales ~ youtube + facebook + youtube:facebook, data = marketing );
anova(m4, m5);
data.frame(
no_interactions = summary(m4)$r.squared,
with_interactions = summary(m5)$r.squared
)
```
**结论**:两个平台同时投放广告最好;
## interactions, cont.
\FontSmall
```{r}
## m5 <- lm( sales ~ youtube + facebook + youtube:facebook, data = marketing );
## 上面的 m5 可以直接写为:
m6 <- lm( sales ~ youtube*facebook, data = marketing );
summary(m6);
```
## visualize interactions
\FontSmall
```{r fig.width=6, fig.height=3, message=FALSE, warning=FALSE}
## install.packages("interactions"); 如需要,请安装这个包
library(interactions); ## 装入
sim_slopes(m6, pred = youtube, modx = facebook, jnplot = TRUE)
```
## relative importance analysis including interactions
\FontSmall
```{r message=FALSE, error=FALSE}
library(relaimpo);
calc.relimp( sales ~ youtube*facebook, data = marketing );
```
## **assumptions** of `linear regression`
### 重要信息
1. 任何检验都有基本的**假设**
2. 将检验应用于不符合**假设**的数据是统计学最大的滥用
### **assumptions**
1. Linearity: The relationship between X and the mean of Y is linear.
2. Homoscedasticity: The variance of residual is the same for any value of X.
3. Independence: Observations are independent of each other.
4. Normality: For any fixed value of X, Y is normally distributed.
## `glm` vs. `lm`
\FontSmall
lm(formula, data, …)
glm(formula, family=gaussian, data, …)
**glm**:
1. 当`family=gaussian`时,二者是一样的。
```{r warning=FALSE, message=FALSE}
library(texreg);
m.lm <- lm(am ~ disp + hp, data=mtcars);
m.glm <- glm(am ~ disp + hp, data=mtcars);
screenreg(l = list(m.lm, m.glm))
```
## `glm` 还可用于其它类型数据的分析
1. Logistic regression (family=binomial)
预测的结果(Y)是binary的分类,比如`Yes`, `No`,且只能有两个值;
\FontSmall
```{r warning=FALSE, message=FALSE}
dat <- iris %>% filter( Species %in% c("setosa", "virginica") );
bm <- glm( Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width,
data = dat, family = binomial );
data.frame( predicted = bm %>% predict( dat, type = "response" ),
original = dat$Species ) %>% sample_n(6) %>% arrange( original );
```
**注意:**
`predict(., type = "response")` 的意义是什么??
## `glm`的`Poisson regression (family=poisson)`
**Poisson regression** is a special type of regression in which the response variable consists of **count data**.
**Asumptions**:
1. The response variable consists of count data.
2. Observations are independent.
3. The mean and variance of the model are equal.
4. The distribution of counts follows a Poisson distribution.
# section 3: Non-linear regression (`nls`)
## 一元 `nls`
什么是 `nls`?to predict **a target variable** using **a non-linear function** consisting of **parameters** and **one or more independent variables**.
**non-linear least squares**
\FontSmall
```{r echo=FALSE, fig.width=10, fig.height=4, error=FALSE, message=FALSE, warning=FALSE}
set.seed(20160227)
x<-seq(0,50,1)
y<-((runif(1,10,20)*x)/(runif(1,0,10)+x))+rnorm(51,0,1)
#for simple models nls find good starting values for the parameters even if it throw a warning
m<-nls(y~a*x/(b+x), start = list(a=0.1, b=0.1));
#get some estimation of goodness of fit
plot(x,y)
lines(x,predict(m),lty=2,col="red",lwd=3)
```
## non-linear least squares
\FontSmall
```{r eval=FALSE}
## 1. generate data
set.seed(20160227)
x<-seq(0,50,1)
y<-((runif(1,10,20)*x)/(runif(1,0,10)+x))+rnorm(51,0,1)
## 2. fit model using nls
m<-nls(y~a*x/(b+x), start = list(a=0.1, b=0.1));
## 3. show how good is the fitting ...
#get some estimation of goodness of fit
plot(x,y)
lines(x,predict(m),lty=2,col="red",lwd=3)
```
1. `nls( equation, data = data, start = ... )`
2. `y~a*x/(b+x)`
## non-linear least squares, using `drc`
\FontSmall
```{r warning=FALSE, message=FALSE, fig.width=10, fig.height=4}
library(drc); ## Analysis of Dose-Response Curves
m13 <- drm( y ~ x, fct = LL.3() ); ## here LL.3() is a least square function
data.frame( x = x, y = y, fitted = predict(m13) ) %>% ggplot( aes(x=x,y=y) ) +
geom_point() + geom_line( aes(x = x, y = fitted), colour = "red", linetype = 2 );
```
## non-linear functions
### Polynomials
- Linear equation
- Quadratic polynomial
### Concave/Convex curves (no inflection)
- Exponential equation
- Asymptotic equation
- Negative exponential equation
- Power curve equation
- Logarithmic equation
- Rectangular hyperbola
### Sygmoidal curves
- Logistic equation
- Gompertz equation
- Log-logistic equation (Hill equation)
- Weibull-type 1
- Weibull-type 2
### Curves with a maximum
- Brain-Cousens equation
## example: `Exponential equation`
`Exponential decay`
`y = a * (exp(k * X) * k)`
\FontSmall
install `aomisc` package:
```{r eval=FALSE}
install.packages("devtools");
devtools::install_github("onofriAndreaPG/aomisc");
```
```{r using-degradation-aomisc, warning=FALSE, error=FALSE, message=FALSE}
library(aomisc);
data(degradation);
m14 <- drm(Conc ~ Time, fct = DRC.expoDecay(),
data = degradation)
summary(m14)
```
## plot `Exponential decay`
\FontSmall
```{r fig.height=4, fig.width=10}
plot(m14, log="");
```
## `Power curve`
`a * (X^(b - 1) * b)`
\FontSmall
```{r message=FALSE, warning=FALSE, message=FALSE}
data(speciesArea)
m15 <- drm(numSpecies ~ Area, fct = DRC.powerCurve(),
data = speciesArea)
summary(m15)
```
**重点**: 1. `DRC.powerCurve` 作为建模的参数, 来自 `library(aomisc)`
## `Power curve`结果作图
\FontSmall
```{r fig.width=10, fig.height=4}
speciesArea %>% mutate( fitted = predict( m15, speciesArea ) ) %>%
ggplot( aes(x= Area, y= numSpecies) ) +
geom_point() + geom_line( aes(x = Area, y = fitted), colour = "red", linetype = 2 );
```
## how good is the model?
\FontSmall
```{r}
R2( speciesArea$numSpecies, predict( m15, speciesArea ) );
## compare with a linear model
m16 <- lm( numSpecies ~ Area, data = speciesArea );
R2( speciesArea$numSpecies, predict( m16, speciesArea ) );
```
## 数据生成函数
`aomisc` 包和 `drc` 包带了非常多数据生成的函数,其使用示例如下:
\FontSmall
**注** `a * (X^(b - 1) * b)` 需要两个参数, a 和 b
```{r fig.width=10, fig.height=4}
plot(powerCurve.fun(1:100, 5, 0.33),
xlab = "X", ylab = "Y", main = "Power curve (b = 0.33)", type = "l")
```
## `logistic function`
\FontSmall
```{r fig.width=10, fig.height=4}
data(beetGrowth)
m17 <- drm(weightFree ~ DAE, fct = L.3(), data = beetGrowth)
plot(m17, log="", main = "Logistic function")
```
## 多元 non-linear regression
1. `mars` : Multivariate Adaptive Regression Splines (多元自适应回归样条)
2. `machine learning`
推荐一本教程: **`Hands-On Machine Learning with R`**
![](images/talk11/homl-cover.jpeg){height="40%"}
## 多元 non-linear regression, cont.
\FontSmall
Multivariate adaptive regression splines (MARS) provide a convenient approach to capture the nonlinear relationships in the data by assessing cutpoints (knots). The procedure assesses each data point for each predictor as a knot and creates a linear regression model with the candidate feature(s).
![](images/talk11/examples-of-multiple-knots-1.png){height="40%"}
## fit a basic `MARS` model, get data ready...
\FontSmall
```{r error=FALSE, message=FALSE}
library(AmesHousing); ## The Ames Iowa Housing Data
ames <- AmesHousing::make_ames();
head(ames);
```
## fit a basic `MARS` model, fit a basic model
\FontSmall
```{r error=FALSE, warning=FALSE, message=FALSE}
library(earth);
mars1 <- earth( Sale_Price ~ ., data = ames );
print(mars1);
```
## `MARS` model, model selection
\FontSmall
```{r fig.width=10, fig.height=5}
plot(mars1, which = 1);
```
**如何解释?**
## `mars` model with interactions
\FontSmall
```{r}
mars2 <- earth(Sale_Price ~ ., data = ames, degree = 2);
summary(mars2) %>% .$coefficients %>% head(10)
```
## Parameter tuning: cross validation
K fold, N times
![](images/talk11/kfold_ntimes.jpeg){height="50%"}
**注** 1. 每个iteration为随机split; 2. 得到 K \* N 个模型;
## cross validation & additional test
![](images/talk11/kfold_ntimes_additional_test.png){height="50%"}
## Parameter tuning
\FontSmall
```{r}
hyper_grid <- expand.grid(
degree = 1:3, ## number of interaction degrees
nprune = seq(2, 100, length.out = 10) %>% floor() ## number of features to select
)
head(hyper_grid);
```
## Parameter tuning, cont.
\FontSmall
```{r message=FALSE, warning=FALSE}
set.seed(123) # for reproducibility
cv_mars <- train(
x = subset(ames, select = -Sale_Price),
y = ames$Sale_Price,
method = "earth",
metric = "RMSE",
trControl = trainControl(method = "cv", number = 10),
tuneGrid = hyper_grid
)
```
**注** 10 折 cross validation
**注** 非常 time consuming ... 最好能使用 parallel computing ...
## Parameter tuning, show results
\FontSmall
```{r fig.width=10, fig.height=3, message=FALSE, warning=FALSE}
cv_mars$bestTune;
cv_mars$results %>%
filter(nprune == cv_mars$bestTune$nprune, degree == cv_mars$bestTune$degree);
ggplot(cv_mars) ## plot
```
**RMSE** the lower the better
## compare with other methods e.g., `lm`
\FontSmall
```{r error=FALSE, message=FALSE, warning=FALSE}
cv_lm <- train(
Sale_Price ~ .,
data = ames,
method = "lm",
trControl = trainControl(method = "repeatedcv", number = 10, repeats = 3)
);
```
```{r}
print(cv_lm);
```
## variable importance plot (VIP)
\FontSmall
```{r error=FALSE, message=FALSE, warning=FALSE, fig.width=10, fig.height=4}
library(vip);
p1 <- vip(cv_mars, num_features = 40, geom = "point", value = "gcv") + ggtitle("Generalized cross-validation")
p2 <- vip(cv_mars, num_features = 40, geom = "point", value = "rss") + ggtitle("Residual Sums of Squares")
gridExtra::grid.arrange(p1, p2, ncol = 2)
```
## `mars` : final thoughts
### advantages
1. First, MARS naturally handles mixed types of predictors (quantitative and qualitative).
2. MARS also requires minimal feature engineering (e.g., feature scaling) and performs automated feature selection.
3. Highly correlated predictors do not impede predictive accuracy as much as they do with OLS models.
### shortcomings
1. typically slower to train
2. Also, although correlated predictors do not necessarily impede model performance, they can make model interpretation difficult.
# section 4: 小结及作业!
## 本次小结
### linear regression
- lm vs. glm
- 一元
- 多元
- 相关函数
- performance evaluation
- interactions
- visualizations
### non-linear regression
- nls
- mars
- earth
- cross validation
- K fold, N times (下次详细讲)
## 下次预告
- Random Forest
- Support Vector Machine
- Deep learning
## 作业
- `Exercises and homework` 目录下 `talk11-homework.Rmd` 文件;
- 完成时间:见钉群的要求
### important
- all codes are available at Github: <https://github.com/evolgeniusteam/R-for-bioinformatics>