talk05.Rmd

---
title: "R for bioinformatics, data wrangler, part 1"
subtitle: "HUST Bioinformatics course series"
author: "Wei-Hua Chen (CC BY-NC 4.0)"
institute: "HUST, China"
date: "`r format(Sys.time(), '%d %B, %Y')`"
output: 
  beamer_presentation:
    theme: AnnArbor
    colortheme: beaver
    fonttheme: structurebold
    highlight: tango
    includes:
      in_header: mystyle.sty
---

```{r include=FALSE}
color_block = function(color) {
  function(x, options) sprintf('\\color{%s}\\begin{verbatim}%s\\end{verbatim}',
                               color, x)
}

## 将错误信息用红色字体显示
knitr::knit_hooks$set(error = color_block('red'))
```

# section 1: TOC

## 前情提要

1.  IO, project management, working environment management
2.  factors: R 中最重要的概念之一

-   factors 基本概念
-   factors 操作
-   factors 在做图中的使用
-   ggplot2 和 dplyr 初步


## 今次提要

-   pipe
-   dplyr 、 tidyr (超级强大的数据处理) part 1

# section 2: pipe

## 什么是 pipe ？

-   pipe 就是 `%>%`
-   it comes from the `magrittr` package by **Stefan Milton Bache**
-   Packages in the tidyverse load %\>% for you automatically, so you don't usually load magrittr explicitly.
-   实质是中间值的传递

示例：

\FontSmall

```{r message=FALSE, warning=FALSE}
## 比如：这段代码可以合并为：
library(tidyverse); ## 装入包
library(magrittr);
a <- subset( swiss, Fertility > 20 );
cor.test(a$Fertility, a$Education);

```

## pipe 版本

\FontSmall

```{r}
## -- 新代码 ... 
swiss %>%
  subset(., Fertility > 20) %$%
  cor.test( Education , Fertility );
```

## 是否所有函数都支持 pipe ？

是的。

通常需要用 `.` 指代传递来的数据，并以参数的形式赋予下游函数:

\FontSmall

```{r}
swiss %>% do( head(.,  n = 4 ) );

## 也可以写为
swiss %>% head(.,  n = 4 );
```

## 其它形式的 pipe

`%T>%` : 返回上游的值 （？？？）

\FontSmall

```{r fig.height=4, fig.width=10, message=FALSE, warning=FALSE}
## 示例：res1 是空值 ... 
res1 <- 
  rnorm(100) %>%
    matrix(ncol = 2) %>%
    plot();
```

## %T\>%: 返回上游值 (left-side values)

\FontSmall

```{r fig.height=4, fig.width=10}
## 示例：res2 是 matrix() 内容 ... 
res2 <- 
  rnorm(100) %>%
    matrix(ncol = 2) %T>%
    plot();
```

## %T\>%: 返回上游值 (left-side values), cont.

\FontSmall

```{r}
head(res2);
```

## %\$% ： attach ???

\FontSmall

```{r}
attach( mtcars ); ## note the warning message ... 
cor.test( cyl, mpg ); ## 汽缸数与燃油效率
```

## %\$% ： attach ??? , cont.

\FontSmall

```{r}
detach( mtcars );
with( mtcars, cor.test( cyl, mpg ) );
```

## %\$% ： attach ??? , cont.

\FontSmall

```{r}
mtcars %$% 
  cor.test( cyl, mpg );
```

## 其它 pipe 及注意事项

\FontSmall

```{r}
## 双向 pipe 
mtcars %<>% transform(cyl = cyl * 2);
```

**注**

-   pipe 的使用可以使思路更清晰
-   因此，尽量使用 `%>%` （方向明确），而不使用其它方向不明确的 pipe


# section 3: data wrangler - dplyr

## dplyr

### what is `dplyr` ?

-   the next iteration of plyr,
-   focusing on only data frames (also tibble),
-   row-based manipulation,
-   dplyr is faster and has a more consistent API.

![dplyr logo](images/talk05/dplyr-large.png){height="50%"}

### more to read

-   [dplyr offical page at tidyverse](https://dplyr.tidyverse.org)
-   [R for data science](https://r4ds.had.co.nz)

## dplyr, overview

dplyr provides a consistent set of verbs that help you **solve the most common data manipulation challenges**:

-   [select()](https://dplyr.tidyverse.org/reference/select.html) 选择列，根据列名规则
-   [filter()](https://dplyr.tidyverse.org/reference/filter.html) 按规则过滤行
-   [mutate()](https://dplyr.tidyverse.org/reference/mutate.html) 增加新列，从其它列计算而得 （不改变行数）
-   [summarise()](https://dplyr.tidyverse.org/reference/summarise.html) 将多个值转换为单个值（通过 mean, median, sd 等操作），生成新列 （总行数减少，通常与 `group_by`配合使用 ）
-   [arrange()](https://dplyr.tidyverse.org/reference/arrange.html) 对行进行排序

## dplyr 安装

\FontSmall

```{r eval=FALSE}
# The easiest way to get dplyr is to install the whole tidyverse:
install.packages("tidyverse")

# Alternatively, install just dplyr:
install.packages("dplyr")
```

\FontNormal

Development version

\FontSmall

```{r eval=FALSE, warning=FALSE, message=FALSE}
# install.packages("devtools")
devtools::install_github("tidyverse/dplyr")
```

\FontNormal

[Get the cheatsheet at here](https://github.com/rstudio/cheatsheets/blob/master/data-transformation.pdf)

## an example of `dplyr`

get the data ready

```{r echo=FALSE, warning=FALSE, message=FALSE}
library(tidyverse); 
```

\FontSmall

```{r}
mouse.tibble <- read_delim( file = "data/talk04/mouse_genes_biomart_sep2018.txt", 
                            delim = "\t", quote = "" );
```

## 查看 mouse.tibble 的内容

\FontSmall

```{r}
( ttype.stats <- mouse.tibble %>% count( `Transcript type` ) %>% arrange(-n) );
```


## 查看 mouse.tibble 的内容, cont.

\FontSmall

```{r}
( chr.stats <- mouse.tibble %>% count( `Chromosome/scaffold name` ) %>% arrange(-n) );
```

## 分析任务

1.  将染色体限制在常染色体和XY上（去掉未组装的小片段） ; 处理行
2.  将基因类型限制在 protein_coding, miRNA和 lincRNA 这三种；处理行
3.  统计每条染色体上不同类型基因（protein_coding, miRNA, lincRNA）的数量
4.  按染色体（正）、基因数量（倒）进行排序

## 用 `dplyr` 实现

\FontSmall

```{r}
dat <- mouse.tibble %>% 
  ## 1. 
  
  filter( `Chromosome/scaffold name` %in% c( 1:19, "X", "Y" )   ) %>% 
  
  ## 2. 
  filter( `Transcript type` %in% c( "protein_coding", "miRNA", "lincRNA" ) ) %>%
  
  ## change column name ... 
  select( CHR = `Chromosome/scaffold name`, TYPE = `Transcript type`, 
          GENE_ID = `Gene stable ID`, 
          GENE_LEN =  `Transcript length (including UTRs and CDS)`  ) %>%
  
  ## 3. 
  group_by( CHR, TYPE ) %>% 
  summarise( count = n_distinct( GENE_ID ), mean_len = mean( GENE_LEN ) ) %>% 
  
  ## 4. 
  arrange(  CHR  , desc( count ) );
```

## 检查运行结果

\FontSmall

```{r echo=FALSE}
knitr::kable( head( dat, n = 15 ) );
```


## dplyr中其它取行的操作

![dplyr与行相关的操作](images/talk05/dplyr-subset.png){height="70%"}


## 课堂练习：使用常用函数解决问题

先创建一个新 tibble

\FontSmall

```{r}
grades <- tibble( "Name" = c("Weihua Chen", "Mm Hu", "John Doe", "Jane Doe",
                             "Warren Buffet", "Elon Musk", "Jack Ma"),
                  "Occupation" = c("Teacher", "Student", "Teacher", "Student", 
                                   rep( "Entrepreneur", 3 ) ),
                  "English" = sample( 60:100, 7 ),
                  "ComputerScience" = sample(80:90, 7),
                  "Biology" = sample( 50:100, 7),
                  "Bioinformatics" = sample( 40:90, 7)
                  );
grades;
```

## use gather & dplyr functions

Question: 1. 每个人平均成绩是多少？ 2. 哪个人的平均成绩最高？

\FontSmall

```{r}
grades.melted <- grades %>% 
  gather( course, grade, -Name, -Occupation, na.rm = T );

## 检查数据 ... 
knitr::kable( head(grades.melted) );
```

## 成绩分析，cont

\FontSmall

```{r}
grades.melted %>% 
  group_by(Name, Occupation) %>% 
  summarise( avg_grades = mean( grade ), courses_count = n() ) %>% 
  arrange( -avg_grades );

## 显示最终结果
knitr::kable( head( grades.melted ) );
```

## use gather & dplyr functions

问题： 每个人的最强科目是什么？？

\FontSmall

```{r}

## 步骤1： 排序：
grades.melted2 <- 
  grades.melted %>% 
  arrange( Name, -grade );

knitr::kable( head(grades.melted2) );
```

## 最强科目问题，cont.

\FontSmall

```{r}
grades.melted2 %>% 
  group_by(Name) %>% 
  summarise( best_course = first( course ),
             best_grade = first( grade ),
             avg_grades = mean( grade ) ) %>% 
  arrange( -avg_grades );
```

## dplyr::summarise 的其它操作

![dplyr::summarise 可用的操作](images/talk05/dplyr-summarise.png){height="60%"}

## 练习考察

问题1： 每个人的最**差**科目是什么？？

## ** 列的使用！ **，以 `starwars` tibble 为例

\FontSmall

```{r}
head(starwars);
```

**note** 包含87行 13 列，星战部分人物的信息，包括身高、体重、肤色等

用 `?starwars` 获取更多帮助

## dplyr::mutate - 产生新列，不改变行数

![dplyr::mutate](images/talk05/dplyr-mutate.png){height="50%"}

另见下页的例子

## dplyr::select - 取列

目标：

-   取出相关列，用于计算人物的 BMI

\FontSmall

```{r}
stats <- 
  starwars %>% 
  select( name, height, mass ) %>%
  mutate( bmi = mass / ( (height / 100 ) ^ 2 ) ) ;

head(stats);
```

## dplyr::select - 取列, cont.

由于 name, height 和 mass 正好是相邻列，可以用 name:mass 获取：

\FontSmall

```{r}
stats <- 
  starwars %>% 
  select( name:mass ) %>%
  mutate( bmi = mass / ( (height / 100 ) ^ 2 ) ) ;

head(stats);
```

## dplyr::select - 取列, cont.

获取与颜色相关的列: hair_color, skin_color, eye_color

\FontSmall

```{r}
stats2 <- starwars %>% 
  select( name, ends_with("color") );

head(stats2);
```

## dplyr::select - 去除列, cont.

请自行检查以下操作的结果

\FontSmall

```{r eval=FALSE}
head( starwars %>% select( -hair_color, -eye_color )  );
```

## dplyr::select - 其它操作, cont.

![dplyr::select 支持的操作](images/talk05/dplyr-select.png){height="70%"}

## 同时对行列进行操作

任务：从星战中挑选**金发碧眼**的人物

\FontSmall

```{r}
starwars %>% select( name, ends_with("color"), gender, species ) %>% 
  filter( hair_color == "blond" & eye_color == "blue" );
```

## 练习考察

问题2： 从 `starwars` 中选出 18 < bmi < 25  的人物，统计他们的 `homeworld` 分布情况

提示： 1. 需计算 bmi； 2. 用 `count` 函数计算 `homeworld` 的分布

# section 4 : 练习与作业

## 练习 & 作业

-   `Exercises and homework` 目录下 `talk05-homework.Rmd` 文件；

-   完成时间：见钉群的要求

## 小结

### 今次提要

-   pipe
-   dplyr 、 tidyr (超级强大的数据处理) part 1

### 下次预告

-   长宽数据转换
-   dplyr, tidyr 和 forcats 的更多功能与生信操作实例

### important

-   all codes are available at Github: <https://github.com/evolgeniusteam/R-for-bioinformatics>