talk04.Rmd

---
title: 'R language basics, part 3: factor'
subtitle: "HUST Bioinformatics course series"
author: "Wei-Hua Chen (CC BY-NC 4.0)"
institute: "HUST, China"
date: "`r format(Sys.time(), '%d %B, %Y')`"
output: 
  beamer_presentation:
    theme: AnnArbor
    colortheme: beaver
    fonttheme: structurebold
    highlight: tango
    includes:
      in_header: mystyle.sty
---

```{r include=FALSE}
color_block = function(color) {
  function(x, options) sprintf('\\color{%s}\\begin{verbatim}%s\\end{verbatim}',
                               color, x)
}

## 将错误信息用红色字体显示
knitr::knit_hooks$set(error = color_block('red'))
```

# section 1: TOC

## 前情提要

### data frame and tibble

* declaration & usage
* manipulation （更多相关内容会在介绍 ```dplyr```  时讲到）
* differences between data.frame and tibble
* advantages of using tibble (更多内容以后会介绍)
* ```with```, ```within```，```attach```, ```detach``` 等的用法

### IO

* read from files of different formats
* write to files
* use GUI to read files (& get the corresponding code)

## 今次预报

1. IO, project management, working environment management
2. factors: R 中最重要的概念之一 
3. exercises

# section 2: IO and working enviroment management

## R session 的概念

每个R session 是一个单独的工作空间（work space），包含各自的数据、变量和操作历史。

![two R sessions](images/talk04/r_sessions.png){height=40%}

## R session in RStudio

Each RStudio session is automatically associated with a R session

![R session in RStudio](images/talk04/r_session_in_rstudio.png){heigh=40%}

## start a new RStudio session by creating a new project

1. 右上角的 Project 按钮，在弹出菜单里选 New Project … 

![create new project, step 1](images/talk04/rstudio_create_new_project.png){height=40%}

## create a new project, cont.

2. Select: New directory -> New Project in the popup window 

![create new project, step 2](images/talk04/rstudio_create_project_2.png){height=40%}

## create a new project, cont.

3. Enter a new directory name, choose its mother directory ... 

![create new project, step 3](images/talk04/rstudio_create_new_project_3.png){height=40%}

## 现场演示

演示~~

## working space

当前工作空间，包括所有已装入的数据、包和自制函数

可通过以下代码管理变量

\FontSmall

```{r eval=TRUE}
ls();  ## 显示当前环境下所有变量
rm( x ); ## 删除一个变量
ls(); 

##rm(list=ls()); ## 删除当前环境下所有变量！！！ 
```

## variables in working space in RStudio

在RStudio右上角的"Environment"窗口显示了所有当前工作间的变量

![RStudio enviroment window](images/talk04/rstudio_enviroment_window.png){height=40%}

## save and restore work space

\FontSmall
```{r eval=FALSE}
## -- save all loaded variables into an external .RData file
save.image( file = "prj_r_for_bioinformatics_aug3_2019.RData" );

## -- restore ( load ) saved work space 
load( file = "prj_r_for_bioinformatics_aug3_2019.RData" );
```

\FontNormal

### Notes

* existing variables will be kept, however, those will the same names will be replaced by loaded variables
* please consider using ```rm( list=ls() )``` to remove all existing variables to have a clean start
* you may need to reload all the packages


## save selected variables 

Sometimes you need to transfer processed data to a collaborator ... 

\FontSmall
```{r eval=FALSE}
## save selected variables to external 
save(city, country, file="1.RData"); ## you can specify directory name

## --
load( "1.RData" );
```

\FontNormal

## close and (re)open a project

close a project is easy:

![Two ways of closing a project](images/talk04/Rstudio_close_a_project.png){height=40%}

however ... 

## 退出 projects 时的一些选项 （RStudio）

![Project options](images/talk04/rstudio_project_options.png){height=30%}

### notes

* 退出时保存
* 打开时装入
* 但数据较大时，装入时间可能过长 ... 

## open a project

![Open a project](images/talk04/rstudio_open_a_project.png){height=40%}

演示项目的不同打开姿式（1-5）。

## 练习

* 创建一个项目
* 定义一些变量
* 从外部文件装入一些数据
* 保存workspace到 <file_name>.RData 
* 退出 project
* 重新打开 project 并恢复 workspace

# section 3: factors

## 什么是factors？

Factor is a data structure used for fields that takes only predefined, finite number of values (categorical data).

Facor 用于限制某个字段（列），只允许其接受某些值

\FontSmall
```{r}
x <- c("single", "married", "married", "single");
str(x);
```

```{r}
## create factor as it is ... 
x <- as.factor(x);

## please note the change in the displayed values ... 
str(x);
```


```{r}
## create factor from scratch ... 
x <- factor( c( "single", "married", "married", "single" ) );
str(x);
```

## factors, cont. 

Factors 会限制输入数据的选择范围

\FontSmall
```{r}
str(x);
x[ length(x) + 1 ] <- "widowed";
```

```{r}
x;
```

Use ```levels()``` function to add new factors 
```{r}
levels(x) <- c(levels(x), "widowed");
x[ length(x) + 1 ] <- "widowed";
str(x);
```

## factors, cont. 

Play around with ```levels()```:

\FontSmall
```{r}
## other ways of assigning factors ... 
y <-  as.factor( c( "single", "married", "married", "single" ) );
levels( y );
levels(y) <- c("single", "married", "widowed");
str(y);
```

```{r eval=FALSE}
## 这个代码现在就没有问题了
y[ length(y) + 1 ] <- "widowed";
```

\FontNormal

** 注意 ** 用 `as.factor` 创建 factor 时，得到的 levels 按字母表排列；

但是，用 `levels( y )` 方式指定 levels 时，则按照指定的顺序；

## `levels`的顺序决定了排序的顺序

\FontSmall
```{r}
##
y <-  as.factor( c( "single", "married", "married", "single" ) );
levels(y);
sort(y);

## 
y2 <- y;
levels(y2) <- c("single", "married", "widowed");
sort(y2);
```

## sort data in a meaningful way ... 

\FontSmall

```{r}
## Month
x1 <- c("Dec", "Apr", "Jan", "Mar");
sort(x1);

month_levels <- c(
  "Jan", "Feb", "Mar", "Apr", "May", "Jun", 
  "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"
)

y1 <- factor(x1, levels = month_levels)
sort(y1);
```

## 以出现的顺序为 `factor`

\FontSmall

```{r}
## Sometimes you’d prefer that the order of the levels match the order of the first appearance in the data.
f1 <- factor(x1, levels = unique(x1));
f1;

library(forcats); ## just to make sure the codes will run smoothly ... 
## you can also use fct_inorder in the forcats package ...
f2 <- x1 %>% factor() %>% fct_inorder()
f2
```

## use factor to clean data

\FontSmall

```{r}
## 假设我有一组性别数据，其写法非常不规整；
gender <- c("f", "m ", "male ","male", "female", "FEMALE", "Male", "f", "m");

## 要求：都改为 Female, Male
gender <- as_factor( gender );
fct_count( gender );

gender <- fct_collapse(
  gender,
  Female = c("f", "female", "FEMALE"),
  Male   = c("m ", "m", "male ", "male", "Male")
)
fct_count(gender)
```

## or use `fct_relabel`

\FontSmall

```{r}
gender <- c("f", "m ", "male ","male", "female", "FEMALE", "Male", "f", "m")
gender <- as_factor(gender)
gender <- fct_relabel(gender, ~ ifelse(tolower(substring(., 1, 1)) == "f", "Female", "Male"))

fct_count(gender)
```

## factor 在做图中的应用（真正精髓）

\FontSmall
```{r fig.height=2, fig.width=6}
## 一项 mock 调查结果数据
( responses <- factor( c("Agree", "Agree", "Strongly Agree", "Disagree", 
                         "Disagree", "Agree") ) );

## -- plot the results --
library(ggplot2);
barplot <-
  ggplot( data = data.frame( res = responses ), aes( x = res )) +
  geom_bar();
```

## factor 在做图中的应用, cont. 

```{r fig.height=3, fig.width=8, echo=FALSE}
barplot;
```

\FontNormal

默认情况下， factor 按字母表排序： Agree -> Disagree -> Strong Agree 。ggplot2 也会按factor的排序作图


## 调整 factor 以调整画图顺序

\FontSmall
```{r fig.heigh=2, fig.width=8}
res <- data.frame( res=responses );
## -- 按照同意程度从强->弱 排序
res$res <- factor( res$res, levels = c( "Strongly Agree", "Agree",  "Disagree" ), ordered = T );
str(res);

plot2 <- 
  ggplot( data = res, aes( x = res )) +
  geom_bar() +
  xlab( "Response" ) + ylab("Count");
```

## 调整 factor 以调整画图顺序， cont.

```{r fig.height=3, fig.width=8, echo=FALSE}
plot2;
```


** 练习 ** 按意程度从 弱->强 排序并作图！！ 

## ordered factor 

通过 ordered 参数，让用户知道 factors 是经过精心排序的 

\FontSmall

```{r}
( responses <- factor( c("Agree", "Agree", "Strongly Agree", "Disagree", "Disagree", "Agree"), levels = c( "Disagree", "Agree", "Strongly Agree"), ordered = T ) );
is.ordered( responses );
```

## 通过 factor 改变 值

使用 ```dplyr``` 包的 ```recode()```函数改变 value 

\FontSmall

```{r}
( x <- factor( c( "alpha", "beta", "gamma", "theta", "beta", "alpha" ) ) );

## --
library( dplyr );
x <- recode( x, "alpha" = "one", "beta" = "two" );
str(x);
```

## 去除不用的 levels 

？什么时候会用到： 

\FontSmall

```{r}
mouse.genes <- read.delim( file = "data/talk04/mouse_genes_biomart_sep2018.txt", 
                           sep = "\t", header = T, stringsAsFactors = T );

str(mouse.genes);
```

## 去除不用的 levels, cont.

\FontSmall

```{r fig.width=10, fig.height=4}
mouse.chr_10_12 <- subset( mouse.genes,  Chromosome.scaffold.name %in% c( "10", "11", "12" ) );
## plot length distribution --

boxplot( Transcript.length..including.UTRs.and.CDS. ~ Chromosome.scaffold.name, 
         data = mouse.chr_10_12, las = 2 );
```

\FontNormal

```subset()``` 无法去除不用的 factors ... 

## 去除不用的 levels, cont.

\FontSmall

```{r fig.width=10, fig.height=4}
mouse.chr_10_12$Chromosome.scaffold.name <- 
  droplevels( mouse.chr_10_12$Chromosome.scaffold.name );

levels( mouse.chr_10_12$Chromosome.scaffold.name );

## 再次 plot ... 
boxplot( Transcript.length..including.UTRs.and.CDS. ~ Chromosome.scaffold.name, 
         data = mouse.chr_10_12, las = 2 );
```

## 也可以使用 tibble , 完全不用担心 factor 的问题 ... 

\FontSmall

```{r fig.width=10, fig.height=5}
library( readr );
mouse.tibble <- read_delim( file = "data/talk04/mouse_genes_biomart_sep2018.txt", 
                            delim = "\t", quote = "" )

mouse.tibble.chr10_12 <- 
  mouse.tibble %>% filter( `Chromosome/scaffold name` %in%  c( "10", "11", "12" ) );

plot3 <- 
  ggplot( data = mouse.tibble.chr10_12, 
        aes( x = `Chromosome/scaffold name`, 
             y = `Transcript length (including UTRs and CDS)` ) ) +
  geom_boxplot() + 
  coord_flip() + 
  ylim( 0, 2500 ) ;

## do not use ylim, but remove outliers

p1 <- 
  ggplot( data = mouse.tibble.chr10_12, 
        aes( x = `Chromosome/scaffold name`, 
             y = `Transcript length (including UTRs and CDS)` ) ) +
  geom_boxplot() +
  coord_flip(ylim = c(0, 5000), clip = "on");
```


## 用 tibble 解决 factor 的问题 , cont. 

```{r fig.height=4, fig.width=8, echo=FALSE}
plot3;
```

## 按基因长度 中值 从 大 -> 小 排序

\FontSmall 

```{r fig.width=10, fig.height=5}
plot4 <- 
  ggplot( data = mouse.tibble.chr10_12, 
        aes( x = reorder( `Chromosome/scaffold name`, 
                          `Transcript length (including UTRs and CDS)`, 
                          median ), 
             y = `Transcript length (including UTRs and CDS)` ) ) +
  geom_boxplot() + 
  coord_flip() + 
  ylim( 0, 2500 ) ;
```

\FontNormal 

```reorder( vector_with_factor, numeric_value , FUN = mean  )``` 的用法

## 按基因长度 中值 从 大 -> 小 排序， cont. 

```{r fig.height=3, fig.width=8, echo=FALSE}
plot4;
```

\FontNormal

** 注意 ** ```reorder( `Chromosome/scaffold name`, - `Transcript length (including UTRs and CDS)`, median ) ``` 的作用


## 按基因长度 中值 从 大 -> 小 排序， cont. 

** 问题 ** 

1. 如果要按 小 -> 大 的顺序排序呢？ ( ```reorder( `Chromosome/scaffold name`, - `Transcript length (including UTRs and CDS)`, median )  )
2. ```reorder``` 的作用是什么？？ 只在 ggplot2 里有用吗？？ 

## use `forcats::fct_reorder` to reorder factors 

\FontSmall
```{r fig.width=10, fig.height=4}
  ggplot( data = mouse.tibble.chr10_12,
        aes( x = fct_reorder( `Chromosome/scaffold name`, 
                          `Transcript length (including UTRs and CDS)`, 
                          median ), 
             y = `Transcript length (including UTRs and CDS)` ) ) +
  geom_boxplot() + 
  coord_flip() + 
  ylim( 0, 2500 ) ;
```


## play around with `gss_cat`: General Social Survey

先看一下数据：

\FontSmall

```{r echo=FALSE}
head( gss_cat, n = 15 );
```

## tv hours vs. religion 

\FontSmall

```{r fig.width=10, fig.height=4}
relig_summary <- gss_cat %>% group_by(relig) %>%
  summarise(
    age = mean(age, na.rm = TRUE),
    tvhours = mean(tvhours, na.rm = TRUE),
    n = n()
  )

ggplot(relig_summary, aes(tvhours, fct_reorder(relig, tvhours))) + geom_point()
```


# section 4: 练习 & 作业

## 练习 & 作业
-   ```Exercises and homework``` 目录下 ```talk04-homework.Rmd``` 文件；

- 完成时间：见钉群的要求

## 小结

### 今次提要

* IO, project management, working environment management
* factor ：R 另一个超级重要且难以上手的概念
  * 定义
  * 操作
  * 使用
* 基础和进阶绘图（配合 factor 讲解）

### 下次预告

* data-wrangler: `dplyr`

### important
* all codes are available at Github: https://github.com/evolgeniusteam/R-for-bioinformatics