Skip to content

Commit

Permalink
Add initial content
Browse files Browse the repository at this point in the history
  • Loading branch information
ehumph committed Apr 7, 2022
1 parent dd837dd commit 33c8df4
Show file tree
Hide file tree
Showing 6 changed files with 197 additions and 10 deletions.
28 changes: 19 additions & 9 deletions 01-intro.Rmd
Original file line number Diff line number Diff line change
@@ -1,18 +1,28 @@
# Overview

```{r, include = FALSE}
ottrpal::set_knitr_image_path()
```
Being able to understand and apply dimensional reduction techniques is vital in order to work with the large, complex datasets that are standard in bioinformatic analysis. This lesson offers an in-depth exploration of common dimensional reduction techniques as implemented in R. It can be used on its own or as a supplement with other AnVIL lessons. During this lesson, students will work hands-on with publicly available data in the RStudio environment on [AnVIL](https://anvilproject.org/) cloud computing resource to apply methods of dimensional reduction, visualize the results, and choose the appropriate number of dimensions for use in downstream analyses.

# Introduction

## Activity Context

## Motivation
**Course Audience**

Graduate Students in Biological Sciences or related fields

**Course Prerequisites**

Layman understanding of genetics (understanding of DNA, genes, trait inheritance)
Some previous exposure to the central dogma of molecular biology

**Class Type**

Lab
Computer-based

**Class Size**

1-50

## Target Audience

The course is intended for ...

## Curriculum

The course covers...
3 changes: 3 additions & 0 deletions 02-learning-objectives.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
## Learning Objectives

Coming soon!
4 changes: 4 additions & 0 deletions 03-lesson-plan.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@

# Lesson Plan

Coming soon!
9 changes: 9 additions & 0 deletions 04-overview.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
# Overview

Biological data, especially molecular data, is inherently high-dimensional. For example, expression matrices can have data for tens of thousands of genes, and each gene represents a different dimension of the data. The sample sizes necessary to fit statistical models to such a high-dimensional dataset are astronomically high and impractical for most modern studies. High-dimensional data also greatly increases the computational resources and time necessary for any bioinformatic analysis.


# What is dimensionality reduction?

Dimensionality reduction methods allow scientists to get around these issues. They are techniques that reduce the number of input variables in a dataset. These methods are possible for biological data because so much of the data variables are correlated. If we go back to our example using expression data, while each gene represents a different dimension, genes involved in similar metabolic pathways will be upregulated or downregulated together and thus those dimensions are not independent. Dimensional reduction methods can identify these correlations and combine them in such a way to reduce the number of data dimensions to a workable amount while minimizing the information loss. As an added benefit, combining correlated dimensions can result in less noisy data, since the data are averaged across multiple dimensions to provide a better representation of patterns. Dimensional reduction also makes it easier for scientists to visualize data in ways that are comprehensible to the human brain.

158 changes: 158 additions & 0 deletions 05-student-guide.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,158 @@
```{r, include = FALSE}
ottrpal::set_knitr_image_path()
```

# Principal Components Analysis

This chapter contains the student instructions for the Principal Components Analysis activity.

## Before You Start

Make sure you can start an RStudio session on AnVIL. Take a look at our guide [here](https://jhudatascience.org/GDSCN_Book_Statistics_for_Genomics_Differential_Expression/getting-set-up.html#introduction).

### Libraries

We will be using the prcomp() command which comes loaded into baseR, as well as the [factoextra package](https://www.rdocumentation.org/packages/factoextra/versions/1.0.7) for visualization. You can install the factoextra package using the install.packages command.

```{r}
library(factoextra)
```

## Introduction

Principal components analysis (PCA) is one of the oldest and most commonly used dimensional reduction techniques. PCA is an unsupervised machine learning algorithm that combines correlated dimensions into a single new variable. This new variable represents an axis or line in the dataset that describes the maximum amount of variation and becomes the first principal component (PC). After the first PC is created, the algorithm then identifies a new axis that is orthagonal (statistically independent) to the first PC while also describing the maximum amount of previously undescribed variation.

This process continues, with each PC capturing a decreasing amount of variation, until the maximum number of PCS is reached (max number of PCs = total number of variables in the original dataset).

The first PCs will capture most heterogeneity (and thus most of the signal) in our high-dimensional dataset by definition, while the variation in later PCs will mostly represent noise. Thus, we can use the first PCs to both examine structure in our dataset as well as for downstream analyses to reduce computational work and improve the fit of our statistical models.


::: {.fyi}
GOING DEEPER: The math behind PCA

PCA is a method of matrix factorization. Essentially, the PCA approach allows a researcher to break down (or decompose) a large data matrix into smaller constituent parts (similar to the way we can break the number 30 into the factors 5 and 6). Each smaller matrix can be summarized by a vector, called the eigenvector. The variance of the eigenvector is called the eigenvalue. For a large matrix (like most biological datasets), there are many possible ways to decompose the matrix. In our example earlier with the number 30 we could choose to start with the number 5 or the number 6. With PCA, We always start with the eigenvector that has the largest eigenvalue. This eigenvector is what we call the first PC. What is left of the original data matrix is then decomposed further into smaller and smaller pieces, with each subsequent eigenvector chosen because it has the largest possible eigenvalue.

The PC variables created by this method are a linear combination of the original variables.

:::


## Exercise One: Calculating PCs

There are many packages available for R and RStudio that can perform PCA. These packages will calculate PCA in similar ways, although some of their default parameters (such as how many PCs the algorithm will calculate and store) may be different. Be sure to read the manuals for any package you choose to use.

For this example, we will use the iris dataset that is preloaded into R. The statistician RA Fisher popularized this dataset for statistical work, although the original data was collected by Edgar Anderson. The dataset contains 4 measurements (petal length, petal width, sepal length, and sepal width) for 150 individuals from 3 different iris species.

Let's start by looking at the first few rows of the dataset.

```{r}
head(iris)
```

Each row represents a single individual flower. The first four columns are _active variables_, or the variables we will use for the principal components analysis. The fifth column is the species of each individual ( _Iris setosa_, _Iris virginica_, or _Iris versicolor_). This is a _supplementary variable_, or additional information about each sample that we will not use in the PCA.

The actual PCA calculation is quite simple. We specify the calculations are only done on the first four columns of the iris dataset, and we are also scaling each variable so the mean is 0 and the standard deviation is 1.

```{r}
pca.obj <- prcomp(iris[1:4], scale=T)
str(pca.obj)
```

The prcomp() command will create a list containing multiple dataframes and vectors:
_sdev_, a vector of the standard deviations for each PC
_rotation_, a matrix of the eigenvectors for each variable
_center_, a vector of the center value for each variable
_scale_, a vector of the scaling factor for each variable
_x_, a matrix of the individual PC values for each flower


## Exercise Two: Choosing the number of PCs

We have now calculated our PCs. In this dataset, we have a limited number of PCs and could choose to use all of them in downstream analysis, but this becomes impractical with high-dimensional data. The majority of the variation in the dataset will be captured by the first few PCs, so common practice is to identify a cutoff for how much variation a PC must explain in order to be included. If the data have been scaled and centered, it is also common to include only those PCs with an eigenvalue greater than or equal to 1.

For the iris dataset, we can look at this in either table format or using a scree plot. Let's take a look at a table of the eigenvalues and variation captured by each PC first. We'll use a command from the factoextra package.

```{r}
get_eigenvalue(pca.obj)
```

This table tells us the vast majority of the variation is described by just the first two PCs.

This information can also be calculated in the sdev vector of the pca.obj list. You don't have to use an additional library like factoextra. Instead, you can convert the standard deviations to variances.

```{r}
pca.obj$sdev^2 / sum(pca.obj$sdev^2)
```

It may be easier to examine this data visually, which we can do using a scree plot (Jollife 2002, Peres-Neto, Jackson, and Somers 2005).

```{r}
fviz_eig(pca.obj, addlabels = TRUE)
```

The PCs are on the x-axis, and the y-axis is the percentage of variance explained by each PC. As you can see, the percentage of explained variance drops off dramatically after the first two PCs.

Going forward, we might choose to use these two variables in our analyses.

## Exercise Three: Correlation between variables and PCs

At some point, we might be interested in knowing which variables are contributing to each PC. Let's first look at this in table form. We first extract the results from our PCA output.

```{r}
var <- get_pca_var(pca.obj)
var
```

The get_pca_var creates a list of matrices with useful information about the PCA. For now, we're interested in the contribution matrix.

```{r}
var$contrib
```

This table tells us what percentage of each PC is contributed by each variable. A higher number (or percentage) means a greater contribution by that variable.


We can also visualize this information using a correlation circle.

```{r}
fviz_pca_var(pca.obj, col.var = "black", repel=T)
```

The size and position of the vectors (arrows) in the correlation circle can tell us quite a lot about the variables.

Positively correlated variables are grouped together.
Negatively correlated variables are positioned on opposite sides of the plot origin (opposed quadrants) .

Vectors that end close to the circle outline greatly contribute to a given PC.
Shorter vectors contribute very little.

Vectors that are primarily horizontal mostly contribute to the first PC (the x-axis).
Vectors that are more vertical contribute primarily to the second PC (the y-axis).


## Exercise Four: Visualizing structure using PCs

Sometimes it is helpful to plot the PC coordinates of the individual samples in order to identify any population structure in the dataset.

```{r}
fviz_pca_ind(pca.obj, geom.ind = "point", mean.point=F)
```

Here we've created a scatter plot with the first PC (Dim1) on the x-axis and the second PC (Dim2) on the y-axis. We can clearly see the samples split into 2 clusters. It might be interesting see whether these clusters correspond with iris species (remember, we have that information in the fifth column of the iris data table).

```{r}
fviz_pca_ind(pca.obj, geom.ind = "point", mean.point=F,
col.ind = iris$Species,
addEllipses = TRUE,
legend.title = "Groups"
)
```

We used the col.ind option to color the dots by species and the addEllipses option to create an ellipse around the center of each species cluster.

## Recap

```{r}
devtools::session_info()
```

5 changes: 4 additions & 1 deletion _bookdown.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,10 @@ chapter_name: "Chapter "
repo: https://github.com/jhudsl/AnVIL_Template/
rmd_files: ["index.Rmd",
"01-intro.Rmd",
"02-chapter_of_course.Rmd",
"02-learning-objectives.Rmd",
"03-lesson-plan.Rmd",
"04-overview.Rmd",
"05-student-guide.Rmd",
"About.Rmd",
"References.Rmd"]
new_session: yes
Expand Down

0 comments on commit 33c8df4

Please sign in to comment.