generated from jhudsl/AnVIL_Template
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
6 changed files
with
197 additions
and
10 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,18 +1,28 @@ | ||
# Overview | ||
|
||
```{r, include = FALSE} | ||
ottrpal::set_knitr_image_path() | ||
``` | ||
Being able to understand and apply dimensional reduction techniques is vital in order to work with the large, complex datasets that are standard in bioinformatic analysis. This lesson offers an in-depth exploration of common dimensional reduction techniques as implemented in R. It can be used on its own or as a supplement with other AnVIL lessons. During this lesson, students will work hands-on with publicly available data in the RStudio environment on [AnVIL](https://anvilproject.org/) cloud computing resource to apply methods of dimensional reduction, visualize the results, and choose the appropriate number of dimensions for use in downstream analyses. | ||
|
||
# Introduction | ||
|
||
## Activity Context | ||
|
||
## Motivation | ||
**Course Audience** | ||
|
||
Graduate Students in Biological Sciences or related fields | ||
|
||
**Course Prerequisites** | ||
|
||
Layman understanding of genetics (understanding of DNA, genes, trait inheritance) | ||
Some previous exposure to the central dogma of molecular biology | ||
|
||
**Class Type** | ||
|
||
Lab | ||
Computer-based | ||
|
||
**Class Size** | ||
|
||
1-50 | ||
|
||
## Target Audience | ||
|
||
The course is intended for ... | ||
|
||
## Curriculum | ||
|
||
The course covers... |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
## Learning Objectives | ||
|
||
Coming soon! |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
|
||
# Lesson Plan | ||
|
||
Coming soon! |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
# Overview | ||
|
||
Biological data, especially molecular data, is inherently high-dimensional. For example, expression matrices can have data for tens of thousands of genes, and each gene represents a different dimension of the data. The sample sizes necessary to fit statistical models to such a high-dimensional dataset are astronomically high and impractical for most modern studies. High-dimensional data also greatly increases the computational resources and time necessary for any bioinformatic analysis. | ||
|
||
|
||
# What is dimensionality reduction? | ||
|
||
Dimensionality reduction methods allow scientists to get around these issues. They are techniques that reduce the number of input variables in a dataset. These methods are possible for biological data because so much of the data variables are correlated. If we go back to our example using expression data, while each gene represents a different dimension, genes involved in similar metabolic pathways will be upregulated or downregulated together and thus those dimensions are not independent. Dimensional reduction methods can identify these correlations and combine them in such a way to reduce the number of data dimensions to a workable amount while minimizing the information loss. As an added benefit, combining correlated dimensions can result in less noisy data, since the data are averaged across multiple dimensions to provide a better representation of patterns. Dimensional reduction also makes it easier for scientists to visualize data in ways that are comprehensible to the human brain. | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,158 @@ | ||
```{r, include = FALSE} | ||
ottrpal::set_knitr_image_path() | ||
``` | ||
|
||
# Principal Components Analysis | ||
|
||
This chapter contains the student instructions for the Principal Components Analysis activity. | ||
|
||
## Before You Start | ||
|
||
Make sure you can start an RStudio session on AnVIL. Take a look at our guide [here](https://jhudatascience.org/GDSCN_Book_Statistics_for_Genomics_Differential_Expression/getting-set-up.html#introduction). | ||
|
||
### Libraries | ||
|
||
We will be using the prcomp() command which comes loaded into baseR, as well as the [factoextra package](https://www.rdocumentation.org/packages/factoextra/versions/1.0.7) for visualization. You can install the factoextra package using the install.packages command. | ||
|
||
```{r} | ||
library(factoextra) | ||
``` | ||
|
||
## Introduction | ||
|
||
Principal components analysis (PCA) is one of the oldest and most commonly used dimensional reduction techniques. PCA is an unsupervised machine learning algorithm that combines correlated dimensions into a single new variable. This new variable represents an axis or line in the dataset that describes the maximum amount of variation and becomes the first principal component (PC). After the first PC is created, the algorithm then identifies a new axis that is orthagonal (statistically independent) to the first PC while also describing the maximum amount of previously undescribed variation. | ||
|
||
This process continues, with each PC capturing a decreasing amount of variation, until the maximum number of PCS is reached (max number of PCs = total number of variables in the original dataset). | ||
|
||
The first PCs will capture most heterogeneity (and thus most of the signal) in our high-dimensional dataset by definition, while the variation in later PCs will mostly represent noise. Thus, we can use the first PCs to both examine structure in our dataset as well as for downstream analyses to reduce computational work and improve the fit of our statistical models. | ||
|
||
|
||
::: {.fyi} | ||
GOING DEEPER: The math behind PCA | ||
|
||
PCA is a method of matrix factorization. Essentially, the PCA approach allows a researcher to break down (or decompose) a large data matrix into smaller constituent parts (similar to the way we can break the number 30 into the factors 5 and 6). Each smaller matrix can be summarized by a vector, called the eigenvector. The variance of the eigenvector is called the eigenvalue. For a large matrix (like most biological datasets), there are many possible ways to decompose the matrix. In our example earlier with the number 30 we could choose to start with the number 5 or the number 6. With PCA, We always start with the eigenvector that has the largest eigenvalue. This eigenvector is what we call the first PC. What is left of the original data matrix is then decomposed further into smaller and smaller pieces, with each subsequent eigenvector chosen because it has the largest possible eigenvalue. | ||
|
||
The PC variables created by this method are a linear combination of the original variables. | ||
|
||
::: | ||
|
||
|
||
## Exercise One: Calculating PCs | ||
|
||
There are many packages available for R and RStudio that can perform PCA. These packages will calculate PCA in similar ways, although some of their default parameters (such as how many PCs the algorithm will calculate and store) may be different. Be sure to read the manuals for any package you choose to use. | ||
|
||
For this example, we will use the iris dataset that is preloaded into R. The statistician RA Fisher popularized this dataset for statistical work, although the original data was collected by Edgar Anderson. The dataset contains 4 measurements (petal length, petal width, sepal length, and sepal width) for 150 individuals from 3 different iris species. | ||
|
||
Let's start by looking at the first few rows of the dataset. | ||
|
||
```{r} | ||
head(iris) | ||
``` | ||
|
||
Each row represents a single individual flower. The first four columns are _active variables_, or the variables we will use for the principal components analysis. The fifth column is the species of each individual ( _Iris setosa_, _Iris virginica_, or _Iris versicolor_). This is a _supplementary variable_, or additional information about each sample that we will not use in the PCA. | ||
|
||
The actual PCA calculation is quite simple. We specify the calculations are only done on the first four columns of the iris dataset, and we are also scaling each variable so the mean is 0 and the standard deviation is 1. | ||
|
||
```{r} | ||
pca.obj <- prcomp(iris[1:4], scale=T) | ||
str(pca.obj) | ||
``` | ||
|
||
The prcomp() command will create a list containing multiple dataframes and vectors: | ||
_sdev_, a vector of the standard deviations for each PC | ||
_rotation_, a matrix of the eigenvectors for each variable | ||
_center_, a vector of the center value for each variable | ||
_scale_, a vector of the scaling factor for each variable | ||
_x_, a matrix of the individual PC values for each flower | ||
|
||
|
||
## Exercise Two: Choosing the number of PCs | ||
|
||
We have now calculated our PCs. In this dataset, we have a limited number of PCs and could choose to use all of them in downstream analysis, but this becomes impractical with high-dimensional data. The majority of the variation in the dataset will be captured by the first few PCs, so common practice is to identify a cutoff for how much variation a PC must explain in order to be included. If the data have been scaled and centered, it is also common to include only those PCs with an eigenvalue greater than or equal to 1. | ||
|
||
For the iris dataset, we can look at this in either table format or using a scree plot. Let's take a look at a table of the eigenvalues and variation captured by each PC first. We'll use a command from the factoextra package. | ||
|
||
```{r} | ||
get_eigenvalue(pca.obj) | ||
``` | ||
|
||
This table tells us the vast majority of the variation is described by just the first two PCs. | ||
|
||
This information can also be calculated in the sdev vector of the pca.obj list. You don't have to use an additional library like factoextra. Instead, you can convert the standard deviations to variances. | ||
|
||
```{r} | ||
pca.obj$sdev^2 / sum(pca.obj$sdev^2) | ||
``` | ||
|
||
It may be easier to examine this data visually, which we can do using a scree plot (Jollife 2002, Peres-Neto, Jackson, and Somers 2005). | ||
|
||
```{r} | ||
fviz_eig(pca.obj, addlabels = TRUE) | ||
``` | ||
|
||
The PCs are on the x-axis, and the y-axis is the percentage of variance explained by each PC. As you can see, the percentage of explained variance drops off dramatically after the first two PCs. | ||
|
||
Going forward, we might choose to use these two variables in our analyses. | ||
|
||
## Exercise Three: Correlation between variables and PCs | ||
|
||
At some point, we might be interested in knowing which variables are contributing to each PC. Let's first look at this in table form. We first extract the results from our PCA output. | ||
|
||
```{r} | ||
var <- get_pca_var(pca.obj) | ||
var | ||
``` | ||
|
||
The get_pca_var creates a list of matrices with useful information about the PCA. For now, we're interested in the contribution matrix. | ||
|
||
```{r} | ||
var$contrib | ||
``` | ||
|
||
This table tells us what percentage of each PC is contributed by each variable. A higher number (or percentage) means a greater contribution by that variable. | ||
|
||
|
||
We can also visualize this information using a correlation circle. | ||
|
||
```{r} | ||
fviz_pca_var(pca.obj, col.var = "black", repel=T) | ||
``` | ||
|
||
The size and position of the vectors (arrows) in the correlation circle can tell us quite a lot about the variables. | ||
|
||
Positively correlated variables are grouped together. | ||
Negatively correlated variables are positioned on opposite sides of the plot origin (opposed quadrants) . | ||
|
||
Vectors that end close to the circle outline greatly contribute to a given PC. | ||
Shorter vectors contribute very little. | ||
|
||
Vectors that are primarily horizontal mostly contribute to the first PC (the x-axis). | ||
Vectors that are more vertical contribute primarily to the second PC (the y-axis). | ||
|
||
|
||
## Exercise Four: Visualizing structure using PCs | ||
|
||
Sometimes it is helpful to plot the PC coordinates of the individual samples in order to identify any population structure in the dataset. | ||
|
||
```{r} | ||
fviz_pca_ind(pca.obj, geom.ind = "point", mean.point=F) | ||
``` | ||
|
||
Here we've created a scatter plot with the first PC (Dim1) on the x-axis and the second PC (Dim2) on the y-axis. We can clearly see the samples split into 2 clusters. It might be interesting see whether these clusters correspond with iris species (remember, we have that information in the fifth column of the iris data table). | ||
|
||
```{r} | ||
fviz_pca_ind(pca.obj, geom.ind = "point", mean.point=F, | ||
col.ind = iris$Species, | ||
addEllipses = TRUE, | ||
legend.title = "Groups" | ||
) | ||
``` | ||
|
||
We used the col.ind option to color the dots by species and the addEllipses option to create an ellipse around the center of each species cluster. | ||
|
||
## Recap | ||
|
||
```{r} | ||
devtools::session_info() | ||
``` | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters