-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathPhylogeny.rmd
112 lines (81 loc) · 3.93 KB
/
Phylogeny.rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
---
title: "Creating a phylogeny from the 1000 Genome Database"
author: "JReceveur"
date: "Aug 6, 2020"
output:
html_document:
toc: true
toc_float: true
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## Creating a Dendrogram from a downloaded VCF file
This document provides an example of creating a dendrogram from a VCF file. The VCF file can be altered using VCFtools to look at a specific area of the genome. The lines starting with ## are what you would expect to see for the output
For a more in depth tutorial of SNPRelate see http://corearray.sourceforge.net/tutorials/SNPRelate/
This document was created using Rmarkdown <http://rmarkdown.rstudio.com>.
**Thanks to Thomas Quiroz Monnens for catching an error in a previous version of this tutorial!
```
#Installing Packages
```{r}
#if (!requireNamespace("BiocManager", quietly = TRUE))
# install.packages("BiocManager")
#BiocManager::install("SNPRelate")
```
#Loading Packages
The packages will have to be loaded each time you open R or Rstudio.
```{r}
library(gdsfmt)
library(SNPRelate)
library(ggplot2)
```
#Loading VCF file
There are two ways that the VCF file can be loaded into Rstudio if it already on your computer. The first is directly specifying the file path and name, the second is using the file.choose option which will open a browser so you can search for the file. The second option is commented out with # (putting a # before any code means it will be treated as text, to run it, just remove the #).
If you want to follow along with the example data, the data (Approx 1.4 MB) is available for download https://github.com/BenbowLab/BenbowLab.github.io/blob/master/VCFExampleData.vcf
```{r import}
vcf.fn <- "VCFExampleData.vcf"
#vcf.fn<- file.choose()
```
#Parsing
The next command will turn the VCF file into a less data intensive form (GDS) for easier computing. If you have loaded an entire genome, expect this command to take an hour or more.
```{r parsing}
snpgdsVCF2GDS(vcf.fn,"data.gds",method ="biallelic.only")
```
#Formatting for IBS matrix
These commands prepare the data so it is formatted correctly to create a Identity by State matrix (fraction of identity by state for each pair of samples).
```{r clustering}
genofile<-snpgdsOpen("data.gds")
set.seed(100)
ibs.hc<-snpgdsHCluster(snpgdsIBS(genofile,num.thread=2, autosome.only=FALSE))
```
#Turn the clustering into a tree file and plotting tree
This step takes the clustering results from before and turns the numerical values into a dendrogram
```{r}
rv <- snpgdsCutTree(ibs.hc)
plot(rv$dendrogram,main="Dendrogram based on IBS")
```
#Dissimilarity matrices
This command creates an dissimilarity matrix between all the samples. If you are looking at the X chromosome, make sure the autosome.only= code is changed to autosome.only=False.
*Note: A previous version of this example incorrectly used the snpgdsIBS command here instead of snpgdsDiss. We apologize for any confusion.
```{r}
dissMatrix = snpgdsDiss(genofile , sample.id=NULL, autosome.only=TRUE,remove.monosnp=TRUE, maf=NaN, missing.rate=NaN, num.thread=2, verbose=TRUE)
```
#Clustering Analysis
This step performs a clustering analysis similar to the IBS analysis above but using dissimilarity. The next line creates a tree file based on dissimilarity rather than relatedness.
```{r}
snpHCluster = snpgdsHCluster(dissMatrix, sample.id=NULL, need.mat=TRUE, hang=0.01)
cutTree = snpgdsCutTree(snpHCluster, z.threshold=15, outlier.n=5, n.perm = 5000, samp.group=NULL,
col.outlier="red", col.list=NULL, pch.outlier=4, pch.list=NULL,label.H=FALSE, label.Z=TRUE,
verbose=TRUE)
cutTree
snpgdsClose(genofile)
```
#Displaying a tree based on dissimilarity
```{r}
snpgdsDrawTree(cutTree, main = "Dendrogram based on dissimilarity",edgePar=list(col=rgb(0.5,0.5,0.5,0.75),t.col="black"),
y.label.kinship=T,leaflab="perpendicular",yaxis.kinship=F)
```
#Session Info
```{r}
sessionInfo()
```