-
Notifications
You must be signed in to change notification settings - Fork 0
1. Calculate population genetics statistics
Clean environment space:
rm(list=ls())
Set working directory:
setwd("/home/user/Desktop/Tutorial")
Load libraries needed:
library("gplots")
library("vegan")
library("genepopedit")
Load genotype data:
file_path_genotypes <- "./data/scallop/scallops_27pops_all2012-2016_genepop.gen"
For datasets of <15,000 loci, use read.table(). Use data.table() to load larger datasets:
GenePopData <- read.table(file_path_genotypes, sep="\t",quote="", stringsAsFactors=FALSE)
Get to know details of your dataset, for example, population names, number of individuals per populaiton, and sample IDs:
# Return population names
PopNames <- genepop_detective(GenePopData, variable="Pops")
PopNames
# Return population counts
PopCounts <- genepop_detective(GenePopData, variable="PopNum")
PopCounts
# Return sample IDs
SampleIDs <- genepop_detective(GenePopData, variable="Inds")
SampleIDs
length(SampleIDs)
Let's calculate genetic diversity estimates using two popular R packages, adegenet
and hierfstat
.
- With
adegenet
:
The package adegenet
performs a wide variety of population genomics analyses. Depending on whether there is individual (genotypes) or population (allele frequencies or counts) data available, an object called "genind" or "genpop", respectively, needs to be created. Further information about this package and these two object types can be found here.
Load required libraries:
library("adegenet")
library("pegas")
library("hierfstat")
Read in genotype data in "genepop" file format (Note: "genpop" != "genepop"):
x <- read.genepop(file_path_genotypes, ncode=3, quiet=FALSE)
With the basic.stats()
function of adegenet
you can obtain summary statistics such as: allelic frequencies, observed heterozygosities, genetic diversities per locus and population, mean observed heterozygosities, mean gene diversities within population Hs, Gene diversities overall Ht and corrected Htp, and Dst, Dstp. Fst and Fstp as well as Fis following Nei (1987) per locus and overall loci.
Let's obtain these estimates:
basicStats <- basic.stats(x)
basicStats
Save results in a txt file:
write.table(basicStats$perloc, file="./analysis/01-basic-stats/Basic_stats_per_locus_scallops.txt",
sep="\t", row.names=FALSE, col.names=TRUE, quote=FALSE)
- With
hierfstat
: Other way to calculate summary statistics is using the functionsummary()
from the R packagehierfstat
. Further information of this package can be found here.
Let's obtain the summary statistics:
SumStats <- summary(x)
SumStats
names(SumStats)
Make a barplot to assess whether Ho and He differ per locus:
barplot(SumStats$Hexp-SumStats$Hobs, main="Heterozygosity: expected-observed",
ylab="Hexp - Hobs")
Question:
- Are the same summary statistics generated by
adegenet
andhierfstat
? If you want to compare Ho and He, which one you would use?
Traditionally, F-statistics are used as estimates of genetic structure of populations. They were first introduced by Sewall Wright in 1951 and later, in 1984, Weir and Cockerham proposed an update of their calculation. If you are not familiarised with these statistics, check out this interesting paper.
Using hierfstat
, calculate global Fst:
fstat(x, fstonly = TRUE)
Now calculate Fst by locus, e.g. Fst (pop/total), Fit (Ind/total), and Fis (ind/pop):
Fst(as.loci(x))
F_stats <- basic.stats(x, diploid = TRUE, digits = 2)
F_stats
names(F_stats)
Questions:
- Are there loci that probably are out of HWE? Which ones?
- Is the population structure among populations high or low? Why?