forked from seb369/landuse_comm_assembly
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathNeutral_model.Rmd
179 lines (140 loc) · 6.31 KB
/
Neutral_model.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
---
title: "Sloan Neutral Model"
author: "Samuel Barnett"
date: "`r format(Sys.time(), '%d %B, %Y')`"
output:
github_document:
toc: true
toc_depth: 2
html_preview: false
---
## Introduction
This notebook goes through the analysis modeling the Sloan neutral model to examine soil bacterial community assembly in a finger lakes region.
### Initiate libraries
```{r, message=FALSE, warning=FALSE}
# Packages needed for analysis
library(dplyr)
library(tibble)
library(phyloseq)
library(ape)
library(vegan)
library(FSA)
library(eulerr)
# Packages needed for plotting
library(ggplot2)
library(grid)
library(gridExtra)
```
### Import data
```{r, message=FALSE, warning=FALSE}
# Import bulk soil phyloseq data
bulk.physeq = readRDS("/home/sam/data/fullCyc2_data/bulk_soil_physeq.RDS")
## Check how many reads you have in each of the samples. This will tell you if you need to re-do anything
# Get read counts and make a new dataframe with this data
read_count = data.frame("count" = colSums(otu_table(bulk.physeq))) %>%
rownames_to_column(var="X.Sample") %>%
inner_join(data.frame(sample_data(bulk.physeq)), by="X.Sample") %>%
arrange(-count) %>%
mutate(X.Sample=factor(X.Sample, levels=X.Sample))
# Now plot read count for each sample. The horizontal line represents a 2000 read threshold
ggplot(data=read_count, aes(x=X.Sample, y=log10(count), fill=ecosystem)) +
geom_bar(stat="identity") +
labs(x="Sample", y="Log10(Read count)") +
geom_hline(yintercept=log10(10000)) +
theme(text = element_text(size=16),
axis.text.x = element_blank())
# Everything seems to be at or above 10000 total reads
bulk.physeq
```
All samples have over 10,000 reads except for one of the forest samples that is just about at 10,000 reads, which is great. We don't need to remove any datasets for lack of sequences.
Now we need to rarefy the data to normalize the sequencing depth. We should also get a normalized dataset which gives relative abundance rather than readcounts.
```{r, message=FALSE, warning=FALSE}
# Rarefy to an even depth
set.seed(72) # setting seed for reproducibility
bulk.physeq.rare = rarefy_even_depth(bulk.physeq)
# Normalize read counts (this gives relative abundance)
bulk.physeq.norm = transform_sample_counts(bulk.physeq.rare, function(x) x/sum(x))
```
## Testing neutral model fit
Now lets fit the sloan neutral model to our data. I will do this for the entire dataset together (irregardless of land use). This code was adapted from Adam Burns et al. 2016.
Burns, A., Stephens, W., Stagaman, K. et al. Contribution of neutral processes to the assembly of gut microbial communities in the zebrafish over host development. ISME J 10, 655–664 (2016) doi:10.1038/ismej.2015.142
```{r, message=FALSE, warning=FALSE}
## Code adapted from Adam Burns (http://dx.doi.org/10.1038/ismej.2015.142)
require(minpack.lm)
require(Hmisc)
require(stats4)
# Extract the OTU table from pthe phyloseq object
OTU.table = t(otu_table(bulk.physeq.rare))
# Calculate the number of individuals in the meta community (Average read depth)
N <- mean(apply(OTU.table, 1, sum))
# Calculate the average relative abundance of each taxa across communities
p.m <- apply(OTU.table, 2, mean)
p.m <- p.m[p.m != 0]
p <- p.m/N
p.df = data.frame(p) %>%
rownames_to_column(var="OTU")
# Calculate the occurrence frequency of each taxa
OTU.table.bi <- 1*(OTU.table>0)
freq.table <- apply(OTU.table.bi, 2, mean)
freq.table <- freq.table[freq.table != 0]
freq.df = data.frame(OTU=names(freq.table), freq=freq.table)
#Combine
C <- inner_join(p.df,freq.df, by="OTU") %>%
arrange(p)
# Remove rows with any zero (absent in either source pool or local communities). You already did this, but just to make sure we will do it again.
C.no0 <- C %>%
filter(freq != 0, p != 0)
#Calculate the limit of detection
d <- 1/N
##Fit model parameter m (or Nm) using Non-linear least squares (NLS)
p.list <- C.no0$p
freq.list <- C.no0$freq
m.fit <- nlsLM(freq.list ~ pbeta(d, N*m*p.list, N*m*(1-p.list), lower.tail=FALSE), start=list(m=0.1))
m.ci <- confint(m.fit, 'm', level=0.95)
m.sum <- summary(m.fit)
m.coef = coef(m.fit)
freq.pred <- pbeta(d, N*coef(m.fit)*p.list, N*coef(m.fit)*(1-p.list), lower.tail=FALSE)
Rsqr <- 1 - (sum((freq.list - freq.pred)^2))/(sum((freq.list - mean(freq.list))^2))
# Get table of model fit stats
fitstats <- data.frame(m=m.coef, m.low.ci=m.ci[1], m.up.ci=m.ci[2],
Rsqr=Rsqr, p.value=m.sum$parameters[4], N=N,
Samples=nrow(OTU.table), Richness=length(p.list),
Detect=d)
# Get confidence interval for predictions
freq.pred.ci <- binconf(freq.pred*nrow(OTU.table), nrow(OTU.table), alpha=0.05, method="wilson", return.df=TRUE)
# Get table of predictions
pred.df <- data.frame(metacomm_RA=p.list, frequency=freq.pred,
frequency_lowerCI=freq.pred.ci[,2],
frequency_upperCI=freq.pred.ci[,3]) %>%
unique()
# Get table of observed occupancy and abundance
obs.df = C.no0 %>%
rename(metacomm_RA = p, frequency=freq)
```
Plot the model and observed occupancy and metacommunity abundance values.
```{r, fig.height=3.14961, fig.width=3.14961}
fulldata.model.plot = ggplot(data=obs.df) +
geom_point(data=obs.df, aes(x=log10(metacomm_RA), y=frequency),
alpha=.2, size=2, color="grey30") +
geom_line(data=pred.df, aes(x=log10(metacomm_RA), y=frequency), color="black") +
geom_line(data=pred.df, aes(x=log10(metacomm_RA), y=frequency_lowerCI), linetype=2, color="black") +
geom_line(data=pred.df, aes(x=log10(metacomm_RA), y=frequency_upperCI), linetype=2, color="black") +
geom_text(data=fitstats, aes(label = paste("R^2 == ", round(Rsqr, 3))),
x=-4.9, y=0.95, size=4, parse=TRUE) +
geom_text(data=fitstats, aes(label = paste("italic(m) ==", round(m, 3))),
x=-4.9, y=0.85, size=4, parse=TRUE) +
labs(x="Log10 abundance in\nmetacommunity", y="Frequency detected") +
theme_bw() +
theme(axis.line = element_line(color="black"),
legend.position = "none",
axis.title = element_text(size=14),
axis.text = element_text(size=12))
fulldata.model.plot
#ggsave("region_neutral_model.tiff", plot=fulldata.model.plot, device="tiff",
# path="/home/sam/notebooks/fullCyc2/figures/community_assembly_MS/",
# width=80, height=80, units="mm")
```
## Session info
```{r}
sessionInfo()
```