Neutral_model.Rmd

---
title: "Sloan Neutral Model"
author: "Samuel Barnett"
date: "`r format(Sys.time(), '%d %B, %Y')`"
output: 
  github_document:
    toc: true
    toc_depth: 2
    html_preview: false
---


## Introduction

This notebook goes through the analysis modeling the Sloan neutral model to examine soil bacterial community assembly in a finger lakes region. 

### Initiate libraries
```{r, message=FALSE, warning=FALSE}
# Packages needed for analysis
library(dplyr)
library(tibble)
library(phyloseq)
library(ape)
library(vegan)
library(FSA)
library(eulerr)

# Packages needed for plotting
library(ggplot2)
library(grid)
library(gridExtra)

```

### Import data

```{r, message=FALSE, warning=FALSE}
# Import bulk soil phyloseq data
bulk.physeq = readRDS("/home/sam/data/fullCyc2_data/bulk_soil_physeq.RDS")

## Check how many reads you have in each of the samples. This will tell you if you need to re-do anything
# Get read counts and make a new dataframe with this data
read_count = data.frame("count" = colSums(otu_table(bulk.physeq))) %>%
  rownames_to_column(var="X.Sample") %>%
  inner_join(data.frame(sample_data(bulk.physeq)), by="X.Sample") %>%
  arrange(-count) %>%
  mutate(X.Sample=factor(X.Sample, levels=X.Sample))

# Now plot read count for each sample. The horizontal line represents a 2000 read threshold
ggplot(data=read_count, aes(x=X.Sample, y=log10(count), fill=ecosystem)) +
  geom_bar(stat="identity") +
  labs(x="Sample", y="Log10(Read count)") +
  geom_hline(yintercept=log10(10000)) +
  theme(text = element_text(size=16),
        axis.text.x = element_blank())
# Everything seems to be at or above 10000 total reads

bulk.physeq
```


All samples have over 10,000 reads except for one of the forest samples that is just about at 10,000 reads, which is great. We don't need to remove any datasets for lack of sequences.

Now we need to rarefy the data to normalize the sequencing depth. We should also get a normalized dataset which gives relative abundance rather than readcounts.

```{r, message=FALSE, warning=FALSE}
# Rarefy to an even depth
set.seed(72)  # setting seed for reproducibility
bulk.physeq.rare = rarefy_even_depth(bulk.physeq)

# Normalize read counts (this gives relative abundance)
bulk.physeq.norm = transform_sample_counts(bulk.physeq.rare, function(x) x/sum(x))

```

## Testing neutral model fit

Now lets fit the sloan neutral model to our data. I will do this for the entire dataset together (irregardless of land use). This code was adapted from Adam Burns et al. 2016.

Burns, A., Stephens, W., Stagaman, K. et al. Contribution of neutral processes to the assembly of gut microbial communities in the zebrafish over host development. ISME J 10, 655–664 (2016) doi:10.1038/ismej.2015.142

```{r, message=FALSE, warning=FALSE}
## Code adapted from Adam Burns (http://dx.doi.org/10.1038/ismej.2015.142)
require(minpack.lm)
require(Hmisc)
require(stats4)
# Extract the OTU table from pthe phyloseq object
OTU.table = t(otu_table(bulk.physeq.rare))

# Calculate the number of individuals in the meta community (Average read depth)
N <- mean(apply(OTU.table, 1, sum))

# Calculate the average relative abundance of each taxa across communities
p.m <- apply(OTU.table, 2, mean)
p.m <- p.m[p.m != 0]
p <- p.m/N
p.df = data.frame(p) %>%
  rownames_to_column(var="OTU")

# Calculate the occurrence frequency of each taxa
OTU.table.bi <- 1*(OTU.table>0)
freq.table <- apply(OTU.table.bi, 2, mean)
freq.table <- freq.table[freq.table != 0]
freq.df = data.frame(OTU=names(freq.table), freq=freq.table)

#Combine
C <- inner_join(p.df,freq.df, by="OTU") %>%
  arrange(p)
# Remove rows with any zero (absent in either source pool or local communities). You already did this, but just to make sure we will do it again.
C.no0 <- C %>%
  filter(freq != 0, p != 0)

#Calculate the limit of detection
d <- 1/N

##Fit model parameter m (or Nm) using Non-linear least squares (NLS)
p.list <- C.no0$p
freq.list <- C.no0$freq
m.fit <- nlsLM(freq.list ~ pbeta(d, N*m*p.list, N*m*(1-p.list), lower.tail=FALSE), start=list(m=0.1))
m.ci <- confint(m.fit, 'm', level=0.95)
m.sum <- summary(m.fit)
m.coef = coef(m.fit)

freq.pred <- pbeta(d, N*coef(m.fit)*p.list, N*coef(m.fit)*(1-p.list), lower.tail=FALSE)
Rsqr <- 1 - (sum((freq.list - freq.pred)^2))/(sum((freq.list - mean(freq.list))^2))

# Get table of model fit stats
fitstats <- data.frame(m=m.coef, m.low.ci=m.ci[1], m.up.ci=m.ci[2], 
                       Rsqr=Rsqr, p.value=m.sum$parameters[4], N=N, 
                       Samples=nrow(OTU.table), Richness=length(p.list), 
                       Detect=d)

# Get confidence interval for predictions
freq.pred.ci <- binconf(freq.pred*nrow(OTU.table), nrow(OTU.table), alpha=0.05, method="wilson", return.df=TRUE)

# Get table of predictions
pred.df <- data.frame(metacomm_RA=p.list, frequency=freq.pred, 
                      frequency_lowerCI=freq.pred.ci[,2], 
                      frequency_upperCI=freq.pred.ci[,3]) %>%
  unique()

# Get table of observed occupancy and abundance
obs.df = C.no0 %>%
  rename(metacomm_RA = p, frequency=freq)

```

Plot the model and observed occupancy and metacommunity abundance values. 

```{r, fig.height=3.14961, fig.width=3.14961}
fulldata.model.plot = ggplot(data=obs.df) +
    geom_point(data=obs.df, aes(x=log10(metacomm_RA), y=frequency), 
               alpha=.2, size=2, color="grey30") +
    geom_line(data=pred.df, aes(x=log10(metacomm_RA), y=frequency), color="black") + 
    geom_line(data=pred.df, aes(x=log10(metacomm_RA), y=frequency_lowerCI), linetype=2, color="black") + 
    geom_line(data=pred.df, aes(x=log10(metacomm_RA), y=frequency_upperCI), linetype=2, color="black") + 
    geom_text(data=fitstats, aes(label = paste("R^2 == ", round(Rsqr, 3))), 
              x=-4.9, y=0.95, size=4, parse=TRUE) +
    geom_text(data=fitstats, aes(label = paste("italic(m) ==", round(m, 3))), 
              x=-4.9, y=0.85, size=4, parse=TRUE) + 
    labs(x="Log10 abundance in\nmetacommunity", y="Frequency detected") +
    theme_bw() +
    theme(axis.line = element_line(color="black"),
          legend.position = "none",
          axis.title = element_text(size=14),
          axis.text = element_text(size=12))

fulldata.model.plot

#ggsave("region_neutral_model.tiff", plot=fulldata.model.plot, device="tiff", 
#       path="/home/sam/notebooks/fullCyc2/figures/community_assembly_MS/",
#       width=80, height=80, units="mm")
```

## Session info

```{r}
sessionInfo()
```