diff --git a/inst/templates/singlecell/02_differential_expression/scRNA_pseudobulk.Rmd b/inst/templates/singlecell/02_differential_expression/scRNA_pseudobulk.Rmd
index ff70e4c..3848914 100644
--- a/inst/templates/singlecell/02_differential_expression/scRNA_pseudobulk.Rmd
+++ b/inst/templates/singlecell/02_differential_expression/scRNA_pseudobulk.Rmd
@@ -21,11 +21,9 @@ params:
contrasts: !r list(c("age", "YOUNG", "OLD"))
cluster_name: "2"
resolution_column: "integrated_snn_res.0.4"
-
---
-Template developed with materials from https://hbctraining.github.io/main/.
-
+Template developed with materials from .
```{r, message=FALSE, warning=FALSE}
# This set up the working directory to this file so all files can be found
@@ -41,7 +39,6 @@ stopifnot(compareVersion(as.character(BiocManager::version()), "3.18")>=0)
stopifnot(compareVersion(as.character(packageVersion("Seurat")), "5.0.0")>=0)
```
-
```{r setup, cache=FALSE, message=FALSE, warning=FALSE, echo=FALSE}
knitr::opts_chunk$set(echo = TRUE)
# Load libraries'
@@ -86,6 +83,7 @@ opts_chunk[["set"]](
cache.lazy = FALSE,
error = TRUE,
echo = TRUE,
+ warning = FALSE,
fig.height = 5L,
fig.retina = 2L,
fig.width = 9.6,
@@ -94,6 +92,8 @@ opts_chunk[["set"]](
warning = TRUE
)
+source(params$project_file)
+
seurat_obj <- params$seurat_obj
column <- params$column
contrasts <- params$contrasts
@@ -101,11 +101,8 @@ cluster_name <- params$cluster_name
resolution_column <- params$resolution_column
```
-
-
This code is in this ![](https://img.shields.io/badge/status-alpha-yellow) revision.
-
# Overview
- Project: `r project`
@@ -114,19 +111,18 @@ This code is in this ![](https://img.shields.io/badge/status-alpha-yellow) revis
- Experiment: `r experiment`
- Aim: `r aim`
-
Here we will apply a pseudobulk approach to look for differentially expressed genes in one of our cell types.
Using a pseudobulk approach involves the following steps:
-1. Subsetting to the cells for the cell type(s) of interest to perform the DE analysis.
-2. Extracting the raw counts after QC filtering of cells to be used for the DE analysis.
-3. Aggregating the counts and metadata to the sample level.
-4. Performing the DE analysis (you need at least two biological replicates per condition to perform the analysis, but more replicates are recommended).
+1. Subsetting to the cells for the cell type(s) of interest to perform the DE analysis.
+2. Extracting the raw counts after QC filtering of cells to be used for the DE analysis.
+3. Aggregating the counts and metadata to the sample level.
+4. Performing the DE analysis (you need at least two biological replicates per condition to perform the analysis, but more replicates are recommended).
## Dataset
-```{r load_data, cache = TRUE}
+```{r load_data, cache = FALSE}
# Loading object
if (isUrl(seurat_obj)){
seurat <- readRDS(url(seurat_obj))
@@ -145,16 +141,13 @@ cols3 <- unlist(strsplit(paste(colsD, colsM, colsL, sep = "_"), "_"))
cols2 <- c(unlist(strsplit(paste(colsD, colsM, sep = "_"), "_")), "deepskyblue2")
```
-
After filtering, each sample contributed the following number of cells to the analysis:
```{r meta pre doub}
table(seurat$orig.ident)
```
-
-
-# Aggregate counts
+# Aggregate counts
## Aggregate metadata at the sample level
@@ -168,13 +161,10 @@ meta <- seurat@meta.data %>%
meta
```
-
-
## Aggregate counts
To aggregate the counts, we will use the AggregateExpression() function from Seurat. It will take as input a Seurat object, and return summed counts ("pseudobulk") for each identity class. The default is to return a matrix with genes as rows, and identity classes as columns. We have set return.seurat to TRUE, which means rather than a matrix we will get an object of class Seurat. We have also specified which factors to aggregate on, using the group.by argument.
-
```{r}
seurat$sample <- seurat$orig.ident
@@ -186,7 +176,6 @@ bulk <- AggregateExpression(
)
```
-
## Add number of cells per sample per cluster to the metadata
```{r}
@@ -213,7 +202,7 @@ bulk@meta.data <- meta_bulk
#bulk[[column]] <- as.factor(bulk[[column]])
# test for consistency
-all(Cells(bulk) == row.names(bulk@meta.data))
+stopifnot(all(Cells(bulk) == row.names(bulk@meta.data)))
bulk@meta.data %>% head()
```
@@ -221,19 +210,19 @@ bulk@meta.data %>% head()
# DE analysis with DESeq2
## Subset to cell type of interest
+
```{r}
Idents(object = bulk) <- resolution_column
bulk_touse <- subset(x = bulk, idents = cluster_name)
```
-
## Check that we have enough cells
Before moving on to a pseudobulk DGE analysis, it is important to identify how many cells we aggregated for each sample. We need to make sure that we have enough cells per sample after subsetting to one celltype. We recommend 50 cells per sample to move forward with confidence.
```{r}
ggplot(bulk_touse@meta.data, aes(x=orig.ident, y=n_cells)) +
- geom_bar(stat="identity", color="black", aes_string(fill=column)) +
+ geom_bar(stat="identity", color="black", aes(fill=.data[[column]])) +
theme_classic() +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) +
labs(x="Sample name", y="Number of cells") +
@@ -242,12 +231,11 @@ ggplot(bulk_touse@meta.data, aes(x=orig.ident, y=n_cells)) +
## Run DE analysis
-Differential gene expression analysis of count data was performed using the Bioconductor R package, DESeq2, which fits the count data to a negative binomial model.
-
-Before fitting the model, we often look at a metric called dispersion, which is a measure for variance which also takes into consideration mean expression. A dispersion value is estimated for each individual gene, then 'shrunken' to a more accurate value based on expected variation for a typical gene exhibiting that level of expression. Finally, the shrunken dispersion value is used in the final GLM fit.
+Differential gene expression analysis of count data was performed using the Bioconductor R package, DESeq2, which fits the count data to a negative binomial model.
-We use the below dispersion plot, which should show an inverse relationship between dispersion and mean expression, to get an idea of whether our data is a good fit for the model.
+Before fitting the model, we often look at a metric called dispersion, which is a measure for variance which also takes into consideration mean expression. A dispersion value is estimated for each individual gene, then 'shrunken' to a more accurate value based on expected variation for a typical gene exhibiting that level of expression. Finally, the shrunken dispersion value is used in the final GLM fit.
+We use the below dispersion plot, which should show an inverse relationship between dispersion and mean expression, to get an idea of whether our data is a good fit for the model.
```{r}
cluster_counts <- t(FetchData(bulk_touse, layer="counts", vars=rownames(bulk_touse)))
@@ -258,12 +246,14 @@ dds_to_use <- DESeqDataSetFromMatrix(cluster_counts, bulk_touse@meta.data, desig
de <- DESeq(dds_to_use)
DESeq2::plotDispEsts(de)
norm_matrix <- as.data.frame(counts(de, normalized=T))
-```
+vsd <- vst(de)
+vsd_matrix <- assay(vsd)
+```
Because it is difficult to accurately detect and quantify the expression of lowly expressed genes, differences in their expression between treatment conditions can be unduly exaggerated after the model is fit. We correct for this so that gene LFC is not dependent overall on basal gene expression level.
-In cases there are multiple groups and conditions across groups is recommended to use dummy variables instead of interaction terms: https://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#interactions.
+In cases there are multiple groups and conditions across groups is recommended to use dummy variables instead of interaction terms: .
The LRT is useful for testing multiple terms at once, for example testing 3 or more levels of a factor at once, or all interactions between two variables. The LRT for count data is conceptually similar to an analysis of variance (ANOVA) calculation in linear regression, except that in the case of the Negative Binomial GLM, we use an analysis of deviance (ANODEV), where the deviance captures the difference in likelihood between a full and a reduced model.
@@ -303,9 +293,9 @@ de_list=lapply(contrasts, function(contrast){
This plot can help to:
-- Identify Differential Expression: Genes that show a significant log-fold change (M value away from 0) indicate changes in expression between conditions.
-- Assess Data Quality: The plot can help in identifying biases or systematic errors in the data. Ideally, most points should scatter around the M=0 line, indicating that there is no significant systematic difference between the conditions.
-- Visualize data dispersion: The distribution of points along the A-axis gives a sense of the spread of expression levels and any patterns or anomalies in the dataset.
+- Identify Differential Expression: Genes that show a significant log-fold change (M value away from 0) indicate changes in expression between conditions.
+- Assess Data Quality: The plot can help in identifying biases or systematic errors in the data. Ideally, most points should scatter around the M=0 line, indicating that there is no significant systematic difference between the conditions.
+- Visualize data dispersion: The distribution of points along the A-axis gives a sense of the spread of expression levels and any patterns or anomalies in the dataset.
```{r after_lfc_shrink, results='asis', message=F, warning=F}
for (contrast in names(de_list)){
@@ -320,7 +310,7 @@ for (contrast in names(de_list)){
## Volcano plot {.tabset}
-This volcano plot shows the genes that are significantly up- and down-regulated as a result of the analysis comparison. The points highlighted in red are genes that have padj < 0.05 and a log2-fold change > 1. Points in blue have a padj < 0.05 and a log2-fold change < 1 and points in green have a padj > 0.05 and a log2-fold change > 2. Grey points are non-significant. The dashed lines correspond to the cutoff values of log2 foldchance and padj that we have chosen.
+This volcano plot shows the genes that are significantly up- and down-regulated as a result of the analysis comparison. The points highlighted in red are genes that have padj \< 0.05 and a log2-fold change \> 1. Points in blue have a padj \< 0.05 and a log2-fold change \< 1 and points in green have a padj \> 0.05 and a log2-fold change \> 2. Grey points are non-significant. The dashed lines correspond to the cutoff values of log2 foldchance and padj that we have chosen.
```{r volcano_plot, fig.height=6, results='asis'}
# degVolcano(res_mod[,c('lfc', 'padj')], plot_text = show)
@@ -343,7 +333,6 @@ for (contrast in names(de_list)){
}
```
-
## Heatmap {.tabset}
```{r heapmap, results='asis'}
@@ -367,7 +356,6 @@ for (contrast in names(de_list)){
}
```
-
## Differentially Expressed Genes {.tabset}
```{r sig_genes_table, results='asis'}
@@ -381,9 +369,9 @@ for (contrast in names(de_list)){
tagList(dt_list)
```
-## Plot top 16 genes {.tabset}
+## Plot top 16 genes - pseudobulk {.tabset}
-```{r top n DEGs, fig.height = 6, fig.width = 8, results='asis'}
+```{r top n DEGs pseudobulk, fig.height = 6, fig.width = 8, results='asis'}
n = 16
for (contrast in names(de_list)){
@@ -391,24 +379,76 @@ for (contrast in names(de_list)){
res_sig = de_list[[contrast]][["sig"]]
top_n <- res_sig %>% slice_min(order_by = padj, n = n, with_ties = F) %>%
dplyr::select(gene_id)
- top_n_exp <- norm_matrix %>% as.data.frame() %>%
+ top_n_exp <- vsd_matrix %>% as.data.frame() %>%
rownames_to_column('gene_id') %>%
# dplyr::select(-group, -group_name) %>%
- pivot_longer(!gene_id, names_to = 'sample', values_to = 'log2_expression') %>%
+ pivot_longer(!gene_id, names_to = 'sample', values_to = 'normalized_counts') %>%
right_join(top_n, relationship = "many-to-many") %>%
- left_join(meta, by = c("sample"="orig.ident"))
+ left_join(meta_bulk, by = c("sample"="orig.ident"))
- p1=ggplot(top_n_exp, aes_string(x = column, y = 'log2_expression')) +
+ p1=ggplot(top_n_exp, aes(x = .data[[column]], y = normalized_counts)) +
geom_boxplot(outlier.shape = NA, linewidth=0.5, color="grey") +
geom_point() +
- facet_wrap(~gene_id) +
- ggtitle(str_interp('Expression of Top ${n} DEGs')) +
+ facet_wrap(~gene_id, scales = 'free_y') +
+ ggtitle(str_interp('Expression of Top ${n} DEGs - Pseudobulk')) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
print(p1)
cat("\n\n")
}
```
+## Plot top 16 genes - single cell {.tabset}
+
+```{r top n DEGs single cell, fig.height = 6, fig.width = 8, results='asis'}
+
+n = 16
+
+for (contrast in names(de_list)){
+ cat("### ", contrast, "{.tabset} \n\n")
+ res_sig = de_list[[contrast]][["sig"]]
+ top_n <- res_sig %>% slice_min(order_by = padj, n = n, with_ties = F) %>%
+ dplyr::select(gene_id)
+
+ seurat_topn <- subset(seurat, features = top_n$gene_id)
+ seurat_topn_meta <- seurat_topn@meta.data %>% as.data.frame() %>%
+ rownames_to_column('barcode')
+
+ topn_counts_df <- LayerData(object = seurat_topn, assay = "RNA", layer = "data") %>%
+ # topn_counts_df <- seurat_topn@assays$RNA@layers$data %>%
+ as.data.frame() %>% rownames_to_column('gene_id') %>%
+ pivot_longer(!gene_id, names_to = 'barcode', values_to = 'lognorm_counts') %>%
+ left_join(seurat_topn_meta)
+
+ for (gene in top_n$gene_id){
+
+ cat("#### ", gene, "\n\n")
+
+ topn_counts_df_gene <- topn_counts_df %>% filter(gene_id == gene)
+
+ p2 <- ggplot(topn_counts_df_gene,
+ aes(x = sample, y = lognorm_counts, fill = .data[[column]])) +
+ scale_fill_viridis_d(option = "D")+
+ geom_violin(alpha=0.5,
+ position = position_dodge(width = .75), linewidth = 1, color = NA) +
+ geom_boxplot(notch = F, outlier.size = -1, color="black",
+ lwd=1, alpha = 0.7,show.legend = F,width=0.1)+
+ ggbeeswarm::geom_quasirandom(shape = 21,size=2, dodge.width = .75,
+ color="black",
+ alpha=.8,show.legend = F)
+
+ print(p2)
+
+ cat("\n\n")
+ }
+
+ cat("\n\n")
+}
+
+```
+
+# Methods
+
+Seurat ([package](https://satijalab.org/seurat/), [paper](https://www.nature.com/articles/s41587-023-01767-y)) is used to aggregate the single cell expression data into pseudobulk samples, and DESeq2 ([package](https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html), [paper](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-014-0550-8)) is used to perform statistical analysis.
# R session
@@ -417,4 +457,3 @@ List and version of tools used for the QC report generation.
```{r}
sessionInfo()
```
-
diff --git a/inst/templates/singlecell/README.md b/inst/templates/singlecell/README.md
index dbcd27d..23d21bf 100644
--- a/inst/templates/singlecell/README.md
+++ b/inst/templates/singlecell/README.md
@@ -22,6 +22,20 @@ Currently we are working on deploying a shiny app to inspect the single cell obj
# Differential expression
+## pseudobulk approach
+
+`scRNA_pseudobulk.Rmd` is a template that performs pseudobulk differential expression analysis using DESeq2. It takes as input:
+
+- information.R, containing basic facts about the project, analysis, and scientific question
+- Seurat object containing both raw counts and log normalized data
+- a metadata variable of interest and two levels of that variable to be compared
+- a column in the seurat object metadata containing cluster names/numbers
+- the name/number of the cluster of interest
+
+The template aggregates single cell counts to the sample x cluster x factor_of_interest pseudobulk level, subsets to the cluster of interest, performs the desired statistical comparisons, and visualizes the results.
+
+## MAST - single cell approach
+
`differential_expression/scRNA_MAST.Rmd` is a template to visualize differentially expressed genes (DEG) results generated from MAST analysis. Main visualizations include:
- Group-level mean expression shown in heatmap;