Merge pull request #85 from bcbio/feature_pseudobulk_plot_ALB

plot DEGs at single cell level, docs
bcbio · Feb 3, 2025 · 6c333b8 · 6c333b8
2 parents 7ce927d + b108c1a
commit 6c333b8
Show file tree

Hide file tree

Showing 2 changed files with 98 additions and 45 deletions.
diff --git a/inst/templates/singlecell/02_differential_expression/scRNA_pseudobulk.Rmd b/inst/templates/singlecell/02_differential_expression/scRNA_pseudobulk.Rmd
@@ -21,11 +21,9 @@ params:
    contrasts: !r list(c("age", "YOUNG", "OLD"))
    cluster_name: "2"
    resolution_column: "integrated_snn_res.0.4"
-
 ---
 
-Template developed with materials from https://hbctraining.github.io/main/.
-
+Template developed with materials from <https://hbctraining.github.io/main/>.
 
 ```{r, message=FALSE, warning=FALSE}
 # This set up the working directory to this file so all files can be found
@@ -41,7 +39,6 @@ stopifnot(compareVersion(as.character(BiocManager::version()), "3.18")>=0)
 stopifnot(compareVersion(as.character(packageVersion("Seurat")), "5.0.0")>=0)
 ```
 
-
 ```{r setup, cache=FALSE, message=FALSE, warning=FALSE, echo=FALSE}
 knitr::opts_chunk$set(echo = TRUE)
 # Load libraries'
@@ -86,6 +83,7 @@ opts_chunk[["set"]](
     cache.lazy = FALSE,
     error = TRUE,
     echo = TRUE,
+    warning = FALSE,
     fig.height = 5L,
     fig.retina = 2L,
     fig.width = 9.6,
@@ -94,18 +92,17 @@ opts_chunk[["set"]](
     warning = TRUE
 )
 
+source(params$project_file)
+
 seurat_obj <- params$seurat_obj
 column <- params$column
 contrasts <- params$contrasts
 cluster_name <- params$cluster_name
 resolution_column <- params$resolution_column
 ```
 
-
-
 This code is in this ![](https://img.shields.io/badge/status-alpha-yellow) revision.
 
-
 # Overview
 
 -   Project: `r project`
@@ -114,19 +111,18 @@ This code is in this ![](https://img.shields.io/badge/status-alpha-yellow) revis
 -   Experiment: `r experiment`
 -   Aim: `r aim`
 
-
 Here we will apply a pseudobulk approach to look for differentially expressed genes in one of our cell types.
 
 Using a pseudobulk approach involves the following steps:
 
-1. Subsetting to the cells for the cell type(s) of interest to perform the DE analysis.
-2. Extracting the raw counts after QC filtering of cells to be used for the DE analysis.
-3. Aggregating the counts and metadata to the sample level.
-4. Performing the DE analysis (you need at least two biological replicates per condition to perform the analysis, but more replicates are recommended).
+1.  Subsetting to the cells for the cell type(s) of interest to perform the DE analysis.
+2.  Extracting the raw counts after QC filtering of cells to be used for the DE analysis.
+3.  Aggregating the counts and metadata to the sample level.
+4.  Performing the DE analysis (you need at least two biological replicates per condition to perform the analysis, but more replicates are recommended).
 
 ## Dataset
 
-```{r load_data, cache = TRUE}
+```{r load_data, cache = FALSE}
 # Loading object
 if (isUrl(seurat_obj)){
   seurat <- readRDS(url(seurat_obj))  
@@ -145,16 +141,13 @@ cols3 <- unlist(strsplit(paste(colsD, colsM, colsL, sep = "_"), "_"))
 cols2 <- c(unlist(strsplit(paste(colsD, colsM, sep = "_"), "_")), "deepskyblue2")
 ```
 
-
 After filtering, each sample contributed the following number of cells to the analysis:
 
 ```{r meta pre doub}
 table(seurat$orig.ident)
 ```
 
-
-
-# Aggregate counts 
+# Aggregate counts
 
 ## Aggregate metadata at the sample level
 
@@ -168,13 +161,10 @@ meta <- [email protected] %>%
 meta
 ```
 
-
-
 ## Aggregate counts
 
 To aggregate the counts, we will use the AggregateExpression() function from Seurat. It will take as input a Seurat object, and return summed counts ("pseudobulk") for each identity class. The default is to return a matrix with genes as rows, and identity classes as columns. We have set return.seurat to TRUE, which means rather than a matrix we will get an object of class Seurat. We have also specified which factors to aggregate on, using the group.by argument.
 
-
 ```{r}
 seurat$sample <- seurat$orig.ident
 
@@ -186,7 +176,6 @@ bulk <- AggregateExpression(
 )
 ```
 
-
 ## Add number of cells per sample per cluster to the metadata
 
 ```{r}
@@ -213,27 +202,27 @@ [email protected] <- meta_bulk
 #bulk[[column]] <- as.factor(bulk[[column]])
 # test for consistency
 
-all(Cells(bulk) == row.names([email protected]))
+stopifnot(all(Cells(bulk) == row.names([email protected])))
 
 [email protected] %>% head()
 ```
 
 # DE analysis with DESeq2
 
 ## Subset to cell type of interest
+
 ```{r}
 Idents(object = bulk) <- resolution_column
 bulk_touse <- subset(x = bulk, idents = cluster_name)
 ```
 
-
 ## Check that we have enough cells
 
 Before moving on to a pseudobulk DGE analysis, it is important to identify how many cells we aggregated for each sample. We need to make sure that we have enough cells per sample after subsetting to one celltype. We recommend 50 cells per sample to move forward with confidence.
 
 ```{r}
 ggplot([email protected], aes(x=orig.ident, y=n_cells)) +
-    geom_bar(stat="identity", color="black", aes_string(fill=column)) +
+    geom_bar(stat="identity", color="black", aes(fill=.data[[column]])) +
     theme_classic() +
     theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) +
     labs(x="Sample name", y="Number of cells") +
@@ -242,12 +231,11 @@ ggplot([email protected], aes(x=orig.ident, y=n_cells)) +
 
 ## Run DE analysis
 
-Differential gene expression analysis of count data was performed using the Bioconductor R package, DESeq2, which fits the count data to a negative binomial model. 
-
-Before fitting the model, we often look at a metric called dispersion, which is a measure for variance which also takes into consideration mean expression. A dispersion value is estimated for each individual gene, then 'shrunken' to a more accurate value based on expected variation for a typical gene exhibiting that level of expression. Finally, the shrunken dispersion value is used in the final GLM fit. 
+Differential gene expression analysis of count data was performed using the Bioconductor R package, DESeq2, which fits the count data to a negative binomial model.
 
-We use the below dispersion plot, which should show an inverse relationship between dispersion and mean expression, to get an idea of whether our data is a good fit for the model. 
+Before fitting the model, we often look at a metric called dispersion, which is a measure for variance which also takes into consideration mean expression. A dispersion value is estimated for each individual gene, then 'shrunken' to a more accurate value based on expected variation for a typical gene exhibiting that level of expression. Finally, the shrunken dispersion value is used in the final GLM fit.
 
+We use the below dispersion plot, which should show an inverse relationship between dispersion and mean expression, to get an idea of whether our data is a good fit for the model.
 
 ```{r}
 cluster_counts <- t(FetchData(bulk_touse, layer="counts", vars=rownames(bulk_touse)))
@@ -258,12 +246,14 @@ dds_to_use <- DESeqDataSetFromMatrix(cluster_counts, [email protected], desig
 de <- DESeq(dds_to_use)
 DESeq2::plotDispEsts(de)
 norm_matrix <- as.data.frame(counts(de, normalized=T))
-```
 
+vsd <- vst(de)
+vsd_matrix <- assay(vsd)
+```
 
 Because it is difficult to accurately detect and quantify the expression of lowly expressed genes, differences in their expression between treatment conditions can be unduly exaggerated after the model is fit. We correct for this so that gene LFC is not dependent overall on basal gene expression level.
 
-In cases there are multiple groups and conditions across groups is recommended to use dummy variables instead of interaction terms: https://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#interactions.
+In cases there are multiple groups and conditions across groups is recommended to use dummy variables instead of interaction terms: <https://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#interactions>.
 
 The LRT is useful for testing multiple terms at once, for example testing 3 or more levels of a factor at once, or all interactions between two variables. The LRT for count data is conceptually similar to an analysis of variance (ANOVA) calculation in linear regression, except that in the case of the Negative Binomial GLM, we use an analysis of deviance (ANODEV), where the deviance captures the difference in likelihood between a full and a reduced model.
 
@@ -303,9 +293,9 @@ de_list=lapply(contrasts, function(contrast){
 
 This plot can help to:
 
-- Identify Differential Expression: Genes that show a significant log-fold change (M value away from 0) indicate changes in expression between conditions.
-- Assess Data Quality: The plot can help in identifying biases or systematic errors in the data. Ideally, most points should scatter around the M=0 line, indicating that there is no significant systematic difference between the conditions.
-- Visualize data dispersion: The distribution of points along the A-axis gives a sense of the spread of expression levels and any patterns or anomalies in the dataset.
+-   Identify Differential Expression: Genes that show a significant log-fold change (M value away from 0) indicate changes in expression between conditions.
+-   Assess Data Quality: The plot can help in identifying biases or systematic errors in the data. Ideally, most points should scatter around the M=0 line, indicating that there is no significant systematic difference between the conditions.
+-   Visualize data dispersion: The distribution of points along the A-axis gives a sense of the spread of expression levels and any patterns or anomalies in the dataset.
 
 ```{r after_lfc_shrink, results='asis', message=F, warning=F}
 for (contrast in names(de_list)){
@@ -320,7 +310,7 @@ for (contrast in names(de_list)){
 
 ## Volcano plot {.tabset}
 
-This volcano plot shows the genes that are significantly up- and down-regulated as a result of the analysis comparison. The points highlighted in red are genes that have padj < 0.05 and a log2-fold change > 1. Points in blue have a padj < 0.05 and a log2-fold change < 1 and points in green have a padj > 0.05 and a log2-fold change > 2. Grey points are non-significant. The dashed lines correspond to the cutoff values of log2 foldchance and padj that we have chosen.
+This volcano plot shows the genes that are significantly up- and down-regulated as a result of the analysis comparison. The points highlighted in red are genes that have padj \< 0.05 and a log2-fold change \> 1. Points in blue have a padj \< 0.05 and a log2-fold change \< 1 and points in green have a padj \> 0.05 and a log2-fold change \> 2. Grey points are non-significant. The dashed lines correspond to the cutoff values of log2 foldchance and padj that we have chosen.
 
 ```{r volcano_plot, fig.height=6, results='asis'}
 # degVolcano(res_mod[,c('lfc', 'padj')], plot_text = show) 
@@ -343,7 +333,6 @@ for (contrast in names(de_list)){
 }
 ```
 
-
 ## Heatmap {.tabset}
 
 ```{r heapmap, results='asis'}
@@ -367,7 +356,6 @@ for (contrast in names(de_list)){
 }
 ```
 
-
 ## Differentially Expressed Genes {.tabset}
 
 ```{r sig_genes_table, results='asis'}
@@ -381,34 +369,86 @@ for (contrast in names(de_list)){
 tagList(dt_list)
 ```
 
-## Plot top 16 genes {.tabset}
+## Plot top 16 genes - pseudobulk {.tabset}
 
-```{r top n DEGs, fig.height = 6, fig.width = 8, results='asis'}
+```{r top n DEGs pseudobulk, fig.height = 6, fig.width = 8, results='asis'}
 n = 16
 
 for (contrast in names(de_list)){
   cat("### ", contrast, "\n\n")
   res_sig = de_list[[contrast]][["sig"]]
   top_n <- res_sig %>% slice_min(order_by = padj, n = n, with_ties = F) %>% 
     dplyr::select(gene_id)
-  top_n_exp <- norm_matrix %>% as.data.frame() %>% 
+  top_n_exp <- vsd_matrix %>% as.data.frame() %>% 
     rownames_to_column('gene_id') %>%
     # dplyr::select(-group, -group_name) %>% 
-    pivot_longer(!gene_id, names_to = 'sample', values_to = 'log2_expression') %>%
+    pivot_longer(!gene_id, names_to = 'sample', values_to = 'normalized_counts') %>%
     right_join(top_n, relationship = "many-to-many") %>%
-    left_join(meta, by = c("sample"="orig.ident"))
+    left_join(meta_bulk, by = c("sample"="orig.ident"))
   
-  p1=ggplot(top_n_exp, aes_string(x = column, y = 'log2_expression')) +
+  p1=ggplot(top_n_exp, aes(x = .data[[column]], y = normalized_counts)) +
     geom_boxplot(outlier.shape = NA, linewidth=0.5, color="grey") + 
     geom_point() +
-    facet_wrap(~gene_id) + 
-    ggtitle(str_interp('Expression of Top ${n} DEGs')) +
+    facet_wrap(~gene_id, scales = 'free_y') + 
+    ggtitle(str_interp('Expression of Top ${n} DEGs - Pseudobulk')) +
     theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
   print(p1)
   cat("\n\n")
 }
 ```
 
+## Plot top 16 genes - single cell {.tabset}
+
+```{r top n DEGs single cell, fig.height = 6, fig.width = 8, results='asis'}
+
+n = 16
+
+for (contrast in names(de_list)){
+  cat("### ", contrast, "{.tabset} \n\n")
+  res_sig = de_list[[contrast]][["sig"]]
+  top_n <- res_sig %>% slice_min(order_by = padj, n = n, with_ties = F) %>% 
+    dplyr::select(gene_id)
+  
+  seurat_topn <- subset(seurat, features = top_n$gene_id)
+  seurat_topn_meta <- [email protected] %>% as.data.frame() %>%
+    rownames_to_column('barcode')
+  
+  topn_counts_df <- LayerData(object = seurat_topn, assay = "RNA", layer = "data") %>%
+  # topn_counts_df <- seurat_topn@assays$RNA@layers$data %>% 
+    as.data.frame() %>% rownames_to_column('gene_id') %>% 
+    pivot_longer(!gene_id, names_to = 'barcode', values_to = 'lognorm_counts') %>%
+      left_join(seurat_topn_meta)
+
+  for (gene in top_n$gene_id){
+    
+    cat("#### ", gene, "\n\n")
+    
+    topn_counts_df_gene <- topn_counts_df %>% filter(gene_id == gene)
+    
+    p2 <- ggplot(topn_counts_df_gene, 
+                 aes(x = sample, y = lognorm_counts, fill = .data[[column]])) +
+      scale_fill_viridis_d(option = "D")+
+      geom_violin(alpha=0.5, 
+                  position = position_dodge(width = .75), linewidth = 1, color = NA) +
+      geom_boxplot(notch = F,  outlier.size = -1, color="black",
+                   lwd=1, alpha = 0.7,show.legend = F,width=0.1)+
+      ggbeeswarm::geom_quasirandom(shape = 21,size=2, dodge.width = .75,
+                                   color="black",
+                                   alpha=.8,show.legend = F) 
+    
+    print(p2)
+    
+    cat("\n\n")
+  }
+  
+  cat("\n\n")
+}
+
+```
+
+# Methods
+
+Seurat ([package](https://satijalab.org/seurat/), [paper](https://www.nature.com/articles/s41587-023-01767-y)) is used to aggregate the single cell expression data into pseudobulk samples, and DESeq2 ([package](https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html), [paper](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-014-0550-8)) is used to perform statistical analysis.
 
 # R session
 
@@ -417,4 +457,3 @@ List and version of tools used for the QC report generation.
 ```{r}
 sessionInfo()
 ```
-
diff --git a/inst/templates/singlecell/README.md b/inst/templates/singlecell/README.md
@@ -22,6 +22,20 @@ Currently we are working on deploying a shiny app to inspect the single cell obj
 
 # Differential expression
 
+## pseudobulk approach
+
+`scRNA_pseudobulk.Rmd` is a template that performs pseudobulk differential expression analysis using DESeq2. It takes as input:
+
+-   information.R, containing basic facts about the project, analysis, and scientific question
+-   Seurat object containing both raw counts and log normalized data
+-   a metadata variable of interest and two levels of that variable to be compared
+-   a column in the seurat object metadata containing cluster names/numbers
+-   the name/number of the cluster of interest
+
+The template aggregates single cell counts to the sample x cluster x factor_of_interest pseudobulk level, subsets to the cluster of interest, performs the desired statistical comparisons, and visualizes the results.
+
+## MAST - single cell approach
+
 `differential_expression/scRNA_MAST.Rmd` is a template to visualize differentially expressed genes (DEG) results generated from MAST analysis. Main visualizations include:
 
 - Group-level mean expression shown in heatmap;