03-single-cell-with-bioconductor.Rmd

```{r echo = FALSE}
knitr::opts_chunk$set(out.width = "100%")
```

# (PART\*) Single Cell with Bioconductor {-}   

# Overview {#single-cell-with-bioconductor-overview}

The `iSEE` Bioconductor package is a powerful tool for visualizing and exploring single-cell RNA-seq data. The iSEE viewer provides an interactive interface for exploring gene expression patterns, performing dimensionality reduction, clustering cells, visualizing metadata, and conducting differential expression analysis. 

In this demo, we'll learn how to launch the iSEE viewer in AnVIL's RStudio environment.

## Skills Level

::: {.notice}
_Genetics_

**Beginner**: some genetics knowledge helpful

_Programming skills_

**Beginner**: some programming experience helpful
:::

## Learning Objectives

1. Launch Terra
1. Clone Workspace
1. Launch RStudio-Bioconductor maintained cloud environment
1. Launch iSEE viewer
1. Create plots for investigating expression of different cell types
1. Shut down RStudio

# Preparation {#single-cell-with-bioconductor-preparation}

If you plan to follow along with these exercises, there are a couple of things you will need to take care of first:

## Review Background

If you aren't already familiar with RStudio, Bioconductor, and single cell RNA sequencing data analysis, we encourage you to check out our background slides [here](https://docs.google.com/presentation/d/18J1x5rkWJisiv3EOMVZOma52XWxAVKANTg0Q2FXbOWU/edit?usp=sharing).

## Create AnVIL account

You will need an AnVIL account in order to view Workspaces and run analyses.

- If you do not already have an account, follow [these instructions](https://jhudatascience.org/AnVIL_Book_Getting_Started/overview-analysts.html) to set one up.  (You do not need to link any external accounts for these exercises.)
- Make sure that your Instructor (if participating in a workshop) or PI / Lab Manager has your username, so that they can add you to an appropriate *Billing Project*.  You can't clone or create Workspaces on AnVIL without a Billing Project.

## Clone Workspace

When you "clone" a copy of an AnVIL Workspace, it can take a few minutes for everything to propagate to your new Workspace. If you are participating in a course or workshop, your instructor may have you start by cloning the Workspace, so that it is ready by the time you need it. (If you are working at your own pace, feel free to come back to this step later, when you're ready to start using the Workspace.)

Follow the instructions below to clone your own copy of the Workspace for this Demo.

:::: {.borrowed_chunk}
```{r, echo = FALSE, results='asis'}
# Specify variables
AnVIL_module_settings <- list(
  workspace_name = "demos-combine-data-workspaces",
  workspace_link = "https://anvil.terra.bio/#workspaces/anvil-outreach/demos-combine-data-workspaces"
)

cow::borrow_chapter(
  doc_path = "child/_child_student_workspace_clone.Rmd",
  repo_name = "jhudsl/AnVIL_Template"
)
```
::::

Now your Workspace should be ready for you by the time you need it below.  You are ready to begin!

## Start Cloud Environment

You will need to launch the interactive RStudio environment to proceed.

### Video Overview

```{r, echo = FALSE, results='asis'}
cow::borrow_chapter(
  doc_path = "child/_child_rstudio_video.Rmd",
  repo_name = "jhudsl/AnVIL_Template"
)
```

### Launching RStudio

```{r, echo = FALSE, results='asis'}
cow::borrow_chapter(
  doc_path = "child/_child_rstudio_launch.Rmd",
  repo_name = "jhudsl/AnVIL_Template"
)
```

# Exercises {#single-cell-with-bioconductor-exercises}

The following exercises will walk you through the process of using iSEE to explore single cell RNA-seq data. You will use data that have already been prepared and is saved in a SingleCellExperiment object. You will use the `pbmc3k` dataset from the `TENxPBMCData` package, which contains gene expression profiles for 2,700 single peripheral blood mononuclear cells. Information on how the data were preprocessed can be found [here](https://isee.github.io/iSEEWorkshopEuroBioc2020/articles/dataset.html).

These exercises were adapted from [iSEEWorkshopEuroBioc2020/](https://isee.github.io/iSEEWorkshopEuroBioc2020/).

:::{.notice}
To follow along with these exercises, you will need to complete the steps described in the [Preparation](#scale-with-workflows-preparation) guide for this demo.
:::

## Loading the Data

First, we'll want to load the data that we'll be using. To do that, we'll need to load the `AnVIL` package created by the Bioconductor team for interfacing with files on AnVIL. Luckily, it's already installed.

```{r, eval=F}
library(AnVIL)
```

Next, we'll use the `avfiles_restore()` function from the `AnVIL` package to actually bring the data into our environment's persistent disk. 

```{r, eval=F}
avfiles_restore( 
  source = "sc_bioconductor_data.RData",
  namespace = "anvil-outreach",
  name = "demos-single-cell-bioconductor"
)
```

In the code above, we are copying the file `sc_bioconductor_data.RData` from the Workspace `demos-single-cell-bioconductor`. It was created under the `anvil-outreach` Billing Project.

## Installing `iSEE`

Once you've loaded the data, you'll want to install and load the `iSEE` library. You can easily install it into your own personal RStudio environment using the pre-installed `BiocManager` commands.

```{r, eval=F}
BiocManager::install('iSEE')
library(iSEE)
```

## Panel Display

First we will take a look at the interactive plots that iSEE can display.

```{r, echo=FALSE}
ottrpal::include_slide("https://docs.google.com/presentation/d/18J1x5rkWJisiv3EOMVZOma52XWxAVKANTg0Q2FXbOWU/edit#slide=id.g24f58a64a6b_0_53")
```

## Visualize Cell Type Assignment

Next, let's focus specifically on visualizing cell type assignment by cluster membership. The goal is to identify the predominant cell type in each cluster. We can do this by plotting the column data in a `ColumnDataPlot`.

First, select the panel organization button and select "Organize panels".

```{r, echo=FALSE}
ottrpal::include_slide("https://docs.google.com/presentation/d/18J1x5rkWJisiv3EOMVZOma52XWxAVKANTg0Q2FXbOWU/edit#slide=id.g254b1a8b951_0_12")
```

Remove all plots except for the column data plot. This will make things easier to view. Change the width to `12`.

```{r, echo=FALSE}
ottrpal::include_slide("https://docs.google.com/presentation/d/18J1x5rkWJisiv3EOMVZOma52XWxAVKANTg0Q2FXbOWU/edit#slide=id.g24f58a64a6b_0_114")
```

You should now see a large scatter plot. Select "Data parameters" underneath the plot.

```{r, echo=FALSE}
ottrpal::include_slide("https://docs.google.com/presentation/d/18J1x5rkWJisiv3EOMVZOma52XWxAVKANTg0Q2FXbOWU/edit#slide=id.g254b1a8b951_0_100")
```

First, select "labels_fine" under "Column of interest (Y-axis)". Directly below, select the "Column data" button for "X-axis". Once the dropdown menu appears for "Column of interest (X-axis)", select "Cluster". 

```{r, echo=FALSE}
ottrpal::include_slide("https://docs.google.com/presentation/d/18J1x5rkWJisiv3EOMVZOma52XWxAVKANTg0Q2FXbOWU/edit#slide=id.g24f58a64a6b_0_124")
```

Since both cell annotations and the cluster are categorical, `iSEE` will generate a visual representation of a matrix called a "Hinton plot".

```{r, echo=FALSE}
ottrpal::include_slide("https://docs.google.com/presentation/d/18J1x5rkWJisiv3EOMVZOma52XWxAVKANTg0Q2FXbOWU/edit#slide=id.g24f58a64a6b_0_129")
```

:::{.notice}
Now we know that cluster 4 contains almost all the cells that were annotated as classical monocytes. On the other hand, T cells can be found in multiple clusters.
:::

We can also save the R code used to create our iSEE plots. This helps make our work reproducible!

```{r, echo=FALSE}
ottrpal::include_slide("https://docs.google.com/presentation/d/18J1x5rkWJisiv3EOMVZOma52XWxAVKANTg0Q2FXbOWU/edit#slide=id.g24f58a64a6b_0_134")
```

```{r, echo=FALSE}
ottrpal::include_slide("https://docs.google.com/presentation/d/18J1x5rkWJisiv3EOMVZOma52XWxAVKANTg0Q2FXbOWU/edit#slide=id.g24f58a64a6b_0_139")
```

## Visualize Expression of a Single Gene

Now let's take a look at the expression data of a single gene across all the clusters. We can use the "Feature assay plot" panel to plot the distribution of the logcount values for a particular gene.

Click on the “Organize Panels” icon in the top right corner. Remove the “column plot” and choose “feature assay plot”. Change the width of the plot to 12.

```{r, echo=FALSE}
ottrpal::include_slide("https://docs.google.com/presentation/d/18J1x5rkWJisiv3EOMVZOma52XWxAVKANTg0Q2FXbOWU/edit#slide=id.g254b1a8b951_0_33")
```

You should now see a rather underwhelming bar plot. We still need to change the data parameters, so click on the “Data parameters” box.

```{r, echo=FALSE}
ottrpal::include_slide("https://docs.google.com/presentation/d/18J1x5rkWJisiv3EOMVZOma52XWxAVKANTg0Q2FXbOWU/edit#slide=id.g254b1a8b951_0_130")
```

Next, change the “Y-axis feature” to “LYZ”. This is the gene whose expression we’ll be examining. Change the feature selection box to “logcounts.” 

```{r, echo=FALSE}
ottrpal::include_slide("https://docs.google.com/presentation/d/18J1x5rkWJisiv3EOMVZOma52XWxAVKANTg0Q2FXbOWU/edit#slide=id.g254b1a8b951_0_60")
```

:::{.notice}
The **LYZ** gene encodes an enzyme called **lysozyme**, which plays a crucial role in the immune system's defense against bacterial infections. The primary function of lysozyme is to break down bacterial cell walls. 

The highest levels of LYZ gene expression are typically observed in tissues with direct contact with the external environment, such as the epithelial cells of the respiratory tract, gastrointestinal tract, and genitourinary system. These tissues are often exposed to potential pathogens, and the expression of lysozyme helps provide an additional line of defense against bacterial invasion.
:::

Next, click the “column data” button under the “X-axis” header. Finally, choose “Cluster” from the drop down menu of “X-axis column data.”

```{r, echo=FALSE}
ottrpal::include_slide("https://docs.google.com/presentation/d/18J1x5rkWJisiv3EOMVZOma52XWxAVKANTg0Q2FXbOWU/edit#slide=id.g254b1a8b951_0_66")
```

Now we have a much more exciting violin plot of LYZ gene expression levels across the 14 clusters in our dataset. LYZ is expressed more in clusters 4, 8, 9, and 13. We might be interested in also displaying cell type information in this plot, which we can do using the Visual Parameters options. Click the “Visual parameters” box.

```{r, echo=FALSE}
ottrpal::include_slide("https://docs.google.com/presentation/d/18J1x5rkWJisiv3EOMVZOma52XWxAVKANTg0Q2FXbOWU/edit#slide=id.g254b1a8b951_0_77")
```

Make sure “Color” is checked in the first row. Choose “Column Data” under the “Color by” options and change the drop down menu to “labels_main”. We could also choose to color the data by “labels_fine”.

```{r, echo=FALSE}
ottrpal::include_slide("https://docs.google.com/presentation/d/18J1x5rkWJisiv3EOMVZOma52XWxAVKANTg0Q2FXbOWU/edit#slide=id.g254b1a8b951_0_46")
```

We should see the dots in our violin plot colored by cell type annotation. Three of the clusters which have higher LYZ expression contain large numbers of cells identified as monocytes. Since LYZ codes for a human lysosome protein and is often used as a marker gene for monocytes, this makes a lot of sense. 

```{r, echo=FALSE}
ottrpal::include_slide("https://docs.google.com/presentation/d/18J1x5rkWJisiv3EOMVZOma52XWxAVKANTg0Q2FXbOWU/edit#slide=id.g254b1a8b951_0_50")
```

:::{.notice}
**Monocytes** are a subset of white blood cells that play a pivotal role in our immune defense against infections. Upon encountering an infection or inflammation, they migrate from the bloodstream to the affected tissues, where they differentiate into specialized cells that engulf and eliminate pathogens.
:::

Let’s download this plot. Click on the “download” button in the top right corner like we did before, but this time choose “Download panel output”. A box will pop up asking you to choose which plots to download. This means you could have multiple plots being displayed in your panel but only choose to download a subset of them. Make sure “Feature assay plot” is checked and click “Download”. Your figure will be saved in a zip file in your Downloads folder.

```{r, echo=FALSE}
ottrpal::include_slide("https://docs.google.com/presentation/d/18J1x5rkWJisiv3EOMVZOma52XWxAVKANTg0Q2FXbOWU/edit#slide=id.g254b1a8b951_0_54")
```

::: {.reflection}
**Your turn!**

CD14 is a marker gene for the same type of cells as LYZ. Does it have the same cluster expression pattern as what we saw for LYZ?
:::

##  Get Session Info

It's a good idea to document information about the packages (and their versions) you used while running the analysis.  The last codeblock uses the `sessionInfo()` command to do just that. Here's an example of what that might look like:

```{r}
sessionInfo()
```

## Shutting Down

:::: {.borrowed_chunk}
```{r, echo = FALSE, results='asis'}
cow::borrow_chapter(
  doc_path = "child/_child_rstudio_delete.Rmd",
  repo_name = "jhudsl/AnVIL_Template"
)
```
::::

# Instructor Guide

Coming soon!