Skip to content

Commit

Permalink
Updates scanpy labs 05 & 06
Browse files Browse the repository at this point in the history
  • Loading branch information
asabjorklund committed Jan 19, 2024
1 parent b673039 commit 67fa9fd
Show file tree
Hide file tree
Showing 2 changed files with 40 additions and 32 deletions.
23 changes: 7 additions & 16 deletions labs/scanpy/scanpy_05_dge.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,7 @@ if not os.path.exists(path_results):
os.makedirs(path_results, exist_ok=True)
# path_file = "data/covid/results/scanpy_covid_qc_dr_scanorama_cl.h5ad"
path_file = "data/covid/results/scanpy_clustered_covid.h5ad"
path_file = "data/covid/results/scanpy_covid_qc_dr_scanorama_cl.h5ad"
if not os.path.exists(path_file):
urllib.request.urlretrieve(os.path.join(
path_data, 'covid/results/scanpy_covid_qc_dr_scanorama_cl.h5ad'), path_file)
Expand All @@ -66,14 +66,12 @@ print(adata.raw.X[:10,:10])

As you can see, the X matrix only contains the variable genes, while the raw matrix contains all genes.

Printing a few of the values in adata.raw.X shows that the raw matrix is not normalized.
Printing a few of the values in adata.raw.X shows that the raw matrix is normalized.

For DGE analysis we would like to run with all genes, but on normalized values, so we will have to revert back to the raw matrix and renormalize.
For DGE analysis we would like to run with all genes, on normalized values, so we will have to revert back to the raw matrix. In case you have raw counts in the matrix you also have to renormalize and logtransform.

```{python}
adata = adata.raw.to_adata()
sc.pp.normalize_per_cell(adata, counts_per_cell_after=1e4)
sc.pp.log1p(adata)
```

Now lets look at the clustering of the object we loaded in the umap. We will use louvain_0.6 clustering in this exercise.
Expand Down Expand Up @@ -134,7 +132,7 @@ venn3([set(wc),set(tt),set(tt_ov)], ('Wilcox','T-test','T-test_ov') )
plt.show()
```

As you can see, the Wilcoxon test and the T-test with overestimated variance gives very similar result. Also the regular T-test has good overlap, while the Logistic regression gives quite different genes.
As you can see, the Wilcoxon test and the T-test with overestimated variance gives very similar result. Also the regular T-test has good overlap.

## Visualization

Expand Down Expand Up @@ -173,7 +171,7 @@ sc.pl.stacked_violin(adata, mynames, groupby = 'louvain_0.6')
{{< meta dge_cond_1 >}}

```{python}
cl1 = adata[adata.obs['louvain_0.6'] == '4',:]
cl1 = adata[adata.obs['louvain_0.6'] == '5',:]
cl1.obs['type'].value_counts()
sc.tl.rank_genes_groups(cl1, 'type', method='wilcoxon', key_added = "wilcoxon")
Expand Down Expand Up @@ -246,8 +244,6 @@ sc.pl.violin(cl1, genes2, groupby='sample')

As you can see, many of the genes detected as DGE in Covid are unique to one or 2 patients.

We can examine more genes with a DotPlot:

We can also plot the top Covid and top Ctrl genes as a dotplot:

```{python}
Expand All @@ -270,14 +266,11 @@ cl1.obs['sample'].value_counts()

So one obvious thing to consider is an equal amount of cells per individual so that the DGE results are not dominated by a single sample.

So we will downsample to an equal number of cells per sample.
So we will downsample to an equal number of cells per sample, in this case 34 cells per sample as it is the lowest number among all samples

```{python}
cl1.obs['sample'].value_counts()
```

```{python}
target_cells = 50
target_cells = 37
tmp = [cl1[cl1.obs['sample'] == s] for s in cl1.obs['sample'].cat.categories]
Expand Down Expand Up @@ -305,8 +298,6 @@ sc.pl.dotplot(cl1,genes, groupby='sample')

It looks much better now. But if we look per patient you can see that we still have some genes that are dominated by a single patient. Still, it is often a good idea to control the number of cells from each sample when doing differential expression.

Why do you think this is?

There are many different ways to try and resolve the issue of patient batch effects, however most of them require R packages. These can be run via rpy2 as is demonstraded in this compendium:
https://www.sc-best-practices.org/conditions/differential_gene_expression.html

Expand Down
49 changes: 33 additions & 16 deletions labs/scanpy/scanpy_06_celltyping.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,7 @@ if not os.path.exists(path_results):
os.makedirs(path_results, exist_ok=True)
# path_file = "data/covid/results/scanpy_covid_qc_dr_scanorama_cl.h5ad"
path_file = "data/covid/results/scanpy_clustered_covid.h5ad"
path_file = "data/covid/results/scanpy_covid_qc_dr_scanorama_cl.h5ad"
if not os.path.exists(path_file):
urllib.request.urlretrieve(os.path.join(
path_data, 'covid/results/scanpy_covid_qc_dr_scanorama_cl.h5ad'), path_file)
Expand All @@ -70,6 +70,17 @@ adata = adata[adata.obs["sample"] == "ctrl_13",:]
print(adata.shape)
```

```{python}
adata.obs["louvain_0.6"].value_counts()
```

As you can see, we have only one cell from cluster 10 in this sample, so lets remove that cell for now.

```{python}
adata = adata[adata.obs["louvain_0.6"] != "10",:]
```


```{python}
sc.pl.umap(
adata, color=["louvain_0.6"], palette=sc.pl.palettes.default_20
Expand Down Expand Up @@ -206,6 +217,15 @@ sc.pl.umap(adata_merged, color=["sample","louvain",'predicted'])
sc.pl.umap(adata_merged[adata_merged.obs['sample']=='ctrl'], color='predicted')
```

Now plot how many cells of each celltypes can be found in each cluster.

```{python}
tmp = pd.crosstab(adata.obs['louvain_0.6'],adata.obs['predicted'], normalize='index')
tmp.plot.bar(stacked=True).legend(bbox_to_anchor=(1.8, 1),loc='upper right')
```


## Ingest

Another method for celltype prediction is Ingest, for more information, please look at
Expand All @@ -216,6 +236,15 @@ sc.tl.ingest(adata, adata_ref, obs='louvain')
sc.pl.umap(adata, color=['louvain','louvain_0.6'], wspace=0.5)
```

Now plot how many cells of each celltypes can be found in each cluster.

```{python}
tmp = pd.crosstab(adata.obs['louvain_0.6'],adata.obs['louvain'], normalize='index')
tmp.plot.bar(stacked=True).legend(bbox_to_anchor=(1.8, 1),loc='upper right')
```


## Compare results

The predictions from ingest is stored in the column 'louvain' while we
Expand Down Expand Up @@ -246,14 +275,12 @@ if not os.path.exists(path_file):
```{python}
df = pd.read_table(path_file)
df
```
```{python}
# Filter for number of genes per celltype
print(df.shape)
```

```{python}
# Filter for number of genes per celltype
df['nG'] = df.geneSymbol.str.split(",").str.len()
df = df[df['nG'] > 5]
Expand All @@ -263,20 +290,16 @@ print(df.shape)
```

```{python}
#| eval: false
# this chunk has issues and therefore not evaluated
df.index = df.cellName
gene_dict = df.geneSymbol.str.split(",").to_dict()
```

```{python}
# run differential expression per cluster
sc.tl.rank_genes_groups(adata, 'louvain_0.6', method='wilcoxon', key_added = "wilcoxon")
```

```{python}
# | eval: false
# this chunk has issues and therefore not evaluated
# do gene set overlap to the groups in the gene list and top 300 DEGs.
import gseapy
Expand Down Expand Up @@ -304,17 +327,11 @@ for cl in adata.obs['louvain_0.6'].cat.categories.tolist():


```{python}
# | eval: false
# this chunk has issues and therefore not evaluated
# prediction per cluster
pred
```

```{python}
# | eval: false
# this chunk has issues and therefore not evaluated
prediction = [pred[x] for x in adata.obs['louvain_0.6']]
adata.obs["GS_overlap_pred"] = prediction
Expand Down

0 comments on commit 67fa9fd

Please sign in to comment.