Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Organize analysis for publication #2

Open
wants to merge 22 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 4 additions & 6 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,10 +1,8 @@
.*
!.git*
t*
_*
*.out

env/
data/.*
results/*/.*

.*
!.gitignore

results/*/.*
54 changes: 32 additions & 22 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,44 +1,54 @@
# Deep mutational scanning of ZIKV NS3 protein
Experiments by Blake Richardson and Matt Evans.
Analysis by Caroline Kikawa, adapting a pipeline by Jesse Bloom and David Bacsik.
# Deep mutational scanning of ZIKV NS2B/NS3 protease

## Analysis pipeline overview
This `Snakemake` takes input deep sequencing reads and processes them using the `dms_tools2` package written by the Bloom lab. For this project, a deep mutational scanning library was created in discrete 'tiles' across the NS2B/NS3 protein from Zika virus. A pool of virus particles expressing these variants was selected by passaging on cells. The pre-selection counts of each variant (e.g., their frequency in the plasmid library) were compared to their counts in the cell-passaging selected libraries. The *amino acid preferences* or `prefs` are the result of this comparison. We then calculate *mutational effects* or `muteffects` for each variant by taking the log ratio of the variant mutation versus wild-type.
Experiments by **Blake Richardson** and **Matt Evans**. Analysis by **Caroline Kikawa** with help from **Will Hannon**

## Analysis

A deep mutational scanning library was created in 3 discrete non-overlapping 'tiles' across the ZIKV NS2B/NS3 protease. The resulting pool of virus particles expressing these variants was selected by passaging on cells. The counts of each variant in the plasmid library were compared to their counts after passaging using a pipeline adapted from a `Snakemake` pipeline written by Jesse Bloom and David Bacsik. The pipeline takes deep sequencing reads as input and processes them using [`dms_tools2`](https://jbloomlab.github.io/dms_tools2/). The **amino acid preferences** or `prefs` are calculated based on count of variants pre- and post- selection by passaging on cells. We then calculate **mutational effects** or `muteffects` for each variant by taking the log ratio of the variant mutation versus wild-type.

## Results
For a summary of the results, see [results/summary/](results/summary/), which has Markdown summaries for the analysis of each tile (e.g., [results/summary/dms_tile_1_analysis.md](results/summary/dms_tile_1_analysis.md), etc).

Other results are placed in [./results/](results), although not all files are tracked in the GitHub repo. Again, these files are sub-divided by tile and analysis (e.g., [results/tile_1/prefs](results/tile_1/prefs), [results/tile_1/muteffects](results/tile_1/muteffects), etc).
All results are located in [results/](results), although large files are not tracked in the GitHub repo. Results files are sub-divided by tile and analysis (e.g., [results/tile_1/prefs](results/tile_1/prefs), [results/tile_1/muteffects](results/tile_1/muteffects), etc).

See [results/summary/](results/summary/) for markdown summaries of the analysis for each tile (e.g., [results/summary/dms_tile_1_analysis.md](results/summary/dms_tile_1_analysis.md), etc).

To view the data on 3D protein structures of the NS3/NS2B protease, download [this file](/data/dms-viz/output/ZIKV-NS2B-NS3-DMS.json) and upload it to [`dms-viz`](https://dms-viz.github.io/v0/).

## Running analysis
First activate the *ZIKV_DMS_NS5_EvansLab* [conda](https://docs.conda.io/projects/conda/en/latest/index.html) environment for the analysis.
If you have not already created this environment, build it from [environment.yml](ZIKV_DMS_NS3_EvansLab) with:

conda env create -f environment.yml
First activate the *ZIKV_DMS_NS3_EvansLab* [conda](https://docs.conda.io/projects/conda/en/latest/index.html) environment for the analysis.
If you have not already created this environment, build it from [environment.yml](environment.yml) with:

```bash
conda env create -f environment.yml
```

Then activate the environment with:

conda activate ZIKV_DMS_NS3_EvansLab
```bash
conda activate ZIKV_DMS_NS3_EvansLab
```

The analysis is run by the [snakemake](https://snakemake.readthedocs.io/) pipeline in [Snakefile](Snakefile).
Essentially, this pipeline runs the Jupyter notebook [dms_tile_analysis.ipynb](dms_tile_analysis.ipynb) for each deep mutational scanning tile, with the tile information specified in [config.yml](config.yml).
To run the pipeline using 36 jobs, use the command:
The analysis is run by the [`snakemake`](https://snakemake.readthedocs.io/) pipeline in [Snakefile](Snakefile). Essentially, this pipeline runs the Jupyter notebook [dms_tile_analysis.ipynb](dms_tile_analysis.ipynb) for each deep mutational scanning tile, with the tile information specified in [config.yml](config.yml).

snakemake -j 36
To run the pipeline using 36 jobs, use the command:

Add the `--keep-incomplete` flag if you don't want to delete results on an error.
```bash
snakemake --keep-incomplete -j 36
```
To run on the Hutch cluster using `slurm`, do:

sbatch -c 36 run_Snakemake.bash

```bash
sbatch -c 36 run_analysis.bash
```

## Input data
The input data are in [./data/](data):

- `./data/tile_*_amplicon.fasta`: amplicons for each tile of the barcoded-subamplicon sequencing.
The input data are in [data/](data):

- `./data/tile_*_subamplicon_alignspecs.txt`: the alignment specs for the [barcoded subamplicon sequencing](https://jbloomlab.github.io/dms_tools2/bcsubamp.html) for each amplicon.
- `data/tile_*_amplicon.fasta`: amplicons for each tile of the barcoded-subamplicon sequencing.

- `./data/tile_*_samplelist.csv`: all the samples that we sequenced and the locations of the associated deep-sequencing data for each amplicon.
- `data/tile_*_subamplicon_alignspecs.txt`: the alignment specs for the [barcoded subamplicon sequencing](https://jbloomlab.github.io/dms_tools2/bcsubamp.html) for each amplicon.

- `data/tile_*_samplelist.csv`: all the samples that we sequenced and the locations of the associated deep-sequencing data for each amplicon.

76 changes: 59 additions & 17 deletions Snakefile
Original file line number Diff line number Diff line change
@@ -1,38 +1,42 @@
"""``snakemake`` pipeline that runs analysis."""
"""
Pipeline that runs the ZIKV NS2B/NS3 analysis for each tile.
Authors: Caroline Kikawa, Will Hannon, and David Bascik
"""

#### ----------------------- Imports ----------------------- ####

import os

import pandas as pd

#### -------------------- Configuration -------------------- ####

configfile: 'config.yml'

#### ----------------------- Targets ----------------------- ####

wildcard_constraints:
tile="tile_\d+"

rule all:
input:
expand("results/{tile}",
tile=config['tiles']),
"results/summary/all_tiles_effects_and_preferences.csv",
"results/summary/all_tiles_effects_and_preferences_with_stops.csv",
expand("results/summary/dms_{tile}_analysis.md",
tile=config['tiles']),
expand("results/summary/dms_{tile}_analysis.html",
tile=config['tiles']),

#### ------------------------ Rules ------------------------ ####

rule jupnb_to_md:
"""Convert Jupyter notebook to Markdown format."""
input: notebook="results/notebooks/{notebook}.ipynb"
output: markdown="results/summary/{notebook}.md"
params: outdir=lambda wildcards, output: os.path.dirname(output.markdown)
conda: 'environment.yml'
shell:
rule clean:
shell:
"""
jupyter nbconvert \
--output-dir {params.outdir} \
--to markdown \
{input.notebook}
rm -rf logs/
rm -rf tmp/
rm -f slurm*.out
"""


rule dms_tile_analysis:
"""Analyze DMS data for a tile."""
input:
Expand All @@ -47,5 +51,43 @@ rule dms_tile_analysis:
threads: config['max_cpus']
conda: 'environment.yml'
log: notebook='results/notebooks/dms_{tile}_analysis.ipynb'
notebook: 'dms_tile_analysis.py.ipynb'

notebook: 'notebooks/dms_tile_analysis.py.ipynb'


rule combine_all_tiles:
input: expand("results/{tile}", tile=config['tiles'])
output: without_stops_csv = "results/summary/all_tiles_effects_and_preferences.csv",
with_stops_csv = "results/summary/all_tiles_effects_and_preferences_with_stops.csv",
params: tiles = config['tiles']
conda: 'environment.yml'
log: notebook='results/notebooks/combine_all_tiles.ipynb'
notebook: 'notebooks/combine_all_tiles.ipynb'


rule jupnb_to_md:
"""Convert Jupyter notebook to Markdown format."""
input: notebook="results/notebooks/{notebook}.ipynb"
output: markdown="results/summary/{notebook}.md"
params: outdir=lambda wildcards, output: os.path.dirname(output.markdown)
conda: 'environment.yml'
shell:
"""
jupyter nbconvert \
--output-dir {params.outdir} \
--to markdown \
{input.notebook}
"""

rule jupnb_to_html:
"""Convert Jupyter notebook to HTML format."""
input: notebook="results/notebooks/{notebook}.ipynb"
output: html="results/summary/{notebook}.html"
params: outdir=lambda wildcards, output: os.path.dirname(output.html)
conda: 'environment.yml'
shell:
"""
jupyter nbconvert \
--output-dir {params.outdir} \
--to html \
{input.notebook}
"""
8 changes: 8 additions & 0 deletions cluster.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
# cluster configuration for running Snakefile on Hutch cluster

__default__:
cpus: 4
partition: campus-new
time: 0-2
mem: 32000
name: "{rule}"
3 changes: 3 additions & 0 deletions data/dms-viz/datasets.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
input,sitemap,name,metric,metric_name,structure,included_chains,excluded_chains,title,exclude_amino_acids
../data/dms-viz/input/ZIKV_NS2B_NS3_DMS.csv,../data/dms-viz/sitemap/5GJ4_sitemap.csv,ZIKV NS2B-NS3 (Open),log2effect,Log2(Effect),5GJ4,polymer,none,ZIKV NS2B-NS3 (Open) DMS,*
../data/dms-viz/input/ZIKV_NS2B_NS3_DMS.csv,../data/dms-viz/sitemap/5LC0_sitemap.csv,ZIKV NS2B-NS3 (Closed),log2effect,Log2(Effect),5LC0,polymer,none,ZIKV NS2B-NS3 (Closed) DMS,*
12 changes: 12 additions & 0 deletions data/dms-viz/environment.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
name: dms-viz

channels:
- conda-forge
- bioconda

dependencies:
- python
- pandas
- pip
- pip:
- configure-dms-viz
Loading