jbloomlab · WillHannon-MCB · Oct 25, 2023 · Oct 30, 2023 · Nov 1, 2023 · Nov 1, 2023
diff --git a/.gitignore b/.gitignore
@@ -1,10 +1,8 @@
+.*
+!.git*
 t*
 _*
 *.out
-
+env/
 data/.*
-results/*/.*
-
-.*
-!.gitignore
-
+results/*/.*
diff --git a/README.md b/README.md
@@ -1,44 +1,54 @@
-# Deep mutational scanning of ZIKV NS3 protein
-Experiments by Blake Richardson and Matt Evans.
-Analysis by Caroline Kikawa, adapting a pipeline by Jesse Bloom and David Bacsik.
+# Deep mutational scanning of ZIKV NS2B/NS3 protease
 
-## Analysis pipeline overview
-This `Snakemake` takes input deep sequencing reads and processes them using the `dms_tools2` package written by the Bloom lab. For this project, a deep mutational scanning library was created in discrete 'tiles' across the NS2B/NS3 protein from Zika virus. A pool of virus particles expressing these variants was selected by passaging on cells. The pre-selection counts of each variant (e.g., their frequency in the plasmid library) were compared to their counts in the cell-passaging selected libraries. The *amino acid preferences* or `prefs` are the result of this comparison. We then calculate *mutational effects* or `muteffects` for each variant by taking the log ratio of the variant mutation versus wild-type.
+Experiments by **Blake Richardson** and **Matt Evans**. Analysis by **Caroline Kikawa** with help from **Will Hannon**
+
+## Analysis
+
+A deep mutational scanning library was created in 3 discrete non-overlapping 'tiles' across the ZIKV NS2B/NS3 protease. The resulting pool of virus particles expressing these variants was selected by passaging on cells. The counts of each variant in the plasmid library were compared to their counts after passaging using a pipeline adapted from a `Snakemake` pipeline written by Jesse Bloom and David Bacsik. The pipeline takes deep sequencing reads as input and processes them using [`dms_tools2`](https://jbloomlab.github.io/dms_tools2/). The **amino acid preferences** or `prefs` are calculated based on count of variants pre- and post- selection by passaging on cells. We then calculate **mutational effects** or `muteffects` for each variant by taking the log ratio of the variant mutation versus wild-type.
 
 ## Results
-For a summary of the results, see [results/summary/](results/summary/), which has Markdown summaries for the analysis of each tile (e.g., [results/summary/dms_tile_1_analysis.md](results/summary/dms_tile_1_analysis.md), etc).
 
-Other results are placed in [./results/](results), although not all files are tracked in the GitHub repo. Again, these files are sub-divided by tile and analysis (e.g., [results/tile_1/prefs](results/tile_1/prefs), [results/tile_1/muteffects](results/tile_1/muteffects), etc).
+All results are located in [results/](results), although large files are not tracked in the GitHub repo. Results files are sub-divided by tile and analysis (e.g., [results/tile_1/prefs](results/tile_1/prefs), [results/tile_1/muteffects](results/tile_1/muteffects), etc).
+
+See [results/summary/](results/summary/) for markdown summaries of the analysis for each tile (e.g., [results/summary/dms_tile_1_analysis.md](results/summary/dms_tile_1_analysis.md), etc).
+
+To view the data on 3D protein structures of the NS3/NS2B protease, download [this file](/data/dms-viz/output/ZIKV-NS2B-NS3-DMS.json) and upload it to [`dms-viz`](https://dms-viz.github.io/v0/). 
 
 ## Running analysis
-First activate the *ZIKV_DMS_NS5_EvansLab* [conda](https://docs.conda.io/projects/conda/en/latest/index.html) environment for the analysis.
-If you have not already created this environment, build it from [environment.yml](ZIKV_DMS_NS3_EvansLab) with:
 
-    conda env create -f environment.yml
+First activate the *ZIKV_DMS_NS3_EvansLab* [conda](https://docs.conda.io/projects/conda/en/latest/index.html) environment for the analysis.
+If you have not already created this environment, build it from [environment.yml](environment.yml) with:
+
+```bash
+conda env create -f environment.yml
+```
 
 Then activate the environment with:
 
-    conda activate ZIKV_DMS_NS3_EvansLab
+```bash
+conda activate ZIKV_DMS_NS3_EvansLab
+```
 
-The analysis is run by the [snakemake](https://snakemake.readthedocs.io/) pipeline in [Snakefile](Snakefile).
-Essentially, this pipeline runs the Jupyter notebook [dms_tile_analysis.ipynb](dms_tile_analysis.ipynb) for each deep mutational scanning tile, with the tile information specified in [config.yml](config.yml).
-To run the pipeline using 36 jobs, use the command:
+The analysis is run by the [`snakemake`](https://snakemake.readthedocs.io/) pipeline in [Snakefile](Snakefile). Essentially, this pipeline runs the Jupyter notebook [dms_tile_analysis.ipynb](dms_tile_analysis.ipynb) for each deep mutational scanning tile, with the tile information specified in [config.yml](config.yml).
 
-    snakemake -j 36
+To run the pipeline using 36 jobs, use the command:
 
-Add the `--keep-incomplete` flag if you don't want to delete results on an error.
+```bash
+snakemake --keep-incomplete -j 36 
+```
 To run on the Hutch cluster using `slurm`, do:
 
-    sbatch -c 36 run_Snakemake.bash
-
+```bash
+sbatch -c 36 run_analysis.bash
+```
 
 ## Input data
-The input data are in [./data/](data):
 
- - `./data/tile_*_amplicon.fasta`: amplicons for each tile of the barcoded-subamplicon sequencing.
+The input data are in [data/](data):
 
- - `./data/tile_*_subamplicon_alignspecs.txt`: the alignment specs for the [barcoded subamplicon sequencing](https://jbloomlab.github.io/dms_tools2/bcsubamp.html) for each amplicon.
+ - `data/tile_*_amplicon.fasta`: amplicons for each tile of the barcoded-subamplicon sequencing.
 
- - `./data/tile_*_samplelist.csv`: all the samples that we sequenced and the locations of the associated deep-sequencing data for each amplicon.
+ - `data/tile_*_subamplicon_alignspecs.txt`: the alignment specs for the [barcoded subamplicon sequencing](https://jbloomlab.github.io/dms_tools2/bcsubamp.html) for each amplicon.
 
+ - `data/tile_*_samplelist.csv`: all the samples that we sequenced and the locations of the associated deep-sequencing data for each amplicon.
 
diff --git a/Snakefile b/Snakefile
@@ -1,38 +1,42 @@
-"""``snakemake`` pipeline that runs analysis."""
+"""
+Pipeline that runs the ZIKV NS2B/NS3 analysis for each tile.
+Authors: Caroline Kikawa, Will Hannon, and David Bascik
+"""
 
+#### ----------------------- Imports ----------------------- ####
 
 import os
-
 import pandas as pd
 
+#### -------------------- Configuration -------------------- ####
 
 configfile: 'config.yml'
 
+#### ----------------------- Targets ----------------------- ####
+
 wildcard_constraints:
     tile="tile_\d+"
 
 rule all:
     input:
-        expand("results/{tile}",
-               tile=config['tiles']),
+        "results/summary/all_tiles_effects_and_preferences.csv",
+        "results/summary/all_tiles_effects_and_preferences_with_stops.csv",
         expand("results/summary/dms_{tile}_analysis.md",
                tile=config['tiles']),
+        expand("results/summary/dms_{tile}_analysis.html",
+            tile=config['tiles']),
 
+#### ------------------------ Rules ------------------------ ####
 
-rule jupnb_to_md:
-    """Convert Jupyter notebook to Markdown format."""
-    input: notebook="results/notebooks/{notebook}.ipynb"
-    output: markdown="results/summary/{notebook}.md"
-    params: outdir=lambda wildcards, output: os.path.dirname(output.markdown)
-    conda: 'environment.yml'
-    shell: 
+rule clean:
+    shell:
         """
-        jupyter nbconvert \
-            --output-dir {params.outdir} \
-            --to markdown \
-            {input.notebook}
+        rm -rf logs/
+        rm -rf tmp/
+        rm -f slurm*.out
         """
 
+
 rule dms_tile_analysis:
     """Analyze DMS data for a tile."""
     input:
@@ -47,5 +51,43 @@ rule dms_tile_analysis:
     threads: config['max_cpus']
     conda: 'environment.yml'
     log: notebook='results/notebooks/dms_{tile}_analysis.ipynb'
-    notebook: 'dms_tile_analysis.py.ipynb'
-
+    notebook: 'notebooks/dms_tile_analysis.py.ipynb'
+
+
+rule combine_all_tiles:
+    input:  expand("results/{tile}", tile=config['tiles'])
+    output: without_stops_csv = "results/summary/all_tiles_effects_and_preferences.csv",
+            with_stops_csv = "results/summary/all_tiles_effects_and_preferences_with_stops.csv",
+    params: tiles = config['tiles']
+    conda: 'environment.yml'
+    log: notebook='results/notebooks/combine_all_tiles.ipynb'
+    notebook: 'notebooks/combine_all_tiles.ipynb'
+
+
+rule jupnb_to_md:
+    """Convert Jupyter notebook to Markdown format."""
+    input: notebook="results/notebooks/{notebook}.ipynb"
+    output: markdown="results/summary/{notebook}.md"
+    params: outdir=lambda wildcards, output: os.path.dirname(output.markdown)
+    conda: 'environment.yml'
+    shell: 
+        """
+        jupyter nbconvert \
+            --output-dir {params.outdir} \
+            --to markdown \
+            {input.notebook}
+        """
+
+rule jupnb_to_html:
+    """Convert Jupyter notebook to HTML format."""
+    input: notebook="results/notebooks/{notebook}.ipynb"
+    output: html="results/summary/{notebook}.html"
+    params: outdir=lambda wildcards, output: os.path.dirname(output.html)
+    conda: 'environment.yml'
+    shell: 
+        """
+        jupyter nbconvert \
+            --output-dir {params.outdir} \
+            --to html \
+            {input.notebook}
+        """
diff --git a/cluster.yml b/cluster.yml
@@ -0,0 +1,8 @@
+# cluster configuration for running Snakefile on Hutch cluster
+
+__default__:
+  cpus: 4
+  partition: campus-new
+  time: 0-2
+  mem: 32000
+  name: "{rule}"
diff --git a/data/dms-viz/datasets.csv b/data/dms-viz/datasets.csv
@@ -0,0 +1,3 @@
+input,sitemap,name,metric,metric_name,structure,included_chains,excluded_chains,title,exclude_amino_acids
+../data/dms-viz/input/ZIKV_NS2B_NS3_DMS.csv,../data/dms-viz/sitemap/5GJ4_sitemap.csv,ZIKV NS2B-NS3 (Open),log2effect,Log2(Effect),5GJ4,polymer,none,ZIKV NS2B-NS3 (Open) DMS,*
+../data/dms-viz/input/ZIKV_NS2B_NS3_DMS.csv,../data/dms-viz/sitemap/5LC0_sitemap.csv,ZIKV NS2B-NS3 (Closed),log2effect,Log2(Effect),5LC0,polymer,none,ZIKV NS2B-NS3 (Closed) DMS,*
diff --git a/data/dms-viz/environment.yml b/data/dms-viz/environment.yml
@@ -0,0 +1,12 @@
+name: dms-viz
+
+channels:
+  - conda-forge
+  - bioconda
+
+dependencies:
+  - python
+  - pandas
+  - pip
+  - pip:
+    - configure-dms-viz