describe outputs of each subworkflow and explain how to test varCA

this should allow users to fully reproduce our results and it also includes a new schematic!
aryarm · Jul 2, 2020 · 970c60e · 970c60e
1 parent 48b9a5a
commit 970c60e
Show file tree

Hide file tree

Showing 5 changed files with 118 additions and 14 deletions.
diff --git a/configs/README.md b/configs/README.md
@@ -53,7 +53,7 @@ You can choose which variant callers are used by providing their "caller IDs" in
 4. Create new, trained classification models using the training data (see the [rules README](/rules#creating-your-own-trained-model) for instructions)
 
 ### [callers.yaml](callers.yaml)
-The `prepare` subworkflow (and, by extension, the master pipeline) read parameters specific to each variant caller from the `callers.yaml` config file. These parameters change the data that is output by the `prepare` subworkflow. As a general rule, changing any of the config options in the `callers.yaml` file will require that you [retrain the classification models](/rules#creating-your-own-trained-model) to recognize the new data that you are providing to the `classify` subworkflow.
+The `prepare` subworkflow (and, by extension, the master pipeline) read parameters specific to each variant caller from the `callers.yaml` config file. These parameters change the data that is output by the `prepare` subworkflow. As a general rule, changing any of the config options in the `callers.yaml` file will require that you [retrain the classification models](/rules#training-and-testing-varca) to recognize the new data that you are providing to the `classify` subworkflow.
 
 For [each caller script in the callers directory](/callers) you can specify:
 

diff --git a/configs/classify.yaml b/configs/classify.yaml
@@ -37,7 +37,7 @@ data:
   # We include this sample here if you ever want to create your own trained model:
   SRR891269:
     # this is our platinum genomes training data! but you can replace it with whatever data you create
-    path: data/indel.tsv.gz
+    path: data/indel.tsv.gz # if you are trying to reproduce our results, you'll probably want to replace this with something like out/merged_indel/SRR891269/odds.tsv.gz
     truth: &truth pg-indel
     filter: &filter
       - 'gatk-indel~DP>10' # we only train on sites with a read depth above 10
@@ -51,6 +51,12 @@ data:
     path: out/merged_indel/jurkat_chr1/final.tsv.gz
     merged: out/merged_indel/jurkat_chr1/merge.tsv.gz
     filter: *filter # use the same filtering expression as the SRR891269 data
+  # # uncomment the lines below if you want to reproduce our results
+  # SRR891269_evens:
+  #   path: out/merged_indel/SRR891269/evens.tsv.gz
+  #   merged: out/merged_indel/SRR891269/merge.tsv.gz
+  #   truth: *truth # use the same truth caller ID as the SRR891269 data
+  #   filter: *filter # use the same filtering expression as the SRR891269 data
 
 
 # The unique ID (from "data" above) of the dataset you'd like to train the
@@ -83,6 +89,7 @@ tune: false
 # serve as test data (see "truth" above).
 # This config entry is required!
 predict: [molt4_chr1, jurkat_chr1]
+# predict: [molt4_chr1, jurkat_chr1, SRR891269_evens] # uncomment this line if you want to reproduce our results
 
 # Callers to use when creating a VCF from the merged .tsv.gz file.
 # Supply a list of caller names in the order in which they should be considered

diff --git a/rules/README.md b/rules/README.md
@@ -1,42 +1,108 @@
+# The varCA pipeline
+![Pipeline Skeleton](https://drive.google.com/uc?export=view&id=1u1KGkAPXmQ1ez5I1XKN6veTTSni0qJe8)
+
+The pipeline consists of the `prepare` (red) and `classify` (green) subworkflows. The prepare subworkflow runs a set of variant callers on the provided ATAC-seq data and prepares their output for use by the classify subworkflow, which uses a trained ensemble classifier to predict the existence of variants.
+The prepare subworkflow can use FASTQ or BAM/BED files as input. The classify subworkflow can predict variants from the prepared datasets or use them for training/testing.
+If a pre-trained model is available (orange), the two subworkflows can be executed together automatically via the master pipeline. However the subworkflows must be executed separately for training and testing (see [below](#training-and-testing-varca)).
+
 ## The `prepare` subworkflow
-The [`prepare` subworkflow](prepare.smk) is a [Snakemake](https://snakemake.readthedocs.io/en/stable/) pipeline for preparing data for the classifier. It uses ATAC-seq FASTQ (or BAM) files to generate a tab-delimited table containing variant caller output for every site in open chromatin regions of the genome. The `prepare` subworkflow uses the scripts in the [callers directory](callers) to run every variant caller in the ensemble.
+The [`prepare` subworkflow](prepare.smk) is a [Snakemake](https://snakemake.readthedocs.io/en/stable/) pipeline for preparing data for the classifier. It generates a tab-delimited table containing variant caller output for every site in open chromatin regions of the genome. The `prepare` subworkflow uses the scripts in the [callers directory](callers) to run every variant caller in the ensemble.
 
+### execution
 The `prepare` subworkflow is included within the [master pipeline](/Snakefile) automatically. However, you can also execute the `prepare` subworkflow on its own, as a separate Snakefile.
 
 First, make sure that you fill out the [`prepare.yaml`](/configs/prepare.yaml) and the [`callers.yaml`](/configs/callers.yaml) config files, which specify input and parameters for the `prepare` subworkflow. See the [config README](/configs) for more information.
 
-To execute the workflow, just call Snakemake with `-s rules/prepare.smk`:
+Then, just call Snakemake with `-s rules/prepare.smk`:
 
 	snakemake -s rules/prepare.smk --use-conda -j
 
-The output of the `prepare` pipeline will be in `<output_directory>/merged_<variant_type>/<sample_ID>/final.tsv.gz`
+### output
+The primary outputs of the `prepare` pipeline will be in `<output_directory>/merged_<variant_type>/<sample_ID>/final.tsv.gz`. However, several intermediary directories and files are also generated:
+
+- align/ - output from the BWA FASTQ alignment step and samtools PCR duplicate removal steps
+- peaks/ - output from MACS 2 and other files required for calling peaks
+- callers/ - output from each [caller script](/callers) in the ensemble (see the [callers README](/callers/README.md) for more information) and the variant normalization and feature extraction steps
+- merged_<variant_type>/ - all other output in the `prepare` subworkflow, including the merged and final datasets for each variant type (ie SNV or indels)
 
 ## The `classify` subworkflow
 The [`classify` subworkflow](classify.smk) is a [Snakemake](https://snakemake.readthedocs.io/en/stable/) pipeline for training and testing the classifier. It uses the TSV output from the `prepare` subworkflow. Its final output is a VCF containing predicted variants.
 
+### execution
 The `classify` subworkflow is included within the [master pipeline](/Snakefile) automatically. However, you can also execute the `classify` subworkflow on its own, as a separate Snakefile.
 
 First, make sure that you fill out the [`classify.yaml`](/configs/classify.yaml) config file, which specifies input and parameters for the `classify` subworkflow. See the [config README](/configs) for more information.
 
-To execute the workflow, just call Snakemake with `-s rules/classify.smk`:
+Then, just call Snakemake with `-s rules/classify.smk`:
 
 	snakemake -s rules/classify.smk --use-conda -j
 
-The output of the `classify` pipeline will be in `<output_directory>/classify/<sample_ID>/final.vcf.gz`
+### output
+The primary output of the `classify` subworkflow will be in `<output_directory>/classify/<sample_ID>/final.vcf.gz`. However, several intermediary files are also generated:
 
-## Creating your own trained model
+#### for all datasets
+
+- subset.tsv.gz - the result of subsetting some callers from the input (only if requested)
+- filter.tsv.gz - the result of filtering rows from the input (only if requested)
+- prepared.tsv.gz - datasets that are ready to be interpreted by the ensemble classifier
+
+#### for prediction datasets
+
+- results.tsv.gz - the predicted variants in TSV format
+- results.vcf.gz - the predicted variants in VCF format with recalibrated QUAL scores
+
+#### for training datasets
+
+- model.rda - the trained classifier
+- variable_importance.tsv - a table containing importance values for every feature in the RF classifier; visualize this with [`importance_plot.py`](/scripts/importance_plot.py)
+- tune_matrix.tsv - the results from hyperparameter tuning (only if requested)
+- tune.pdf - a visualization of the results in tune_matrix.tsv
+
+#### for test datasets
+
+- prc/results.pdf - precision-recall plot for evaluating the performance of varCA and comparing it to other variant callers
+- prc/curves - data used to create the precision-recall plot
+- prc/pts - single point metrics evaluating the performance of varCA and comparing it to other callers; you may aggregate these with [`metrics_table.py`](/scripts/metrics_table.py)
+
+# Training and Testing varCA
+## Motivation
+### Training
 You may want to create your own trained models (rather than use the ones we provided in the example data) for any number of reasons. The most common are
 
 1. You changed one of the caller specific parameters in the `callers.yaml` config file, or
 2. You would like to use a different set of variant callers than the example models support, or
 3. You'd like to include a new variant caller that hasn't already been implemented as a caller script in the [callers directory](/callers)
+4. You'd like to wholly and completely reproduce our results (since the training and testing steps are usually skipped by the master pipeline)
+
+In general, you will need to create a new trained model whenever the columns in the output of the `prepare` subworkflow change.
+### Testing
+You may want to test a trained model if:
+
+1. You want to create precision-recall curves and generate performance metrics for every variant caller in the ensemble
+2. You want to compare the performance of varCA to other variant callers
+3. You'd like to wholly and completely reproduce our results (since the training and testing steps are usually skipped by the master pipeline)
+
+## Creating your own trained model
+For the sake of this example, let's say you'd like to include a new indel variant caller (ie #3 above). You've also already followed the directions in the [callers README](/callers/README.md) to create your own caller script, and you've modified the `prepare.yaml` and `callers.yaml` config files to include your new indel caller. However, before you can predict variants using the indel caller, you must create a new trained classification model that knows how to interpret your new input.
+
+To do this, we recommend downloading the truth set we used to create our model. First, download the [GM12878 FASTQ files from Buenrostro et al](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE47753). Specify the path to these files in `data/samples.tsv`, the samples file within the example data. Then, download the corresponding [Platinum Genomes VCF for that sample](https://www.illumina.com/platinumgenomes.html).
+
+Fill out the `prepare.yaml` config file with the necessary information to run it on the GM12878 sample. Add `pg-indel` as the last caller under the `indel_callers` list, so that the `prepare` subworkflow will know to include the Platinum Genomes VCF in its output. Also, include the path to your Platinum Genomes VCF in the `callers.yaml` config file under `pg`, `pg-snp`, and `pg-indel`. Next, execute the `prepare` subworkflow on its own (see above).
+
+After the `prepare` subworkflow has finished running, add the sample (specfically, the path to its `final.tsv.gz` file) as a dataset in the `classify.yaml` file and specify its attributes. Most of the yaml for this has already been written for you. Specifically, make sure you've specified `pg-indel` (for the Platinum Genomes VCF) as the truth caller ID. Next, execute the `classify` subworkflow on its own (see above).
+
+The `classify` subworkflow can only create one trained model at a time, so you will need to repeat these steps if you'd also like to create a trained model for SNVs. However, make sure to replace every mention of "indel" in `classify.yaml` with "snp".
+
+## Testing your model
+We recommend testing your trained model before using it. You can include test sets as datasets in the `classify.yaml` file.
 
-For the sake of this example, let's say you'd like to include a new indel variant caller (ie #3). You've also already followed the directions in the [callers README](/callers/README.md) to create your own caller script, and you've modified the `prepare.yaml` and `callers.yaml` config files to include your new indel caller. However, before you can predict variants using the indel caller, you must create a new trained classification model that knows how to interpret your new input.
+For this example, we will demonstrate how you can test the model described [above](#creating-your-own-trained-model). You should do these steps after running the `prepare` subworkflow but before running the `classify` subworkflow.
 
-To do this, we recommend first downloading the training and truth sets we used to create our model. First, download the [GM12878 FASTQ files from Buenrostro et al](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE47753). Then, download the corresponding [Platinum Genomes VCF for that sample](https://www.illumina.com/platinumgenomes.html).
+First, split the output of the `prepare` subworkflow by chromosome parity using `awk` commands like this:
 
-Fill out the `prepare.yaml` config file with the necessary information to run it on the GM12878 sample. Add `pg-indel` as the last caller under the `indel_callers` list. Also, include the path to your Platinum Genomes VCF in the `callers.yaml` config file under `pg`, `pg-snp`, and `pg-indel`. Next, execute the `prepare` subworkflow on its own (see above).
+	zcat out/merged_indel/SRR891269/final.tsv.gz | { read -r head && echo "$head" && awk -F $"\t" -v 'OFS=\t' '$1 ~ /^[0-9]+$/ && $1%2'; } | gzip > out/merged_indel/SRR891269/odds.tsv.gz
+	zcat out/merged_indel/SRR891269/final.tsv.gz | { read -r head && echo "$head" && awk -F $"\t" -v 'OFS=\t' '$1 ~ /^[0-9]+$/ && !($1%2)'; } | gzip > out/merged_indel/SRR891269/evens.tsv.gz
 
-After the `prepare` subworkflow has finished running, add the sample (add specfically, the path to its `final.tsv.gz` file) as a dataset in the `classify.yaml` file. Most of the yaml for this has already been written for you. Next, execute the `classify` subworkflow on its own (see above).
+Then, specify in `classify.yaml` the path to the `odds.tsv.gz` file as your training set and the path to `evens.tsv.gz` as your test set. Make sure that both your training and test sets have a truth caller ID and that you've specified the test set ID in the list of `predict` datasets. Then, execute the `classify` subworkflow on its own (see above).
 
-We recommend testing your trained model before using it. You can include test sets as datasets in the `classify.yaml` file. For example, you could test on odd chrosomes by subsetting the `final.tsv.gz` file and providing that as a separate "predict" dataset.
+When provided with a test set, the `classify` subworkflow will produce a plot and metrics to help you evaluate the performance of the trained model and compare it with the other variant callers in the ensemble. These will be stored in the test set's `prc` folder. The precision-recall plot will be named `results.pdf` and the metrics will be appear in the `pts` directory. You can create the other plots and tables in our paper using the scripts in the [`scripts` directory](/scripts).
diff --git a/scripts/README.md b/scripts/README.md
@@ -5,7 +5,7 @@ However, you can use most of these scripts on their own, too. Some may even be h
 All python scripts implement the `--help` argument. For bash, R, and awk scripts, you can run `head <script>` to read about their usage.
 
 ### [2vcf.py](2vcf.py)
-A python script that uses files from the `prepare` and `classify` pipelines to create a VCF with the final, predicted variants.
+A python script that uses files from the `prepare` and `classify` pipelines to create a VCF with the final, predicted variants. This script also has a special internal mode, which can be used for recalibrating the QUAL scores output in the VCF.
 
 ### [cgrep.bash](cgrep.bash)
 A bash script for extracting columns from TSVs via `grep`. Every argument besides the first is passed directly to `grep`.
@@ -28,6 +28,9 @@ A python script for creating plots of the importance of each variable (ie featur
 ### [metrics.py](metrics.py)
 A python script for calculating evaluation metrics on a two column TSV of binary labels: truth and predictions.
 
+### [metrics_table.py](metrics_table.py)
+A python script for summarizing multiple metrics files output by `metrics.py` in a nicely formatted table.
+
 ### [norm_nums.awk](norm_nums.awk)
 A fast awk script for ensuring that unusual numerical values in a large TSV can be read by R.
 

diff --git a/scripts/metrics_table.py b/scripts/metrics_table.py
@@ -0,0 +1,28 @@
+#!/usr/bin/env python3
+
+import sys
+import argparse
+import matplotlib
+import numpy as np
+import pandas as pd
+from pathlib import Path
+
+
+parser = argparse.ArgumentParser("Creates a summary metrics.tsv table containing performance metrics for varCA and the variant callers in the ensemble.")
+parser.add_argument(
+    "pts", type=Path, help="path to the pts dir within your test set's classify/prc/ directory"
+)
+parser.add_argument(
+    "out", nargs='?', default=sys.stdout,
+    help="the filename to which to save the metrics table (default: stdout)"
+)
+args = parser.parse_args()
+
+df = pd.concat([
+    pd.read_csv(f, sep="\t", header=None, names=['varca' if f.stem == 'breakca' else f.stem])
+    for f in args.pts.rglob('*.txt')
+], axis=1)
+df.index = ['recall', 'precision', 'f-beta', 'total positives', 'total negatives', 'auroc', 'avg precision']
+df.sort_values('f-beta', axis=1, inplace=True, ascending=False)
+
+df.to_csv(args.out, sep="\t")