hyperboliccake · troycomi · Mar 20, 2019 · Mar 20, 2019 · Mar 21, 2019 · Mar 21, 2019
diff --git a/.gitignore b/.gitignore
@@ -33,3 +33,6 @@ code/*/scratch/*
 code/setup/*
 .coverage
 *.swp
+*egg-info
+tags
+*.log
diff --git a/.travis.yml b/.travis.yml
@@ -0,0 +1,25 @@
+language: python
+python:
+  - "3.6"
+before_install:
+    cd code
+install:
+  - sudo apt-get update
+  - wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh;
+  - bash miniconda.sh -b -p $HOME/miniconda
+  - export PATH="$HOME/miniconda/bin:$PATH"
+  - hash -r
+  - conda config --set always_yes yes --set changeps1 no
+  - conda update -q conda
+  # Useful for debugging any issues with conda
+  - conda info -a
+
+  - conda env create -q -n test-environment --file environment.yml
+  - source activate test-environment
+  - pip install codecov
+
+script:
+    pytest --cov --cov-config .coveragerc --flake8
+
+after_success:
+  - codecov
diff --git a/README.md b/README.md
@@ -0,0 +1,199 @@
+[![Build Status](https://travis-ci.com/troycomi/introgression.svg?branch=master)](https://travis-ci.com/troycomi/introgression)
+[![codecov](https://codecov.io/gh/troycomi/introgression/branch/master/graph/badge.svg)](https://codecov.io/gh/troycomi/introgression)
+
+# introgression
+> Discovering yeast admixture through sequencing
+
+## Background
+TBA
+
+## Installation
+All required packages are specified in the conda environment located in 
+`code/environment.yml`.  The introgression environment can be generated with
+```bash
+conda env create -f environment.yml
+```
+To access the command line bindings of the main analyze class,
+install the setup file using pip with
+```bash
+conda activate introgression
+pip install --editable .
+```
+while in the code directory.
+
+## Usage
+
+### Configuration
+A set of initial parameters are provided in `code/config.yaml` which need to
+be set specifically for your system and dataset.
+
+Strings of the form \_\_KEY\_\_
+are substituted during execution and are used as a shortcut.  For example,
+with `output\_root` set to `/data/results`, the value `__OUTPUT_ROOT__/genes/`
+becomes `/data/results/genes/`.
+
+Strings of the form `{state}` are used for wildcards within the code.  Their
+location and surrounding characters can change, but the wildcard must be the
+same.  For example, `blocks_{state}.txt` can be changed to
+`{state}_with-block.txt` but not `blocks_{st}.txt`.
+
+### Command Line
+With the package installed and the conda environment activated, main methods
+are accessed with the `introgression` command. Some documentation is provided
+by adding the argument `--help` to introgression or any of its subcommands.
+
+#### introgression
+Options include:
+- --config: specify one or more configuration files.  Files are evaluated in
+order.  Conflicting values are overwritten by the newest file. This allows a
+base configuration for the system and analysis-specific configurations added
+as needed.
+- verbosity: set by varying the number of v's attached to the option, with 
+`-v` indicating a log level of critical and `-vvvvv` indicating debug logging.
+- --log-file: Optional location to store log information.  Default is stdout.
+If set and run on an interactive shell, some commands will display progress
+bars.
+
+Most subcommand options will overwrite corresponding values in the config
+file.  Leaving options unset without supplying a value in the config file
+will raise an error.  Some values are only set through the config file 
+including the list of chromosomes and the known states.
+
+Available subcommands are:
+##### predict
+The predict subcommand uses an HMM to predict regions of introgression from
+alignment files.  Several outputs are used in subsequent steps which refine
+the predicted introgressed regions.
+
+Test strains on which to predict introgression can be supplied in the config
+file under the name `strains` or pulled from the directory structure of
+`test_strains`.
+
+Available options are:
+- --alignment: input alignment file location with wildcards for 
+{prefix} (optional), {strain} and {chrom}.
+- --prefix: An optional wildcard value for alignment files.  If left blank,
+will default to the known states joined with an underscore.  Leaving the
+{prefix} wildcard out of the alignment file will prevent its use as well.
+- --blocks: An output file containing the regions predicted to belong to the
+given state.  Must contain the {state} wildcard which will be populated with a
+known state during analysis.  Columns are the strain, chromosomes, the 
+predicted state, start position, end position, and the number of sites 
+supporting the assignment.
+- --test-strains. If strains are not provided in the config, this file with
+{strain} and {chrom} wildcards will be used to populate the strains for 
+prediction.
+- --hmm-initial: Output file with the initial parameters of the HMM for each
+strain.
+- --hmm-trained: Output file with HMM parameters following Baum-Welch training.
+- --positions: Output file with indices of sites which are non-gapped sequences
+which differ between reference alignments.
+- --probabilities: Output file with the probability of each position belonging
+to the master reference strain.
+- --threshold: The threshold value to apply when filtering the predicted HMM
+path through the test sequence.  Either a float, indicating cutoff probability,
+or 'viterbi' to indicate the Viterbi algorithm should be used to find the most
+likely sequence of states.
+- --only-poly-sites/--all-sites: A switch to indicate if all non-gapped,
+sequenced sites should be considered during HMM training, or only polymorphic
+sites.  Default is only polymorphic sites.
+
+##### id-regions
+id-regions prepends a column to block files with a unique region id, of the
+form 'r#'.  Regions are sorted by the start position of the region.  Changing
+the states to label will affect the region numbers as a different set of
+regions will be considered.
+
+Available options are:
+- --blocks: The input file to label with the wildcard {state}.  This is the 
+file produced by predict in the previous step.
+- --labeled: The output file, also containing {state} wildcard.
+- --state: May be specified multiple times to indicate which states to add
+labels to.  Leaving unset will use the states in the config file (recommended).
+
+##### summarize-regions
+Analyzes the regions predicted to be introgressed.  Several columns are added
+to the block file containing information about the region including the number
+of matching sites to each state.
+
+Available options are:
+- --state: May be set multiple times for each state to summarize.  Leaving
+unset will default to all states in the config file.
+- --labeled: the labeled block file with {state} wildcard created in the
+previous step.
+- --masks: Sequence mask files with {strain} and {chrom} wildcards.
+- --alignment: The input alignment file similar to the predict option.
+- --positions: The position file created during predict.
+- --quality: The output file with a {state} wildcard.
+- --region: The alignment for each region in the labeled file with {state}
+wildcard.  Each state file contains all regions for the state.
+- --region-index: A pickled python dictionary used for random access into the
+region file.  Must have {state} wildcard.
+
+##### filter-regions
+From the quality files produced in `summarize-regions`, filter regions based
+on several criteria including those with weak support for the alternative
+hypothesis and those which can be assigned to multiple alternative states.
+
+Regions passing the 'introgressed filter' satisfy all of the following:
+- fraction of gaps masked in reference > 0.5
+- fraction of gaps masked in predicted state > 0.5
+- number of matches to predicted > 7
+- number of matches to predicted > number of matches to reference
+- sequence identity with predicted state is higher than reference
+- sequence identity with reference is > 0.7
+
+Regions passing the 'ambiguous filter' match only the predicted state.  No
+other state has:
+- sequence identity >= sequence identity with predicted state * threshold
+- matching bases >= matching bases with predicted state * threshold
+
+Available options are:
+- --region: The region file from summarize-regions.
+- --region-index: The region index file from summarize-regions.
+- --quality: The quality file produced by summarize-regions with {state}
+wildcard.
+- --introgress-filter: The output file with only regions passing introgression
+filter.  Must contain {state} wildcard.
+- --introgress-inter: An output file with all regions.  Includes the reason
+for filtering by the introgression filter or blank if it passes.
+Must contain {state} wildcard.
+- --ambiguous-filter: Output file containing only regions which pass the 
+ambiguous filter after passing the introgression filter.
+Must contain {state} wildcard.
+- --ambiguous-inter: Contains all regions from introgression filter with a 
+column for the reason the region failed ambiguous filtering.  Must contain
+{state} wildcard.
+- --thresh: The threshold to apply to the ambiguous filter.
+- --filter-sweep: If set and threshold values are supplied as arguments,
+will output summary information for applying the ambiguous filter with various
+threshold values.
+
+`filter-regions` accepts multiple threshold values as arguments to test
+and output to the `filter-sweep` file.  Sample usage would be
+```bash
+introgression --config config.yml \
+    filter-regions
+    --threshold 0.995 \
+    --filter-sweep sweep.txt \
+    0.99 0.98 0.8  # these are the sweep arguments
+```
+where 0.99, 0.98 and 0.8 are used as test threshold values as summarized in
+sweep.txt.  Note that the ambiguous filter will only use the threshold 0.995
+in this example.
+
+##### summarize-strains
+Summarize strains produces summary information for each test strain including
+the number of regions and bases assigned to each hidden state, filtered at
+each stage, and ambiguous between states.
+
+Available options are:
+- --introgress-inter: The introgressed filter file as used in `filter-regions`.
+- --ambiguous-inter: The ambiguous filter file as used in `filter-regions`.
+- --strain-info: Tab separate table with information on the strain to include
+with the summary output.  Columns should be the strain name, alternate name,
+location, environment, and population.
+- --state-counts: The summary output file.
+
+## License
+TBD
diff --git a/code/align/aggregate_alignment_stats.py b/code/align/aggregate_alignment_stats.py
@@ -1,14 +1,17 @@
 import os
-import sys
-sys.path.insert(0, '..')
 import global_params as gp
 
 gp_dir = '../'
-stats_files = [gp_dir + gp.alignments_dir + x for x in filter(\
-        lambda x: 'stats' in x and 'summary' not in x, os.listdir(gp_dir + gp.alignments_dir))]
+stats_files = [gp_dir + gp.alignments_dir + x for x in filter(
+    lambda x: 'stats' in x and 'summary' not in x,
+    os.listdir(gp_dir + gp.alignments_dir))]
 
 # goal is to generate file for R (e.g. for two references and test strain):
-# chromosome strain frac_S288c_S288c frac_S288c_CBS432 frac_S288c_x frac_CBS432_S288c frac_CBS432_CBS432 frac_CBS432_x frac_x_S288c frac_x_CBS432 frac_x_x  aligned_length_S288c aligned_length_CBS432 aligned_length_x num_align_columns_0 num_align_columns_1 num_align_columns_2 num_align_columns_3
+# chromosome strain frac_S288c_S288c frac_S288c_CBS432 frac_S288c_x
+# frac_CBS432_S288c frac_CBS432_CBS432 frac_CBS432_x frac_x_S288c
+# frac_x_CBS432 frac_x_x  aligned_length_S288c aligned_length_CBS432
+# aligned_length_x num_align_columns_0 num_align_columns_1
+# num_align_columns_2 num_align_columns_3
 
 f = open(gp_dir + gp.alignments_dir + 'mafft_stats_summary.txt', 'w')
 
@@ -25,15 +28,15 @@
 
 for i in range(0, len(gp.alignment_ref_order) + 2):
     f.write('\t' + 'num_align_columns_' + str(i))
-    
+
 f.write('\n')
 
 all_strains = gp.alignment_ref_order + ['x']
 
 # one line for each of these files
 for fn in stats_files:
-    print fn
-    
+    print(fn)
+
     lines = [line.strip() for line in open(fn, 'r').readlines()]
 
     # histogram of number of number of strains aligned
@@ -43,10 +46,10 @@
         c.append(float(lines[i + offset].split(',')[1]))
 
     # aligned lengths
-    l = []
+    lengths = []
     offset += len(all_strains) + 1 + 2
     for i in range(len(all_strains)):
-        l.append(float(lines[i + offset].split(',')[1]))
+        lengths.append(float(lines[i + offset].split(',')[1]))
 
     sx = lines[offset + len(all_strains) - 1].split(',')[0]
 
@@ -67,7 +70,7 @@
         for j in range(len(all_strains)):
             f.write('\t' + str(fr[i][j]))
     for i in range(len(all_strains)):
-        f.write('\t' + str(l[i]))
+        f.write('\t' + str(lengths[i]))
     for i in range(len(all_strains) + 1):
         f.write('\t' + str(c[i]))
     f.write('\n')

diff --git a/code/align/align_helpers.py b/code/align/align_helpers.py
@@ -1,12 +1,24 @@
 import os
 import global_params as gp
+from typing import List, Tuple
 
 
-def flatten(l):
+def flatten(l: List[List]) -> List:
+    '''
+    Flatten list of lists into a single list
+    '''
     return [item for sublist in l for item in sublist]
 
 
-def get_strains(dirs):
+def get_strains(dirs: List[str]) -> List[Tuple[str, str]]:
+    '''
+    Find all strains in the provided list of directories
+    Returns a sorted list of tuples with (strain_name, directory) entries
+    Checks for files with the fasta_suffix and contain _chr
+    strain_name is the name of the file up to _chr.
+    Raises assertion error if the number of files found is < number of strains
+    * the number of chromosomes
+    '''
     # get all non-reference strains of cerevisiae and paradoxus; could
     # generalize this someday...
 
@@ -15,10 +27,11 @@ def get_strains(dirs):
     for d in dirs:
         fns = os.listdir(d)
         # only look at fasta files in the directory
-        fns = filter(lambda x: x.endswith(gp.fasta_suffix), fns)
         # only look at files containing '_chr' which should be chromosome
         # sequence files
-        fns = list(filter(lambda x: '_chr' in x, fns))
+        fns = list(
+            filter(lambda x: x.endswith(gp.fasta_suffix) and '_chr' in x,
+                   fns))
         num_files = len(fns)
         if num_files == 0:
             print(f'found no chromosome sequence files in {d} '
@@ -33,7 +46,13 @@ def get_strains(dirs):
     return sorted(s)
 
 
-def concatenate_fasta(input_files, names, output_file):
+def concatenate_fasta(input_files: List[str],
+                      names: List[str],
+                      output_file: str) -> None:
+    '''
+    Combines several fasta files together into a single output
+    Adds header between each input fasta as > name[i] filename
+    '''
     with open(output_file, 'w') as output:
         for i, file in enumerate(input_files):
             with open(file, 'r') as input:

diff --git a/code/align/alignment_stats.py b/code/align/alignment_stats.py
@@ -1,8 +1,6 @@
-import sys
-sys.path.insert(0, '..')
 import global_params as gp
-sys.path.insert(0, '../misc')
-import read_fasta
+from misc import read_fasta
+
 
 # count sites where n, ..., 3, 2, 1 genomes aligned, etc.
 def num_strains_aligned_by_site(seqs):
@@ -17,7 +15,8 @@ def num_strains_aligned_by_site(seqs):
 
     return num_strains_hist
 
-# fraction of each strain's sequence contained in alignment 
+
+# fraction of each strain's sequence contained in alignment
 # (should be 1)
 def fraction_strains_aligned(headers, seqs):
     nseqs = len(seqs)
@@ -34,6 +33,7 @@ def fraction_strains_aligned(headers, seqs):
 
     return fracs_aligned, seq_lengths
 
+
 # using each genome as reference, percentage of other genomes aligned
 def frac_aligned_to_reference(seqs, seq_lengths):
     nseqs = len(seqs)
@@ -47,7 +47,8 @@ def frac_aligned_to_reference(seqs, seq_lengths):
             else:
                 total = 0
                 for i in range(nsites):
-                    if seqs[ref][i] != gp.gap_symbol and seqs[other][i] != gp.gap_symbol:
+                    if seqs[ref][i] != gp.gap_symbol and \
+                            seqs[other][i] != gp.gap_symbol:
                         total += 1
                 r.append(float(total) / seq_lengths[other])
         fracs_aligned_to_ref.append(r)