Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tests passing, merge with devel_python3 #4

Open
wants to merge 34 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
34 commits
Select commit Hold shift + click to select a range
00700bb
Filter tests migrated to python 3
troycomi Mar 20, 2019
4fe85ec
Filter 1 working with single gz file using seek
troycomi Mar 20, 2019
1a94214
Filter 1 running, matching output
troycomi Mar 21, 2019
f77e1ef
Filter 1 with region yield
troycomi Mar 21, 2019
42d91dd
Filter 2 refactor
troycomi Mar 22, 2019
a1c6155
Adding docstrings and return types
troycomi Mar 22, 2019
8844007
Filter 2 thresholds refactor
troycomi Mar 22, 2019
7c82f95
Summarize Strain under test, refactored
troycomi Mar 26, 2019
3267582
Working on documentation
troycomi Apr 1, 2019
58634eb
Suppress log warnings
troycomi Apr 9, 2019
17d51bf
Merge branch 'devel_python3' into merge_wip
troycomi Apr 9, 2019
96d6cfb
Tests passing
troycomi Apr 9, 2019
4be5be7
Config yaml
troycomi Apr 10, 2019
9102a16
Working on predict main
troycomi Apr 19, 2019
0ee0ecf
Started Predict refactor
troycomi Apr 19, 2019
dfc0846
Translated predict_main
troycomi Apr 25, 2019
85eaac8
Fixed readme
troycomi Apr 25, 2019
29ae638
Tested predict main
troycomi Apr 26, 2019
9438339
Log file, progress bar
troycomi Apr 26, 2019
3fef252
Refactor id_regions
troycomi Apr 30, 2019
1bd8cf2
Moved validate code, required position
troycomi Apr 30, 2019
c42b6fd
Refactor summarize region quality
troycomi May 9, 2019
1ac85ed
Combined filter methods
troycomi May 13, 2019
2667a6a
Filter Regions Refactor
troycomi May 21, 2019
b00140a
Flake8 passing, summarize_strains refactored
troycomi Jun 1, 2019
21c52ef
Updated readme, removed unused argument parser
troycomi Jun 5, 2019
df69936
Update readme
troycomi Jun 5, 2019
d113279
Working on setting up travisci and codecov
troycomi Jun 10, 2019
8b266e5
Travis debugging
troycomi Jun 10, 2019
deaea99
Travis debugging
troycomi Jun 10, 2019
000385d
Travis debugging
troycomi Jun 10, 2019
a56b5a0
Travis debugging
troycomi Jun 10, 2019
1cfd268
Travis debugging
troycomi Jun 10, 2019
3c2fa43
Badges in readme
troycomi Jun 10, 2019
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -33,3 +33,6 @@ code/*/scratch/*
code/setup/*
.coverage
*.swp
*egg-info
tags
*.log
25 changes: 25 additions & 0 deletions .travis.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
language: python
python:
- "3.6"
before_install:
cd code
install:
- sudo apt-get update
- wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh;
- bash miniconda.sh -b -p $HOME/miniconda
- export PATH="$HOME/miniconda/bin:$PATH"
- hash -r
- conda config --set always_yes yes --set changeps1 no
- conda update -q conda
# Useful for debugging any issues with conda
- conda info -a

- conda env create -q -n test-environment --file environment.yml
- source activate test-environment
- pip install codecov

script:
pytest --cov --cov-config .coveragerc --flake8

after_success:
- codecov
199 changes: 199 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,199 @@
[![Build Status](https://travis-ci.com/troycomi/introgression.svg?branch=master)](https://travis-ci.com/troycomi/introgression)
[![codecov](https://codecov.io/gh/troycomi/introgression/branch/master/graph/badge.svg)](https://codecov.io/gh/troycomi/introgression)

# introgression
> Discovering yeast admixture through sequencing

## Background
TBA

## Installation
All required packages are specified in the conda environment located in
`code/environment.yml`. The introgression environment can be generated with
```bash
conda env create -f environment.yml
```
To access the command line bindings of the main analyze class,
install the setup file using pip with
```bash
conda activate introgression
pip install --editable .
```
while in the code directory.

## Usage

### Configuration
A set of initial parameters are provided in `code/config.yaml` which need to
be set specifically for your system and dataset.

Strings of the form \_\_KEY\_\_
are substituted during execution and are used as a shortcut. For example,
with `output\_root` set to `/data/results`, the value `__OUTPUT_ROOT__/genes/`
becomes `/data/results/genes/`.

Strings of the form `{state}` are used for wildcards within the code. Their
location and surrounding characters can change, but the wildcard must be the
same. For example, `blocks_{state}.txt` can be changed to
`{state}_with-block.txt` but not `blocks_{st}.txt`.

### Command Line
With the package installed and the conda environment activated, main methods
are accessed with the `introgression` command. Some documentation is provided
by adding the argument `--help` to introgression or any of its subcommands.

#### introgression
Options include:
- --config: specify one or more configuration files. Files are evaluated in
order. Conflicting values are overwritten by the newest file. This allows a
base configuration for the system and analysis-specific configurations added
as needed.
- verbosity: set by varying the number of v's attached to the option, with
`-v` indicating a log level of critical and `-vvvvv` indicating debug logging.
- --log-file: Optional location to store log information. Default is stdout.
If set and run on an interactive shell, some commands will display progress
bars.

Most subcommand options will overwrite corresponding values in the config
file. Leaving options unset without supplying a value in the config file
will raise an error. Some values are only set through the config file
including the list of chromosomes and the known states.

Available subcommands are:
##### predict
The predict subcommand uses an HMM to predict regions of introgression from
alignment files. Several outputs are used in subsequent steps which refine
the predicted introgressed regions.

Test strains on which to predict introgression can be supplied in the config
file under the name `strains` or pulled from the directory structure of
`test_strains`.

Available options are:
- --alignment: input alignment file location with wildcards for
{prefix} (optional), {strain} and {chrom}.
- --prefix: An optional wildcard value for alignment files. If left blank,
will default to the known states joined with an underscore. Leaving the
{prefix} wildcard out of the alignment file will prevent its use as well.
- --blocks: An output file containing the regions predicted to belong to the
given state. Must contain the {state} wildcard which will be populated with a
known state during analysis. Columns are the strain, chromosomes, the
predicted state, start position, end position, and the number of sites
supporting the assignment.
- --test-strains. If strains are not provided in the config, this file with
{strain} and {chrom} wildcards will be used to populate the strains for
prediction.
- --hmm-initial: Output file with the initial parameters of the HMM for each
strain.
- --hmm-trained: Output file with HMM parameters following Baum-Welch training.
- --positions: Output file with indices of sites which are non-gapped sequences
which differ between reference alignments.
- --probabilities: Output file with the probability of each position belonging
to the master reference strain.
- --threshold: The threshold value to apply when filtering the predicted HMM
path through the test sequence. Either a float, indicating cutoff probability,
or 'viterbi' to indicate the Viterbi algorithm should be used to find the most
likely sequence of states.
- --only-poly-sites/--all-sites: A switch to indicate if all non-gapped,
sequenced sites should be considered during HMM training, or only polymorphic
sites. Default is only polymorphic sites.

##### id-regions
id-regions prepends a column to block files with a unique region id, of the
form 'r#'. Regions are sorted by the start position of the region. Changing
the states to label will affect the region numbers as a different set of
regions will be considered.

Available options are:
- --blocks: The input file to label with the wildcard {state}. This is the
file produced by predict in the previous step.
- --labeled: The output file, also containing {state} wildcard.
- --state: May be specified multiple times to indicate which states to add
labels to. Leaving unset will use the states in the config file (recommended).

##### summarize-regions
Analyzes the regions predicted to be introgressed. Several columns are added
to the block file containing information about the region including the number
of matching sites to each state.

Available options are:
- --state: May be set multiple times for each state to summarize. Leaving
unset will default to all states in the config file.
- --labeled: the labeled block file with {state} wildcard created in the
previous step.
- --masks: Sequence mask files with {strain} and {chrom} wildcards.
- --alignment: The input alignment file similar to the predict option.
- --positions: The position file created during predict.
- --quality: The output file with a {state} wildcard.
- --region: The alignment for each region in the labeled file with {state}
wildcard. Each state file contains all regions for the state.
- --region-index: A pickled python dictionary used for random access into the
region file. Must have {state} wildcard.

##### filter-regions
From the quality files produced in `summarize-regions`, filter regions based
on several criteria including those with weak support for the alternative
hypothesis and those which can be assigned to multiple alternative states.

Regions passing the 'introgressed filter' satisfy all of the following:
- fraction of gaps masked in reference > 0.5
- fraction of gaps masked in predicted state > 0.5
- number of matches to predicted > 7
- number of matches to predicted > number of matches to reference
- sequence identity with predicted state is higher than reference
- sequence identity with reference is > 0.7

Regions passing the 'ambiguous filter' match only the predicted state. No
other state has:
- sequence identity >= sequence identity with predicted state * threshold
- matching bases >= matching bases with predicted state * threshold

Available options are:
- --region: The region file from summarize-regions.
- --region-index: The region index file from summarize-regions.
- --quality: The quality file produced by summarize-regions with {state}
wildcard.
- --introgress-filter: The output file with only regions passing introgression
filter. Must contain {state} wildcard.
- --introgress-inter: An output file with all regions. Includes the reason
for filtering by the introgression filter or blank if it passes.
Must contain {state} wildcard.
- --ambiguous-filter: Output file containing only regions which pass the
ambiguous filter after passing the introgression filter.
Must contain {state} wildcard.
- --ambiguous-inter: Contains all regions from introgression filter with a
column for the reason the region failed ambiguous filtering. Must contain
{state} wildcard.
- --thresh: The threshold to apply to the ambiguous filter.
- --filter-sweep: If set and threshold values are supplied as arguments,
will output summary information for applying the ambiguous filter with various
threshold values.

`filter-regions` accepts multiple threshold values as arguments to test
and output to the `filter-sweep` file. Sample usage would be
```bash
introgression --config config.yml \
filter-regions
--threshold 0.995 \
--filter-sweep sweep.txt \
0.99 0.98 0.8 # these are the sweep arguments
```
where 0.99, 0.98 and 0.8 are used as test threshold values as summarized in
sweep.txt. Note that the ambiguous filter will only use the threshold 0.995
in this example.

##### summarize-strains
Summarize strains produces summary information for each test strain including
the number of regions and bases assigned to each hidden state, filtered at
each stage, and ambiguous between states.

Available options are:
- --introgress-inter: The introgressed filter file as used in `filter-regions`.
- --ambiguous-inter: The ambiguous filter file as used in `filter-regions`.
- --strain-info: Tab separate table with information on the strain to include
with the summary output. Columns should be the strain name, alternate name,
location, environment, and population.
- --state-counts: The summary output file.

## License
TBD
25 changes: 14 additions & 11 deletions code/align/aggregate_alignment_stats.py
Original file line number Diff line number Diff line change
@@ -1,14 +1,17 @@
import os
import sys
sys.path.insert(0, '..')
import global_params as gp

gp_dir = '../'
stats_files = [gp_dir + gp.alignments_dir + x for x in filter(\
lambda x: 'stats' in x and 'summary' not in x, os.listdir(gp_dir + gp.alignments_dir))]
stats_files = [gp_dir + gp.alignments_dir + x for x in filter(
lambda x: 'stats' in x and 'summary' not in x,
os.listdir(gp_dir + gp.alignments_dir))]

# goal is to generate file for R (e.g. for two references and test strain):
# chromosome strain frac_S288c_S288c frac_S288c_CBS432 frac_S288c_x frac_CBS432_S288c frac_CBS432_CBS432 frac_CBS432_x frac_x_S288c frac_x_CBS432 frac_x_x aligned_length_S288c aligned_length_CBS432 aligned_length_x num_align_columns_0 num_align_columns_1 num_align_columns_2 num_align_columns_3
# chromosome strain frac_S288c_S288c frac_S288c_CBS432 frac_S288c_x
# frac_CBS432_S288c frac_CBS432_CBS432 frac_CBS432_x frac_x_S288c
# frac_x_CBS432 frac_x_x aligned_length_S288c aligned_length_CBS432
# aligned_length_x num_align_columns_0 num_align_columns_1
# num_align_columns_2 num_align_columns_3

f = open(gp_dir + gp.alignments_dir + 'mafft_stats_summary.txt', 'w')

Expand All @@ -25,15 +28,15 @@

for i in range(0, len(gp.alignment_ref_order) + 2):
f.write('\t' + 'num_align_columns_' + str(i))

f.write('\n')

all_strains = gp.alignment_ref_order + ['x']

# one line for each of these files
for fn in stats_files:
print fn
print(fn)

lines = [line.strip() for line in open(fn, 'r').readlines()]

# histogram of number of number of strains aligned
Expand All @@ -43,10 +46,10 @@
c.append(float(lines[i + offset].split(',')[1]))

# aligned lengths
l = []
lengths = []
offset += len(all_strains) + 1 + 2
for i in range(len(all_strains)):
l.append(float(lines[i + offset].split(',')[1]))
lengths.append(float(lines[i + offset].split(',')[1]))

sx = lines[offset + len(all_strains) - 1].split(',')[0]

Expand All @@ -67,7 +70,7 @@
for j in range(len(all_strains)):
f.write('\t' + str(fr[i][j]))
for i in range(len(all_strains)):
f.write('\t' + str(l[i]))
f.write('\t' + str(lengths[i]))
for i in range(len(all_strains) + 1):
f.write('\t' + str(c[i]))
f.write('\n')
Expand Down
29 changes: 24 additions & 5 deletions code/align/align_helpers.py
Original file line number Diff line number Diff line change
@@ -1,12 +1,24 @@
import os
import global_params as gp
from typing import List, Tuple


def flatten(l):
def flatten(l: List[List]) -> List:
'''
Flatten list of lists into a single list
'''
return [item for sublist in l for item in sublist]


def get_strains(dirs):
def get_strains(dirs: List[str]) -> List[Tuple[str, str]]:
'''
Find all strains in the provided list of directories
Returns a sorted list of tuples with (strain_name, directory) entries
Checks for files with the fasta_suffix and contain _chr
strain_name is the name of the file up to _chr.
Raises assertion error if the number of files found is < number of strains
* the number of chromosomes
'''
# get all non-reference strains of cerevisiae and paradoxus; could
# generalize this someday...

Expand All @@ -15,10 +27,11 @@ def get_strains(dirs):
for d in dirs:
fns = os.listdir(d)
# only look at fasta files in the directory
fns = filter(lambda x: x.endswith(gp.fasta_suffix), fns)
# only look at files containing '_chr' which should be chromosome
# sequence files
fns = list(filter(lambda x: '_chr' in x, fns))
fns = list(
filter(lambda x: x.endswith(gp.fasta_suffix) and '_chr' in x,
fns))
num_files = len(fns)
if num_files == 0:
print(f'found no chromosome sequence files in {d} '
Expand All @@ -33,7 +46,13 @@ def get_strains(dirs):
return sorted(s)


def concatenate_fasta(input_files, names, output_file):
def concatenate_fasta(input_files: List[str],
names: List[str],
output_file: str) -> None:
'''
Combines several fasta files together into a single output
Adds header between each input fasta as > name[i] filename
'''
with open(output_file, 'w') as output:
for i, file in enumerate(input_files):
with open(file, 'r') as input:
Expand Down
13 changes: 7 additions & 6 deletions code/align/alignment_stats.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,6 @@
import sys
sys.path.insert(0, '..')
import global_params as gp
sys.path.insert(0, '../misc')
import read_fasta
from misc import read_fasta


# count sites where n, ..., 3, 2, 1 genomes aligned, etc.
def num_strains_aligned_by_site(seqs):
Expand All @@ -17,7 +15,8 @@ def num_strains_aligned_by_site(seqs):

return num_strains_hist

# fraction of each strain's sequence contained in alignment

# fraction of each strain's sequence contained in alignment
# (should be 1)
def fraction_strains_aligned(headers, seqs):
nseqs = len(seqs)
Expand All @@ -34,6 +33,7 @@ def fraction_strains_aligned(headers, seqs):

return fracs_aligned, seq_lengths


# using each genome as reference, percentage of other genomes aligned
def frac_aligned_to_reference(seqs, seq_lengths):
nseqs = len(seqs)
Expand All @@ -47,7 +47,8 @@ def frac_aligned_to_reference(seqs, seq_lengths):
else:
total = 0
for i in range(nsites):
if seqs[ref][i] != gp.gap_symbol and seqs[other][i] != gp.gap_symbol:
if seqs[ref][i] != gp.gap_symbol and \
seqs[other][i] != gp.gap_symbol:
total += 1
r.append(float(total) / seq_lengths[other])
fracs_aligned_to_ref.append(r)
Expand Down
Loading