addition of draft machine learning chapter (applied-bioinformatics#325)

* additionally updates homology searching to re-use code between chapters * installs local iab library for travis testing (previously was installing master, which caused changes in the library code to not be available during testing)
dearmattdearmatt · Feb 11, 2020 · ceec861 · ceec861
1 parent 795328f
commit ceec861
Show file tree

Hide file tree

Showing 4 changed files with 86 additions and 65 deletions.
diff --git a/.travis.yml b/.travis.yml
@@ -1,4 +1,4 @@
-# Taken and modified from 
+# Taken and modified from
 # https://github.com/biocore/scikit-bio/blob/master/.travis.yml
 dist: bionic
 language: python
@@ -14,6 +14,7 @@ before_install:
 install:
   - conda env create -n iab -f environment.yml
   - source activate iab
+  - pip install .
   - conda install -y nose
   - pip install https://github.com/caporaso-lab/build-iab/archive/master.zip
   - biab notebook -i book -o ipynb

diff --git a/book/fundamentals/database-searching.md b/book/fundamentals/database-searching.md
@@ -43,24 +43,13 @@ First, let's load Greengenes into a list of ``skbio.DNA`` sequence objects, and
 ```
 
 ```python
->>> import qiime_default_reference as qdr
->>> import skbio
-...
->>> # Load the taxonomic data
-... reference_taxonomy = {}
->>> for e in open(qdr.get_reference_taxonomy()):
-...     seq_id, seq_tax = e.strip().split('\t')
-...     reference_taxonomy[seq_id] = seq_tax
-...
->>> # Load the reference sequences, and associate the taxonomic annotation with
-... # each as metadata
-... reference_db = []
->>> for e in skbio.io.read(qdr.get_reference_sequences(), format='fasta', constructor=skbio.DNA):
-...     seq_tax = reference_taxonomy[e.metadata['id']]
-...     e.metadata['taxonomy'] = seq_tax
-...     reference_db.append(e)
+>>> from iab.algorithms import load_taxonomy_reference_database
 ...
->>> print("%s sequences were loaded from the reference database." % len(reference_db))
+>>> %psource load_taxonomy_reference_database
+```
+
+```python
+>>> reference_taxonomy, reference_db = load_taxonomy_reference_database()
 ```
 
 Next, we'll just inspect a couple of the sequences we loaded. Notice how the specificity of our taxonomic annotations (i.e., how many taxonomic levels are annotated and unknown) differs for different sequences.
@@ -87,11 +76,14 @@ We'll also extract some sequences from Greengenes to use as query sequences in o
 Note that some of our query sequences may also be in our subsampled reference database and some won't. This is realistic: sometimes we're working with sequences that are exact matches to known sequences, and sometimes we're working with sequences that don't match any known sequences (or at least any in the reference database that we're working with).
 
 ```python
->>> queries = []
->>> for e in skbio.io.read(qdr.get_reference_sequences(), format='fasta', constructor=skbio.DNA):
-...     e = e[100:300]
-...     queries.append(e)
->>> queries = random.sample(queries, k=500)
+>>> from iab.algorithms import load_taxonomy_query_sequences
+...
+>>> %psource load_taxonomy_query_sequences
+```
+
+```python
+>>> queries = load_taxonomy_query_sequences()
+>>> queries = random.sample(queries, k=50)
 ```
 
 Let's inspect a couple of the query sequences that we'll work with.
@@ -416,6 +408,8 @@ Try increasing and decreasing the number of sequences we'll align by increasing
 Another metric of sequence composition is *kmer composition*. A [kmer](alias://C7hMX5) is simply a word (or list of adjacent characters) of length *k* found within a sequence. Here are the kmer frequencies in a short DNA sequence. The ``overlap=True`` parameter here means that our kmers can overlap one another.
 
 ```python
+>>> import skbio
+...
 >>> skbio.DNA('ACCGTGACCAGTTACCAGTTTGACCAA').kmer_frequencies(k=5, overlap=True)
 ```
 

diff --git a/book/fundamentals/machine-learning.md b/book/fundamentals/machine-learning.md
@@ -1,14 +1,14 @@
-# Machine learning in bioinformatics <link src="7SFnTr"/>
+# Machine learning in bioinformatics (work-in-progress) <link src="7SFnTr"/>
 
-## Supervised v unsupervised classification <link src="b05Cya"/>
+**This chapter is currently a work-in-progress, and is incomplete.**
 
-## Training data, test data, and cross validation <link src="QIewra"/>
+Machine learning algorithms are commonly used in bioinformatics for a variety of tasks. Typically, the common thread in these tasks is that the user would like the algorithm to assist in the identification of patterns in a complex data set. In this chapter we'll implement a few machine learning algorithms so we can gain an in-depth understanding of how they work. In practice though, there are many mature machine learning libraries that you'd want to use. [scikit-learn](http://scikit-learn.org/) is a popular and well-documented Python library for machine learning which many bioinformatics researchers and software developers use in their work.
 
-## scikit-learn <link src="ifjF4O"/>
+These algorithms generally work beginning with a collection of samples and some user-defined features of those samples. These data are typically represented in a matrix, where samples are the rows and features are the columns. There are a few different high-level tasks that are common in machine learning, including classification, regression, and dimensionality reduction. In a classification task, a user provides examples of data that fall into certain discrete classes (for example, _healthy_ and _disease_), and tries to have the computer develop a model that can differentiate those classes based on the defined features. If successful, the resulting model could be applied to data where the class isn't known ahead of time, in attempt to predict the class from the features. A regression task is similar, except that a continuous value will be predicted rather than a discrete value. Dimensionality reduction tasks, on the other hand, generally don't have classes or labels assigned ahead of time, and the user is hoping to identify which samples are most similar to each other based on new features that are defined by the algorithm. The goal here might be to reduce the number of features from thousands or more to around two or three that explain most of the variation in the data. This allows the user to explore the samples visually, for example in a scatter plot, which would not be feasible if there were thousands of features.
 
-In this chapter we'll implement several machine learning classifiers so we can gain an in-depth understanding of how they work. In practice though, there are many mature machine learning libraries that you'd want to use. [scikit-learn](http://scikit-learn.org/) is a popular and well-documented Python library for machine learning which many bioinformatics researchers and software developers use in their work.
+In this chapter we'll explore two classification algorithms and one dimensionality reduction task in the context of some real-world examples.
 
-## Defining the problem <link src="2R6CTy"/>
+## Defining a classification problem <link src="2R6CTy"/>
 
 We'll explore machine learning classifiers in the context of a familiar topic: taxonomic classification of 16S rRNA sequences. We previously explored this problem in [Sequence Homology Searching](alias://d22e6b), so it is likely worth spending a few minutes skimming that chapter if it's not fresh in your mind.
 
@@ -18,7 +18,7 @@ This time, instead of using sequence alignment to identify the most likely taxon
 
 Let's jump in...
 
-## Naive Bayes classifiers <link src="H8vYPu"/>
+### Naive Bayes classifiers <link src="H8vYPu"/>
 
 The first classifier we'll explore is the popular and relatively simple Naive Bayes classifier. This classifier uses Bayes Theorem to determine the most likely label for an unknown input based on a probabilistic model it has constructed from training data. (_The preceding text needs work._) The model that is constructed is based on user-defined features of the sequences. The most commonly used features for sequence classification tasks such as this is overlapping [kmers](alias://C7hMX5).
 
@@ -38,27 +38,13 @@ We'll begin by importing some libraries that we'll use in this chapter, and then
 ```
 
 ```python
->>> import qiime_default_reference as qdr
+>>> from iab.algorithms import load_taxonomy_reference_database
 ...
->>> # Load the taxonomic data
-... reference_taxonomy = {}
->>> for e in open(qdr.get_reference_taxonomy()):
-...     seq_id, seq_tax = e.strip().split('\t')
-...     reference_taxonomy[seq_id] = seq_tax
-...
->>> # Load the reference sequences, and associate the taxonomic annotation with
-... # each as metadata
-... reference_db = []
->>> for e in skbio.io.read(qdr.get_reference_sequences(), format='fasta', constructor=skbio.DNA):
-...     if e.has_degenerates():
-...         # For the purpose of this lesson, we're going to ignore sequences that contain
-...         # degenerate characters (i.e., characters other than A, C, G, or T)
-...         continue
-...     seq_tax = reference_taxonomy[e.metadata['id']]
-...     e.metadata['taxonomy'] = seq_tax
-...     reference_db.append(e)
-...
->>> print("%s sequences were loaded from the reference database." % len(reference_db))
+>>> %psource load_taxonomy_reference_database
+```
+
+```python
+>>> reference_taxonomy, reference_db = load_taxonomy_reference_database()
 ```
 
 ```python
@@ -189,17 +175,16 @@ Our "kmer probability table" is $P(w_i | taxon)$ computed for all kmers in W and
 With our kmer probability table we are now ready to classify unknown sequences. We'll begin by defining some query sequences. We'll pull these at random from our reference sequences, which means that some of the query sequences will be represented in our reference database and some won't be. This is the sitatuation that is typically encountered in practice. To simulate real-world 16S taxonomy classification tasks, we'll also trim out 200 bases of our reference sequences since (as of this writing) we typically don't obtain full-length 16S sequences from a DNA sequencing instrument.
 
 ```python
->>> queries = []
->>> for e in skbio.io.read(qdr.get_reference_sequences(), format='fasta', constructor=skbio.DNA):
-...     if e.has_degenerates():
-...         # For the purpose of this lesson, we're going to ignore sequences that contain
-...         # degenerate characters (i.e., characters other than A, C, G, or T)
-...         continue
-...     e = e[100:300]
-...     queries.append(e)
->>> # can't figure out why np.random.choice isn't working here...
-... np.random.shuffle(queries)
->>> queries = queries[:50]
+>>> from iab.algorithms import load_taxonomy_query_sequences
+...
+>>> %psource load_taxonomy_query_sequences
+```
+
+```python
+>>> import random
+...
+>>> queries = load_taxonomy_query_sequences()
+>>> queries = random.sample(queries, k=50)
 ```
 
 ```python
@@ -237,9 +222,9 @@ Since we know the actual taxonomy assignment for this sequence, we can look that
 >>> get_taxon_at_level(reference_taxonomy[queries[0].metadata['id']], taxonomic_level)
 ```
 
-Because the query and reference sequences that were working with were randomly selected from the full reference database, each time you run this notebook you should observe different results. Chances are however that if you run the above steps multiple times you'll get the wrong taxonomy assignment at least some of the time. Up to this point, we've left out an important piece of information: how confident should we be in our assignment, or in other workds, how dependent is our taxonomy assignment on our specific query? If there were slight differences in our query (e.g., because we observed a very closely related organism, such as one of the same species but a different strain, or because we sequenced a different region of the 16S sequence) would we obtain the same taxonomy assignment? If so, we should have higher confidence in our assignment. If not, we should have lower confidence in our assignment.
+Because the query and reference sequences that were working with were randomly selected from the full reference database, each time you run this notebook you should observe different results. Chances are however that if you run the above steps multiple times you'll get the wrong taxonomy assignment at least some of the time. Up to this point, we've left out an important piece of information: how confident should we be in our assignment, or in other words, how dependent is our taxonomy assignment on our specific query? If there were slight differences in our query (e.g., because we observed a very closely related organism, such as one of the same species but a different strain, or because we sequenced a different region of the 16S sequence) would we obtain the same taxonomy assignment? If so, we should have higher confidence in our assignment. If not, we should have lower confidence in our assignment.
 
-We can quantify confidence using an approach called bootstrapping. With a bootstrap approach, we'll get our taxonomy assignment as we did above, but then for some user-specified number of times, we'll create random subsets of V sampled with replacement (DEFINE THIS). We'll then assign taxonomy each random subset of V, and count the number of times the resulting taxonomy assignment is the same that we achieved when assigning taxonomy to V. The count divided by the number of iterations we've chosen to run will be our confidence value. If the assignments are often the same we'll have a high confidence value. If the assignments are often different, we'll have a low confidence value.
+We can quantify confidence using an approach called bootstrapping. With a bootstrap approach, we'll get our taxonomy assignment as we did above, but then for some user-specified number of times, we'll create random subsets of V sampled with replacement (DEFINE THIS). We'll then assign taxonomy to each random subset of V, and count the number of times the resulting taxonomy assignment is the same that we achieved when assigning taxonomy to V. The count divided by the number of iterations we've chosen to run will be our confidence value. If the assignments are often the same we'll have a high confidence value. If the assignments are often different, we'll have a low confidence value.
 
 Let's now assign taxonomy and compute a confidence for that assignment.
 
@@ -302,6 +287,10 @@ What does this plot tell you about how well setting a confidence threshold is li
 >>> summary # maybe explore whether certain taxa are more frequently wrong than others...
 ```
 
-## Random Forest classifiers <link src="N7CyaN"/>
+### Random Forest classifiers <link src="N7CyaN"/>
+
+Coming soon...
+
+## Defining a dimensionality reduction problem <link src="Y0TtkW"/>
 
-## Neural networks and "deep learning" <link src="DgnnyS"/>
+[This content](alias://b1cdbe) will be adapted and ported here.
diff --git a/iab/algorithms/__init__.py b/iab/algorithms/__init__.py
@@ -16,6 +16,7 @@
 import skbio
 from skbio import Sequence, DNA, TabularMSA, TreeNode, DistanceMatrix
 from skbio.alignment import local_pairwise_align_ssw
+import qiime_default_reference as qdr
 
 
 blosum50 = {'A': {'A': 5, 'C': -1, 'D': -2, 'E': -1, 'F': -3, 'G': 0, 'H': -2, 'I': -1, 'K': -1, 'L': -2, 'M': -1, 'N': -1, 'P': -1, 'Q': -1, 'R': -2, 'S': 1, 'T': 0, 'V': 0, 'W': -3, 'Y': -2},
@@ -1146,3 +1147,39 @@ def evolve_generations(ancestral_sequence, generations, substitution_probability
 
     # upon completion of all generations, return the last generation's sequences
     return previous_generation_sequences
+
+
+def load_taxonomy_reference_database():
+    # Load the taxonomic data
+    reference_taxonomy = {}
+    for e in open(qdr.get_reference_taxonomy()):
+        seq_id, seq_tax = e.strip().split('\t')
+        reference_taxonomy[seq_id] = seq_tax
+
+    # Load the reference sequences, and associate the taxonomic annotation with
+    # each as metadata
+    reference_db = []
+    for e in skbio.io.read(qdr.get_reference_sequences(), format='fasta', constructor=skbio.DNA):
+        if e.has_degenerates():
+            # For the purpose of this lesson, we're going to ignore sequences that contain
+            # degenerate characters (i.e., characters other than A, C, G, or T)
+            continue
+        seq_tax = reference_taxonomy[e.metadata['id']]
+        e.metadata['taxonomy'] = seq_tax
+        reference_db.append(e)
+
+    print("%s sequences were loaded from the reference database." % len(reference_db))
+
+    return reference_taxonomy, reference_db
+
+def load_taxonomy_query_sequences(start_position=100, length=200):
+    queries = []
+    for e in skbio.io.read(qdr.get_reference_sequences(), format='fasta', constructor=skbio.DNA):
+        if e.has_degenerates():
+            # For the purpose of this lesson, we're going to ignore sequences that contain
+            # degenerate characters (i.e., characters other than A, C, G, or T)
+            continue
+        e = e[start_position:start_position + length]
+        queries.append(e)
+
+    return queries