Skip to content

Latest commit

 

History

History
428 lines (350 loc) · 18.6 KB

README.md

File metadata and controls

428 lines (350 loc) · 18.6 KB

MACHINA - Metastatic And Clonal History INtegrative Analysis

MACHINA is a computational framework for inferring migration patterns between a primary tumor and metastases using DNA sequencing data. Overview of MACHINA

Contents

  1. Installation
  2. Usage instructions

Installation

bioconda

  1. Install Anaconda or Miniconda if you do not already have one installed.
  2. (recommended) Create a new conda environment for machina and activate it:
conda create -n machina
conda activate machina
  1. Set up conda channels for bioconda (once per Anaconda/Miniconda installation):
conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge
  1. Install machina from bioconda:
conda install machina

Manual compilation

Note that binaries for macOS and linux are available here. These binaries require a valid Gurobi installation and license key. License key location can be specified via the environment variable GRB_LICENSE_KEY. In addition, installation of Gurobi in a non-standard location will require updating LD_LIBRARY_PATH (linux) and DYLD_LIBRARY_PATH (macOS).

Also note that to run the below examples, you must either provide the full path to the executable (e.g., /path/to/machina/build/pmh_sankoff) or add the build directory to your PATH.

Dependencies

MACHINA is written in C++11 and thus requires a modern C++ compiler (GCC >= 4.8.1, or Clang). In addition, MACHINA has the following dependencies.

Graphviz is required to visualize the resulting DOT files, but is not required for compilation.

Gurobi is a commercial ILP solver with two licensing options: (1) a single-host license where the license is tied to a single computer and (2) a network license for use in a compute cluster. Both options are freely available for users in academia.

In case doxygen is available, extended source code documentation will be generated.

Compilation

To compile MACHINA, execute the following commands from the root of the repository:

$ mkdir build
$ cd build
$ cmake ..
$ make

In case CMake fails to detect LEMON or Gurobi, run the following command with adjusted paths:

$ cmake -DLIBLEMON_ROOT=~/lemon \
-DGUROBI_HOME=/path/to/gurobiXXX

where XXX is the 3-digit version of gurobi.

The compilation results in the following files in the build directory:

COMMAND DESCRIPTION
cluster Cluster mutations using a combinatorial algorithm that models variant read counts using a binomial distribution.
generatemigrationtrees Generates all migration trees given anatomical site labels. These migration trees can be used to constrain the search space of the pmh, pmh_pr and pmh_cti algorithms.
generatemutationtrees Generates all mutation trees given a frequency matrix.
pmh_sankoff Enumerates all minimum-migration vertex labelings given a clone tree.
pmh Solves the Parsimonious Migration History (PMH) problem given a migration pattern restriction and a clone tree.
pmh_tr Solves the Parsimonious Migration History with Polytomy Resolution (PMH-PR) problem given a migration pattern restriction and a clone tree.
pmh_ti Solves the Parsimonious Migration History and Tree Inference (PMH-TI) given a migration pattern restriction, a mutation tree and a frequency matrix.
simulate Simulates a metastatic tumor.
visualizeclonetree Visualizes a clone tree and optional vertex labeling.
visualizemigrationgraph Visualizes the migration graph given a clone tree and vertex labeling.

Usage instructions

I/O formats

Below we describe the various formats used by the algorithms of the MACHINA framework.

Clone tree

A clone tree is provided as an edge list. Each line specifies an edge by listing the labels of the incident vertices separated by a space or tab character. For example:

A A1
A A2
A A3
A A4
A A5
A A6
...

See patient1.tree for the complete clone tree.

Leaf labeling

A leaf labeling assigns an anatomical site label to each leaf of a clone tree. Each line contains two values, the leaf label and the anatomical site label separated by a space or tab character. For example:

A1 Om
A2 SBwl
A3 LFTB
A4 LOv
A5 ApC
A6 RFTA
...

See patient1.labeling for the complete leaf labeling.

Vertex labeling

A vertex labeling assigns an anatomical site label to each vertex of a clone tree (including the leaves). Each line contains two values, the vertex label and the anatomical site label separated by a space or tab character. For example:

A ROv
B SBwl
D ROv
F ROv
H ROv
A1 Om
A2 SBwl
...

See patient1.reported.labeling for the complete vertex labeling.

Frequencies

A frequency file encodes the frequency of every mutation (cluster) in an anatomical site (sample). It is a tab separated file. The first line lists the number of anatomical sites followed by the number of samples and then the number of mutations, each on separate lines. The fourth line is ignored but describes the format of the rest of the file. Each subsequent line encodes the cell frequency of a mutation in a sample: first the sample 0-based index is given, followed by the label of the sample, the 0-based index of the anatomical site, the anatomical site label, the 0-based index of the mutation, the label of the mutation, the frequency lower bound and upper bound.

6 #anatomical sites							
6 #samples							
10 #mutation clusters							
#sample_index	sample_label	anatomical_site_index	anatomical_site_label	character_index	character_label	f_lb	f_ub
0	breast	0	breast	0	1	0.503628522	0.545237495
0	breast	0	breast	1	2	0	0.01213794
...

See F.tsv for the complete frequency file. For an example on how obtain this file from read data, please see: process_A7_new.ipynb. Specifically, you will need to process the bulk DNA sequencing data by first calling single-nucleotide variants and copy-number aberrations. Then, SNVs that occur in copy-neutral regions need to be clustered (e.g., using SciClone or PyClone). Confidence intervals can then be obtained by first pooling for each sample the read counts of the mutations that belong to the same cluster followed by using a beta distribution. Please see the supplement of the MACHINA paper for more details.

Parsimonious Migration History (pmh_sankoff and pmh)

In the parsimonious migration history we are given a clone tree T whose leaves are labeled by anatomical sites. The task is to label the inner vertices of T such that the resulting migration graph G has minimum migration number and comigration number. Additionally, it is possible to specify constraints on the topology of the migration graph.

PATTERN DESCRIPTION
PS (parallel single-source seeding) Each metastatic site is seeded directly from the primary tumor, i.e. G is a multi-tree such that the primary P is the only vertex with out-degree greater than 1.
S (single-source seeding) Each metastatic site is seeded from only one other anatomical site, i.e. G is a multi-tree.
M (multi-source seeding) A metastatic site may be seeded from multiple anatomical sites, but no directed cycles are introduced. That is, G is multi-DAG.
R (reseeding) Directed cycles in G are allowed.

In our algorithms we allow for the following restrictions on the migration pattern:

  1. Unrestricted: PS, S, M and R
  2. No reseeding: PS, S, M
  3. No reseeding and no multi-source seeding: PS and S
  4. No reseeding, no multi-source seeding and no single-source seeding: PS

The unconstrained PMH problem can be solved by running pmh_sankoff, which is an adaptation of the Sankoff algorithm and enumerates all migration histories:

Usage:
pmh_sankoff [--help|-h|-help] [-c str] [-o str] [-p str] T leaf_labeling
Where:
T
    Clone tree
leaf_labeling
    Leaf labeling
--help|-h|-help
    Print a short help message
-c str
    Color map file
-o str
    Output prefix
-p str
    Primary anatomical sites separated by commas (if omitted, every
    anatomical site will be considered iteratively as the primary)

An example execution of the pmh_sankoff algorithm (executed from the root directory of the MACHINA repository):

$ mkdir patient1
$ pmh_sankoff -p LOv,ROv -c data/mcpherson_2016/coloring.txt data/mcpherson_2016/patient1.tree \
data/mcpherson_2016/patient1.labeling -o patient1/ 2> patient1/result.txt

$ cat patient1/result.txt
Clone tree has 7 anatomical sites
Found 4 maximum parsimony labelings with primary 'LOv'
Found 2 labelings with 7 comigrations, 2 seeding sites and pR
Found 2 labelings with 11 comigrations, 3 seeding sites and pR
Labeling 0: 13 migrations, 11 comigrations, 3 seeding sites and pR
Labeling 1: 13 migrations, 11 comigrations, 3 seeding sites and pR
Labeling 2: 13 migrations, 7 comigrations, 2 seeding sites and pR
Labeling 3: 13 migrations, 7 comigrations, 2 seeding sites and pR
Found 1 maximum parsimony labelings with primary 'ROv'
Found 1 labelings with 10 comigrations, 2 seeding sites and pM
Labeling 0: 13 migrations, 10 comigrations, 2 seeding sites and pM

The above command considers the left ovary (LOv) and right ovary (ROv) as the primary tumor site and enumerates all minimum migration vertex labelings of the given clone tree and leaf labeling. The output is stored in the directory patient1. In the DOT files the given color map is used for coloring the anatomical sites.

To further constrain the migration graph, we can use pmh:

Usage:
  pmh [--help|-h|-help] [-G str] [-OLD] [-UB_gamma int] [-UB_mu int]
     [-UB_sigma int] -c str [-e] [-g] [-l int] [-log] [-m str] [-o str]
     -p str [-t int] T leaf_labeling
Where:
  T
     Clone tree
  leaf_labeling
     Leaf labeling
  --help|-h|-help
     Print a short help message
  -G str
     Optional file with migration graphs
  -OLD
     Use old ILP (typically much slower)
  -UB_gamma int
     Upper bound on the comigration number (default: -1, disabled)
  -UB_mu int
     Upper bound on the migration number (default: -1, disabled)
  -UB_sigma int
     Upper bound on the seeding site number (default: -1, disabled)
  -c str
     Color map file
  -e
     Export ILP
  -g
     Output search graph
  -l int
     Time limit in seconds (default: -1, no time limit)
  -log
     Gurobi logging
  -m str
     Allowed migration patterns:
       0 : PS
       1 : PS, S
       2 : PS, S, M
       3 : PS, S, M, R
     If no pattern is specified, all allowed patterns will be enumerated.
  -o str
     Output prefix
  -p str
     Primary anatomical site
  -t int
     Number of threads (default: -1, #cores)

An example execution of the pmh algorithm (executed from the root directory of the MACHINA repository):

$ mkdir patient1_constrained
$ pmh -p LOv -c data/mcpherson_2016/coloring.txt data/mcpherson_2016/patient1.tree \
data/mcpherson_2016/patient1.labeling -o patient1_constrained/ > patient1_constrained/result.txt

$ cat patient1_constrained/result.txt
LOv-    (PS)    15      6       1       pPS     15.125  15.125  0.106424
LOv-    (PS, S) 15      6       1       pPS     15.125  15.125  0.057286
LOv-    (PS, S, M)      15      6       1       pPS     15.125  15.125  0.060853
LOv-    (PS, S, M, R)   13      7       2       pR      13.148  13.148  0.44595

Each line lists the solution found by MACHINA. First the primary anatomical site is given, then the provided migration pattern restriction set, followed by the migration number, comigration number and seeding site number. Finally, the identified migration pattern is given, followed by a lower bound (LB) on the optimal solution and then an upper bound (UB), ending with the total running time in seconds. In case LB == UB, the identified solution is optimal.

Parsimonious Migration History with Tree Resolution (pmh_tr)

In the parsimonious migration history with polytomy resolution we are given a clone tree T whose leaves are labeled by anatomical sites. The task is to find a refinement T' of T and label its inner vertices of such that the resulting migration graph G has minimum migration number, comigration number and seeding site number. It is possible to specify constraints on the topology of the migration graph.

Usage:
  pmh_tr [--help|-h|-help] [-G str] [-OLD] [-UB_gamma int] [-UB_mu int]
     [-UB_sigma int] -c str [-e] [-g] [-l int] [-log] [-m str] [-o str]
     -p str [-t int] T leaf_labeling
Where:
  T
     Clone tree
  leaf_labeling
     Leaf labeling
  --help|-h|-help
     Print a short help message
  -G str
     Optional file with migration graphs
  -OLD
     Use old ILP (typically much slower)
  -UB_gamma int
     Upper bound on the comigration number (default: -1, disabled)
  -UB_mu int
     Upper bound on the migration number (default: -1, disabled)
  -UB_sigma int
     Upper bound on the seeding site number (default: -1, disabled)
  -c str
     Color map file
  -e
     Export ILP
  -g
     Output search graph
  -l int
     Time limit in seconds (default: -1, no time limit)
  -log
     Gurobi logging
  -m str
     Allowed migration patterns:
       0 : PS
       1 : PS, S
       2 : PS, S, M
       3 : PS, S, M, R
     If no pattern is specified, all allowed patterns will be
     enumerated (default: '0,1,2,3')
  -o str
     Output prefix
  -p str
     Primary anatomical site
  -t int
     Number of threads (default: -1, #cores)

An example execution (executed from the root directory of the MACHINA repository):

$ mkdir patient1_tr
$ pmh_tr -p LOv -c data/mcpherson_2016/coloring.txt data/mcpherson_2016/patient1.tree \
data/mcpherson_2016/patient1.labeling -o patient1_tr/ > patient1_tr/result.txt

$ cat patient1_tr/result.txt
LOv-    (PS)    12      6       1       pPS     12125   12125   0.53603
LOv-    (PS, S) 12      6       1       pPS     12125   12125   0.242606
LOv-    (PS, S, M)      12      6       1       pPS     12125   12125   0.248338
LOv-    (PS, S, M, R)   11      7       2       pR      11148   11148   15.9117

The results.txt file is formatted in exactly the same way as pmh.

Parsimonious Migration History with Tree Inference (pmh_ti)

Given a mutation tree T with mutation frequencies F- and F+, the task is to find a frequency assignment F yielding a refined clone tree T' that admits a vertex labeling l such that the resulting migration graph G has minimum number of migrations and comigrations. It is possible to specify constraints on the topology of the migration graph.

Usage:
  pmh_ti [--help|-h|-help] -F str [-G str] [-OLD] [-UB_gamma int]
     [-UB_mu int] [-UB_sigma int] -barT str -c str [-e] [-g] [-l int] [-log]
     [-m str] [-mutTreeIdx int] [-noPR] [-o str] -p str [-t int]
Where:
  --help|-h|-help
     Print a short help message
  -F str
     Frequencies file
  -G str
     Optional file with migration graphs
  -OLD
     Use old ILP (typically much slower)
  -UB_gamma int
     Upper bound on the comigration number (default: -1, disabled)
  -UB_mu int
     Upper bound on the migration number (default: -1, disabled)
  -UB_sigma int
     Upper bound on the seeding site number (default: -1, disabled)
  -barT str
     Mutation trees
  -c str
     Color map file
  -e
     Export ILP
  -g
     Output search graph
  -l int
     Time limit in seconds for the ILP (default: -1, unlimited)
  -log
     Gurobi logging
  -m str
     Allowed migration patterns:
       0 : PS
       1 : PS, S
       2 : PS, S, M
       3 : PS, S, M, R
     If no pattern is specified, all allowed patterns will be
     enumerated (default: '0,1,2,3')
  -mutTreeIdx int
     Mutation tree index (default: -1)
  -noPR
     Disable polytomy resolution
  -o str
     Output prefix
  -p str
     Primary anatomical site
  -t int
     Number of threads (default: -1, #cores)

An example execution (executed from the root directory of the MACHINA repository):

$ mkdir A7
$ generatemutationtrees data/hoadley_2016/A7/A7_MACHINA_0.95.tsv > A7/mutation_trees.txt
8/1/1/-1 (1)
9/3/4/-1 (8)
10/4/8/-1 (9)
Found 2 mutation trees with 10 out of 10 mutations

$ pmh_ti -p breast -c data/hoadley_2016/coloring.txt -m 1 -o A7 -F data/hoadley_2016/A7/A7_MACHINA_0.95.tsv \
-barT A7/mutation_trees.txt > A7/result.txt
Read 2 mutation trees.

$ cat A7/result.txt
0-      (PS, S) 5       5       2       mS      5146.83 5146.83 36.4647
1-      (PS, S) 5       5       2       mS      5146.83 5146.83 37.8299

The program generatemutationtrees uses the SPRUCE algorithm to enumerate all mutation trees given a frequency matrix. The program pmh_ti considers solves the PMH-TI problem for each enumerated mutation tree. The results.txt file is formatted in exactly the same way as in pmh.