This repository contains the source code used in the manuscript 'Persistent, Private and Mobile genes: a model for gene dynamics in evolving pangenomes' by Jasmine Gamblin, Amaury Lambert and François Blanquart.
- folder
source
: source code for inference tool - folder
test
: test files containing simulated tree and matrix to run a test inference Makefile
: compilation filesimulation_aux.R
,run_simulations.R
: R scripts used to simulate the PPM model
- GCC (or other C++ compiler)
libtbb
(Intel Threading Building Blocks)
Install TBB on Ubuntu/Debian:
sudo apt-get install libtbb-dev
on Fedora :
sudo dnf install tbb-devel
on macOS :
brew install tbb
git clone https://github.com/JasmineGamblin/PPMmodelPangenome
cd PPMmodelPangenome
make
- Species tree
tree.nwk
: Newick format with leaf labels (handles node labels by ignoring it). Tree must be ultrametric. Example:
(genome2:0.35,genome1:0.35):0;
- Presence/absence matrix
pa_matrix.txt
: Matrix in csv format (comma-separated values) where rows are genomes and columns are genes. Matrix must contain row names (genome IDs matching the tree leaf labels) and column names (which are ignored). Values must be either 0 or 1. Example:
gene1,gene2,gene3
genome1,1,0,1
genome2,0,1,1
To run an inference with a random starting point, use the following command:
./inference seed "test/tree.nwk" "test/pa_matrix.txt" "test/mle_param.txt" "test/inf_cat.txt"
where:
seed
is the random seed (must be an integer different from 0, as 0 indicates that the user is chosing the starting point)tree.nwk
andpa_matrix.txt
are the input filesmle_param.txt
andinf_cat.txt
are the output files
To run an inference with a chosen starting point, use instead:
./inference 0 "test/tree.nwk" "test/pa_matrix.txt" "test/mle_param.txt" "test/inf_cat.txt" N0 l0 i1 l1 g2 l2 s_10 s_01
where N0
, l0
, i1
, l1
, g2
, l2
, s_10
, and s_01
are replaced by the chosen initial values.
Inference should take around 5 minutes on a laptop with the provided test data (20 genomes, 949 genes), but using a cluster is recommended for bigger datasets.
-
Parameter estimates
mle_param.txt
: values are stored in the following order:seed
,N0
,l0
,i1
,l1
,i2
,g2
,l2
,s_10
,s_01
, and the maximum log-likelihood value reached -
Inferred gene categories
inf_cat.txt
: file containing inferred category number for each gene (0 for Persistent, 1 for Private and 2 for Mobile), separated by blank spaces