-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathsimpg.Rd
178 lines (154 loc) · 8.51 KB
/
simpg.Rd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/simpg.R
\name{simpg}
\alias{simpg}
\title{Simulate a bacterial pangenome}
\usage{
simpg(ref = "pan_genome_reference.fa", norg = 10, br, ne = 1e+11,
C = 100, u = 1e-08, v = 1e-11, mu = 5e-12, write_by = "gene",
dir_out = "sim_pg", replace = FALSE, smat, force = FALSE,
verbose = TRUE)
}
\arguments{
\item{ref}{A multi-fasta file from where to sample genes. Each fasta header
is expected to identify a single gene. Each sampled gene is taken as the
most recent common ancestor of each gene family sampled at the end of the
simulation (final time); in other words, each gene family will coalesce to
each of the original genes sampled from this file.}
\item{norg}{The number of organisms sampled at final time.}
\item{br}{\code{numeric} vector of length \code{norg - 1} for setting the
coalescent branch lengths. By default is calculated as:
\code{rexp( norg - 1, choose(seq(norg, 2, -1), 2) )}
as expected by the coalescent theory (see Wakeley, 2008 "Coalescent Theory:
An Introduction". Roberts & Company Publishers).}
\item{ne}{Effective population size. Baumdicker et al. (2012) estimates
the effective population sizes of \emph{Prochlorococcus} and \emph{Synechococcus} to be
around 10e11, which was taken as default. This is also the number of
generations from the MRCA to the sampled (final) organisms.}
\item{C}{The coregenome size (default is 100).}
\item{u}{Probability of gene gain per generation. Default is 1e-8, to be in
accordance to values presented in Baumdicker et al. (2012). (theta = 2Ne*u,
and taking Ne = 10e11, and estimates of theta ~2000 for \emph{Prochlorococcus}).}
\item{v}{Probability of gene loss rate per generation. Default is 1e-11, to
be in accordance to values presented in Baumdicker et al. (2012). (rho = 2Ne*v,
and taking Ne = 10e11, and estimates of rho ~2 for \emph{Prochlorococcus}).}
\item{mu}{The per site per generation substitution rate. According to Duchene
et al. (2016), a typical substitution rate is around 1e-5 and 1e-9 per site,
per annum. Taking 1e-9, and assuming 200 generation per annum (Shierup and Wuif,
2010), this gives mu = 5e-12, which was taken as default.}
\item{write_by}{One of \code{'gene'} of \code{'genome'}. Whether to write
sequences in files grouped by orthology or by genome, respectively.}
\item{dir_out}{The non-existent directory where to put the simulated
orthologous groups.}
\item{replace}{\code{logical}. Whether to replace (see \link[base]{sample}) when sampling
genes from reference. Default and recommended is FALSE, since IMG model states
that each new gene is a different one. Setting \code{replace = TRUE} will probably
cause duplicated genes, although each one will follow a different evolutionary
history in the simulation.}
\item{smat}{A codon substitution matrix with probability weights for
transition. It must be a \code{data.frame} with \code{dim(smat) == c(64, 64)}, where
column and row names are the possible combination of codons (i.e. "aaa",
"taa", "caa", etc). The values in the \code{data.frame} correspond to probability
weights for obtaining the elements of the vector being sampled. See
\link[base]{sample} for more information. The default is a transition matrix
with 0 probability of transition from a coding codon to any stop codon and to
itself, and equal probability weights of transition to any other codon which
differ in, at most, one base (point mutations; one site per codon at most is
mutated each time). If you want to use a different codon matrix, I recommend
to use the default as reference: \code{simba:::.codon.subst.mat} .
See package source (R/subs_mat.R) for a guide on how to generate a similar
matrix.}
\item{force}{\code{logical}. If \code{dir_out} exists, whether to force
directory creation by overwriting the existing one. It will eliminate any
existing content.}
\item{verbose}{\code{logical}. Show (or not) progress messages.}
}
\value{
An object of class \code{pangenomeSimulation}, which consist in a
list of 5 elements: (1) The simulated coalescent, as an object of class
\link[ape:read.tree]{phylo} (ape package); (2) A list with gene families;
(3) The panmatrix; (4) a list with substitution codon position for each
gene, per branch, since gene birth to sampling time, and (5) a list of
genetic distances, for each gene family. Also a series of attributes are
returned, with calling parameters. A summary method is also provided to
retrieve statistics from the simulation.
}
\description{
Simulate the evolution of a pangenome using a random ultrametric
tree as guide to resemble a coalescent process, and according to both the
neutral model (mutations) and the Infinitely Many Genes (IMG) model (gene
presence/absence).
}
\details{
The algorithm first simulates a random ultrametric tree (\link[ape]{rcoal})
describing the evolutionary history of organisms sampled at the final time.
Effective population size (ne) is also interpreted as the number of
generations from the MRCA to the sampling time.
A IMG process is simulated using this tree, and parameters C (coregenome
size), u (probability of gene gain, per generation), and v (probability of
gene loss, per generation). At the end of this process a final panmatrix is
obtained, describing the presence or absence of each gene for each organism,
at the sampling time. The final gene presence/absence pattern is in
accordance with the previously simulated coalescent tree, as the model
describes.
As many genes as in the above panmatrix are sampled from the reference
file (ref). For each one the MRCA node in the tree is identified, and from
this node to the final time the gene is subjected to mutations at rate mu
(probability of substitution, per generation). Genes are first replicated
as many times it appears at the final time, and mutations are computed in
accordance to the simulated tree (each gene will follow an independent path).
This implementation mutates codons, rather than single nucleotides, because
we wanted to be able to avoid stop codons of being created at the middle of
a gene. The probability of substitution from one codon to another is decribed
by the \code{smat} parameter, which by defaults assigns equal probability of
changing one codon to another with a single nucleotide difference between
them, 0 probability of changing from one codon to another with more than one
difference, and 0 probability of changing from any codon to a stop codon.
This behavior can be changed by modifying smat parameter, which by default
takes the substitution matrix \code{'.codon.subst.mat'} in the namespace of
this package. To see a reference of how it was created, the file subs_mat.R
in the package source shows details, and can be taken as guide to create a
custom one.
Finally, sequences are written to the output directory (dir_out) by gene
(group of orthologous sequences), or by genome, in fasta format. Each header
contains information about which gene in the reference file was used to
generate it. One can set the algorithm to be able to sample genes from the
reference file with replace (replace = TRUE), but it is not recommended
since the IMG model states that each gene only is gained once in the
evolutionary history of the pan organism (so replace = FALSE, by default,
and gives error if number of columns in the final panmatrix is greater than
available genes in ref), although if it is set to TRUE, each copy will
follow an independent evolutionary path.
}
\examples{
\dontrun{
library(simurg)
ref_file <- system.file('extdata', 'ref_tutorial.tar.gz', package = 'simurg')
untar(tarfile = ref_file)
ref <- 'ref_tutorial.fasta'
pg <- simpg(ref = ref, norg = 10, ne = 1e10, C = 100, u = 1e-8, v = 1e-11,
mu = 5e-12, write_by = 'genome', dir_out = 'sim_pg',
replace = TRUE)
summary(pg)
}
}
\references{
Baumdicker, F., Hess, W. R., & Pfaffelhuber, P. (2010). "The
Diversity of a Distributed Genome in Bacterial Populations". The Annals of
Applied Probability.
Baumdicker, F., Hess, W. R., & Pfaffelhuber, P. (2012). "The Infinitely Many
Genes Model for the Distributed Genome of Bacteria". Genome Biology and
Evolution.
Duchene, S. (2016). "Genome-scale rates of evolutionary change in bacteria".
Microbial Genomics.
Ashley Robinson, D., Falush, D. Fiel, E. J. (2010). "Bacterial Population
Genetics in Infectious Disease." Wiley-Blackwell, Chapter 1: "The Coalescent
of Bacterial Populations"; Section 1.5; Schierup, M. H. & Wiuf, C. "From
Coalescent Time to Real Time".
Jukes, T. H., Cantor, C. R. (1969). "Evolution of Protein Molecules". New
York: Academic Press.
Kimura, M. (1983). "The neutral theory of molecular evolution". Cambridge.
}
\author{
Ignacio Ferres
}