forked from popsim-consortium/manuscript
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathpopSimManu.tex
756 lines (672 loc) · 45.7 KB
/
popSimManu.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
\documentclass[12pt,halfline,a4paper]{ouparticle}
\usepackage{amsmath,amssymb}
\usepackage[round]{natbib}
\usepackage{graphicx,rotating}
\usepackage{dsfont}
\usepackage{xspace}
\newcommand{\Est}[1]{\hat{#1}}
\newcommand{\beginsupplement}{%
\setcounter{table}{0}
\renewcommand{\thetable}{S\arabic{table}}%
\setcounter{figure}{0}
\renewcommand{\thefigure}{S\arabic{figure}}%
}
\newcommand{\stopsupplement}{%
\setcounter{table}{0}
\renewcommand{\thetable}{\arabic{table}}%
\setcounter{figure}{0}
\renewcommand{\thefigure}{\arabic{figure}}%
}
% method names
\newcommand{\stdpopsim}{\texttt{stdpopsim}\xspace}
\newcommand{\dadi}{$\partial a \partial i$\xspace}
\newcommand{\MSMC}{\texttt{MSMC}\xspace}
\newcommand{\smcpp}{\texttt{smc++}\xspace}
\newcommand{\stairwayplot}{\texttt{stairwayplot}\xspace}
\newcommand{\fastsimcoal}{\texttt{fastsimcoal}\xspace}
\newcommand{\tskit}{\texttt{tskit}\xspace}
\newcommand{\adk}[1]{\textcolor{red}{ADK: #1}}
\usepackage[hidelinks]{hyperref}
\begin{document}
\title{A community-maintained standard library of population genetic simulations}
\author{%
\name{The PopSim Consortium}
\address{}
\email{}}
% JK: didn't bother with the \texttt stuff for stdpopsim in the abstract as it's nearly
% always dropped in production anyway.
\abstract{The explosion in population genomic data demands ever more complex
modes of analysis, and increasingly these analyses depend on sophisticated simulations.
Recent advances in population genetic simulation have made it possible to
simulate large and complex models, but specifying such models for a particular
simulation engine remains a difficult and error-prone task.
Computational genetics researchers currently re-implement simulation models
independently, leading to duplication of effort and the possibility for error.
Population genetics, as a field, also lacks standard benchmarks by which new tools for inference might
be measured. Here we describe a new resource, \stdpopsim, that attempts to rectify this situation.
\stdpopsim is a community-driven, open source project, which provides a standard
catalog of published simulation models from a wide range of organisms,
and supports multiple simulation engine backends.
% PopSim assures correctness in our implementations through a mature, community-based quality control pipeline.
We share some examples demonstrating how \stdpopsim
can be used to systematically compare demographic inference methods,
and encourange an even broader
community of developers to join in the effort to standardize
population genetic methods.}
\date{\today}
\keywords{Population genetics, Simulation, Inference}
\maketitle
\section*{Introduction}
While population genetics has always used statistical methods to make inferences from data,
the degree of sophistication of the questions, models, data, and computational approaches
used have all increased over the past decade. Currently there exist myriad computational methods
that can infer the histories of populations
\citep{gutenkunst2009inferring,li2011inference,excoffier2013robust,schiffels2014inferring, terhorst2017robust,ragsdale2019models},
the distribution of fitness effects \citep{kim2017inference},
recombination rates \citep{chan2012genome,lin2013fast,Adrion662247},
the extent of positive selection in genome sequence data
\citep{eyre2009estimating, alachiotis2012omegaplus,degiorgio2016sweepfinder2,kern2018diplos,sugden2018localization}.
While en masse these methods have increased our understanding of the
impacts of genetic and evolutionary processes, very little has been done to systematically
benchmark the quality of inferences gleaned from computational population genetics.
As large databases of population genetic variation begin to be used to inform public health procedures,
the accuracy and quality of these inferences is becoming even more important.
This is in large part because ``ground-truth'' in population genetics comes from simulation,
not from empirical observations, and as a field there has been no agreement as to what should
constitute community standards or best practices. The general modus operandi to date has been to
introduce methods which are validated by a set of bespoke simulations produced by individual
groups. Beyond a lack of efficiency from duplicated effort, such a situation leads
to decreases in reproducibility and transparency across the entire field.
Here we describe \stdpopsim, a community-driven catalog
of standardized population genetics models spanning a variety of common study organisms.
This resource will empower the field to rigorously evaluate population genetic software
and select the best tool for a given task, thereby helping to usher in the next era of population genetic data analysis.
%DRS: still not in love with this last sentence here, but tried to make it a little less sappy by adding some explanation.
%ADK: made some edits here
The degree to which modeling assumptions and different choices of summaries of the
data can affect the ultimate results has been challenging to assess. Yet, there
have been several clear examples of different methods yielding fundamentally
different conclusions. For example, \MSMC methods applied to human genomes have
% JK is this right? > 100 years doesn't sound very ancient to me.
suggested large ancient ($>100$ years ago) ancestral population sizes and
bottlenecks that have not been detected by SFS-based methods
\citep[see][]{beichman2017comparison}.
These distinct methods differ in how they model, summarize, and optimize fit to
genetic variation data, suggesting that such design choices can greatly affect the
performance of the inference. Furthermore, it is reasonable
to assume generally that specific methods will perform better than others under certain scenarios.
However, because these conditions are typically unknown, empirical researchers
lack principled guidelines for deciding which statistical method is best suited
to accurately answer their particular question. The need for empirical
guidance will only increase as researchers continue to generate more genomic
variation data from non-model taxa and wish to make inferences regarding
demography and selection.
One way to benchmark different methods for statistical inference in population
genetics is to apply the methods to simulated datasets where the true model
parameters are known. Methods that perform well will accurately infer these
parameters and will also provide reasonable estimates of uncertainty of the
estimated parameters. Indeed, simulation studies are a standard ingredient in
most papers proposing new methods in population genetics. However, these
studies often use simulations of the particular models for which the method was designed,
which showcases the novel method rather than rigorously comparing performance to competing methods.
Of course, testing a new method in the situation for which it was designed is an important step,
but more rigorous testing is necessary for tools that are to be widely used in the field.
% Moreover researchers are generally forced to reimplement previously
% inferred scenarios independently, which makes good benchmarking
% a major additional time investment, and can lead to errors.
% Lastly, the degree to which different methods are compared can be quite variable.
Many more researchers in the broader field
use simulations to perform power and other exploratory analyses.
The many steps of implementing a moderately complicated population genetics simulation,
including translating published demographic models into input for a simulator
and obtaining genetic maps,
present a major obstacle, especially to new researchers.
% One reason why simulation studies do not compare the performance of many methods
% across a wide range of situations
% is that simulation studies are challenging to carry out, for several reasons.
% First, while there are many published population genetic models available, their
% parameters are often not easily accessible, usually presented in supplemental
% tables. Translating these parameters into input for a given simulation engine can
% be a time-consuming and error prone process. In addition, papers often present many
% different models, and different authors may choose to select distinct models,
% making comparison difficult. While improvements in simulation software and ever
% faster computers have partially ameliorated this concern, generating the
% simulated datasets can still be a time and computing resource bottleneck.
For these reasons, we have generated a standardized, community-driven resource
for simulating published demographic models from a number of popular study systems.
This resource, which we call \texttt{stdpopsim}, makes running
realistic simulations for population genetic analysis a simple matter of
choosing pre-implemented models from a community maintained catalog.
The \stdpopsim catalog currently contains four organisms: humans,
\emph{D.~melanogaster}, \emph{A.~thaliana}, and \emph{E.~coli}. For each
organism the catalog contains details on the physical organization (e.g., chromosome structure)
of their genomes, one or more genetic maps, default population-level parameters (mutation rate,
generation time) and one or more published demographic histories. Through
either a command line interface or a simple Python API, the user can specify which
organism, genetic map, chromosome, and demographic history they are interested in simulating, and the
simulation output from their chosen model is returned.
The software has been developed by the PopSim Consortium using a
distributed open source model, with strong procedures in place
to continue its growth and maintain its accuracy.
Importantly, we have rigorous quality control methods to ensure implemented models are accurate,
and have documented methods for others to contribute new modules.
We welcome new collaborators.
Below we describe the resource and give an
example of how it can be used to benchmark demographic inference methods.
\begin{figure}[t]
\begin{center}
\includegraphics[width=0.5\linewidth]{display_items/Figure1.pdf}
\caption{\textbf{Structure of \stdpopsim}. \textbf{A.} The
hierarchical organization of the PopSim catalog contains all information
within individual species. Information within the species level is shown here
for \emph{Homo sapiens} only. We see that \emph{H. sapiens} is associated
with a default generation time and mutation rate,
a representation of the physical genome, and two genetic maps. Also shown is a
subset of the demographic models currently implemented. \textbf{B.} Caching
and automated download of genetic maps. \textbf{C.} Example code to specify
and simulate models using \stdpopsim via the API (top) or
command-line interface (bottom). \textbf{D.} Simulation output
is a tree sequence file that can
be manipulated and analysed with \tskit.}
\label{fig:cartoon}
\end{center}
\end{figure}
\section*{Results}
\paragraph{The \stdpopsim library}
The central contribution of the PopSim consortium currently is \stdpopsim, a
community-maintained library of empirical genome data and population genetics simulation
models, providing easy access to realistic simulations. Figure \ref{fig:cartoon} shows a graphical
representation of the structure of \stdpopsim. The package centers
on a catalog of species (Fig. \ref{fig:cartoon}A), initially consisting of humans, \emph{Drosophila},
\emph{Arabidopsis}, \emph{E. coli}, and \emph{Pongo}. A species definition consists
of two key elements. Firstly, the library defines
some basic information about the species' genome, including information about chromosome
lengths, average mutation rates and so on. We also provide access to detailed
empirical information such as recombination maps, which model empirically-observed
heterogeneity along chromosomes. As such maps are often large and hard
to find, we provide a mechanism to automatically download (and cache locally) a
set of previously published maps (Fig. \ref{fig:cartoon}B). For example, in humans we support the
HapMapII~\citep{international2007second} and
deCODE~\citep{kong2010fine} genetic maps. The second key element of a species description
within \stdpopsim is a set of carefully curated population genetic model
descriptions from the literature, which are the basis for accurate simulations
of specific historical scenarios that have influenced present-day patterns of
genetic polymorphism.
Given the genome data and simulation model descriptions defined within the
library, it is then straightforward to run accurate, standardized simulations
across a range of organisms. \stdpopsim has a Python API and a user-friendly
command line interface (Fig. \ref{fig:cartoon}C), allowing users with minimal experience direct access to
state-of-the-art simulations. Simulations are output in the “tree sequence”
format \citep{kelleher2016efficient,kelleher2018efficient,kelleher2019inferring}, which
contains complete genealogical information about the simulated samples, is
extremely compact, and can be processed very efficiently using the \tskit library
\citep{kelleher2016efficient,kelleher2018efficient}. Currently,
\stdpopsim uses the \texttt{msprime} coalescent simulator \citep{kelleher2016efficient}
as the default simulation engine. We plan to include other simulation
engines to support scenarios that cannot be modelled under the coalescent framework,
and have already implemented \texttt{SLiM} \citep{haller2019slim} as
an alternative back-end.
The \stdpopsim command line interface, by default, outputs citation information
for the models, genetic maps and simulation engines used in any particular run.
We hope that this will encourage users to appropriately acknowledge the
resources used in published work, and authors
of simulation models to contribute to our ongoing community-driven development process.
\subsection*{The \stdpopsim Catalog}
The current contents of the \stdpopsim catalog are shown in \autoref{tab:catalog}. At the time of
writing the PopSim Consortium has implemented previously published demographic models for
humans, \emph{Drosophila}, and \emph{Arabidopsis}. These range from
simple, single population histories \cite[e.g.,][]{sheehan2016deep},
to complex models which include population splitting, migration, and archaic
admixture \cite[e.g.,][]{ragsdale2019models}.
\begin{table}[t]
\begin{footnotesize}
\input{model_table.tex}
\end{footnotesize}
\caption{\label{tab:catalog}
Details of the initial set of population models across four species.
\textbf{NOTES: Column names are ID, name (or description maybe?),
reference, CPU time, RAM}
}
\end{table}
At the time of writing, more models of human history have been implemented than for
any other organism. These models include: a simplified version of the \cite{tennessen2012evolution}
model with only the African population specified (expansion from the ancestral
population and recent growth), the three population model of \cite{gutenkunst2009inferring}
which specifies the out-of-Africa bottleneck as well as the subsequent divergence of
the European and Asian populations, the \cite{tennessen2012evolution} two population variant of the
Gutenkunst model which does not include Asian populations, but more explicitly models
recent rapid human population growth, the \cite{browning2018ancestry} admixture model
for American populations which specifies ancestral African, European, and Asian population
components, and the three-population out-of-Africa model from \cite{ragsdale2019models}
which includes archaic admixture.
Together these models
contain features believed to have widespread impacts in real data (e.g. bottlenecks, population growth,
admixture), and are therefore highly pertinent in the context of method development.
For non-human genomes we have have implemented two demographic histories for
\emph{Drosophila melanogaster} and one each from \emph{A. thaliana} and \emph{E. coli}.
For \emph{D. melanogaster} we have implemented the three-epoch model estimated by \cite{sheehan2016deep} from
an African sample, as well as the out-of-Africa divergence
and associated bottleneck model of \cite{li2006inferring} which jointly models African
and European populations. For \emph{Arabidopsis thaliana}, we implemented the
model in \cite{durvasula2017african} inferred using \MSMC. This model includes
a continuous change in population size over time, rather than pre-specified epochs of different
population sizes.
\subsection*{Use case: comparing methods of demographic inference}
As an example of the utility of \stdpopsim, we demonstrate how it can be used
to simply and fairly compare popular demographic inference methods.
Although we present comparison of results from several
competing methods, our aim at this stage is not to provide an exhaustive
evaluation or ranking of these methods. Our hope is instead that future work built upon this resource
will enable more detailed exploration of the strengths and weaknesses of the numerous
inference methods that are available to the population genetics community.
We start by comparing popular methods for estimating
population size histories ($N(t)$) of single populations and subsequently
show simple examples of multi-population inference.
To reproducibly evaluate and compare the performance of inference methods, we developed
\texttt{snakemake} \citep{koster2012snakemake} workflows,
available from \url{https://github.com/popgensims/analysis}.
These workflows allow efficient computing in multicore or cluster environments.
For single-population population size histories, we compared
\MSMC~\citep{schiffels2014inferring}, \smcpp~\citep{terhorst2017robust}, and
\stairwayplot~\citep{liu2015exploring}
on simulated genomes sampled from a single population,
in a number of the demographic models described above. The pipeline was generalized to
run $R$ replicates with $C$ chromosomes for $n$ samples, for a total of $R \times C$
simulations for each demographic model. After simulation, the pipeline prepares
input files for each of the respective inference methods by grouping all
chromosomes, for each sample, into a single file. For each of the $R$ simulation replicates, this step results in an
input file for each of the
respective inference methods and derived from the same simulated tree sequences.
Each of the inference programs are then run in parallel, followed by plotting of
$N(t)$ estimates from each program.
\begin{figure}
\begin{center}
\includegraphics[width=0.8\linewidth]{display_items/homo_sapiens_mask_Ragsdale.pdf}
\caption{\textbf{Comparing estimates of $N(t)$ in humans}. Here we show estimates of population
size over time ($N(t)$) inferred using 4 different methods: \smcpp, \texttt{stairwayplot},
\MSMC with $n=2$ and $n=8$. Data were generated by simulating
replicate human genomes under the \cite{ragsdale2019models} model and using the
HapMapII genetic map~\citep{international2007second}. From top to bottom we show estimates for each
of the three populations in the model (YRI, CEU, and CHB). In shades of blue we show the estimated
$N(t)$ trajectories for each replicate. In black we show the true population size history as inferred
for the rate of coalescence in the demographic model.}
\label{fig:n_t_ragsdale}
\end{center}
\end{figure}
% JK here --- I think it would really help if we described what the models
% were, briefly, rather than saying who inferred them.
In Figure \ref{fig:n_t_ragsdale} we present results from our $N(t)$ analysis pipeline
run on whole genome simulations under a sophisticated demographic model
and empirical genetic map. In each column of this figure
we show inferred $N(t)$ for the three extant populations in the model.
In each row we show comparisons among the methods (note that we compare two differing
sample sizes for \MSMC). The solid black lines represent the true effective population
size as inferred from the inverse coalescent rate defined by the model.
Blue lines represent estimates from each of three replicate whole genome simulations.
While there is variation in accuracy among methods, populations, and individual replicates,
generally the methods are accurate for this model of human history.
\stdpopsim allows us to readily compare these estimates to those drawn from a different
model of human history. In Figure \ref{fig:n_t_gutenkunst} we show estimates of
$N(t)$ from simulations using the same physical and genetic maps, but a different demographic
history---that inferred by \cite{gutenkunst2009inferring}. Again we see that each
of the methods is capturing relevant parts of the population history, although the
discrepancy between inferred and simulated values varies with $t$. In comparing inferences between the
models it is interesting to note that $N(t)$ estimates for the CHB and CEU
simulated populations are generally better across methods than estimates from the YRI
simulated population.
We can even do away with the demographic model entirely, and simulate under a human
genome architecture and genetic map to see how well methods might do at recovering
the population history of a constant sized population. We show results of such an
experiment in Figure \ref{fig:n_t_HomSap_constant}. Again we see that all methods
are recovering a modern population size within a factor of two of the truth, however
SMC-based methods, perhaps due to their regularization, tend to learn sinusoidal
patterns of population size even though no change is present.
\begin{figure}
\begin{center}
\includegraphics[width=0.8\linewidth]{display_items/d_mel_Sheehan_mask2.pdf}
\caption{\textbf{Comparing estimates of $N(t)$ in \emph{Drosophila}}. Population
size over time ($N(t)$) estimated from an African population sample. Data were generated by simulating
replicate \emph{D.melanogaster} genomes under the three-epoch \cite{sheehan2016deep} model
with the genetic map inferred in \cite{comeron2012many}. In shades of blue we show the estimated
$N(t)$ trajectories for each replicate. In black we show the true population size history as inferred
for the rate of coalescence in the demographic model.}
\label{fig:n_t_sheehan}
\end{center}
\end{figure}
As most method development for population genetics has been focused on human-scale
data, it is of consequence to ask how such methods might perform in non-human
genomes. Figure \ref{fig:n_t_sheehan} shows parameter estimates from a demographic
model estimated from an African sample of \emph{Drosophila melanogaster} \citep{sheehan2016deep}.
As with humans, we use \stdpopsim to simulate replicate \emph{Drosophila} genomes under
an empirically derived genetic map, and then try to infer back the parameters of the model.
Accuracy is mixed among methods in this setting, and generally worse than what we
observe under simulations of the human genome.
\begin{figure}
\begin{center}
\includegraphics[width=0.7\linewidth]{display_items/homo_sapiens_two_popn_comp.png}
\caption{\textbf{Parameters estimated using a multi-population human model}.
Here we show estimates of $N(t)$ inferred using \dadi, \texttt{fastsimcoal2}, or \smcpp.
A) Data were generated by simulating
replicate human genomes under the \cite{gutenkunst2009inferring} model and using the genetic map
inferred in \cite{international2007second}. B) For \dadi and \texttt{fastsimcoal2} we show parameters inferred
by fitting the depicted IM model, which includes population sizes, migration rates, and a split
time between CEU and YRI samples. C) Population size estimates for each population (rows)
from \dadi, \texttt{fastsimcoal2}, and \smcpp (columns). In shades of blue we show the estimated
$N(t)$ trajectories for each replicate. In black we show the true population size history as inferred
for the rate of coalescence in the demographic model. The population split date, $T_{div}$, is shown at
the bottom, with a common X-axis to the background figure.}
\label{fig:IM_popn_human}
\end{center}
\end{figure}
\paragraph*{Multi-population demographic models}
As \texttt{stdpopsim} implements multi-population demographic models, we also
explored parameter estimation of population divergence parameters. In particular
we performed simulations under multi-population models for humans and \emph{Drosophila}
and inferred simulated parameters using \dadi, \fastsimcoal, and \smcpp.
For simplicity, we conducted inference in \dadi and \fastsimcoal by fitting an isolation with migration (IM) model
with constant population sizes and bi-directional migration. For human
models with more than two-populations, e.g. \cite{gutenkunst2009inferring},
this means that we are inferring parameters for a model which itself does
not match the model from which the data were generated (Figure
\ref{fig:IM_popn_human}A and B), however this is common practice in the field
when applying IM-type models.
In Figure \ref{fig:IM_popn_human}C we show estimates of population sizes and divergence
time, for each of the inference methods, using samples drawn from African and European populations
simulated under the \cite{gutenkunst2009inferring} model. Each of the methods compared has stronger
and weaker points to their accuracy across parameters. For instance the simple, IM-style
analyses we have performed do not capture recent population size change in CEU, however
they do a good job of estimating the simpler simulated YRI population size history from this model.
Again we can compare between species and look at the performance of these methods in
the context of a two-population model from \emph{Drosophila melanogaster}. Figure
\ref{fig:two_popn_fly} shows parameter estimates for simulations drawn from
the model of \cite{li2006inferring}, which includes
an ancestral population in Africa, then a population expansion with a population
split and bottleneck into a European population. These simulations are generated
using the recombination map of \cite{comeron2012many}. Here methods differ dramatically
in the estimates, with \smcpp yielding quite noisy estimates of population size
towards the more ancient past and all methods failing to capture the very brief
population bottleneck in Europe, although the IM model of course only allows for
two epochs in the population size history.
Again the goal of these analyses are not to perform thorough benchmarking of
the methods at hand under a wide range of simulations, but instead to show how
the tools that we have built could be used to facilitate such comparisons.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section*{Discussion}
Here we have described what we view as the first product from the PopSim Consortium:
the \stdpopsim library. We have founded the Consortium with a number of specific aims
in mind: standardization of simulation within the computational genetics community,
increasing reproducibility of simulation based results, community-based development and
decision making in guiding best practices in population genetics, and eventual benchmarking
of inferential methods.
The \stdpopsim library is the first piece of this effort, in that it allows for rigorous
standardization of complex population genetic simulations. Population genetics, as a field,
has yet to coalesce around a set of gold-standards for the crucial task of method
evaluation, which in our field hinges on simulation. In contrast, other fields such as
structural biology \citep{moult1995large} and machine learning \citep{russakovsky2015imagenet} have a long track record
of standardized method testing. We hope that our efforts represent the beginning of what
will prove to be an equally longstanding and invaluable tradition in population genetics.
Moreover individual research groups are often in the position of having to independently
reimplement complex, previously published models. \stdpopsim offers solutions to both of
these issues by creating a quality controlled, easy-to-use, gateway to models and organisms
that are central to modern research in genetics.
We have built \stdpopsim, and the PopSim Consortium, to be an open, community-developed
project, meaning that it has excellent potential for future growth.
Thus far we have implemented genome representations and genetic maps for the three most
common study systems in computational genetics: humans, \emph{Drosophila}, and \emph{Arabidopsis}.
The standard operating procedures that we have outlined in our detailed documentation
means that new researchers, who are interested in adding their favorite organism/genome
are free to do so with very few barriers to entry (see Methods).
Further, new demographic models can be added
to each of the organisms in the catalog through similarly well documented steps.
A long-term goal then is that through a mechanism such as \stdpopsim, the community
can determine (in an open way) what is the correct implementation for a published demographic model,
and in turn, what are the models in which the community is most interested.
We have shown in the results above how \stdpopsim can be used for direct comparisons
of inferential methods on a common set of simulations. While we have not made a concerted
effort at benchmarking methods yet, we believe that this would be an important goal for
the PopSim Consortium and population genetics more broadly. The analysis workflows that
we have released as part of \stdpopsim should provide a cookbook for the community to
dive in to more rigorous benchmarking than we have done thus far.
In the use case comparisons we have performed,
we found that the methods tested \emph{do} generally perform well on human genomes,
up to, for instance, the usual factor-of-two differences in inferred population sizes.
This will not be surprising to people familiar with the landscape of
demographic inference methods in population genetics,
but the nature and scale of uncertainty may be not clear to people new to the field.
The ability for \stdpopsim to generate data with and without significant features, such
as the genetic map or population size change (e.g., Figure \ref{fig:n_t_HomSap_constant}), allows
us to ask directly about the various failure modes of popular methods.
Moreover the comparison of methods across the various genome organizations, genetic maps,
and demographic histories of different organisms, provides valuable information
about how methods might perform on non-human systems.
In our initial comparison we find that inference of population histories from simulated \emph{Drosophila}
or \emph{Arabidopsis} genomes is a fair bit more difficult from SMC based methods
than in the human genome. Likely this is at least partly a consequence of the reduced physical and genetic
map lengths of these genomes, although we will leave further investigation of this to the future.
Finally, comparison of results across methods or simulation runs
gives a strong estimate of inference uncertainty, analogous to parametric
bootstrapping---although different methods have different strengths and weaknesses,
strong conclusions should probably not rest entirely on features inferred by only one method.
The broad brush comparisons we have done, while far from exhaustive, are informative,
and highlight the utility of easy comparison of methods on the same footing.
The tests here were under best-case scenarios---simulations of discrete,
randomly mating populations that fit the assumptions of the methods nearly
exactly---and yet different methods, as well as applications of the same method
to different simulations of the same model,
produced somewhat different results. Future efforts will extend the resource described here to
encompass additional population genetic tasks, such as detecting natural selection, inferring
recombination rates, etc, in hopes of producing a general-purpose resource for fair and
standardized evaluation of the ever-increasing arsenal of computational population genetic
inference tools.
%\adk{should we mention benchmarking of tree sequence inference here? what else?}
% \paragraph{The problem of comparison} % Moved from the intro
% Benchmarking the performance of inference methods in population genetics is not straightforward.
% Different methods often fit different types of model
% or use different summaries of the genetic variation data, even when
% attempting to address the same biological question. For example, there are
% numerous methods to infer population size changes.
% The \MSMC method \citep{schiffels2014inferring} allows for the
% population to continuously change in size over time. Other methods, such as
% \dadi \citep{gutenkunst2009inferring}, allow for a smaller number of size changes to occur at certain
% points throughout the population's history. The types of data that the
% methods use also differs. While \MSMC uses one to four whole genome sequences and leverages the
% spatial patterns of genetic variation along the sequence, \dadi instead uses the
% site frequency spectrum (SFS), which summarizes the frequencies of SNPs in the
% sample, ignoring any information about the correlation structure (linkage
% disequilibrium) among SNPs. Thus from the start these methods may not be
% expected to perform equally well in all scenarios,
% and indeed, it is not always clear how to compare to the truth --
% for instance, how well can a method of piecewise constant population sizes
% describe a period of exponential growth?
% Or, how should discrete population models be mapped onto real-world (continuous) geography?
% The answer depends on determining what is most important for the question at hand.
\section*{Methods}
\subsection*{Generating the PopSim resource}
\subsection*{Model QC procedures}
As a consortium we have agreed to a standardized proceedure for model inclusion
into \stdpopsim that allows for rigorous quality control. Imagine Developer A
wants to introduce a new model into \stdpopsim. Developer A implements the
demographic model for the relevant organism along with clear documentation
of the model parameters and populations. This model is submitted as a Pull Request,
where it is evaluated by a reviewer and then included as `preliminary',
but is not linked to the online documentation nor the command line interface.
Inclusion of a preliminary model triggers a QC issue, and a second developer,
Developer B, then independently reimplements the model from the relevant
primary sources and adds an automatic unit test for equality between the
QC implementation and the preliminary production model. If the two
implementations are equal the original model is included in \stdpopsim.
If not, we move to an arbitration process whereby the A and B first try
to work out the details of what went wrong. If that fails, the original
authors of the published model must be contacted
to resolve ambiguities. This process can lead to cycles of revision of
the implementation from Developers A and B but allows the project to assure
quality. Indeed we have already had a number of interesting cases of arbitration
to this point. Further details of our QC process can be found in our
\href{https://stdpopsim.readthedocs.io/en/latest/development.html#}{developer documentation}.
\subsection*{Workflow for analysis of simulated data}
To demonstrate the utility of \stdpopsim we created \texttt{Snakemake}
workflows \citep{koster2012snakemake} that perform demographic inference on
tree sequence output from our package using a few common software packages.
Our choice of \texttt{Snakemake} allows complete reproducibility of the
analyses shown and the entire Analysis repository is available from
\url{https://github.com/popgensims/analysis}.
We performed demographic analysis for one-and two-population tasks.
Our one population task was to infer historical population size over
time ($N(t)$). This was done using three software packages: \stairwayplot
which uses site frequency spectrum information only \citep{liu2015exploring};
the SMC based \MSMC~\citep{schiffels2014inferring} which we ran with two different
sample sizes $n\in (2 , 8)$; and \smcpp~\citep{terhorst2017robust},
which combines information from the site frequency spectrum, with
recombination information as in SMC-based methods. No attempt
was made at trying to optimize the analysis from
any particular software package,
as our goal was not to benchmark performance of methods but
instead show how such benchmarking could be easily done using
the \stdpopsim resource. In this spirit we ran each software package as near
to default parameters as possibile. For \stairwayplot we
set the parameters numRuns=1 and dimFactor=5000. For \smcpp we used the
``estimate" run mode to infer $N(t)$ with all other parameters set
to their default values. For \MSMC we used the ``--fixedRecombination"
option and used the default number of iterations.
For the single-population task we ran human (HomSap) simulations
using a variety of models: OutOfAfricaArchaicAdmixture\_5R19, OutOfAfrica\_3G09,
a constant-sized generic model, and a two-epoch generic model where the
population size instantaneously decreases from $N=10^4$ to $N=10^3$, 500
generations before the present. Each HomSap simulation was run
using the HapmapII\_GRCh37 genetic maps. For \emph{D. melanogaster}
we estimated $N(t)$ from an African sample simulated under the DroMel,
African3Epoch\_1S16 model using the Comeron2012\_dm6 map. Finally we ran
simulations of \emph{A. thaliana} genomes using the AraTha African2Epoch\_1H18
model under the Salome2012\_TAIR7 map. For each model, three replicate whole genomes were
simulated and the population size estimated from those data. In all cases we
set the sample size of the focal population to $N=50$ chromosomes.
Following simulation, low-recombination portions of chromosomes were masked
from the analysis in a manner that reflects the ``accessible" subset of sites
used in empirical population genomic studies \citep[e.g.,][]{danecek20111000,langley2012genomic}.
Specifically we masked regions of 1\,cM or greater in the lowest 5th percentile of the empirical
distribution of recombination, regions which are nearly uniformly absent for
empirical analysis.
Our goal for the two-population analysis pipeline was to explore two-population inference
on some of the multi-population demographic models implemented in \stdpopsim.
For HomSap we examined the OutOfAfrica\_3G09 model with the HapmapII\_GRCh37 genetic map,
and for DroMel we simulated under the OutOfAfrica\_2L06 model with the Comeron2012\_dm6 map.
The HomSap model is a three population model (Africa, Europe, and Asia) including post-divergence
migration and exponential growth (Figure \ref{fig:IM_popn_human}C), whereas the
DroMel model is a two population model (Africa and Europe) with no post-divergence
migration and constant population sizes (Figure \ref{fig:two_popn_fly}).
To conduct inference on these models, we applied three commonly used methods:
\dadi~\citep{gutenkunst2009inferring}, \fastsimcoal2~\citep{excoffier2013robust},
and \smcpp~\citep{terhorst2017robust}. As above, these methods were used
generally with default settings and we did not attempt to optimize their performance or fit
parameter-rich demographic models. As with the single population inference pipeline,
we developed a \texttt{snakemake}~\citep{koster2012snakemake} workflow to manage
our inference pipeline and maintain reproducibility.
For both \dadi and \fastsimcoal2, we fit a two population
isolation-with-migration (IM) model, with constant population sizes, to the simulated
data from the HomSap and DroMel models. This IM model contains six parameters:
the ancestral population size, the size of population 1 after the split, the size of
population 2 after the split, the divergence time, and two migration rate parameters.
Importantly, this meant that for both species, the
fitted model did not match the simulated model (Figures \ref{fig:IM_popn_human} and \ref{fig:two_popn_fly}).
In the HomSap case, we therefore performed inference solely on the Africa
and Europe populations, meaning that the Asia population functioned as a ghost
population that was ignored by our inference. Our motivation for fitting this simple
IM model was to mimic the typical approach of two population inference on empirical
data, where the user is not aware of the `true' underlying demography and the inference
model is often misspecified. To ground-truth our inference approach, we also conducted
inference on a generic IM model that was identical to the model used for inference.
From HomSap we ran replicate whole genome simulations under the model
and specified taking 20 samples each from the Europe and Africa populations.
For DroMel, runtimes under OutOfAfrica\_2L06 were prohibitively slow when simulating whole genomes.
For example, both chromosomes 3R and X failed to coalesce within 30 days of starting using $n = 50$.
Runtimes using the identical sample size were 19, 21, and 12 days for chromosomes 2L, 2R, and 3L,
respectively. For this reason, we chose to present only data from chromosome 2R for simulations
under OutOfAfrica\_2L06. For the generic IM simulations, we used the HomSap genome along with the
HapmapII\_GRCh37 genetic map, and sampled 20 individuals from each population.
Following simulation, we outputted tree sequences and masked low-recombination
regions using the same approach described for the single population pipeline above. We
converted tree sequences into a two-dimensional site frequency spectrum for all
chromosomes in the appropriate format for \dadi and \fastsimcoal2. For each simulation
replicate, we performed 10 runs of \dadi and \fastsimcoal2 and checked for convergence.
Detailed settings for \dadi and \fastsimcoal2 can be found in the Snakefile
on the popgensims/analysis GitHub repository (\url{https://github.com/popgensims/analysis}).
The parameter estimates from the top run of each simulation replicate
are shown in Figures \ref{fig:IM_popn_human}C and \ref{fig:two_popn_fly}C.
For \smcpp , we converted the tree sequences into vcf format and performed inference
with default settings. Importantly, \smcpp assumes no migration post-divergence, so
here again the model fitted during inference does not exactly match the simulated
human demography that does include migration. However, because \smcpp allows
for continuous population size changes, it is better equipped to capture many of the
more complex aspects of the simulated demographic models (e.g., exponential growth).
To visualize our results, we plotted the inferred population size trajectories
for each simulation replicate along side the true (census) population sizes
(Figures \ref{fig:IM_popn_human}C and \ref{fig:two_popn_fly}C). Here, unlike the single-population pipeline,
we focus on census size rather than the inverse coalescence rate as the `true' population size.
% \subsection*{Extra text}
% [This section describes the stdpopsim Python library. What it’s for, how it’s used and what it can do]
% [First para: running accurate simulations is hard]
% Running realistic population genetic simulations is a complex task today. It is
% straightforward to run simple simulations of population models such as the
% coalescent or Wright-Fisher using (e.g.) msprime or SLiM, but the process of
% instantiating these models to accurately reflect the properties of natural
% populations that are important for a particular study is fraught with
% difficulties. There are typically three important inputs that must be decided:
% the demographic model, recombination and mutation rates. Demographic models for
% a particular species are typically chosen by examining the literature for
% published model specifications, which can consist of dozens of parameters. These
% parameters must be transcribed from the original publication and translated into
% the input format required for the simulator in question. This is a difficult and
% error prone task, and mistakes are common. Recombination and mutation rates are
% simpler to describe and often a single genome-wide estimate is appropriate.
% However, such estimates are updated frequently and there is wide variation in
% the values used for the simulations of the same species. When more fine-grained
% simulations of the local chromosome structure is required, empirical
% recombination and mutation rate maps are used. However, there is no standard
% format for such maps, and finding the maps can be challenging.
\bibliographystyle{plainnat}
\bibliography{popSim.bib}
\pagebreak
\beginsupplement
\section*{Supplemental Figures}
\begin{figure}
\begin{center}
\includegraphics[width=0.8\linewidth]{display_items/homo_sapiens_mask_Gutenkunst.pdf}
\caption{\textbf{Comparing estimates of $N(t)$ in humans}. Estimates of population
size over time ($N(t)$) inferred using 4 different methods, \smcpp, \stairwayplot,
\MSMC with $n=2$ and $n=8$. Data were generated by simulating
replicate human genomes under the \cite{gutenkunst2009inferring} model and using the genetic map
inferred in \cite{international2007second}. From top to bottom we show estimates for each
of the three populations in the model: YRI, CEU, and CHB. In shades of blue we show the estimated
$N(t)$ trajectories for each replicate. In black we show the true population size history as inferred
for the rate of coalescence in the demographic model.}
\label{fig:n_t_gutenkunst}
\end{center}
\end{figure}
\begin{figure}
\begin{center}
\includegraphics[width=0.8\linewidth]{display_items/homo_sapiens_constant.pdf}
\caption{\textbf{Comparing estimates of $N(t)$ in humans}. Here we show estimates of population
size over time ($N(t)$) inferred using 4 different methods, \smcpp, \texttt{stairwayplot},
\MSMC with $n=2$ and $n=8$. Data were generated by simulating
replicate human genomes under a constant sized population model with $N=10^4$ and using the
HapMapII genetic map~\citep{international2007second}. In black we show the true population size history
of the model.}
\label{fig:n_t_HomSap_constant}
\end{center}
\end{figure}
\begin{figure}
\begin{center}
\includegraphics[width=0.8\linewidth]{display_items/d_mel_two_popn_comp.png}
\caption{\textbf{Parameters estimated using a two-population \emph{Drosophila} model}.
Here we show estimates of $N(t)$ inferred using \dadi, \texttt{fastsimcoal2}, or \smcpp.
Data were generated by simulating
replicate \emph{Drosophila} genomes under the \cite{li2006inferring} model and using the genetic map
inferred in \cite{comeron2012many}. See legend of Figure \ref{fig:IM_popn_human} for details.
In shades of blue we show the estimated
$N(t)$ trajectories for each replicate. In black we show the true population size history as inferred
for the rate of coalescence in the demographic model.}
\label{fig:two_popn_fly}
\end{center}
\end{figure}
\stopsupplement
\end{document}