forked from jorgemasgomez/almondsim
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathLineBreeding.Rmd
405 lines (310 loc) · 14.2 KB
/
LineBreeding.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
---
title: "Line breeding program"
output: html_document
date: "2023-12-07"
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## Introduction
In this vignette we present AlphaSimR code for a line breeding program using DH technology, as would be used in wheat breeding. Sections of this vignette mirror subsections 3.2 -- 3.5 in the Results section of the main text, and we suggest the reader refer to it while reading these sections of the main text. For simplicity, in this vignette we only display a single year of the burn-in for a single replicate. For code of the full multi-year and multi-replicate line breeding program, refer to the AlphaSimR code found in the supplementary material and on GitHub `01_LineBreeding/01_PhenotypicSelection/04_DoubledHaploid` (start with `00RUNME.R`. Also, the supplementary material and GitHub contain simulation code for numerous other breeding programs.
## Preliminaries
First we clear our workspace, install required packages if they are not yet installed, and load required packages.
```{r}
rm(list = ls())
# install.packages(pkgs = "AlphaSimR")
library(package = "AlphaSimR")
```
## Specifying global parameters
We define global parameters that are used throughout a simulation. Such parameter values can be chosen based on estimates from prior analyses of real data or long-term trends, trial and error, or practical experience. For definitions of the parameters see Table 3 in the main text. Note that in the script files, this step is included with `source(file = "GlobalParameters.R")`, but here, for illustrative purposes, we include all code into a single continuous script. We also define a scenario name so we can track multiple possible scenarios.
```{r}
# ---- Number of simulation replications and breeding cycles ----
nReps = 1
nBurnin = 1
nFuture = 0
nCycles = nBurnin + nFuture
# ---- Genome simulation ----
nQtl = 1000
nSnp = 0
# ---- Initial parents mean and variance ----
initMeanG = 1
initVarG = 1
initVarEnv = 1e-6
initVarGE = 2
varE = 4
# ---- Breeding program details ----
nParents = 50
nCrosses = 100
nDH = 100
famMax = 10
nPYT = 500
nAYT = 50
nEYT = 10
# Effective replication of yield trials
repHDRW = 4/9
repPYT = 1
repAYT = 4
repEYT = 8
scenarioName = "LinePheno_DH"
```
We create a list to store results from replications.
```{r}
results = list()
```
First we always run just a single replicate of our simulation. Once we have completed all the steps of the simulation and the simulation runs as expected, we can run multiple replicates. Note that in the script files we run through a `for` loop of the multiple replicates, but for illustrative purposes here, we only have one replicate.
```{r}
REP = 1
```
We create a data frame to track key parameters.
```{r}
output = data.frame(year = 1:nCycles,
rep = rep(REP, nCycles),
scenario = rep(scenarioName, nCycles),
meanG = numeric(nCycles),
varG = numeric(nCycles),
accSel = numeric(nCycles))
```
## Simulating genomes and founders
We simulate founder genomes and create initial parents. AlphaSimR embeds MaCS software to generate founder genomes through a backward-in-time (coalescent) simulation. Here we use a pre-defined demographic history for wheat. Note that in the script files this step is included with `source(file = "CreateParents.R")`.
First we create a founder population, by generating initial haplotypes using `runMacs()`; and make the global simulation parameter (`SP`) object.
```{r}
founderPop = runMacs(nInd = nParents,
segSites = nQtl + nSnp,
inbred = TRUE,
species = "WHEAT")
SP = SimParam$new(founderPop)
```
We can include a SNP chip if desired (relevant for genomic selection strategies explained elsewhere).
```{r}
SP$restrSegSites(nQtl, nSnp)
if (nSnp > 0) {
SP$addSnpChip(nSnp)
}
```
We add a trait, representing yield, and specify the mean, and variance of QTL effects, as well as the environmental variance and variance due to genotype-by-environment interactions.
```{r}
SP$addTraitAG(nQtlPerChr = nQtl,
mean = initMeanG,
var = initVarG,
varEnv = initVarEnv,
varGxE = initVarGE)
```
We opt to track pedigree information.
```{r}
SP$setTrackPed(TRUE)
```
We create founder parents and set a phenotype, reflecting evaluation in EYT stage (see the main text).
```{r}
Parents = newPop(founderPop)
Parents = setPheno(Parents, varE = varE, reps = repEYT)
```
Finally we clear the founderPop object to free some computer memory.
```{r}
rm(founderPop)
```
## Filling the breeding pipeline
We fill the breeding pipeline with distinct populations for each stage. The filling process is necessary to capture the overlap of different breeding cycles in real breeding programs, meaning that at any given time, a breeding program will have populations in all stages of the breeding program.
We achieve this by running the founder parents through the crosses and selection steps of our breeding program, saving a population at a different breeding stage each time.
Our wheat breeding program consists of six breeding stages and hence, we need to generate six distinct breeding populations through this filling process. The first round runs through all stages, saving the population in the final (sixth) stage. The second round runs through the stages up to the fifth stage, saving the population at that stage. This process is repeated until all stages have been populated.
Despite using the same founder parents, distinct populations are obtained because each round involves unique crossings, random process of inheritance, and selections. We will update parents after the filling process.
Note that in the script files this step is included with `source(file = "FillPipeline.R")`.
```{r}
# Set initial yield trials with unique individuals
for (cohort in 1:7) {
cat(" Fill pipeline stage:", cohort, "of 7\n")
if (cohort < 7) {
# Stage 1
F1 = randCross(Parents, nCrosses)
}
if (cohort < 6) {
# Stage 2
DH = makeDH(F1, nDH)
}
if (cohort < 5) {
# Stage 3
HDRW = setPheno(DH, varE = varE, reps = repHDRW)
}
if (cohort < 4) {
# Stage 4
PYT = selectWithinFam(HDRW, famMax)
PYT = selectInd(PYT, nPYT)
PYT = setPheno(PYT, varE = varE, reps = repPYT)
}
if (cohort < 3) {
# Stage 5
AYT = selectInd(PYT, nAYT)
AYT = setPheno(AYT, varE = varE, reps = repAYT)
}
if (cohort < 2) {
# Stage 6
EYT = selectInd(AYT, nEYT)
EYT = setPheno(EYT, varE = varE, reps = repEYT)
}
if (cohort < 1) {
# Stage 7
}
}
```
We will now explain each of the operations performed in filling the pipeline, which are seen again in the Running the Burn-in phase section.
In "Stage 1", `F1 = randCross(Parents, nCrosses)`, performs `nCrosses` random crosses of the parents to produce `nCrosses` F1s.
In "Stage 2" `DH = makeDH(F1, nDH)` uses doubled-haploid technology to make `nDH` double haploids per each F1. By definition, these DH individuals are homozygous at all loci.
In "Stage 3" `HDRW = setPheno(DH, varE = varE, reps = repHDRW)` sets/simulates a phenotype for the DH lines with a certain error variance and replication, which captures a certain heritability. The phenotype will be used as the basis for breeder's selection.
In "Stage 4" `PYT = selectWithinFam(HDRW, famMax)` performs selection within families such that no more than `famMax` DH lines per family can be carried forward;
`PYT = selectInd(PYT, nPYT)` performs selection among all individuals in the Preliminary Yield Trial choosing the best `nPYT`; and
`PYT = setPheno(PYT, varE = varE, reps = repPYT)` sets/simulates a phenotype of the remaining individuals.
In "Stage 5" `AYT = selectInd(PYT, nAYT)` performs selection in the Preliminary Yield Trial; and
`AYT = setPheno(AYT, varE = varE, reps = repAYT)` sets/simulates a phenotype for the remaining individuals.
In "Stage 6" `EYT = selectInd(AYT, nEYT)` performs selection in the Advanced Yield Trial; and
`EYT = setPheno(EYT, varE = varE, reps = repEYT)` sets a phenotype for the remaining individuals in the first year of Elite Yield Trials.
## Running the burn-in phase
Now we will go through a single year of the breeding program's burn-in phase. In a real simulation, we would run multiple years of burn-in followed by multiple years of future breeding.
```{r}
year = 1
```
### Update Parents
First we update our parents by replacing the 10 oldest parents with 10 new parents from EYT stage. Note that in the script files this step is included with `source(file = "UpdateParents.R")`.
```{r}
Parents = c(Parents[11:nParents], EYT)
```
### Advance Year
Then we advance a year, going through all the crossing and selection steps. In one year, each stage of the breeding program is updated, reflecting the parallel breeding pipelines that are progressing. Note that in the script files this step is included with `source(file = "AdvanceYear.R")`.
```{r}
# Stage 7
# Release variety
# Stage 6
EYT = selectInd(AYT, nEYT)
EYT = setPheno(EYT, varE = varE, reps = repEYT)
# Stage 5
AYT = selectInd(PYT, nAYT)
AYT = setPheno(AYT, varE = varE, reps = repAYT)
# Stage 4
PYT = selectWithinFam(HDRW, famMax)
PYT = selectInd(PYT, nPYT)
PYT = setPheno(PYT, varE = varE, reps = repPYT)
# Stage 3
HDRW = setPheno(DH, varE = varE, reps = repHDRW)
# Stage 2
DH = makeDH(F1, nDH)
# Stage 1
F1 = randCross(Parents, nCrosses)
```
### Backward vs forward pipeline pass
An important note! We work backward through the pipeline to avoid copying a population before performing operations on it. This is an important point because failing to do so is a common source of errors in simulating breeding programs with AlphaSimR. To explain further, consider what would happen if we went forward through the pipeline instead.
```{r, eval = FALSE}
# Stage 1
F1 = randCross(Parents, nCrosses)
# Stage 2
DH = makeDH(F1, nDH)
```
While the above might look reasonable and is syntactically valid R and AlphaSimR code, note that by making a new F1, we have erased the F1 from the previous year, which we require to make the DH for this year. The above code effectively fast-tracks the crossing of parents to create F1 and creation of DH from the F1 in one time step. If we continue in this way, we would fast-track cohorts through all the stage of the breeding pipeline in one time step, which is biologically and logistically impossible.
A correct way to work forward through the pipeline would be to duplicate populations at each stage, as in the following, but this is inefficient both in terms of computer memory and code presentation.
```{r, eval = FALSE}
# Stage 1
F1_saved_for_DH = F1
F1 = randCross(Parents, nCrosses)
# Stage 2
DH_saved_for_HDRW = DH
DH = makeDH(F1_saved_for_DH, nDH)
```
We avoid this needles duplication of populations (with a forward pass through the pipeline) by running backward through the pipeline, as shown above in previous code chunks.
### Saving simulation summaries
Finally, for each year we save results in the `output` data frame. Note that results could here invilve complete simulated populations, associated data, or just data summaries as shown here. Here, we save mean of genetic values and variance of genetic values for individuals in the DH stage, while for accuracy of selection we calculate it at the HDRW stage where first phenotypes are available. Note that heritability of these phenotypes is very low, representing breeder's visual selection.
```{r}
output$meanG[year] = meanG(DH)
output$varG[year] = varG(DH)
output$accSel[year] = cor(HDRW@pheno, HDRW@gv)
```
## Wrap-up
The above completed simulation of a single year of the burn-in of a wheat breeding program for a single replicate. Explore additional material available in R scripts with multi-year, multi-rep, and multi-scenario examples.
## Full Code
The code below runs through all steps and 20 years of breeding.
```{r}
# Create founder haplotypes
founderPop = runMacs(nInd = nParents,
segSites = nQtl + nSnp,
inbred = TRUE,
species = "WHEAT")
# Set simulation parameters
SP = SimParam$new(founderPop)
# Add trait
SP$addTraitAG(nQtlPerChr = nQtl,
mean = initMeanG,
var = initVarG,
varEnv = initVarEnv,
varGxE = initVarGE)
# Create founder parents
Parents = newPop(founderPop)
# Fill breeding program pipeline
for (cohort in 1:7) {
cat(" Fill pipeline stage:", cohort, "of 7\n")
if (cohort < 7) {
# Stage 1
F1 = randCross(Parents, nCrosses)
}
if (cohort < 6) {
# Stage 2
DH = makeDH(F1, nDH)
}
if (cohort < 5) {
# Stage 3
HDRW = setPheno(DH, varE = varE, reps = repHDRW)
}
if (cohort < 4) {
# Stage 4
PYT = selectWithinFam(HDRW, famMax)
PYT = selectInd(PYT, nPYT)
PYT = setPheno(PYT, varE = varE, reps = repPYT)
}
if (cohort < 3) {
# Stage 5
AYT = selectInd(PYT, nAYT)
AYT = setPheno(AYT, varE = varE, reps = repAYT)
}
if (cohort < 2) {
# Stage 6
EYT = selectInd(AYT, nEYT)
EYT = setPheno(EYT, varE = varE, reps = repEYT)
}
if (cohort < 1) {
# Stage 7
}
}
# Create results output data frame
output = data.frame(year = 1:20,
rep = rep(1, 20),
scenario = rep(scenarioName, 20),
meanG = numeric(20),
varG = numeric(20))
# Simulate 20 years of breeding
for(year in 1:20) {
# Stage 7
# Release variety
# Stage 6
EYT = selectInd(AYT, nEYT)
EYT = setPheno(EYT, varE = varE, reps = repEYT)
# Stage 5
AYT = selectInd(PYT, nAYT)
AYT = setPheno(AYT, varE = varE, reps = repAYT)
# Stage 4
PYT = selectWithinFam(HDRW, famMax)
PYT = selectInd(PYT, nPYT)
PYT = setPheno(PYT, varE = varE, reps = repPYT)
# Stage 3
HDRW = setPheno(DH, varE = varE, reps = repHDRW)
# Stage 2
DH = makeDH(F1, nDH)
# Stage 1
Parents = c(Parents[11:nParents], EYT)
F1 = randCross(Parents, nCrosses)
# Report results
output$meanG[year] = meanG(DH)
output$varG[year] = varG(DH)
}
# Plot results
plot(1:20,output$meanG,type="l",
main="Genetic gain",xlab="Year",ylab="Yield")
```
```{r}
plot(1:20,output$varG,type="l",
main="Genetic variance",xlab="Year",ylab="Variance")
```