-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathREADME.Rmd
executable file
·491 lines (358 loc) · 21.8 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
---
output: github_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```
# epifish
<!-- badges: start -->
<!-- badges: end -->
This package provides tools to use Chris Miller's fishplot package (https://github.com/chrisamiller/fishplot) with epidemiological datasets, to generate fishplot epi-curves.
```{r, include=FALSE}
library(fishplot); library(dplyr); library(tidyr); library(lubridate); library(epifish); library(knitr); library(kableExtra)
sample_df <- read.csv("https://raw.githubusercontent.com/learithe/epifish/main/inst/extdata/samples.csv", stringsAsFactors=FALSE)
parent_df <- read.csv("https://raw.githubusercontent.com/learithe/epifish/main/inst/extdata/parents.csv", stringsAsFactors=FALSE)
colour_df <- read.csv("https://raw.githubusercontent.com/learithe/epifish/main/inst/extdata/colours.csv", stringsAsFactors=FALSE)
epifish_output <- epifish::build_epifish( sample_df, parent_df, colour_df )
```
```{r, echo=FALSE}
vlines <- c((4/7), 3, 8.5, 14)
vlabs <- c("first\ncases", "wave 1", "quarantine\nbreach", "wave 2")
fishplot::fishPlot(epifish_output$fish, shape="spline", vlines=vlines, vlab=vlabs)
```
**Why?**
A fishplot is variety of [themeriver / streamgraph](https://www.data-to-viz.com/graph/streamgraph.html), which is designed specifically for categorical data where where individual categories can mutate to form subcategories. Originally designed for [plotting evolution of tumor cell lineages,](https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-016-3195-z) we have found fishplots especially useful for illustrating the rise and fall of genomic clusters in disease outbreaks, which can have a similar evolutionary pattern.
However, a count matrix for a fishplot has a set of specific rules which an epidemiological dataset will not naturally fulfill:
- cluster counts per timepoint can never go completely to zero, if cases reappear later
- if a cluster has a parent/child relationship, at every timepoint the parent must always have >= the count of all its children.
- counts should be normalised to fit the fishplot y-axis
This package exists to make it easy to convert a list of samples in an epidemological dataset into a normalised and appropriately "padded" relative count matrix that fulfils these requirements.
<br><br>
## Contents
- [Installation](#installation)
- Usage
- [Quick start](#quick-start)
- [Basic demo](#basic-demo)
- Input/Output
- [Input format](#input-format)
- [Output](#output)
- Calculate timepoints
- [Calculate epi weeks](#calculate-epi-weeks-and-text-labels)
- [Calculate timepoint labels](#calculate-epi-weeks-and-text-labels)
- [Calculate epi months](#calculate-epi-months)
- Using informative timepoint labels
- [Using timepoint labels](#using-informative-timepoint-labels)
- [Plot every other week](#plot-every-other-week)
- [Add extra timepoint lines](#add-extra-timepoint-lines-not-present-in-the-data)
- [Use completely manual timepoint lines and labels](#use-completely-manual-timepoint-lines-and-labels)
- Controlling appearance
- [Control how far back in time clusters appear](#control-how-far-back-in-time-clusters-appear)
- [Add extra start or end timepoints](#add-extra-start-or-end-timepoints)
- [Control legend spacing](#control-legend-spacing)
- [Change cluster label appearance](#change-cluster-label-appearance)
- [Don’t show cluster labels](#dont-show-cluster-labels)
- [Change fishplot titles and background](#change-fishplot-titles-and-background)
- [Citation](#citation)
- [Acknowledgements](#acknowledgements)
## Installation
You can install epifish with:
``` r
#install devtools if you don't have it already
install.packages("devtools")
library(devtools)
#install fishplot if you haven't already
devtools::install_github("chrisamiller/fishplot")
#install epifish
devtools::install_github("learithe/epifish")
```
## Quick Start
To get started with a basic epi fishplot, given an input file in the right format (details below), this is all you need:
``` r
# load required libraries
library(fishplot); library(dplyr); library(tidyr); library(lubridate); library(epifish)
# read data file
sample_df <- read.csv("https://raw.githubusercontent.com/learithe/epifish/main/inst/extdata/samples.csv", stringsAsFactors=FALSE)
# run epifish
epifish_output <- epifish::build_epifish( sample_df )
# run fishplot on the epifish output
fishplot::fishPlot( epifish_output$fish, shape="spline" )
```
```{r include=FALSE}
epifish_output <- epifish::build_epifish( sample_df )
```
```{r echo=FALSE}
fishplot::fishPlot(epifish_output$fish, shape="spline")
```
**If you want to include evolutionary relationships with subclusters:**
``` r
# load required libraries
library(fishplot); library(dplyr); library(tidyr); library(lubridate); library(epifish)
# read data files
sample_df <- read.csv("https://raw.githubusercontent.com/learithe/epifish/main/inst/extdata/samples.csv", stringsAsFactors=FALSE)
parent_df <- read.csv("https://raw.githubusercontent.com/learithe/epifish/main/inst/extdata/parents.csv", stringsAsFactors=FALSE)
# run epifish
epifish_output <- epifish::build_epifish( sample_df, parent_df )
# run fishplot on the epifish output
fishplot::fishPlot( epifish_output$fish, shape="spline" )
```
```{r, include=FALSE}
epifish_output <- epifish::build_epifish( sample_df, parent_df)
```
```{r, echo=FALSE}
fishplot::fishPlot(epifish_output$fish, shape="spline")
```
## Basic demo
*This demo expands on the quick-start example. It runs on a made-up set of example data that can be accessed here, in the [`inst/extdata`](https://github.com/learithe/epifish/tree/main/inst/extdata) directory.*
<br><br>
Load epifish and required packages
``` r
library(fishplot); library(dplyr); library(tidyr); library(lubridate); library(epifish)
```
Read in the tables of sample data, cluster parent-child relationships, and custom colour scheme:
```r
sample_df <- read.csv("https://raw.githubusercontent.com/learithe/epifish/main/inst/extdata/samples.csv", stringsAsFactors=FALSE)
parent_df <- read.csv("https://raw.githubusercontent.com/learithe/epifish/main/inst/extdata/parents.csv", stringsAsFactors=FALSE)
colour_df <- read.csv("https://raw.githubusercontent.com/learithe/epifish/main/inst/extdata/colours.csv", stringsAsFactors=FALSE)
```
Use epifish to convert this into a fishplot object, with extra assorted summary information:
```{r}
epifish_output <- epifish::build_epifish( sample_df, parent_df=parent_df, colour_df=colour_df, add_missing_timepoints=TRUE)
```
Then use the fishplot package to generate a fishplot:
```{r}
fishplot::fishPlot(epifish_output$fish, shape="spline",
vlines=epifish_output$timepoints, vlab=epifish_output$timepoints)
fishplot::drawLegend(epifish_output$fish, nrow=1)
```
If you're happy with the default colours, or all your clusters are independent, you don't need those dataframes:
```{r}
epifish_output <- epifish::build_epifish( sample_df )
fishplot::fishPlot(epifish_output$fish, shape="spline",
vlines=epifish_output$timepoints, vlab=epifish_output$timepoints)
fishplot::drawLegend(epifish_output$fish, nrow=1)
```
You also can automatically collapse any clusters of a minimum size into a group with `min_cluster_size`:
*Note: this currently does not work well with parent/child relationships if any child clusters are small!*
``` {r}
epifish_output <- epifish::build_epifish(sample_df, colour_df=colour_df, min_cluster_size=10)
fishplot::fishPlot(epifish_output$fish, shape="spline", vlines=epifish_output$timepoints, vlab=epifish_output$timepoints)
fishplot::drawLegend(epifish_output$fish, nrow=1)
```
## Input format
Example input files/templates can be found in the `inst/extdata` folder in this repository. The basic requirement is a data frame containing one row per sample, with columns `cluster_id` and `timepoint` (any other columns are ignored). Optionally, the `timepoint` column can be calculated using epifish from a column of dates (see below).
Optional data frames may also be provided that describe parent-child relationships for clusters (eg cluster A.1 evolved from cluster A), or a custom colour scheme.
It is easiest and safest (especially when working with dates) to save and maintain these tables in `.csv` (comma-separated values) format, and to read them into R using `read.csv("filename", stringsAsFactors=FALSE)` as shown in the example above. However you can use whatever methods you want to create these dataframes, as long as they contain the required columns in character or numeric (NOT factor) format.
**the last few rows of sample data:**
(Note that the order doesn't matter)
```{r echo=F}
kable(tail(sample_df), row.names=FALSE)
```
**the parent-child data:**
```{r echo=F}
options(knitr.kable.NA = '')
kable(parent_df)
```
**a custom colour scheme:**
(Note that you can use [named ggplot colours](https://www.nceas.ucsb.edu/sites/default/files/2020-04/colorPaletteCheatsheet.pdf) or hex codes (eg "red" or "#ff0000") )
```{r echo=F}
kable(colour_df)
```
## Output
The output of epifish is a list variable (named `epifish_output` here) containing: a fishplot object (`epifish_output$fish`), the data structures needed to generate it, and some extra data summary tables:
- `fish` fishplot object to pass to `fishplot::fishPlot()`
- `timepoint_counts` summary table of number of samples per cluster per timepoint
- `timepoint_sums` summary table of number of samples per timepoint
- `cluster_sums` summary table of total number of samples per cluster
- `timepoints` vector of timepoints used
- `timepoint_labels` vector of the names of timepoints assigned in the plot
- `parents` named list matching child clusters to their parent's position in the matrix (0 means cluster is independent)
- `raw_table` initial table of counts per cluster per timepoint, before padding and normalisation
- `fish_table` normalised and parent-padded table for the epi-fishplot
- `fish_matrix` final transformed matrix used to make the epifish object
The epifish fishplot object output `epifish_output$fish` is used with the fishplot package's `fishPlot()` function to generate an R plot image, as shown above. If using RStudio, it is most straightforward to save the R plot as PDF image from the RStudio plot window (Export -> "Save as PDF").
If you wish to save individual tables from the epifish output list for any reason, it can be done like so:
``` r
write.csv(epifish_output$fish_table, "epifishplot_table.csv", row.names=FALSE)
```
This is the extra summary data that epifish creates:
```{r include=FALSE}
epifish_output <- epifish::build_epifish( sample_df, parent_df=parent_df, colour_df=colour_df)
```
```{r}
# total cases per cluster per timepoint
print( as.data.frame(epifish_output$timepoint_counts), row.names=FALSE )
# total cases per timepoint
print( as.data.frame(epifish_output$timepoint_sums), row.names=FALSE )
# total cases per cluster
print( as.data.frame(epifish_output$cluster_sums), row.names=FALSE )
# parent relationship table
print( epifish_output$parents )
# list of timepoints to display
print( epifish_output$timepoints )
# list of labels for each timepoint
print( epifish_output$timepoint_labels )
# raw count table
print( epifish_output$raw_table )
# normalised and padded count table
print( epifish_output$fish_table )
# rotated final matrix used to generate the epifish fishplot object
print( epifish_output$fish_matrix )
```
## Calculate timepoints
Epifish also has a few functions to make it easy to convert dates to epidemic weeks or months (to use as timepoints), and generate label-friendly versions of timepoint dates.
*NOTE: when working with dates in both R and Excel, be sure to check that your values match what you expect! When using R for analysis it is best practice to save your data files in a text-based format like `.csv` (comma-separated-value) format rather than Excel format, because [Excel has many issues with how it handles dates](https://datacarpentry.org/spreadsheets-socialsci/03-dates-as-data/), and using a text-only format avoids having your dates messed up by Excel.*
#### Calculate epi weeks and text labels
Given a date column name (`date_of_collection` here), the start date of the epidemic, and the date format, you can use `get_epiweek()` to calculate the number of weeks since the start of the epidemic each sample belongs to, and `get_epiweek_span()` to give the epi week a clear text label. *Note: these functions have customisation options for different date/range formats; check their documentation for details.*
``` {r}
#calculate epiweek timepoints from the column "date_of_collection" & create text labels to match them
sample_df <- sample_df %>% rowwise() %>%
mutate("epiweek"= epifish::get_epiweek(cdate = date_of_collection,
start_date = "1/1/20",
date_format = "dmy"))
#create a timepoint label column that gives the last day of each epi week the sample was collected in:
sample_df <- sample_df %>% rowwise() %>%
mutate("epiweek_label"= epifish::get_epiweek_span(cdate = date_of_collection,
date_format = "dmy",
return_end = TRUE,
newline=TRUE))
```
``` r
#peek at what we created
tail(sample_df)
```
```{r echo=FALSE}
#peek at what we created
kable( tail(sample_df) )
```
<br>
#### Calculate epi months:
Epifish also has `get_epimonth()` and `get_month_text()` functions for calculating epi months from dates:
``` {r}
#create a "epimonth" timepoint:
sample_df <- sample_df %>% rowwise() %>%
mutate("epimonth"= epifish::get_epimonth(cdate = date_of_collection,
start_date = "1/1/20",
date_format = "dmy"))
#and an epimonth label
sample_df <- sample_df %>% rowwise() %>%
mutate("epimonth_label"= epifish::get_month_text(cdate = date_of_collection,
date_format = "dmy"))
```
``` r
#peek at what we created
tail(sample_df)
```
```{r echo=FALSE}
#peek at what we created
kable( tail(sample_df) )
```
## Using informative timepoint labels
If you call `build_epifish ()` with `timepoint_labels=TRUE`, epifish will look for a column called `timepoint_label` to use as the timepoint labels. You can set this up as a column in your input file by hand, or you can calculate it from dates as shown above.) *Note: you can only have one unique label per timepoint value.*
``` {r}
#to use the epiweeks and epiweek labels we calculated above, we need to set these as columns named "timepoint" and "timepoint_label" in the sample dataframe:
sample_df$timepoint <- sample_df$epiweek
sample_df$timepoint_label <- sample_df$epiweek_label
#then generate the epifish object, specifying that you want to use timepoint labels
epifish_output <- epifish::build_epifish( sample_df, parent_df, colour_df, timepoint_labels=TRUE)
#and draw the fishplot
fishplot::fishPlot(epifish_output$fish, shape="spline",
vlines=epifish_output$timepoints, vlab=epifish_output$timepoint_labels)
fishplot::drawLegend(epifish_output$fish, nrow=1)
```
The fishplot package provides further flexibility in where to display the vertical lines and what text to show, which can be used to create custom combinations rather than using the epifish defaults:
#### Plot every other week:
If things are getting crowded, you can label just every other week:
``` {r}
#subset timepoints and labels to ever other entry
vlines <- epifish_output$timepoints[c(TRUE, FALSE)]
vlabs <- epifish_output$timepoint_labels[c(TRUE, FALSE)]
fishplot::fishPlot(epifish_output$fish, shape="spline", vlines=vlines, vlab=vlabs)
```
#### Add extra timepoint lines not present in the data
Or add a "zero" timepoint with the first case, which starts on the fourth day of the first epi week (we'll also make the text a bit smaller so it doesn't overlap):
``` {r}
vlines <- c((4/7), epifish_output$timepoints)
vlabs <- c("1\nJan", epifish_output$timepoint_labels)
fishplot::fishPlot(epifish_output$fish, shape="spline",
vlines=vlines, vlab=vlabs, cex.vlab=0.5)
```
#### Use completely manual timepoint lines and labels
We can specify completely custom timepoints and labels that describe an epidemiological story, with red lines:
``` {r}
vlines <- c((4/7), 3, 8.5, 14)
vlabs <- c("first\ncases", "wave 1", "quarantine\nbreach", "wave 2")
fishplot::fishPlot(epifish_output$fish, shape="spline",
vlines=vlines, vlab=vlabs, col.vline="red")
```
## Controlling appearance
### Control how far back in time clusters appear
You can modify how far back in time the clusters seem to "begin" using the `pad_left` argument to `fishplot::fishPlot()`:
```{r, include=FALSE}
epifish_output <- build_epifish (sample_df, parent_df, colour_df)
```
``` {r}
fishplot::fishPlot(epifish_output$fish, shape="spline", vlines=epifish_output$timepoints, vlab=epifish_output$timepoint_labels, pad.left=0.05)
fishplot::drawLegend(epifish_output$fish, nrow=2, widthratio=0.3, xsp=0.2)
```
### Add extra start or end timepoints
You can add extra "padding" timepoints to the start and end of the data using the `start_time` and `end_time` arguments of `epifish::build_epifish`:
```{r}
epifish_output <- build_epifish (sample_df, parent_df, colour_df, start_time = 0, end_time = 17)
fishplot::fishPlot(epifish_output$fish, shape="spline", vlines=epifish_output$timepoints, vlab=epifish_output$timepoint_labels, pad.left=0.05)
fishplot::drawLegend(epifish_output$fish, nrow=2, widthratio=0.3, xsp=0.2)
```
### Control legend spacing
Using fishplot v0.5.1+, you can modify the spacing of the epi-fishplot legend, which is especially useful with long cluster names. Use `widthratio` to adjust the width between columns relative to the longest cluster name (smaller value = more space), and `xsp` to control space between the colour box and the text (larger = more space)
```{r, include=FALSE}
epifish_output <- build_epifish (sample_df, parent_df, colour_df)
```
``` {r}
fishplot::fishPlot(epifish_output$fish, shape="spline", vlines=epifish_output$timepoints, vlab=epifish_output$timepoint_labels)
fishplot::drawLegend(epifish_output$fish, nrow=2, widthratio=0.3, xsp=0.2)
```
### Change cluster label appearance
Using fishplot v0.5.1+, you can modify the size, colour, position, and angle of cluster labels when building fishplot objects. You can set these values using arguments to `epifish::build_epifish()`.
```{r}
epifish_output <- build_epifish (sample_df, parent_df, colour_df,
label_col = "purple",
label_angle = -45,
label_cex = 1.2,
label_pos=2,
label_offset=0.05
)
fishplot::fishPlot(epifish_output$fish, shape="spline", vlines=epifish_output$timepoints, vlab=epifish_output$timepoint_labels, pad.left=0.05)
fishplot::drawLegend(epifish_output$fish)
```
### Don't show cluster labels
If you don't want to show the cluster labels on the fishplot, set `label_clusters=FALSE` in `epifish::build_epifish()`.
```{r}
epifish_output <- build_epifish (sample_df, parent_df, colour_df, label_clusters=FALSE)
fishplot::fishPlot(epifish_output$fish, shape="spline", vlines=epifish_output$timepoints, vlab=epifish_output$timepoint_labels, pad.left=0.05)
fishplot::drawLegend(epifish_output$fish)
```
### Change fishplot titles and background
You can also adjust assorted aspects of the fishplot as arguments to `fishplot::fishPlot()`. For details, refer to the documentation for the fishplot package (try the command `?fishplot::fishPlot`).
```{r}
epifish_output <- build_epifish (sample_df, parent_df, colour_df)
fishplot::fishPlot(epifish_output$fish, shape="spline", vlines=epifish_output$timepoints, vlab=epifish_output$timepoint_labels, pad.left=0.2,
cex.vlab=0.85,
title="Epi-fishplot of outbreak waves",
title.btm="number of cases per epi week over time",
cex.title=1,
bg.type="solid",
bg.col="grey95")
fishplot::drawLegend(epifish_output$fish, nrow=1, xpos=-1)
```
## Citation:
**If you use epifish in your work, please cite:**
- **epifish**: Documenting elimination of co-circulating COVID-19 clusters using genomics in New South Wales, Australia. Arnott A, Draper J et al. BMC Research Notes. [10.1186/s13104-021-05827-x](https://bmcresnotes.biomedcentral.com/articles/10.1186/s13104-021-05827-x)
- **fishplot**: Visualizing tumor evolution with the fishplot package for R. Miller CA, McMichael J, Dang HX, Maher CA, Ding L, Ley TJ, Mardis ER, Wilson RK. BMC Genomics. [doi:10.1186/s12864-016-3195-z](https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-016-3195-z)
<br>
## Acknowledgements:
This work extends the Chris Miller's fishplot package (https://github.com/chrisamiller/fishplot). It was written by Dr. Jenny Draper, a member of the pathogen genomics team employed by [New South Wales Health Pathology](https://www.pathology.health.nsw.gov.au), at the Westmead Hospital Institute of Clinical Pathology & Medical Research (ICPMR) [Centre for Infectious Diseases and Microbiology - Public Health](https://www.wslhd.health.nsw.gov.au/Education-Portal/Research/Research-Categories/Centre-for-Infectious-Diseases-and-Microbiology-Public-Health/About-CIDMPH) in Australia. `epifish` was initially developed as part of the NSW government's response to COVID-19.