-
Notifications
You must be signed in to change notification settings - Fork 5
/
02-data-import.Rmd
508 lines (373 loc) · 20.9 KB
/
02-data-import.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
# Data Importing/Exporting and interaction with other programmes {#data-import}
This chapter is for code examples of data importing/exporting and interactions with other programmes and databases.
## Libraries
```{r message=FALSE}
library(tidyverse)
library(fs) #cross-platform file systems operations (based on libuv C library)
library(knitr) #general purpose tool for dynamic report generation in R
library(kableExtra) #presentation of complex tables with customizable styles
library(DT) #presentation of tables (wrapper of JavaScript library DataTables)
library(DBI) #database connection
library(dbplyr) #database connection
library(haven) #for importing/exporting SPSS, Stata, and SAS files
library(bigrquery) #connecting to GCP BigQuery
library(openxlsx) #formatting xslx outputs
library(xltabr) #MoJ RAP enabler built on openxlsx
```
## Star functions
- `readxl::read_excel`
- `readxl::excel_sheets`
- `purrr::map_dfr`
- `purrr::map_dfc`
- `purrr::map2_dfc`
- `fs::dir_ls`
## Navigating folders
A couple of pointers to navigate from your working directory, which, if you're using R projects (it is highly recommended that you do) will be wherever the `.Rproj` file is located
### Down
To navigate down folders use `/`. The path given below saves the file **my_df.csv** in the **data** folder, which itself is inside the **monthly_work** folder
```{r eval=FALSE}
readr::write_csv(
x = my_dataframe
, path = "monthly_work/data/my_df.csv"
)
```
### Up
To go up a folder use `../`. In particular you may need to do this when running Rmarkdown files. Rmarkdown files use their location as the working directory. If you have created an **R** folder, say, to stash all your scripts in, and a **data** folder to stash your data files in, then you will need to go up, before going down...
The path below goes up one folder, then into the **data** folder, where the **lookup_table.csv** is located.
```{r eval=FALSE}
lookup_table <- readr::read_csv(
file = "../data/lookup_table.csv"
)
```
## .rds
.rds is R's native file format, any object you create in R can be saved as a .rds file. The functions `readRDS` and `saveRDS` are base R functions.
```{r eval=FALSE}
#not run
saveRDS(
object = my_model #specify the R object you want to save
, file = "2019_0418_my_model.rds" #give it a name, don't forget the file extension
)
```
## .csv
We use the functions `read_csv` and `write_csv` from the `readr` package (which is part of the `tidyverse`). These are a little bit *cleverer* than their base counterparts, however, this cleverness can catch you out.
The file **messy_pokemon_data.csv** contains pokemon go captures data which has been deliberately messed up a bit. `read_csv` imputes the column specification from the first 1000 rows, which is fine if your first 1000 rows are representative of the data type. If not then subsequent data that can't be coerced into the imputed data type will be replaced with NA.
Looking at the column specification below notice that `read_csv` has recognised **time_first_capture** as a time type, but not **date_first_capture** as date type. Given the information that **combat_power** should be numeric we can see that something is also amiss here as `read_csv` has guessed character type for this column.
```{r}
pokemon <- readr::read_csv(
file = "data/messy_pokemon_data.csv"
)
```
Let's have a quick look at some data from these columns
```{r}
pokemon %>%
dplyr::select(species, combat_power, date_first_capture, time_first_capture) %>%
dplyr::arrange(desc(combat_power)) %>%
head()
```
The pokemon dataset has less than 1000 rows so `read_csv` has 'seen' the letters mixed in with some of the numbers in the **combat_power** column. It has guessed at character type because everything it has read in the column can be coerced to character type.
What if there are more than 1000 rows? For example, say you have a numeric column, but there are some letters prefixed to the numbers in some of the post-row-1000 rows. These values are still meaningful to you, and with some data wrangling you can extract the actual numbers. Unfortunately `read_csv` has guessed at type double based on the first 1000 rows and since character type cannot be coerced into double, these values will be replaced with `NA`. If you have messy data like this the best thing to do is to force `read_csv` to read in as character type to preserve all values as they appear, you can then sort out the mess yourself.
You can specify the column data type using the `col_types` argument. Below I have used a compact string of abbreviations to specify the column types, see the help at `?read_csv` or the `readr` vignette for the full list. You can see I got many parsing failures, which I can access with `problems()`, which is a data frame of the values that `read_csv` was unable to coerce into the type I specified, and so has replaced with NA.
```{r}
pokemon <- readr::read_csv(
file = "data/messy_pokemon_data.csv"
, col_types = "cdddcdcccDt"
)
# c = character, d = double, D = Date, t = time
tibble::glimpse(pokemon)
```
Let's take a look at the problems.
```{r}
problems(pokemon) %>%
head()
```
And since I know that there are problems with **combat_power** let's take a look there.
```{r}
problems(pokemon) %>%
dplyr::filter(col == "combat_power") %>%
head()
```
The `problems()` feature in `read_csv` is super useful, it helps you isolate the problem data so you can fix it.
Other arguments within `read_csv` that I will just mention, with their default settings are
- `col_names = TRUE`: the first row on the input is used as the column names.
- `na = c("", "NA")`: the default values to interpret as `NA`.
- `trim_ws = TRUE`: by default trims leading/trailing white space.
- `skip = 0`: number of lines to skip before reading data.
- `guess_max = min(1000, n_max)`: maximum number of records to use for guessing column type. NB the bigger this is the longer it will take to read in the data.
## .xlsx and .xls
Excel workbooks come in many shapes and sizes. You may have one or many worksheets in one or many workbooks, there may only be certain cells that you are interested in. Below are a few examples of how to cope with these variations using functions from `readxl` and `purrr` to iterate over either worksheets and/or workbooks, the aim being to end up with all the data in a single tidy dataframe.
### Single worksheet - single workbook
The simplest combination, you are interested in one rectangular dataset in a particular worksheet in one workbook. Leaving the defaults works fine on this dataset. Note that `readxl::read_excel` detects if the file is `.xlsx` or `.xls` and behaves accordingly.
```{r}
readxl::read_excel(path = "data/port0499.xlsx") %>%
head()
```
Let's set a few of the other arguments, run `?read_excel` in the console to see the full list.
```{r}
readxl::read_excel(
path = "data/port0499.xlsx"
, sheet = 1 #number or name of sheet, default is first sheet
, col_names = TRUE #default
, col_types = "text" #a single type will recycle to all columns, specify each using character vector of the same length eg c("numeric", "text", ...)
) %>%
head()
```
### Single worksheet - many workbooks
For example, you collect pokemon go capture data from many different players, the data all has the same structure and you want to read it in and row bind into a single dataframe in R.
![](image/pokemon_player.png)
<br/>
The code below collects the names of the 3 excel workbooks using `fs::dir_ls`, and, as these are not the only files in that folder, I've specified them using regular expressions (regex). Then we use `purrr::map_dfr` to iterate and rowbind over this list of files, applying the function we supply, that is `readxl::read_excel`. Since we are only reading a single worksheet per workbook we don't need to supply any arguments to `readxl:read_excel`, the defaults will work fine, each workbook path is piped in, in turn. The `.id` argument in `purrr:map_dfr` adds the file path into a new column, which we have named "player" in this instance. The "dfr" in `map_dfr` refers to the output "data-frame-rowbound".
```{r}
pokemon <- fs::dir_ls(path = "data", regex = "pokemon_player_.\\.xlsx$") %>%
purrr::map_dfr(.f = readxl::read_excel, .id = "player")
tibble::glimpse(pokemon)
```
Using `DT::datatable` for ease of viewing we can see that all 5 rows of data from each of the 3 workbooks has been read in, rowbound, and an id column has been added showing the workbook path.
```{r}
DT::datatable(data = pokemon)
```
Note that the `regex` argument in `fs::dir_ls` is applied to the full file path so if I had tried to specify that the file name starts with "pokemon" by front anchoring it using "^pokemon" this would return no results, since the full name is actually "data/pokemon...". Helpful regex links below.
[regex cheatsheet](https://www.rstudio.com/wp-content/uploads/2016/09/RegExCheatsheet.pdf)
[stringr cheatsheet including regex](http://edrub.in/CheatSheets/cheatSheetStringr.pdf)
### Many worksheets - single workbook
You have a single workbook, but it contains many worksheets of interest, each containing rectangular data with the same structure. For example, you have a workbook containing pokemon go captures data, where each different data collection point has its own sheet. The data structure, column names and data types are consistent. You want to read in and combine these data into a single dataframe.
The code below sets the location of the workbook and puts this in the object `path`. It then collects the names of all the sheets in that workbook using `readxl::excel_sheets`. Next `purrr::set_names` sets these names in a vector so that they can be used in the next step. This vector of names is implictly assigned to the `.x` argument in `purrr::map_dfr` as it is the first thing passed to it. This means we can refer to it as `.x` in the function we are iterating, in this case `readxl::read_excel`. Finally, an id column is included, made up of the sheet names and named "sheet". The output is a single dataframe with all the sheets row bound together.
```{r}
path <- "data/multi_tab_messy_pokemon_data.xlsx"
pokemon_collections <- readxl::excel_sheets(path = path) %>%
purrr::set_names() %>%
purrr::map_dfr(
~ readxl::read_excel(path = path, sheet = .x)
, .id = "sheet"
)
DT::datatable(data = pokemon_collections)
```
### Many worksheets - many workbooks
Now we can use the above two solutions to combine data from many worksheets spread across many workbooks. As before, the data is rectangular and has the same structure. For example, you receive a workbook every month, containing pokemon go captures data, and each data collection point has its own sheet.
![](image/pokemon_collection_point.png)
We create a function to import and combine the sheets from a single workbook, and then iterate this function over all the workbooks using `purrr::map_df`.
```{r}
#function to combine sheets from a single workbook
read_and_combine_sheets <- function(path){
readxl::excel_sheets(path = path) %>%
purrr::set_names() %>%
purrr::map_df(
~ readxl::read_excel(path = path, sheet = .x)
, .id = "sheet"
)
}
#code to iterate over many workbooks
pokemon_monthly_collections <- fs::dir_ls(
path = "data", regex = "pokemon_2019\\d{2}\\.xlsx$") %>%
purrr::map_df(
read_and_combine_sheets
, .id = "month"
)
DT::datatable(data = pokemon_monthly_collections)
```
### Non-rectangular data - single worksheet - single workbook
You have received some kind of data entry form that has been done in excel in a more human readable, rather than machine readable, format. Some of the cells contain instructions and admin data so you only want the data held in specific cells. This is non-rectangular data, that is, the data of interest is dotted all over the place. In this example we have pet forms, and the data of interest is in cells **B2**, **D5** and **E8** only.
Here's an image of what the data looks like.
![](image/pet_form.png)
Let's see what we get if we naively try to read it in.
```{r message=FALSE}
readxl::read_excel(
path = "data/pet_form_1.xlsx"
) %>%
knitr::kable() %>%
kableExtra::kable_styling(full_width = F, position = "left")
```
It's not what we wanted, let's try again, now using the `range` argument
```{r message=FALSE}
readxl::read_excel(
path = "data/pet_form_1.xlsx"
, col_names = FALSE
, range = "A2:B2"
) %>%
knitr::kable() %>%
kableExtra::kable_styling(full_width = F, position = "left")
```
The `range` argument helps, we have picked up one bit of the data, and its name. The `range` argument uses the `cellranger` package which allows you to refer to ranges in Excel files in Excel style. However, we have 3 disconnected data points, we need to iterate, so it's `purrr` to the rescue once more.
The code below demonstrates explicitly that the `.x` argument in `purrr::map_dfr` takes the vector of things that will be iterated over in the supplied function. In this case we are giving the `range` argument of `readxl::read_excel` three individual cells to iterate over. These will then be rowbound so we end up with a single dataframe comprising a single column, named "cells", containing 3 rows.
```{r}
pet_details <- purrr::map_dfr(
.x = c("B2", "D5", "E8")
, ~ readxl::read_excel(
path = "data/pet_form_1.xlsx"
, range = .x
, col_names = "cells" #assign name
, col_types = "text" #have to use text to preserve all data in single column
)
)
pet_details %>%
knitr::kable() %>%
kableExtra::kable_styling(full_width = F, position = "left")
```
This is an improvement, we have a dataframe named `pet_details` comprising a single "cells" column, which contains all the relevant data from this worksheet.
We could now try to reshape it, however, a better idea is to use `map_dfc` since we actually want to column bind these data rather than rowbind them. The read out from `tibble::glimpse` shows that the different variable types have been picked up, which is also helpful. The default naming of the columns gives a clue as to how the function works.
```{r}
pet_details <- purrr::map_dfc(
.x = c("B2", "D5", "E8") #vector of specific cells containing the data
, ~ readxl::read_excel(
path = "data/pet_form_1.xlsx"
, range = .x
, col_names = FALSE
)
)
tibble::glimpse(pet_details)
```
```{r echo=FALSE}
pet_details %>%
knitr::kable() %>%
kableExtra::kable_styling(full_width = F, position = "left")
```
This is pretty close to what we want, the only sticking point is that we still don't have the correct column names. We could deal with this using `dplyr::rename`, but an even better idea is to use `purrr::map2_dfc`. The `map2` variant allows you to iterate over two arguments simultaneously (into the same function).
```{r}
pet_details_2 <- purrr::map2_dfc(
.x = c("B2", "D5", "E8") #vector of specific data cells
, .y = c("Name", "Age", "Species") #vector of column names
, ~ readxl::read_excel(
path = "data/pet_form_1.xlsx"
, range = .x
, col_names = .y
)
)
tibble::glimpse(pet_details_2)
```
```{r echo=FALSE}
pet_details_2 %>%
knitr::kable() %>%
kableExtra::kable_styling(full_width = F, position = "left")
```
### Non-rectangular data - single worksheet - many workbooks
Having solved for one workbook and worksheet, we can functionalise and iterate to gather the data from every wookbook, two of which are shown below.
![](image/pet_forms.png)
<br/>
The function `cells_to_rows` below iterates over `read_excel` reading each of the three cells from the worksheet, applying the corresponding column name as it goes. It takes three character or character vector inputs, `path`, `cells`, and `col_names`.
```{r}
cells_to_rows <- function(path, cells, col_names){
purrr::map2_dfc(
.x = cells
, .y = col_names
, ~ readxl::read_excel(
path = path
, range = .x
, col_names = .y
)
)
}
```
Let's test it on the first pet form data, first setting the paramaters to use in the function.
```{r}
path <- "data/pet_form_1.xlsx"
cells <- c("B2", "D5", "E8")
col_names <- c("Name", "Age", "Species")
pet_form_1 <- cells_to_rows(
path = path, cells = cells, col_names = col_names
)
```
```{r echo=FALSE}
pet_form_1 %>%
knitr::kable() %>%
kableExtra::kable_styling(full_width = F, position = "left")
```
It works! So now we can iterate this over all the pet form workbooks, specifying the paths using regex as before. Note below we use `.x` in the `path` argument in the `cells_to_rows` function to refer to the vector of paths piped to `purrr::map_dfr` from `fs::dir_ls`.
```{r}
cells <- c("B2", "D5", "E8")
col_names <- c("Name", "Age", "Species")
all_pet_forms <- fs::dir_ls(
path = "data", regex = "pet_form_\\d\\.xlsx$") %>%
purrr::map_dfr(
~ cells_to_rows(path = .x, cells = cells, col_names = col_names)
, .id = "path"
)
```
```{r echo=FALSE}
all_pet_forms %>%
knitr::kable() %>%
kableExtra::kable_styling(full_width = F, position = "left")
```
### Non-rectangular data - many worksheets - single workbook
Now we have more than one worksheet in a single workbook, and the data looks like this, the workbook is from a "pet house" and each worksheet is pet details.
![](image/pet_house.png)
<br/>
To incorporate the worksheets element we rejig the `cells_to_rows` function from above and give it a "sheet" argument, so it can be passed a specific sheet.
```{r}
sheet_cells_to_rows <- function(path, sheet, cells, col_names){
purrr::map2_dfc(
.x = cells
, .y = col_names
, ~ readxl::read_excel(
path = path
, sheet = sheet
, range = .x
, col_names = .y
)
)
}
```
We now have the function `sheet_cells_to_rows` that can accept a list of worksheet names. As before we use `readxl::excel_sheets` to collect the worksheet names, first setting the other parameters
```{r}
path <- "data/pet_house_1.xlsx"
cells <- c("B2", "D5", "E8")
col_names <- c("Name", "Age", "Species")
pet_house_1 <- readxl::excel_sheets(path = path) %>%
purrr::set_names() %>%
purrr::map_dfr(
~ sheet_cells_to_rows(path = path
, sheet = .x
, cells = cells
, col_names = col_names)
, .id = "sheet"
)
```
```{r echo=FALSE, eval=FALSE}
pet_house_1 %>%
knitr::kable() %>%
kableExtra::kable_styling(full_width = F, position = "left")
```
### Non-rectangular data - many worksheets - many workbooks
Finally we have many workbooks each containing many worksheets, each containing many cells, as before we want to read them in and combine.
![](image/pet_houses.png)
<br/>
We could functionalise the code above that reads and combines the cells in many worksheets from a single workbook, but an alternative approach is used below. We create an anonymous function an use that on the fly. This is useful if the function is a one off, and not too complicated. The anonymous function below still depends on the `sheet_cells_to_rows` we created earlier though.
```{r}
cells <- c("B2", "D5", "E8")
col_names <- c("Name", "Age", "Species")
pet_house_all <- fs::dir_ls(
path = "data", regex = "pet_house_\\d\\.xlsx$") %>%
purrr::map_dfr(
function(path){
readxl::excel_sheets(path = path) %>%
purrr::set_names() %>%
purrr::map_dfr(
~ sheet_cells_to_rows(path = path
, sheet = .x
, cells = cells
, col_names = col_names)
, .id = "sheet"
)
}
, .id = "path"
)
```
```{r echo=FALSE}
pet_house_all %>%
knitr::kable() %>%
kableExtra::kable_styling(full_width = F, position = "left")
```
## Exporting to .xlsx
We recommend `openxlsx` and `xltabr` for writing and formatting tables to MS Excel. The latter is a wrapper built on `openxlsx` developed by MoJ and is specifically aimed at making it easier to produce publication ready tables. `xltabr` has one known drawback in that it applies the foramtting cell by cell, so if your tables are massive (~100,000 rows) it will take too long, in this instance you should resort to `openxlsx`.
https://cran.r-project.org/web/packages/openxlsx/openxlsx.pdf
https://cran.r-project.org/web/packages/xltabr/xltabr.pdf
So you can see an example of the formatting available, here is a link to DfT's Search and Rescue Helicopter statistics tables that are produced in R using `xltabr`.
https://www.gov.uk/government/statistical-data-sets/search-and-rescue-helicopter-sarh01
## .sav
Use `haven` to import SPSS, Stata and SAS files.
## SQL
Below are links to DfT Coffee and Coding materials on the subject of connecting R to SQL
[20181114_Connecting_R_to_SQL](https://github.com/departmentfortransport/coffee-and-coding/tree/master/20181114_Connecting_R_to_SQL)
[20190130_SQL_and_Excel_to_R](https://github.com/departmentfortransport/coffee-and-coding/tree/master/20190130_SQL_and_Excel_to_R)
## GCP
### BigQuery
Link below to DfT Coffee and Coding materials on how to use bigrquery to interact with GCP's BigQuery
[20190403_bigrquery](https://github.com/departmentfortransport/coffee-and-coding/tree/master/20190403_bigrquery)