forked from UofTCoders/rcourse
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathlec03-basic-r.Rmd
585 lines (458 loc) · 21.5 KB
/
lec03-basic-r.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
---
title: "Manipulating and analyzing data with dplyr"
author: Joel Östblom
---
## Lesson preamble
> ### Learning objectives
>
> * Describe what a data frame is.
> * Load external data from a .csv file into a data frame in R.
> * Summarize the contents of a data frame in R.
> * Understand the purpose of the **`dplyr`** package.
> * Learn to use data wrangling commands `select`, `filter`, `%>%,` and `mutate`
> from the **`dplyr`** package.
>
> ### Lesson outline
>
> - Data set background (10 min)
> - What are data frames (15 min)
> - R packages for data analyses (5 min)
> - Data wrangling in dplyr (40 min)
-----
## Dataset background
Today, we will be working with real data from a longitudinal study of the
species abundance in the Chihuahuan desert ecosystem near Portal, Arizona, USA.
This study includes observations of plants, ants, and rodents from 1977 - 2002,
and has been used in over 100 publications. More information is available in
[the abstract of this paper from 2009](
https://onlinelibrary.wiley.com/doi/10.1890/08-1222.1/full). There are several
datasets available related to this study, and we will be working with datasets
that have been preprocessed by the [Data
Carpentry](https://www.datacarpentry.org) to facilitate teaching. These are made
available online as _The Portal Project Teaching Database_, both at the [Data
Carpentry website](https://datacarpentry.org/ecology-workshop/data/), and on
[Figshare](https://figshare.com/articles/Portal_Project_Teaching_Database/1314459/6).
Figshare is a great place to publish data, code, figures, and more openly to
make them available for other researchers and to communicate findings that are
not part of a longer paper.
### Presentation of the survey data
We are studying the species and weight of animals caught in plots in our study
area. The dataset is stored as a comma separated value (CSV) file. Each row
holds information for a single animal, and the columns represent:
| Column | Description |
|------------------|------------------------------------|
| record_id | unique id for the observation |
| month | month of observation |
| day | day of observation |
| year | year of observation |
| plot_id | ID of a particular plot |
| species_id | 2-letter code |
| sex | sex of animal ("M", "F") |
| hindfoot_length | length of the hindfoot in mm |
| weight | weight of the animal in grams |
| genus | genus of animal |
| species | species of animal |
| taxa | e.g. rodent, reptile, bird, rabbit |
| plot_type | type of plot |
To read the data into R, we are going to use a function called `read_csv`. This
function is contained in an R-package called
[`readr`](https://readr.tidyverse.org/). R-packages are a bit like browser
extensions; they are not essential, but can provide nifty functionality. We will
go through R-packages in general and which ones are good for data analyses in
detail later in this lecture. Now, let's install `readr`:
```{r, eval=FALSE}
install.packages('readr') # You only need to install a package once
```
Now we can use the `read_csv` function. One useful option that `read_csv`
includes, is the ability to read a CSV file directly from a URL, without
downloading it in a separate step:
```{r, eval=FALSE}
surveys <- readr::read_csv('https://ndownloader.figshare.com/files/2292169')
```
However, it is often a good idea to download the data first, so you have a copy
stored locally on your computer in case you want to do some offline analyses, or
the online version of the file changes or the file is taken down. You can either
download the data manually or from within R:
```{r, eval=FALSE}
download.file("https://ndownloader.figshare.com/files/2292169",
"portal_data.csv") # Saves this name in the current directory
```
The data is read in by specifying its local path.
```{r, eval=FALSE}
surveys <- readr::read_csv('portal_data.csv')
```
```{r, echo=FALSE}
surveys <- readr::read_csv('data/portal_data.csv')
```
This statement produces some output regarding which data type it found in each
column. If we want to check this in more detail, we can print the variable's
value: `surveys`.
```{r}
surveys
```
In the online html-version of this lecture, you only see the first few rows of
the data frame. Running the code chunk above in the R Notebook would display a
nice tabular view of the data, which also includes pagination when there are
many rows and we can click the green arrow to view all the columns.
Technically, this object is actually a `tibble` rather than a data frame, as
indicated in the output. The reason for this is that `read_csv` automatically
converts the data into to a `tibble` when loading it. Since a `tibble` is just
a data frame with some convenient extra functionality, we will use these words
interchangeably from now on.
If we just want to glance at how the data frame looks, it is sufficient to
display only the top (the first 6 lines) using the function `head()`:
```{r}
head(surveys)
```
## What are data frames?
Data frames are the _de facto_ data structure for most tabular data, and what we
use for statistics and plotting. A data frame can be created by hand, but most
commonly they are generated by the function `read_csv()`; in other words, when
importing spreadsheets from your hard drive (or the web).
A data frame is a representation of data in the format of a table where the
columns are vectors that all have the same length. Because the columns are
vectors, they all contain the same type of data as we discussed in last class
(e.g., characters, integers, factors). We can see this when inspecting the
structure of a data frame with the function `str()`:
```{r}
str(surveys)
```
Integer refers to a whole number, such as 1, 2, 3, 4, etc. Numbers with
decimals, 1.0, 2.4, 3.333, are referred to as floats. Factors are used to
represent categorical data. Factors can be ordered or unordered, and
understanding them is necessary for statistical analysis and for plotting.
Factors are stored as integers, and have labels (text) associated with these
unique integers. While factors look (and often behave) like character vectors,
they are actually integers under the hood, and you need to be careful when
treating them like strings.
### Inspecting `data.frame` objects
We already saw how the functions `head()` and `str()` can be useful to check the
content and the structure of a data frame. Here is a non-exhaustive list of
functions to get a sense of the content/structure of the data. Let's try them
out!
* Size:
* `dim(surveys)` - returns a vector with the number of rows in the first element
and the number of columns as the second element (the dimensions of the object)
* `nrow(surveys)` - returns the number of rows
* `ncol(surveys)` - returns the number of columns
* Content:
* `head(surveys)` - shows the first 6 rows
* `tail(surveys)` - shows the last 6 rows
* Names:
* `names(surveys)` - returns the column names (synonym of `colnames()` for `data.frame`
objects)
* `rownames(surveys)` - returns the row names
* Summary:
* `str(surveys)` - structure of the object and information about the class,
length, and content of each column
* `summary(surveys)` - summary statistics for each column
Note: most of these functions are "generic", they can be used on other types of
objects besides `data.frame`.
#### Challenge
Based on the output of `str(surveys)`, can you answer the following questions?
* What is the class of the object `surveys`?
* How many rows and how many columns are in this object?
* How many species have been recorded during these surveys?
```{r, include=FALSE}
## Answers
##
## * class: data frame
## * how many rows: 34786, how many columns: 13
## * how many species: 48
```
### Indexing and subsetting data frames
Our survey data frame has rows and columns (it has 2 dimensions). If we want to
extract some specific data from it, we need to specify the "coordinates" we
want from it. Row numbers come first, followed by column numbers. When
indexing, base R data frames return a different format depending on how we
index the data (i.e. either a vector or a data frame), but with enhanced data
frames, `tibbles`, the returned object is almost always a data frame.
```{r}
surveys[1, 1] # first element in the first column of the data frame
surveys[1, 6] # first element in the 6th column
surveys[, 1] # first column in the data frame
surveys[1] # first column in the data frame
surveys[1:3, 7] # first three elements in the 7th column
surveys[3, ] # the 3rd element for all columns
surveys[1:6, ] # equivalent to head(surveys)
```
`:` is a special operator that creates numeric vectors of integers in
increasing or decreasing order; test `1:10` and `10:1` for instance. This works
similarly to `seq`, which we looked at earlier in class:
```{r}
0:10
seq(0, 10)
# We can test if all elements are the same
0:10 == seq(0,10)
all(0:10 == seq(0,10))
```
You can also exclude certain parts of a data frame using the "`-`" sign:
```{r}
surveys[,-1] # All columns, except the first
surveys[-c(7:34786),] # Equivalent to head(surveys)
```
As well as using numeric values to subset a `data.frame` (or `matrix`), columns
can be called by name, using one of the four following notations: <!-- Not sure how important it is to learn the difference vs just teaching the preferred way with the footnote that there are other ways also. -->
```{r}
surveys["species_id"] # Result is a data.frame
surveys[, "species_id"] # Result is a data.frame
```
For our purposes, these notations are equivalent. RStudio knows about
the columns in your data frame, so you can take advantage of the autocompletion
feature to get the full and correct column name.
Another syntax that is often used to specify column names is `$`. In this case,
the returned object is actually a vector. We will not go into detail about this,
but since it is such common usage, it is good to be aware of this.
```{r}
# We use `head()` since the output from vectors are not automatically cut off
# and we don't want to clutter the screen with all the `species_id` values
head(surveys$species_id) # Result is a vector
```
#### Challenge
1. Create a `data.frame` (`surveys_200`) containing only the observations from
row 200 of the `surveys` dataset.
2. Notice how `nrow()` gave you the number of rows in a `data.frame`?
* Use that number to pull out just that last row in the data frame.
* Compare that with what you see as the last row using `tail()` to make
sure it's meeting expectations.
* Pull out that last row using `nrow()` instead of the row number.
* Create a new data frame object (`surveys_last`) from that last row.
3. Use `nrow()` to extract the row that is in the middle of the data
frame. Store the content of this row in an object named `surveys_middle`.
4. Combine `nrow()` with the `-` notation above to reproduce the behavior of
`head(surveys)` keeping just the first through 6th rows of the surveys
dataset.
```{r, include=FALSE}
## Answers
surveys_200 <- surveys[200, ]
surveys_last <- surveys[nrow(surveys), ]
surveys_middle <- surveys[nrow(surveys)/2, ]
surveys_head <- surveys[-c(7:nrow(surveys)),]
```
## R packages for data analyses
There are certainly many tools built in to base R which can be used to
understand data, but we are going to use a package called `dplyr` which makes
exploratory data analysis (EDA) particularly intuitive and effective.
First, let's explain the concept of an R-package. What we have used so far is
all part of base R (except `read_csv`), together with many more functions. Every
package included in base R will be installed on any computer where R is
installed, since they are considered critical for using R, e.g. `c()`, `mean()`,
`+`, `-`, etc. However, since R is an open language, it is easy to develop your
own R-package that provides new functionality and submit it to the official
repository for R-packages called CRAN (Comprehensive R Archive Network). CRAN
has thousands of packages, and all these cannot be installed by default, because
then base R installation would be huge and most people would only be using a
fraction of everything installed on their machine. It would be like if you
downloaded the Firefox or Chrome browser and you would get all extensions and
addons installed by default, or as if your phone came with every app ever made
for it already installed when you bought it: quite impractical.
To install a package in R, we use the function `install.packages()`. In this
case, the package `dplyr` is part of a bigger collections of packages called
[`tidyverse`](https://www.tidyverse.org/) (just like Microsoft Word is part of
Microsoft Office), which also contains the `readr` package we installed in the
beginning alongside many more packages that makes exploratory data analyses more
intuitive and effective.
```{r, eval=FALSE}
install.packages('tidyverse')
```
Now all the `dplyr` functions are available to us by prefacing them with
`dplyr::`:
```{r}
dplyr::glimpse(surveys) # `glimpse` is similar to `str`
```
We will be using this package a lot, and it would be a little annoying to have
to type `dplyr::` every time, so we will load it into our current environment.
This needs to be done once for every new R session and makes all functions
accessible without their package prefix, which is very convenient, as long as
you are aware of which function you are using and don't load a function with the
same name from two different packages.
```{r}
# We could also do `library(dplyr)`, but we need the rest of the
# tidyverse packages later, so we might as well import the entire collection.
library(tidyverse)
glimpse(surveys)
```
## Data wrangling with dplyr
Wrangling here is used in the sense of maneuvering, managing, controlling, and
turning your data upside down and inside out to look at it from different angles
in order to understand it. The package **`dplyr`** provides easy tools for the
most common data manipulation tasks. It is built to work directly with data
frames, with many common tasks optimized by being written in a compiled language
(C++), this means that many operations run much faster than similar tools
in R. An additional feature is the ability to work directly with data stored in
an external database, such as SQL-databases. The ability to work with databases
is great because you are able to work with much bigger datasets (100s of GB)
than your computer could normally handle. We will not talk in detail about this
in class, but there are great resources online to learn more (e.g. [this lecture
from Data
Carpentry](https://datacarpentry.org/R-ecology-lesson/05-r-and-databases.html)).
### Selecting columns and filtering rows
We're going to learn some of the most common **`dplyr`** functions: `select()`,
`filter()`, `mutate()`, `group_by()`, and `summarise()`. To select columns of a
data frame, use `select()`. The first argument to this function is the data
frame (`surveys`), and the subsequent arguments are the columns to keep.
```{r}
select(surveys, plot_id, species_id, weight, year)
```
To choose rows based on a specific criteria, use `filter()`:
```{r}
filter(surveys, year == 1995)
```
#### An aside on conditionals
Note that to check for equality, R requires *two* equal signs (`==`). This is
to prevent confusion with object assignment, since otherwise `year = 1995`
might be interpreted as 'set the `year` parameter to `1995`', which is not
what `filter` does!
Basic conditionals in R are broadly similar to how they're already expressed
mathematically:
```{r}
2 < 3
5 > 9
```
However, there are a few idiosyncrasies to be mindful of for other conditionals:
```{r}
2 != 3 # not equal
2 <= 3 # less than or equal to
5 >= 9 # greater than or equal to
```
Finally, the `%in%` operator is used to check for membership:
```{r}
2 %in% c(2, 3, 4) # check whether 2 in c(2, 3, 4)
```
All of the above conditionals are compatible with `filter`, with the
key difference being that `filter` expects column names as part of
conditional statements instead of individual numbers.
### Chaining functions together using pipes
But what if you wanted to select and filter at the same time? There are three
ways to do this: use intermediate steps, nested functions, or pipes. With
intermediate steps, you essentially create a temporary data frame and use that
as input to the next function. This can clutter up your workspace with lots of
objects:
```{r}
temp_df <- select(surveys, plot_id, species_id, weight, year)
filter(temp_df, year == 1995)
```
You can also nest functions (i.e. one function inside of another).
This is handy, but can be difficult to read if too many functions are nested as
things are evaluated from the inside out.
```{r}
filter(select(surveys, plot_id, species_id, weight, year), year == 1995)
```
The last option, pipes, are a fairly recent addition to R. Pipes let you take
the output of one function and send it directly to the next, which is useful
when you need to do many things to the same dataset. Pipes in R look like `%>%`
and are made available via the `magrittr` package that also is included in the
`tidyverse`. If you use RStudio, you can type the pipe with <kbd>Ctrl/Cmd</kbd> +
<kbd>Shift</kbd> + <kbd>M</kbd>.
```{r}
surveys %>%
select(., plot_id, species_id, weight, year) %>%
filter(., year == 1995)
```
The `.` refers to the object that is passed from the previous line. In this
example, the data frame `surveys` is passed to the `.` in the `select()`
statement. Then, the modified data frame which is the result of the `select()`
operation, is passed to the `.` in the filter() statement. Put more simply:
whatever was the result from the line above the current line, will be used in
the current line.
Since it gets a bit tedious to write out all the dots, **`dplyr`** allows for
them to be omitted. By default, the pipe will pass its input to the _first_
argument of the right hand side function; in `dplyr`, the first argument
is always a data frame. The chunk below gives the same output as the one above:
```{r}
surveys %>%
select(plot_id, species_id, weight, year) %>%
filter(year == 1995)
```
Another example:
```{r}
surveys %>%
filter(weight < 5) %>%
select(species_id, sex, weight)
```
In the above code, we use the pipe to send the `surveys` dataset first through
`filter()` to keep rows where `weight` is less than 5, then through `select()`
to keep only the `species_id`, `sex`, and `weight` columns. Since `%>%` takes
the object on its left and passes it as the first argument to the function on
its right, we don't need to explicitly include it as an argument to the
`filter()` and `select()` functions anymore.
If this runs off your screen and you just want to see the first few rows, you
can use a pipe to view the `head()` of the data. (Pipes work with
non-**`dplyr`** functions, too, as long as either the **`dplyr`** or `magrittr`
package is loaded).
```{r}
surveys %>%
filter(weight < 5) %>%
select(species_id, sex, weight) %>%
head()
```
If we wanted to create a new object with this smaller version of the data, we
could do so by assigning it a new name:
```{r}
surveys_sml <- surveys %>%
filter(weight < 5) %>%
select(species_id, sex, weight)
surveys_sml
```
Note that the final data frame is the leftmost part of this expression.
A single expression can also be used to filter for several criteria, either
matching *all* criteria (`&`) or *any* criteria (`|`):
```{r}
surveys %>%
filter(taxa == 'Rodent' & sex == 'F') %>%
select(sex, taxa)
```
```{r}
surveys %>%
filter(species == 'clarki' | species == 'leucophrys') %>%
select(species, taxa)
```
#### Challenge
Using pipes, subset the `survey` data to include individuals collected before
1995 and retain only the columns `year`, `sex`, and `weight`.
```{r, include=FALSE}
## Answer
surveys %>%
filter(year < 1995) %>%
select(year, sex, weight)
```
### Creating new columns with mutate
Frequently, you'll want to create new columns based on the values in existing
columns. For instance, you might want to do unit conversions, or find the ratio
of values in two columns. For this we'll use `mutate()`.
To create a new column of weight in kg:
```{r}
surveys %>%
mutate(weight_kg = weight / 1000)
```
You can also create a second new column based on the first new column within the
same call of `mutate()`:
```{r}
surveys %>%
mutate(weight_kg = weight / 1000,
weight_kg2 = weight_kg * 2)
```
The first few rows of the output are full of `NA`s, so if we wanted to remove
those we could insert a `filter()` in the chain:
```{r}
surveys %>%
filter(!is.na(weight)) %>%
mutate(weight_kg = weight / 1000)
```
`is.na()` is a function that determines whether something is an `NA`. The `!`
symbol negates the result, so we're asking for everything that *is not* an `NA`.
#### Challenge
Create a new data frame from the `surveys` data that meets the following
criteria: contains only the `species_id` column and a new column called
`hindfoot_half` containing values that are half the `hindfoot_length` values.
In this `hindfoot_half` column, there are no `NA`s and all values are less
than 30.
**Hint**: think about how the commands should be ordered to produce this data frame!
```{r, include=FALSE}
## Answer
surveys_hindfoot_half <- surveys %>%
filter(!is.na(hindfoot_length)) %>%
mutate(hindfoot_half = hindfoot_length / 2) %>%
filter(hindfoot_half < 30) %>%
select(species_id, hindfoot_half)
```