-
Notifications
You must be signed in to change notification settings - Fork 26
/
Copy pathdates.Rmd
150 lines (105 loc) · 8.77 KB
/
dates.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
# (PART) Computational thinking {-}
# (PART) Computational methods {-}
# Dates
Computational historians unsurprisingly often have to deal with dates. It is difficult to do even basic operations with dates, however. For example, since months and years have unequal numbers of days, even figuring out the duration of time between two dates can be tricky. But R includes a special type of object, `Date`, which makes the task much easier. As long as you are dealing with [Gregorian dates](https://en.wikipedia.org/wiki/Gregorian_calendar), you can do almost anything that you might need to do with dates using R's built-in functions and the [lubridate](https://cran.r-project.org/package=lubridate) package.^[@R-lubridate.]
We should be precise in our definition of what a date is. A date must be specified to an exact year, month, and day. If you are dealing only with years, then a simple numeric or integer vector is sufficient. If you are dealing with dates and times, then you will need to specify the time and possibly the time zone as well. This chapter will not get into dates and times, but if you understand how to work with date objects, those principles are easily extended to date and time object.^[See R's documentation at `?POSIXt` or look at the [lubridate](https://cran.r-project.org/package=lubridate) documentation for time classes.]
In this chapter we will use the [lubridate](https://cran.r-project.org/package=lubridate) package alongside our customary [tidyverse](https://cran.r-project.org/package=tidyverse) and [historydata](https://cran.r-project.org/package=historydata) packages.
```{r, warning=FALSE, message=FALSE}
library(tidyverse)
library(historydata)
library(lubridate)
```
## Years
If you are dealing only with dates in the form of years, then a numeric column in your data with the year information is sufficient. For example, the `dijon_prices` dataset in [historydata](https://cran.r-project.org/package=historydata) contains price series for various commodities in Dijon, France, from 1568 to 1630.
```{r}
dijon_prices
```
Using the year column, we can filter the data frame to a particular year as we are accustomed to doing in [dplyr](https://cran.r-project.org/package=dplyr), or we can use the `year` column as a variable in a [ggplot2](https://cran.r-project.org/package=ggplot2) plot.
```{r}
dijon_prices %>%
filter(year == 1600)
```
Often, though, we have a year embedded in some other kind of variable, and we would like to extract that year so that we can manipulate it directly. For example, we might have document IDs or file names that contain a year, or a set of sentences that include dates.
```{r}
doc_ids <- c("NY1850", "CA1851", "CA1850", "WA1855", "NV1861")
files <- c("sotu-1968.txt", "sotu-1969.txt", "sotu-1970.txt", "sotu-1971.txt")
sentences <- c("George Washington became president in 1789.",
"John Adams became president in 1798.",
"In 1801, Thomas Jefferson became president.",
"James Madison was inaugurated in 1809.")
```
In each of these cases, we can describe what we need to do. We need to extract a four-character sequence of digits, which will be our year, and we need to convert that sequence of characters into an integer which we can treat as a number. You will often find yourself writing a function that looks like this.
```{r}
extract_year <- function(x) {
stopifnot(is.character(x))
year_char <- stringr::str_extract(x, "\\d{4}")
year_int <- as.integer(year_char)
year_int
}
```
This function first checks that the input vector `x` is a character vector; if it's not, then the input is probably a mistake. Then it uses the `str_extract()` function from the [stringr](https://cran.r-project.org/package=stringr) package to find the first sequence of four digits. (That is the meaning of the regular expression `"\\d{4}"`.) Then it turns the resulting character vector into an integer and returns it.
We can test this function on our sample data.
```{r}
extract_year(doc_ids)
extract_year(files)
extract_year(sentences)
```
Because this function is vectorized, we could use it in a [dplyr](https://cran.r-project.org/package=dplyr) expression in order to create a new column of years from an existing column.
## R's date objects
R includes a `Date` class for representing dates and doing calculations with them. You can turn a text representation of a date into a `Date` object using the `as.Date()` function. This function takes a `format =` argument that lets you specify the order of the elements of the date, but the default is to accept dates in the form `YYYY-MM-DD`. Let's create two date objects
```{r}
fort_sumter <- as.Date("1861-04-12")
appomattox <- as.Date("1865-04-09")
```
Now that we have created our date objects we can use comparison functions to figure out if dates come before or after one another.
```{r}
fort_sumter <= appomattox
fort_sumter >= appomattox
fort_sumter == appomattox
```
We can also calculate the difference in time using the `-` function. Note that the returned object prints out the length in days, but it is actually an object of class `difftime`. You can get a measurement of a time difference in different intervals by using the `difftime()` function.
```{r}
appomattox - fort_sumter
```
Another useful operation with dates from base R is creating a sequence between two dates at some regular interval.
```{r}
seq(from = as.Date("1860-01-01"), to = as.Date("1861-01-01"), by = "month")
```
## Parsing dates with lubridate
While there are other things that you can do with dates in base R, you almost always better off using the [lubridate](https://cran.r-project.org/package=lubridate) package. That package provides many additional functions and some additional classes for dealing with dates in a sensible way.
The [lubridate](https://cran.r-project.org/package=lubridate) package provides a series of functions in the form `mdy()`, `ymd()`, and `dmy()`, where those letters correspond to the position of the month, day, and year in a date. As long as your dates are in a reasonably consistent format, [lubridate](https://cran.r-project.org/package=lubridate)
should be able to parse them. For example, lubridate can parse these different ways of writing the same dates.
```{r}
mdy(c("September 17, 1862", "July 21, 1861", "July 1, 1863"))
dmy(c("17 September 1862", "21 July 1861", "1 July 1863"))
mdy("9/17/1862", "07/21/1861", "07/01/1863")
ymd("1862-09-17", "1861-07-21", "1863-07-01")
```
You often don't have a choice about the formats of dates in data that you receive, so
## Other operations on dates
Once you have a vector of dates, you might need to extract some component of the date, such as the year or the day of the week. The [lubridate](https://cran.r-project.org/package=lubridate) package provides functions to pull out those pieces.
```{r}
gettysburg <- mdy("July 1, 1863", "July 2, 1863", "July 3 1863")
year(gettysburg)
month(gettysburg)
day(gettysburg)
weekdays(gettysburg)
```
Sometimes you have dates that are specific down to the day, but you are interested in aggregating them by year, month, or week. For example, you might have the dates of newspaper issues, but want to know how many papers were published in a year or a month. For that you can use lubridate's `round_date()`, `floor_date()`, and `ceiling_date()` functions.
```{r}
floor_date(gettysburg, unit = "year")
floor_date(gettysburg, unit = "month")
floor_date(gettysburg, unit = "week")
```
Note that `floor_date()` will give you the date at the start of the week or month or year, while `round_date()` will give you the nearest start of the week or month or year.
```{r}
floor_date(gettysburg, unit = "week")
round_date(gettysburg, unit = "week")
```
The [lubridate](https://cran.r-project.org/package=lubridate) package also contains classes and functions for intervals and periods, which you can read about in [its vignette](https://cran.rstudio.com/web/packages/lubridate/vignettes/lubridate.html).
## Creating data with dates
When you are creating your own data, such as when you transcribe a source, you should write dates in a standardized way. The standard way of writing a date (called [ISO 8601](https://en.wikipedia.org/wiki/ISO_8601)) is to include a four-digit year, a two-digit month, and two-digit day, each separated by hyphens: `YYYY-MM-DD`. This way of writing dates has several virtues. One of them is that even when the dates are treated as text, they sort correctly in chronological order. The other is that by default many R functions expect dates to be in that format.
For example, in the toy data file `webster-speeches.csv` ([download here](data/webster-speeches.csv)), the dates are written as `1800-07-04`. When we load the file with the `read_csv()` function from [readr](https://cran.r-project.org/package=readr), that `date` column is automatically parsed into a date format.
```{r, message=FALSE}
read_csv("data/webster-speeches.csv")
```