-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathr4ds_week41.Rmd
346 lines (232 loc) · 9.24 KB
/
r4ds_week41.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
---
title: "R4DS Study Group - Week 41"
author: Pierrette Lo
date: 1/15/2021
output:
github_document:
toc: true
toc_depth: 2
html_preview: true
editor_options:
chunk_output_type: inline
---
```{r setup, include=FALSE, cache=FALSE}
# allow code to show errors
knitr::opts_chunk$set(error = TRUE)
```
## This week's assignment
* Chapter 14.4
```{r, warning=FALSE, message=FALSE}
library(tidyverse)
```
## Chapter 14\.4 Tools
### 14\.4\.1\.1 Detect matches
>1. For each of the following challenges, try solving it by using both a single regular expression, and a combination of multiple str_detect() calls.
>Find all words that start or end with x.
I subbed "a" for "x" below, as there are no words in this dataset that start with x.
```{r}
str_subset(words, "^a|a$")
words[str_detect(words, "^a") | str_detect(words, "a$")]
```
>Find all words that start with a vowel and end with a consonant.
```{r}
str_subset(words, "^[aeiou].*[^aeiou]$")
words[str_detect(words, "^[aeiou]") & str_detect(words, "[^aeiou]$")]
```
>Are there any words that contain at least one of each different vowel?
There isn't a simple way to do this using regexes.
```{r}
str_subset(words, "(?=.*a)(?=.*e)(?=.*i)(?=.*o)(?=.*u)")
```
```{r}
words[str_detect(words, "a") &
str_detect(words, "e") &
str_detect(words, "i") &
str_detect(words, "o") &
str_detect(words, "u")]
```
Apparently there are no such words in this dataset.. but examples would include "facetious", "abstemious", and "sequoia".
>2. What word has the highest number of vowels? What word has the highest proportion of vowels? (Hint: what is the denominator?)
```{r}
vowel_counts <- str_count(words, "[aeiou]")
vowel_props <- vowel_counts / str_count(words, "")
words[vowel_counts == max(vowel_counts)]
words[vowel_props == max(vowel_props)]
```
Alternate method (probably more likely to be used with "real life" data):
```{r}
words_df <- words %>%
as_tibble_col(column_name = "word") %>%
mutate(vowel_count = str_count(word, "[aeiou]"),
vowel_prop = vowel_count / str_count(word, ""))
words_df %>%
slice_max(vowel_count)
words_df %>%
slice_max(vowel_prop)
```
### 14\.4\.2\.1 Extract matches
>1. In the previous example, you might have noticed that the regular expression matched “flickered”, which is not a colour. Modify the regex to fix the problem.
You only want to match "red" as a complete word (i.e. not words like "flickeRED" or "REDuce"), so you need to use \b to indicate word boundaries around the pattern:
```{r}
colours2 <- c("red", "orange", "yellow", "green", "blue", "purple")
colour_match2 <- str_c("\\b(", str_c(colours2, collapse = "|"), ")\\b")
sentences %>%
str_subset(colour_match2) %>%
str_match_all(colour_match2)
```
>2. From the Harvard sentences data, extract:
>The first word from each sentence.
Regex: one or more alphabets or "word characters" (letter, digit, or underscore)
```{r}
# match one or more alphabets - NOTE that this doesn't catch words with apostrophes, like "It's"
sentences %>%
str_extract("[:alpha:]+")
# match one or more "word characters" - also doesn't match apostrophe
sentences %>%
str_extract("\\w+")
# use the tidyverse shortcut - does match apostrophe
sentences %>%
str_extract(boundary("word"))
```
>All words ending in ing.
Using the pattern "ing$" in the `words` dataset will find words ending with -ing, but the "ends with" doesn't work with `sentences` -- we need to account for word boundaries instead.
```{r}
words %>%
str_subset("ing$")
# match one or more alphabet, followed by ing, surrounded by word boundaries
# first get the whole sentences containing that pattern
# then extract the words
sentences %>%
str_subset("\\b[:alpha:]+ing\\b") %>%
str_extract_all("\\b[:alpha:]+ing\\b", simplify = TRUE)
```
>All plurals.
Start by matching words ending with "s" - but need to account for exceptions like "is", "was", etc. You can filter out some of those by matching words with >3 letters, but this isn't a complete solution (doesn't catch verbs like "makes" or words with "ss" like "across"). You should use text mining/NLP packages if you really need to do a rigorous grammatical analysis.
```{r}
sentences %>%
str_subset("\\b[:alpha:]{4,}s\\b") %>%
str_extract_all("\\b[:alpha:]{4,}s\\b", simplify = TRUE)
```
### 14.4.3.1 Grouped matches
>1. Find all words that come after a “number” like “one”, “two”, “three” etc. Pull out both the number and the word.
For simplicity, I stuck to numbers 1 through 10.
```{r}
# start with word boundary (so you don't match words like "tONE")
num_match <- "\\b(one|two|three|four|five|six|seven|eight|nine|ten) +(\\w+)"
sentences %>%
str_subset(num_match) %>%
str_extract(num_match)
```
>2. Find all contractions. Separate out the pieces before and after the apostrophe.
Use parentheses to designate groups, and `str_match_all()` to extract those groups:
```{r}
matches <- sentences %>%
str_subset("'") %>%
str_match_all("(\\w+)'(\\w+)")
```
`str_match_all` returns a list of matrices. It's a bit complicated to convert to a tibble, but here is one method:
```{r}
matches %>%
# enframe() converts each list element to a nested column called "value", where each row is a character vector
# the "name" column is the name of each list element, which in this case was numbered
enframe() %>%
# use the purrr::map() function with setNames to add names to each element of the vector in each row
mutate(value = map(value, setNames, c("full_word", "part1", "part2"))) %>%
# unnest_wider() splits each vector in the "value" column into 3 columns, with names as set above
unnest_wider(value) %>%
# remove the "name" column
select(-name)
```
### 14\.4\.4\.1 Replacing matches
>1. Replace all forward slashes in a string with backslashes.
Remember you need to use `writeLines()` to see the printed string without the escaping slashes.
```{r}
test_string <- "for/ward//slash/"
writeLines(test_string)
str_replace_all(test_string, "/", "\\\\") %>%
writeLines()
```
>2. Implement a simple version of str_to_lower() using replace_all().
`str_to_lower()` makes all of the letters in a string lower-case.
The long way to do this would be to create a named vector of replacements (I only did "A" and "R", due to laziness):
```{r}
test_string <- "cApItAl lEtTeRs"
replacements <- c("A" = "a",
"R" = "r")
str_replace_all(test_string, replacements)
```
>3. Switch the first and last letters in `words`. Which of those strings are still words?
```{r}
replaced_words <- words %>%
str_replace("^(.)(.*)(.)$", "\\3\\2\\1")
# this search pattern also works: "(^.)(.*)(.$)"
str_subset(words, replaced_words)
```
### 14\.4\.5\.1 Splitting
>1. Split up a string like "apples, pears, and bananas" into individual components.
```{r}
test_string <- "apples, pears, and bananas"
str_split(test_string, ", (and)?")
```
>2. Why is it better to split up by boundary("word") than " "?
`boundary("word")` handles punctuation, like commas or periods.
```{r}
test_string <- "Apples, pears, and bananas."
str_split(test_string, " ")
str_split(test_string, boundary("word"))
```
>3. What does splitting with an empty string ("") do? Experiment, and then read the documentation.
"" is equivalent to `boundary("character")`
```{r}
test_string <- "Apples, pears, and bananas."
str_split(test_string, "")
```
### 14.5.1 Other types of pattern
>1. How would you find all strings containing `\` with `regex()` vs. with `fixed()`?
```{r}
test_string <- "this\\sentence\\has\\slashes"
writeLines(test_string)
# with usual regex()
str_view_all(test_string, "\\\\")
# with fixed()
str_view_all(test_string, fixed("\\"))
```
>2. What are the five most common words in sentences?
```{r}
sentences %>%
str_extract_all(boundary("word")) %>%
unlist() %>%
as_tibble_col(column_name = "word") %>%
mutate(word = tolower(word)) %>%
# remember count() = group_by() %>% summarize(n = n())
count(word, sort = TRUE) %>%
slice(1:5)
```
If you're working with an actual piece of text, I would use the {tidytext} text mining package instead.
More info about {tidytext}, which is very fun to play with: https://www.tidytextmining.com/index.html
```{r}
# install.packages("tidytext")
library(tidytext)
sentences %>%
as_tibble_col(column_name = "sentence") %>%
# tolower by default
unnest_tokens(output = word,
input = sentence) %>%
anti_join(stop_words) %>%
count(word, sort = TRUE)
```
### 14.7.1 stringi
>1. Find the stringi functions that:
The base {stringi} functions are listed in the help for each {stringr} function (or you can just Google).
To get a list of {stringr} functions, start typing "str_" into the console or a code chunk.
>Count the number of words.
Look at the help for `str_count()` -> `stringi::stri_count()`
>Find duplicated strings.
`str_dup()` -> `stri_duplicated()`
>Generate random text.
For some reason this function doesn't have an equivalent in {stringr}, so it doesn't get automatically loaded when you load {tidyverse} - need to load the {stringi} package directly.
```{r}
stringi::stri_rand_strings(4, 5)
```
>2. How do you control the language that stri_sort() uses for sorting?
Use the `locale` argument of `stri_opts_collator()` (per the help, you can use the `stri_opts_collator` arguments directly in `stri_sort`).