Pierrette Lo 12/4/2020
Chapter 13.7 talks about “set operations” - these are for adding entire rows to your starting table (vs. “joins”, which add columns).
The {dplyr} cheatsheet provides a helpful visualization.
- Ch. 14
library(tidyverse)
- In code that doesn’t use {stringr}, you’ll often see
paste()
andpaste0()
. What’s the difference between the two functions? What stringr function are they equivalent to? How do the functions differ in their handling ofNA
?
The paste
functions are the equivalent of str_c
. I usually use
paste0
just because I learned it first.
paste
: option to specify a separator (default is a space)paste0
: likepaste
but by default, squashes strings together with no space
Handling of missing values:
- The
paste
functions coerceNA
to the string “NA” - The
str_c
function leaves missing values asNA
. You can usestr_replace_na
to convertNA
to the string “NA”
e.g.:
string1 <- c("A", "B", NA)
string2 <- c(1, 2, 3)
paste(string1, string2)
## [1] "A 1" "B 2" "NA 3"
paste0(string1, string2)
## [1] "A1" "B2" "NA3"
str_c(string1, string2)
## [1] "A1" "B2" NA
- In your own words, describe the difference between the
sep
andcollapse
arguments tostr_c()
.
If you’re combining two vectors using str_c
:
sep
specifies a separator between the elements of each vectorcollapse
specifies a separator to be used if you’re squashing all of the elements into a single string
e.g.:
string3 <- c("A", "B", "C")
string4 <- c(1, 2, 3)
str_c(string3, string4, sep = "-")
## [1] "A-1" "B-2" "C-3"
str_c(string3, string4, sep = "-", collapse = "|")
## [1] "A-1|B-2|C-3"
- Use
str_length()
andstr_sub()
to extract the middle character from a string. What will you do if the string has an even number of characters?
If odd number of characters:
string5 <- "abcdefg"
# use str_length() to get the length of string5
# use seq() to generate a sequence of numbers from 1 to length of string5
# use median() to get the middle position of that sequence (i.e. 3)
middle <- median(seq(str_length(string5)))
str_sub(string5,
start = middle,
end = middle)
## [1] "d"
If even number of characters:
Note that if you input a decimal to str_sub
, it converts it to integer
by truncating, so it is always rounded down (e.g. as.integer(3.8) = 3
).
So our code above will still return a median of 3.5, and by default
str_sub
will convert that to 3. If you want to follow normal rounding
rules, use round()
.
Note: R, like many programming languages, follows the “round half
to even”
rule, meaning that x.5
will be rounded to the nearest even number. So
even.5
numbers will be rounded down, and odd.5
numbers will rounded
up. This is meant to prevent bias that would be introduced in large
datasets where .5 is always rounded up.
round(2.5)
## [1] 2
round(3.5)
## [1] 4
string6 <- "abcdef"
middle <- median(seq(str_length(string6)))
str_sub(string6,
start = middle,
end = middle)
## [1] "c"
str_sub(string6,
start = round(middle),
end = round(middle))
## [1] "d"
- What does
str_wrap()
do? When might you want to use it?
This function wraps text so it fits within a certain width. Could be useful if you were trying to fit a block of text into a table, for example.
string7 <- "This chapter introduces you to string manipulation in R. You’ll learn the basics of how strings work and how to create them by hand, but the focus of this chapter will be on regular expressions, or regexps for short. Regular expressions are useful because strings usually contain unstructured or semi-structured data, and regexps are a concise language for describing patterns in strings. When you first look at a regexp, you’ll think a cat walked across your keyboard, but as your understanding improves they will soon start to make sense."
writeLines(string7)
## This chapter introduces you to string manipulation in R. You’ll learn the basics of how strings work and how to create them by hand, but the focus of this chapter will be on regular expressions, or regexps for short. Regular expressions are useful because strings usually contain unstructured or semi-structured data, and regexps are a concise language for describing patterns in strings. When you first look at a regexp, you’ll think a cat walked across your keyboard, but as your understanding improves they will soon start to make sense.
#can use writeLines() or cat() here
cat(str_wrap(string7, width = 10, exdent = 3))
## This
## chapter
## introduces
## you to
## string
## manipulation
## in R.
## You’ll
## learn the
## basics
## of how
## strings
## work and
## how to
## create
## them by
## hand, but
## the focus
## of this
## chapter
## will be
## on regular
## expressions,
## or regexps
## for short.
## Regular
## expressions
## are useful
## because
## strings
## usually
## contain
## unstructured
## or semi-
## structured
## data, and
## regexps
## are a
## concise
## language
## for
## describing
## patterns
## in
## strings.
## When you
## first
## look at
## a regexp,
## you’ll
## think a
## cat walked
## across
## your
## keyboard,
## but as
## your
## understanding
## improves
## they will
## soon start
## to make
## sense.
- What does
str_trim()
do? What’s the opposite ofstr_trim()
?
str_trim
removes whitespace from the start and end of a string (often
useful to apply to an entire column when cleaning up a dataset).
str_pad
adds whitespace.
eg:
string8 <- " abcde "
str_trim(string8, side = "left")
## [1] "abcde "
string9 <- "abcde"
str_pad(string9, width = 8, pad = "#")
## [1] "###abcde"
str_squish
trims and removes interior repeated (i.e. >1) spaces
str_squish(" a b cde")
## [1] "a b cde"
- Write a function that turns (e.g.) a vector
c("a", "b", "c")
into the stringa, b, and c
. Think carefully about what it should do if given a vector of length 0, 1, or 2.
General format of a function:
function_name <- function(arguments) {
body
}
First figure out the steps manually for a simple test case:
test_output <- str_c(c("a", "b", "c"), collapse = ", ")
str_sub(test_output, -2, -2) <- " and "
cat(test_output)
## a, b, and c
Then figure out the input that you would want to change each time, and make it into a function:
test_function <- function(input){
test_output <- str_c(input, collapse = ", ")
str_sub(test_output, -2, -2) <- " and "
cat(test_output)
}
Test it with different cases to see what happens with exceptions:
test_function(c("b"))
## and b
Example exception: If you have a vector of 2, you don’t want to keep the comma:
test_output <- str_c(c("a", "b"), collapse = ", ")
str_sub(test_output, -3, -2) <- " and "
cat(test_output)
## a and b
Now add an if/else statement to your function to handle the special cases:
make_string <- function(input) {
output <- str_c(input, collapse = ", ")
if (length(input) >= 3) {
str_sub(output, -2, -2) <- " and "
} else if (length(input) == 2) {
str_sub(output, -3, -2) <- " and "
}
cat(output)
}
Now test it with different inputs:
make_string(c("a", "b", "c"))
## a, b, and c
make_string(c("a", "b"))
## a and b
make_string("a")
## a
make_string("")
make_string(c("a", "b", "c", "d"))
## a, b, c, and d