-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathstring_methods.qmd
76 lines (51 loc) · 2.31 KB
/
string_methods.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
---
title: "Text Methods"
engine: knitr
---
## Overview
Data stored as text (i.e., [string]{.py}/[object]{.r} or [character]{.r}) is notoriously typo-prone and often requires extensive quality control checks throughout the data tidying process. Below is a--non exhaustive--set of common text methods that may prove valuable to people interested in dealing with text data in either [{{< fa brands python >}} Python]{.py} or [{{< fa brands r-project >}} R]{.r}.
## Find & Replace
The bulk of text tidying often boils down to finding unwanted [strings]{.py}/[characters]{.r} and replacing them with desired variants. This operation is how computers handle fixing typos.
:::panel-tabset
## [{{< fa brands r-project >}} R]{.r}
We can use the [{{< fa brands r-project >}} R]{.r} function `gsub` to find and replace (part of) a character object.
```{r r-find-replace}
# Make a vector of characters
text_r <- c("he1lo", "HELLO", "bye")
# Find and replace the number "1" with an "L"
fix1_r <- gsub(pattern = "1", replacement = "l", x = text_r)
# Print the result
print(fix1_r)
```
## [{{< fa brands python >}} Python]{.py}
We can use the [{{< fa brands python >}} Python]{.py} method `replace` to find and replace (part of) a string/object variable.
```{python py-find-replace}
# Make a list of strings
text_py = ["he1lo", "HELLO", "bye"]
# Find and replace the number "1" with an "L"
fix1_py = [item.replace("1", "l") for item in text_py]
# Print the result
print(fix1_py)
```
Note that we have to loop across our list to do this operation in this language.
:::
## Casing
Text casing (i.e., either UPPERCASE or lowercase) is also a frequent source of issues in code as many scripted operations are sensitive to text case. We can coerce to upper or lowercase as needed though with relative ease.
:::panel-tabset
## [{{< fa brands r-project >}} R]{.r}
We can use the `tolower` or `toupper` functions to coerce all text into lower or uppercase respectively.
```{r r-case}
# Coerce everything to lowercase
fix2_r <- tolower(x = fix1_r)
# Print the result
print(fix2_r)
```
## [{{< fa brands python >}} Python]{.py}
We can use the `lower` or `upper` methods to coerce all text into lower or uppercase respectively.
```{python py-case}
# Coerce everything to lowercase
fix2_py = [item.lower() for item in fix1_py]
# Print the result
print(fix2_py)
```
:::