-
Notifications
You must be signed in to change notification settings - Fork 18
/
Copy path5-TextAnalysis.qmd
404 lines (280 loc) · 20.3 KB
/
5-TextAnalysis.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
---
title: 'Worksheet 5: Text Analysis'
author: ''
date: ''
---
_This is the fifth in a series of worksheets for History 8500 at Clemson University. The goal of these worksheets is simple: practice, practice, practice. The worksheet introduces concepts and techniques and includes prompts for you to practice in this interactive document. When you are finished, you should change the author name (above), render your document to a pdf, and upload it to canvas. Don't forget to commit your changes as you go and push to github when you finish the worksheet._
Text analysis is an umbrella for a number of different methodologies. Generally speaking, it involves taking a set (or corpus) of textual sources, turning them into data that a computer can understand, and then running calculations and algorithms using that data. Typically, at its most basic level, that involves the counting of words.
**Text analysis can be broken down into 4 general steps:**
1. Acquiring a corpus
2. Preparing the text or Pre-processing
3. Choosing an analytical tool
* (There are many different tools or methods for text analysis. Take a minute and Google each of these methodologies: tf-idf, topic modeling, sentiment analysis, word vector analysis, n-grams)
4. Analyzing the results
In this worksheet we are focusing on basic text analysis. We'll learn how to load textual data into R, how to prepare it, and then how to analyze it using tf-idf or term-frequency according to inverse document frequency.
Before doing too much, lets load a few relevant libraries. The last few you will likely need to install.
```{r message=FALSE, warning=FALSE}
library(tidyverse)
library(tidytext)
library(readtext)
library(widyr)
library(SnowballC)
```
## Acquiring a Corpus
First, lets install the State of the Union package. This package contains text of all the state of the Union addresses from Washington to Trump. Run `install.packages` to install the `sotu` package.
```{r}
library(sotu)
```
This package includes both the metadata about these speeches in `sotu_meta` and the texts themselves in `sotu_texts`. Lets first look at the metadata associated with this package.
```{r}
meta <- as.data.frame(sotu_meta)
head(meta)
```
This package also includes a function that will let us write all of the files to disk. This is crucial but also an unusual step because when conducting text analysis in the real world, you will not have an R package filled with the data. Rather you will have to organize the metadata and load the files yourself. Writing these to the disk allows us to practice that step.
```{r}
file_paths <- sotu_dir(dir = "sotu_files")
head(file_paths)
```
What this does is create a new directory (sotu_files) and adds each State of the Union address as a text file. Notice each speech is its own .txt file that is comprised of just the text of the speech.
(@) Take a look at the directory in your files pane and open one of the documents.
Now lets load all these texts into R using the `readtext()` function. First look up the documentation for this function and read about it.
```{r}
sotu_texts <- readtext(file_paths)
```
Take a look at sotu_texts now. Notice that we have two columns, one filled with the text, and one with a document id.
```{r}
head(sotu_texts, n = 5)
```
Now our textual data is loaded into R but the textual data and the metadata are in two different data frames. Lets combine them. Note that this isn't the way I would typically recommend doing this but its a quirk of the SOTU data. Typically when I create a metadata spreadsheet for a textual dataset I have a column for the file name which makes joining the textual data and metadata together easier. Here, we'll need to sort the dataset so that is alphabetical and then join the two together.
```{r}
sotu_whole <-
sotu_meta %>%
arrange(president) %>% # sort metadata
bind_cols(sotu_texts) %>% # combine with texts
as_tibble() # convert to tibble for better screen viewing
glimpse(sotu_whole)
```
Now our data is loaded into R and its ready to be pre-processed.
## Pre-Processing
### Tokenizing
One of the most basic pre-processing techniques for textual data is to tokenize it. Tokenization is essentially splitting a phrase, sentence, paragraph, or an entire text document into smaller units, such as individual words or terms. Each of these smaller units are called tokens. The tokens could be words, numbers or punctuation marks but, for historians, its common to remove the numbers and punctuation too. To do this we'll create a data frame where each row contains a single word with its metadata as unit of observation.
`tidytext` provides a function called `unnest_tokens().` We can use this to convert our sotu_whole data frame into one that is tokenized. It takes three arguments:
* a tibble or data frame which contains the text
* the name of the newly created column that will contain the tokens
* the name of the column within the data frame which contains the text to be tokenized
```{r}
tidy_sotu <- sotu_whole %>%
unnest_tokens(word, text)
tidy_sotu
```
`unnest_tokens()` also did something else that is really important: it made everything lowercase and took out all punctuation. The function contains options if we wanted to keep those elements, but for our purposes we don't.
The function `unnest_tokens()` also has an option called token. Tokenizing by word is the default but you could also tokenize by characters, ngrams, lines, or sentences.
(@)Use the documentation to tokenize the dataset into sentences:
```{r}
```
We've talked about n-grams loosely in class. But lets define it more formally. An n-gram is a contiguous sequence of n items from a given sample of text or speech. The n stands for the number of items. So for example, a bi-gram is sets of two words.
For example, if I had the string: "Nothing to fear but fear itself" A bi-gram would look like this:
Nothing to, to fear, fear but, but fear, fear itself.
A tri-gram would look like this:
Nothing to fear, to fear but, but fear itself
We can use unnest_tokens() to create n-grams for us. To do that we just have to add an extra option that defines n.
```{r}
sotu_bigrams <- sotu_whole %>%
unnest_tokens(bigram, text, token = "ngrams", n = 2)
head(sotu_bigrams$bigram)
```
(@) Use `unest_tokens()` to create tri-grams.
```{r}
```
### Stopwords
Another crucial component of text analysis is removing stopwords. Stopwords are words like "I, he, she, of, the" that are common and don't convey meaning. Because they are highly common they don't tell us anything about the content of the text itself.
There are stopwords that come with the `tidytext` package.
```{r}
stop_words
```
This is just one example of stopwords. You can find other lists such as stopwords in other languages or [stopwords designed specifically for the 19th century.](https://www.matthewjockers.net/macroanalysisbook/expanded-stopwords-list/) Its also possible you may want to edit the list of stopwords to include some of your own. For example, if we wanted to add the word, "America" to the stopwords list we could use add_row to do so:
```{r}
stop_words_custom <- stop_words %>% add_row(word="America", lexicon="NA")
```
For now lets just remove the default stopwords. The easiest way to do that here is to do an anti-join. We join and return all rows from our table of tokens tidy_sotu where there are no matching values in our list of stopwords.
```{r}
tidy_sotu_words <- tidy_sotu %>%
anti_join(stop_words)
tidy_sotu_words
#another way to do this would be to filter by words NOT in the stop word list like this: filter(!word %in% stop_words$word)
```
### Stemming
The third common kind of pre-process is called word stemming. This process reduces a word to its root stem. So for example: fishing becomes fish, fished becomes fish, fishes becomes fish. You can easily see how this might be useful for capturing all forms of a word.
`tidytext` doesn't have its own word stemming function. Instead we have to rely on the functions provided by `hunspell` or `SnowballC`. I prefer `SnowballC`. You may need to install it before running the below code.
```{r}
library(SnowballC)
tidy_sotu_words %>%
mutate(word_stem = wordStem(word))
```
Now if you compare the word and word_stem columns you can see the effect that wordStem had. Notice that it works well in cases like
citizens = citizen
But it does some odd things to words like representatives. Whether this is useful for you will depend on the question your asking (and the OCR accuracy) but its a useful technique to be familiar with nevertheless.
## Analysis
Lets reset our work space and ensure that our df is loaded with single tokenized words and filter by our stopword list. Go ahead and clear your environment using the broom button and then run the below code. This code is simply everything we've run up to this point.
```{r}
meta <- as.data.frame(sotu_meta)
file_paths <- sotu_dir(dir = "sotu_files")
sotu_texts <- readtext(file_paths)
sotu_whole <-
sotu_meta %>%
arrange(president) %>% # sort metadata
bind_cols(sotu_texts) %>% # combine with texts
as_tibble()
tidy_sotu <- sotu_whole %>%
unnest_tokens(word, text) %>%
anti_join(stop_words)
```
(@) Before we move forward, take a minute a describe the chunk of code you just ran. What does each section do and how does it reflect the workflow for a topic modeling project? What are the important steps that are unique to topic modeling?
>
The most basic kind of analysis we might be interested in doing is counting words. We can do that easily using the `count()` function:
```{r}
tidy_sotu %>%
count(word, sort = TRUE)
```
Now we know that the most used word in state of the union speeches is government. But what if we wanted to look at when presidents use the words war versus the word peace?
```{r}
tidy_sotu %>%
filter(word %in% c("war", "peace")) %>%
count(year, word)
```
This data frame is to big to understand quickly without visualizing it. We can create a bar chart to better understand it:
```{r}
library(ggplot2)
tidy_sotu %>%
filter(word %in% c("war", "peace")) %>%
count(year, word) %>%
ggplot(aes(year, n, fill = word)) +
geom_col(position = "fill")
```
We also might want to ask about the average length of each president's state of the union address. Who had the longest speech and who had the shortest?
```{r}
tidy_sotu %>%
count(president, doc_id) %>%
group_by(president) %>%
summarize(avg_words = mean(n)) %>%
arrange(desc(avg_words))
```
(@) Think back to the metadata that we loaded about these speeches. Why are more modern president's state of the union addresses shorter?
>
(@) Filter the dataset to address this discrepancy and the recreate these statistics:
```{r}
```
### Term Frequency
Often, the raw frequency of a term is not as useful as relative frequency. In other words, how often that word appears relative to the total number of words in a text. This ratio is called **term frequency**.
You can calculate the term frequency by dividing the total occurrences of a word by the total number of words. Typically you want to do this per document.
Here's an easy way to calculate it:
```{r}
tidy_sotu_rel.freq <- tidy_sotu %>%
count(doc_id, word, sort = T) %>%# count occurrence of word and sort descending
group_by(doc_id) %>%
mutate(n_tot = sum(n), # count total number of words per doc
term_freq = n/n_tot)
```
We can assume that words with a high frequency in the text are more important or significant. Here we can find the words with the most significance for each president:
```{r}
tidy_sotu %>%
count(president, word) %>% # count n for each word
group_by(president) %>%
mutate(n_tot = sum(n), # count total number of words per doc
term_freq = n/n_tot) %>%
arrange(desc(term_freq)) %>% # sort by term frequency
top_n(1) %>% # take the top for each president
print(n = Inf) # print all rows
```
(@) The code above is commented to help you follow it. Walk through the code above, and explain what each line does in your own words. If its a function you are unfamiliar with, look up the documentation.
>
### TF-IDF
The above measures the frequency of terms within individual documents. But what if we know about words that seem more important based on the contents of the **entire** corpus? That is where tf-idf or term-frequency according to inverse document frequency comes in.
Tf-idf measures how important a word is within a corpus by scaling term frequency per document according to the inverse of the term’s document frequency (number of documents within the corpus in which the term appears divided by the number of documents). The tf–idf value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently in general.
Don't worry too much about how tf-idf is calculated. But if you feel like you are a bit lost and want to understand the specifics - I recommend reading the [tf-idf wikipedia page](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) and this blog post from [_Learn Data Science_](https://www.learndatasci.com/glossary/tf-idf-term-frequency-inverse-document-frequency/).
We'll calculate tf-idf in the next code chunk but lets talk for a second about what that number will represent. It will be:
* lower for words that appear frequently in many documents of the corpus, and lowest when the word occurs in virtually all documents.
* higher for words that appear frequently in just a few documents of the corpus, this lending high discriminatory power to those few documents.
Luckily, `tidytext` provides a function for calculating tf-idf. To calculate tf-idf the function needs a list of every word in every document and the count. Like this:
```{r}
tidy_sotu %>%
count(doc_id, word, sort = TRUE)
```
We can feed that to the function and get the tf-idf:
```{r}
sotu.tf.idf <- tidy_sotu %>%
count(doc_id, word, sort = TRUE) %>%
bind_tf_idf(word, doc_id, n)
head(sotu.tf.idf)
```
The resulting data frame has 3 columns: term frequency (tf), inverse document frequency (idf) and Tf-idf (tf_idf).
Lets look at what the words with the highest tf-idf score are.
```{r}
sotu.tf.idf %>% arrange(desc(tf_idf))
```
(@) Pick a president who served more than one term. Filter the dataset and generate both raw word counts and tf-idf scores for that president. What words are most significant in each method? Why and what does that tell you about that president?
```{r}
```
### Co-Occurance
Co-occurrence gives us a sense of words that appear in the same text, but not necessarily next to each other. It shows words that are likely to co-occur. Note that this is different than topic modeling, which we'll discuss next week.
For this section we will make use of the `widyr` package. The function which helps us do this is the `pairwise_count()` function. It lets us count common pairs of words co-appearing within the same speech. This function might take a second as the resulting data frame will be incredibly large.
```{r}
sotu_word_pairs <- sotu_whole %>%
mutate(speech_end = word(text, -5000, end = -1)) %>% # extract last 100 words
unnest_tokens(word, speech_end) %>% # tokenize
filter(!word %in% stop_words$word) %>% # remove stopwords
pairwise_count(word, doc_id, sort = TRUE, upper = FALSE) # don't include upper triangle of matrix
head(sotu_word_pairs)
```
Now we have a list of words that appear near each other in the text as well as the frequency. Once again this dataset is far too large to look at in a data frame. Instead, we'll create a network graph that shows us the relationships between words for any words that appear more than 200 times. I chose 200 after looking at the above dataset and seeing that the highest count was 239. You want the network graph to be manageable and not too large.
```{r}
library(igraph)
library(ggraph)
sotu_word_pairs %>%
filter(n >= 200) %>% # only word pairs that occur 200 or more times
graph_from_data_frame() %>% #convert to graph
ggraph(layout = "fr") + # place nodes according to the force-directed algorithm of Fruchterman and Reingold
geom_edge_link(aes(edge_alpha = n, edge_width = n), edge_colour = "tomato") +
geom_node_point(size = 5) +
geom_node_text(aes(label = name), repel = TRUE,
point.padding = unit(0.2, "lines")) +
theme_void()
```
(@) Create a network graph that shows the relationship between words that appear between 125 and 175 times.
```{r}
```
## Analyzing Historical Newspapers About Buffalo Bill
To practice text analysis this week you will be helping Dr. Seefeldt with his research. In the github repository below I have included a metadata file containing all of the associated info from each article and a txt folder with each text file.
The dataset is described by Dr. Seefeldt this way:
> The project staff identified 99 visits to Indiana communities, some of which included performances spanning two or more days. We located and captured a total of 720 Indiana newspaper pieces, including 273 articles promoting specific Wild West performances, 74 substantive pieces documenting locally-based reception to these performances, and 373 extraneous items, including advertisements, images, and reprints of articles generated elsewhere. In total, the team transcribed approximately 740 items (R = Reception; P = Promotion; E = Extraneous)
Use this link to read more about the project and the historical context: [https://docs.google.com/document/d/1LnmyCob5zc984EkMAx3sXlXQqhNbm9O8XV4O3HrQUbQ/edit?usp=sharing](https://docs.google.com/document/d/1LnmyCob5zc984EkMAx3sXlXQqhNbm9O8XV4O3HrQUbQ/edit?usp=sharing)
You should use the rest of this worksheet to practice text analysis (and R more broadly) by exploring this dataset. Using the metadata provided and the raw text files can you use what you learned above to analyze these texts? What historical conclusions might you be able to draw? Perform the textual analysis and then write up your findings for Dr. Seefeldt. What might he be able to draw from this analysis? I will then take your work, compile it, and share it with him. Here are 3 research questions to get started. You can use one of these or come up with your own.
* Is there a difference in promotion language vs. reception language. In other words, what was BBWW selling and what did Indiana newspapers find remarkable.
* Can we detect some change over time in promotion language and reception language (marketing and reviewing)? Were there types of characters, scenarios, action promised in promotional material and/or noted in reviews earlier vs later?
* What can be gleaned from the items tagged as extraneous as far as topics? These are news items that are somehow related to BBWW. Crime, finances, celebrity, etc.
Write up a research plan in the space below. What are you looking to explore?
>
Lets download the data.
```{r}
#| eval: false
#zip file of all the .txt files. One for each issue.
download.file("https://github.com/dseefeldt/IndianaNewspapers/raw/main/bb-txt.zip", "bb-txt.zip")
unzip("bb-txt.zip")
# Metadata that includes info about each article.
bb.metadata <- read.csv("https://raw.githubusercontent.com/dseefeldt/IndianaNewspapers/main/metadata.csv")
```
(@) There are a few problems with this metadata from a tidy data perspective. First the date column isn't useful for assessing change over time. Can you split it into 3 columns one for year, month, and day? This will be useful if you want to try and identify change over time. Use the Date column not the FullDate column. This column is messy and you'll need to clean it up in order to have useful data.
```{r}
```
(@) Now lets load the BB data. This happens in a slightly different way than with the SOTU data since its local data and not data from an R package.
```{r}
data_dir <- paste(getwd(), "/txt", sep = "")
bb <- readtext(paste0(data_dir, "/*.txt"))
```
(@) Now you need to join this with the metadata for each file. Note that doc_id and Filename are the same value. How would you join two datasets together in R so that you can proceed with textual analysis?
```{r}
```
(@) Now you can begin your textual analysis for Dr. S. Add code chunks below and intersperse text to explain what you are doing and why in the text of the worksheet.
```{r}
```
(@) Finally, when you are all done. Write up your findings. What research question did you focus on and what did you learn?