CCHC notebook

lmullen · Mar 21, 2022 · 285c58b · 285c58b
1 parent 09e7080
commit 285c58b
Showing 1 changed file with 149 additions and 4 deletions.
diff --git a/notebooks/016-cchc.Rmd b/notebooks/016-cchc.Rmd
@@ -27,15 +27,14 @@ This report is a part of the extension of the [*America's Public Bible*](https:/
 
 The purpose of this report is to demonstrate the viability of the initial results of applying machine learning models to Library of Congress full-text digtal collections. Full interpretative findings, which are outside the specific scope of the project, will subsequently be published on the *America's Public Bible* website or other suitable publication venues.
 
-## Creation of the dataset
+## Creation of the datasets
 
 The two datasets used in this report were created in November 2021 through January 2022 by applying machine-learning models to a subset of Library of Congress digital collections. This subset was (1) items that were part of collections identified by subject as pertaining to American history; (2) items that were marked in the library catalog has having full text available; (3) items for which that full text could be readily identified and accessed. 
 
 Two datasets were created. One, generated by the `cchc-language-detector` service, seeks to identify multilingual items in the collections. The other, generated by the `ccch-predictor` service, seeks to identify biblical quotations in the English language collections.
 
 ## Language detection dataset
 
-
 ```{r}
 lang_tbl <- tbl(db, in_schema("results", "languages")) |> 
   select(-job_id)
@@ -123,11 +122,157 @@ lang |>
   knitr::kable()
 ```
 
-Some of the combinations are prima facie extremely likely. The first combination (English and Latin) and the third combination (English and French) make perfect sense. Other combinations, including English and Welsh (`CYM`) raise some questions. If, upon inspection of the items Welsh is not being accurately detected, perhaps it should be eliminated from the language detector. Alternatively, it is possible that the parameters for our filter described above should be more restrictive to avoid even more false positives.
+Some of the combinations are prima facie extremely likely. The first combination (English and Latin) and the third combination (English and French) make perfect sense. Other combinations, including English and Welsh (`CYM`), raise some questions. If upon inspection of the items Welsh is not being accurately detected, perhaps it should be eliminated from the language detector. Alternatively, it is possible that the parameters for our filter described above should be more restrictive to avoid even more false positives.
 
 While further analysis and inspection is necessary, my cautious conclusion is that these preliminary results validate the process of identifying potential multilingual documents computationally.
 
+## Biblical quotations dataset
+
+The results of running the biblical quotation identifier across even just a subset of the Library of Congress collections has resulted in 49.2 million possible combinations of verses, items, and versions (i.e., rows in the database table). This represents 232,290 unique combinations of references and items. (Keep in mind that items can comprise many pages or sub-items, and if a verse appears multiple times it is counted only once.) 
+
+I am continuing to work on improving the prediction model:
+
+1. Ironically, the model may be overeager to identify quotations due to the better quality OCR found in most digitized collections. The Chronicling America OCR can (understandably) be spotty in areas. Most digitized collections have better OCR, and so the model may overestimate the likelihood of some quotations.
+
+2. A process of refining obvious errors by the algorithm takes some hand correction. For instance, some verses are very likely to be false positives, because they are not scripturally significant but have very common words that would appear in other documents. 
+
+3. I am continuing to refine the CCHC software to extract full text from digitized items in a clean, plain text format. 
+
+Nevertheless, as hoped the model is clearly capable of identifying quotations across collections besides Chronicling America, though perhaps with more noise than I had initially thought.
+
+Given the scale of the results, and given that the results are still preliminary, the purpose of this report is not to definitively advance the interpretations that I hope to make on the America's Public Bible website. But rather, I want to demonstrate the validity of computing across collections by making interpretative comparisons across collections.
+
+```{r}
+verses_in_collection <- function(collection) {
+  # Probably need to imrpove improve this query with a "DISTINCT ON" clause
+  raw <- "
+SELECT
+q.reference_id,
+count(q.reference_id) AS n
+FROM
+	items_in_collections ic
+	LEFT JOIN (
+		SELECT
+			*
+		FROM
+			results.biblical_quotations
+		WHERE
+			probability >= 0.9) AS q ON ic.item_id = q.item_id
+WHERE
+	ic.collection_id = {collection}
+GROUP BY
+	q.reference_id
+ORDER BY
+	n DESC
+LIMIT 20;
+  "
+  query <- glue::glue_sql(raw, .con = db)
+  dbGetQuery(db, query)
+}
+```
+
+```{r}
+top_verses <- function() {
+  # query <- "
+  # SELECT reference_id, COUNT(reference_id) AS n
+  # FROM (
+  # SELECT DISTINCT ON (item_id, reference_id) item_id, reference_id
+  # FROM (
+  # SELECT *
+  # FROM results.biblical_quotations
+  # WHERE probability >= 0.9) q
+  # ORDER BY item_id, reference_id, probability) u
+  # GROUP BY reference_id
+  # ORDER BY n DESC
+  # LIMIT 100;
+  # "
+  query <- "select * from temp.top_verses"
+  dbGetQuery(db, query)
+}
+```
+
+Let's start by looking at the 100 most quoted verses across the collections. Although we are going to retrieve the top 100 verses, for the sake of space we will display only the top 20.
+
+```{r}
+top100 <- top_verses() |> 
+  select(-n) |> 
+  mutate(rank = 1:100)
+knitr::kable(top100 |> head(20))
+```
+
+This list contains some verses which are likely to have been quoted very frequently. Consider some of the top verses:
+
+- Mark 9:40: "For he that is not against us is on our part."
+- John 11:35: "Jesus wept."
+- John 10:30: " I and my Father are one."
+- Mark 13:37: "And what I say unto you I say unto all, Watch."
+
+These are precisely the kinds of proverbial, well-known scriptural texts which were quotable, and thus frequently quoted.
+
+Other verses, however, are clearly errors. It is unlikely that a quotation from the Apocrypha (Baruch 4:17) was among the most commonly quoted. (For America's Public Bible, I have simply eliminated some of these verses, in effect "blacklisting" them. I will do the same after identifying which verses should be a part of that list.)
+
+However, one of the fundamental premise of computing across collections is that we can learn something by comparison across collections, and not just from within a collection (however large). 
+
+So let us get the top 20 verses from the a single collection. In this instance, let us find the top verses quoted from the [Civil Rights History Project](http://www.loc.gov/collections/civil-rights-history-project/about-this-collection/).
+
+```{r}
+cr <- verses_in_collection("http://www.loc.gov/collections/civil-rights-history-project/about-this-collection/") |> 
+  select(-n) |> 
+  mutate(rank = 1:20)
+knitr::kable(cr)
+```
+
+We can notice immediately that we have a different set of verses, though with some of the same likely errors. (1 Esdras and Tobit are in the Apocrypha; Ezra and Chronicles seem to be overrepresented due to the presence of numbers and formulaic phrases.) What we want to do is figure out not just what the top verses are in a collection, but which verses are _unusual_ in a collection. While there are a number of ways to do that, including computing the likelihood, a simple approach suffices. We can compare the two lists of verses, and keep only the verses in a collection that are _not_ contained in the list of top verses across all collections. This approach also has the neat property that verses that are erroneously represented in one collections are even more likely to be represented in all the collections, so this comparison eliminates many of our errors.
+
+Undertaking this comparison for the Civil Rights collection, we find the following top 10 verses:
+
+```{r}
+cr10 <- cr |> 
+  anti_join(top100, by  = "reference_id") |> 
+  filter(reference_id != "1 Esdras 8:80") |> 
+  filter(reference_id != "Tobit 10:9") |> 
+  mutate(revised_rank = 1:10)
+knitr::kable(cr10)
+```
+
+The results are immediately suggestive:
+
+- Galatians 4:3: "Even so we, when we were children, were in bondage under the elements of the world."
+- 1 Thessalonians 3:4: "For verily, when we were with you, we told you before that we should suffer tribulation; even as it came to pass, and ye know."
+- 1 John 2:3: "And hereby we do know that we know him, if we keep his commandments."
+- Romans 5:10: "For if, when we were enemies, we were reconciled to God by the death of his Son, much more, being reconciled, we shall be saved by his life."
+- 2 Thessalonians 3:10: "For even when we were with you, this we commanded you, that if any would not work, neither should he eat."
+
+Key words and concepts include "bondage," "tribulation," the keeping of commandments (the commandment in this instance being "love your brother"), "reconciliation," and labor.
+
+Now, it would be interpretatively irresponsible to assume that because a verse was quoted in a collection about Civil Rights, we immediately understand how it must have been used. Essential at this point in the research is to bring in conventional historical methods, to go back to the sources and identify the context of the quotations, and to listen to the voices of the people who used these scriptural texts. I merely mean to suggest that the results are prima facie interesting and worthy of further interpretation via a mix of computational and conventional methods, paying due attention to the ethical considerations of these kinds of research.
+
+## Table of interesting language items
+
+```{r}
+lang_interest <- lang |> 
+  filter_langs() |> 
+  group_by(item_id) |> 
+  mutate(n = n()) |> 
+  filter(n > 1)
+items <- tbl(db, "items") |> select(id, title, year, date, subjects, languages)
+lang_interest_md <- lang_interest |>
+  select(item_id) |> 
+  distinct() |> 
+  left_join(items, by = c("item_id" = "id")) |> 
+  collect()
+lang_interest_langs <- lang_interest |> 
+  select(-sentences) |> 
+  pivot_wider(names_from = lang, values_from = percentage) |> 
+  collect() %>%
+  select(item_id, sort(colnames(.)))
+lang_interest_md |> 
+  left_join(lang_interest_langs, by = "item_id") |> 
+  write_csv("~/Desktop/multilingual-of-interest.csv", na = "")
+```
+
+## Cleanup
+
 ```{r disconnect}
 dbDisconnect(db)
 ```
-