fill in coded-data vignette

khusmann · Mar 5, 2024 · 688ffc1 · 688ffc1
1 parent 03a9002
commit 688ffc1
Show file tree

Hide file tree

Showing 6 changed files with 279 additions and 38 deletions.
diff --git a/DESCRIPTION b/DESCRIPTION
@@ -26,7 +26,7 @@ Imports:
 Suggests: 
     knitr,
     rmarkdown,
-    haven,
+    forcats,
     testthat (>= 3.0.0)
 Config/testthat/edition: 3
 Encoding: UTF-8

diff --git a/_pkgdown.yml b/_pkgdown.yml
@@ -8,5 +8,5 @@ articles:
   contents:
   - mutations
   - column-types
-  - recipes
+  - coded-data
   - other-approaches
diff --git a/inst/extdata/colors_coded_string.csv → inst/extdata/colors_coded_char.csv b/inst/extdata/colors_coded_string.csv → inst/extdata/colors_coded_char.csv
diff --git a/vignettes/coded-data.Rmd b/vignettes/coded-data.Rmd
@@ -0,0 +1,274 @@
+---
+title: "Coded Data"
+output: rmarkdown::html_vignette
+vignette: >
+  %\VignetteIndexEntry{Coded Data}
+  %\VignetteEngine{knitr::rmarkdown}
+  %\VignetteEncoding{UTF-8}
+---
+
+```{r, include = FALSE}
+knitr::opts_chunk$set(
+  collapse = TRUE,
+  comment = "#>"
+)
+```
+
+In addition to interlacing values and missing reasons, many statistical software
+packages will store categorical values and missing reasons as alphanumeric
+codes. These codes are often chosen so that numeric comparisons or casts can be
+used to determine if a value represents a real value or missing reason.
+Like 8-character variable name limits, this practice comes from a
+historical need to save digital storage space even if it made analyses less
+readable and more error-prone.
+
+Even though storage is cheap these days, coded formats continue to be the
+standard format used
+by statistical software packages like SPSS, SAS, and Stata. This article
+will describe these common coding schemes and how they can be decoded and
+deinterlaced to make them easier to work with in R.
+
+## Numeric codes with negative missing reasons (SPSS)
+
+It's extremely common to find data sources that encode all categorical responses
+as numeric values, with negative values representing missing values codes. SPSS
+is one such example. Here's an SPSS-formatted version of the `colors.csv`
+example:
+
+```{r}
+library(readr)
+library(interlacer)
+
+read_file(
+  interlacer_example("colors_coded.csv")
+) |>
+  cat()
+```
+
+Where missing values are:
+
+> -99: N/A
+>
+> -98: REFUSED
+>
+> -97: OMITTED
+
+And colors are coded:
+
+> 1: BLUE
+>
+> 2: RED
+>
+> 3: YELLOW
+
+This format gives you the ability to load everything as a numeric type:
+
+```{r}
+(df_coded <- read_csv(
+  interlacer_example("colors_coded.csv"),
+  col_types = "n"
+))
+```
+
+To test if a value is a missing code, you can check if it's less than 0:
+
+```{r}
+library(dplyr, warn.conflicts = FALSE)
+
+df_coded |>
+  mutate(
+    favorite_color_missing = if_else(favorite_color < 0, favorite_color, NA),
+    age = if_else(age > 0, age, NA)
+  ) |>
+  summarize(
+    mean_age = mean(age, na.rm=T),
+    n = n(),
+    .by = favorite_color_missing
+  )
+```
+
+The downsides of this approach are twofold: 1) all of your values and 
+missing reasons become codes you have to remember and 2) it's really easy to
+make mistakes.
+
+What sort of mistakes? Well, because everything is numeric, there's nothing
+stopping us from treating missing values as if they are regular values...
+If you forget to remove your missing values, R will still happily
+compute aggregations using the negative numbers!
+
+```{r}
+df_coded |>
+  mutate(
+    favorite_color_missing = if_else(favorite_color < 0, favorite_color, NA),
+#    age = if_else(age > 0, age, NA)
+  ) |>
+  summarize(
+    mean_age = mean(age, na.rm=T),
+    n = n(),
+    .by = favorite_color_missing
+  )
+```
+
+Have you ever thought you had a significant result, only to find that it's
+only because there are some stray missing reason codes still interlaced with
+your values? It's a bad time.
+
+You're much better off loading these formats with interlacer, then converting
+the codes into labelled factor levels:
+
+```{r}
+library(forcats)
+
+(df_decoded_deinterlaced <- read_interlaced_csv(
+  interlacer_example("colors_coded.csv"),
+  na = c("-99", "-98", "-97")
+) |>
+  mutate(
+    across(
+      missing_cols(),
+      \(x) fct_recode(x,
+        `N/A` = "-99",
+        REFUSED = "-98",
+        OMITTED = "-97",
+      )
+    ),
+    favorite_color = fct_recode(
+      as.character(favorite_color),
+      BLUE = "1",
+      RED = "2",
+      YELLOW = "3",
+    )
+  ))
+```
+
+Now aggregations won't mix up values and missing codes, and you won't have to
+keep cross-referencing your codebook to know what values mean:
+
+```{r}
+df_decoded_deinterlaced |>
+  summarize(
+    mean_age = mean(age, na.rm=T),
+    n = n(),
+    .by = .favorite_color.
+  )
+```
+
+## Numeric codes with character missing reasons (SAS, Stata)
+
+Like SPSS, SAS and Stata will encode factor levels as numeric values, but
+instead of representing missing reasons as negative codes, they are given
+character codes:
+
+```{r}
+read_file(
+  interlacer_example("colors_coded_char.csv")
+) |>
+  cat()
+```
+
+Here, the same value codes are used as the previous example, except the missing
+reasons are coded as follows:
+
+> ".": N/A
+>
+> ".a": REFUSED
+>
+> ".b": OMITTED
+
+
+To handle these missing reasons without interlacer, columns must be loaded as
+character vectors:
+
+```{r}
+(df_coded_char <- read_csv(
+  interlacer_example("colors_coded_char.csv"),
+  col_types = "c"
+))
+```
+
+To test if a value is missing, they can be cast to numeric types. If the cast fails,
+you know it's a missing code. If it is successful, you know it's a coded value.
+
+```{r}
+df_coded_char |>
+  mutate(
+    favorite_color_missing = if_else(
+      is.na(as.numeric(favorite_color)),
+      favorite_color,
+      NA
+    ),
+    age = if_else(!is.na(as.numeric(age)), as.numeric(age), NA)
+  ) |>
+  summarize(
+    mean_age = mean(age, na.rm=T),
+    n = n(),
+    .by = favorite_color_missing
+  )
+```
+
+Although the character missing codes help prevent us from mistakenly including
+missing codes in value aggregations, having to cast our columns to numeric
+all the time to check for missingness is hardly ergonomic, and generates
+annoying warnings. Like before, it's easier to import with interlacer and
+decode the values and missing reasons:
+
+```{r}
+read_interlaced_csv(
+  interlacer_example("colors_coded_char.csv"),
+  na = c(".", ".a", ".b")
+) |>
+  mutate(
+    across(
+      missing_cols(),
+      \(x) fct_recode(x,
+        `N/A` = ".",
+        REFUSED = ".a",
+        OMITTED = ".b",
+      )
+    ),
+    favorite_color = fct_recode(
+      as.character(favorite_color),
+      BLUE = "1",
+      RED = "2",
+      YELLOW = "3",
+    )
+  )
+```
+
+## Encoding a decoded & deinterlaced data frame.
+
+Re-coding and re-interlacing a data frame is easily done as follows:
+
+```{r, eval = FALSE}
+df_decoded_deinterlaced |>
+  mutate(
+    across(
+      missing_cols(),
+      \(x) fct_recode(x,
+        `-99` = "N/A",
+        `-98` = "REFUSED",
+        `-97` = "OMITTED"
+      )
+    ),
+    favorite_color = fct_recode(
+      favorite_color,
+      `1` = "BLUE",
+      `2` = "RED",
+      `3` = "YELLOW"
+    )
+  ) |>
+  write_interlaced_csv("output.csv")
+```
+
+## haven
+
+The [haven](https://haven.tidyverse.org/) package has functions for loading
+native SPSS, SAS, and Stata native file formats into
+special data frames that use column attributes and special values to keep track
+of interlaces values and missing reasons. For a complete discussion of how this
+compares to interlacer's approach, see `vignette("other-approaches")`.
+
+Future versions of interlacer could have the ability to convert haven data
+frames to and from deinterlaced data frames, but I want to gauge interest for
+this feature before I invested the time to implement. If this is a a feature
+you'd use, [please let me know](mailto:[email protected])!
diff --git a/vignettes/other-approaches.Rmd b/vignettes/other-approaches.Rmd
@@ -12,27 +12,13 @@ knitr::opts_chunk$set(
   collapse = TRUE,
   comment = "#>"
 )
-library(interlacer)
 ```
 
-Why don't data
 
-Values and missing reasons 
+## Labelled missing values
 
-## Negative numeric codes
-
-## String types in numeric fields
-
-## haven
-
-```{r}
-library(haven)
-```
-
-### Labelled missing values
-
-### "Tagged" missing values
+## "Tagged" missing values
 
 ## The "ideal" approach
 
-
+An "ideal" missing value API would make use of truly typed 
diff --git a/vignettes/recipes.Rmd b/vignettes/recipes.Rmd