add text to intro vignette

khusmann · Mar 1, 2024 · 79b02ea · 79b02ea
1 parent 22a5b60
commit 79b02ea
Show file tree

Hide file tree

Showing 2 changed files with 203 additions and 9 deletions.
diff --git a/inst/extdata/colors.csv b/inst/extdata/colors.csv
@@ -1,7 +1,7 @@
 person_id,age,favorite_color
 1,20,BLUE
 2,REFUSED,BLUE
-3,21,RED
+3,21,REFUSED
 4,30,OMITTED
 5,1,N/A
 6,41,RED

diff --git a/vignettes/interlacer.Rmd b/vignettes/interlacer.Rmd
@@ -12,15 +12,209 @@ knitr::opts_chunk$set(
   collapse = TRUE,
   comment = "#>"
 )
+library(interlacer)
+library(readr)
+library(dplyr)
 ```
 
-```{r setup}
-library(interlacer)
+In many datasets, reasons for missing values are interlaced with data as special
+values or codes. For example, consider the following CSV:
+
+```{r}
+interlacer_example("colors.csv") |>
+  read_file() |>
+  cat()
+```
+
+As you can see, this data source has three variables: `person_id` and `age`,
+both numeric variables and `favorite_color`, a character or factor variable.
+Interlaced in their values are three possible missing reasons: `REFUSED`,
+`OMITTED`, and `N/A`.
+
+To load the values of this data source, it is an easy call to the venerable
+`readr::read_csv()`:
+
+```{r}
+(df <- read_csv(
+  interlacer_example("colors.csv"),
+  na = c("REFUSED", "OMITTED", "N/A"),
+))
+```
+
+As you can see, the data were loaded into a dataframe with three columns,
+and all of the missing reasons were replaced with `NA` values.
+
+## Aggregations with missing reasons
+
+Now, if we were only interested in the *values* of our source data, this
+functionality is all we need. But what if we wanted to know *why* some values
+were `NA`? Although that information was encoded in our source data, it was
+lost when all of the missing reasons were converted into `NA` values.
+
+For example, consider the `favorite_color` column. How many respondents 
+`REFUSED` to give their favorite color? How many people just `OMITTED` their 
+answer? Was the question `N/A` for some respondents (e.g. wasn't on their
+survey form)? What was the mean respondent age for each of these groups?
+
+Our current dataframe only gets us part way:
+
+```{r}
+df |>
+  mutate(
+    favorite_color_missing = is.na(favorite_color)
+  ) |>
+  summarize(
+    mean_age = mean(age, na.rm = T),
+    n = n(),
+    .by = favorite_color_missing
+  )
+```
+
+As you can see, because we converted all our missing reasons into a single `NA`,
+we can only answer these questions about missingness in general, rather than
+work with the specific reasons stored in our source data.
+
+Unfortunately, if we try load our data with the missing reasons intact, we lose
+something else: the type information of the values.
+
+```{r}
+(df_with_missing <- read_csv(
+  interlacer_example("colors.csv"),
+  col_types = cols(.default = "c")
+))
+```
+
+Now we have access to our missing reasons, but all the columns are character
+vectors. This means that in order to do anything with our values, we always
+have to filter out the missing reasons, and cast the remaining values to our 
+desired type:
+
+```{r}
+reasons <- c("REFUSED", "OMITTED", "N/A")
+
+df_with_missing |>
+  mutate(
+    age_values = as.numeric(if_else(age %in% reasons, NA, age)),
+    favorite_color_missing_reasons = if_else(
+      favorite_color %in% reasons,
+      favorite_color,
+      NA
+    )
+  ) |>
+  summarize(
+    mean_age = mean(age_values, na.rm=T),
+    n = n(),
+    .by = favorite_color_missing_reasons
+  )
+```
+
+This gives us the information we want, but it is cumbersome and starts to get
+really complex when different columns have different sets of possible missing
+reasons. It means you have to do a lot of type conversion
+gymnastics to switch between value types and missing types.
+
+### The interlacer approach
+
+Interlacer was built based on the insight that everything becomes much more
+tidy, simple, and expressive when we explicitly work with values and missing
+reasons as separate *channels* of the same variable. The functions
+the `read_interlaced_*` functions in interlacer do this for you: 
+they *deinterlace* variables from interlaced data sources into two columns per
+variable: one for holding values, one for holding missing reasons.
+
+```{r}
+(df_deinterlaced <- read_interlaced_csv(
+  interlacer_example("colors.csv"),
+  na = c("REFUSED", "OMITTED", "N/A"),
+))
+```
+As you can see, missing reasons columns are denoted by names surrounded by
+dots: the `.age.` column holds the missing reasons for the `age` variable,
+and so on.
+
+Now, all the missing reason information you need is right at your fingertips,
+AND the value types are preserved. To make the same report as we did before,
+we would run:
+
+```{r}
+df_deinterlaced |>
+  summarize(
+    mean_age = mean(age, na.rm=T),
+    n = n(),
+    .by = .favorite_color.
+  )
+```
+
+We get the same results as before but without needing to do any type gymnastics!
+
+## Filtering based on missing reasons
+
+Having separate columns for values and missing reasons also helpful for creating
+samples with inclusion / exclusion criteria based on missing reasons. For
+example, using our example data, say we wanted to create a sample of respondents
+that `REFUSED` to give their age?
+
+```{r}
+df_deinterlaced |>
+  filter(.age. == "REFUSED")  
+```
+
+How about people who `REFUSED` to report their age AND favorite color?
+
+```{r}
+df_deinterlaced |>
+  filter(.age. == "REFUSED" & .favorite_color. == "REFUSED")
+```
+
+With separate columns, we can combine value conditions with missing reason
+conditions. For example, this will select everyone who `REFUSED` to give
+their favorite color, who were over 20 years old:
+
+```{r}
+df_deinterlaced |>
+  filter(age > 20 & .favorite_color. == "REFUSED")
+```
+
+After we've created our sample, and are ready to start analyzing our data,
+we typically don't need to keep the missing reasons around anymore. Interlacer
+provides a convenient `drop_missing_reasons()` function to take care of this:
+
+```{r}
+df_deinterlaced |>
+  filter(.age. == "REFUSED") |>
+  drop_missing_reasons()
+```
+
+## Next steps
+
+So far, we've covered how interlacer's `read_interlaced_*` family
+of functions enabled us to deinterlace value and missing reason channels from
+interlaced data sources into separate dataframe columns. Separate value and
+missing reason columns enable us to create tidy and type-aware aggregation
+and filtering pipelines that can simultaneously consider a variable's value 
+AND missing reasons.
+
+That's all well and good, but what happens when we want to make modifications
+to our dataframe? What if we want to add variables to our dataframe, replace
+values with missing reasons, or missing reasons with values? Inevitably, we'll
+create situations where we simultaneously have a value and a missing reason,
+or neither a value nor a missing reason:
+
+```{r}
+# Value and missing reason:
+df_deinterlaced |>
+  mutate(.age. = "REDACTED")
+
+# No value, no missing reason:
+df_deinterlaced |>
+  mutate(
+    favorite_color = na_if(favorite_color, "BLUE")
+  )
 ```
 
-Unfortunately, when these are loaded into R with 
-functions like `readr::read_csv()` these all become `NA` values and lose the
-missing reason. If you want to create samples or calculate statistics connected
-to the *reasons* values are missing, you're forced to load your data as raw
-character columns and do a bunch of manual string operations to obtain the
-information you want.
+These operations produce dataframes that don't conform to the rule of "one value
+OR missing reason per variable row". We could manually solve this by manually
+fixing the corresponding column, but as the above output hints,
+interlacer provides an easier way by way of the function
+`coalesce_missing_reasons()`. In the next vignette,
+`vignette("mutations")`, we will show how this works!