Skip to content

Commit

Permalink
add text to intro vignette
Browse files Browse the repository at this point in the history
  • Loading branch information
khusmann committed Mar 1, 2024
1 parent 22a5b60 commit 79b02ea
Show file tree
Hide file tree
Showing 2 changed files with 203 additions and 9 deletions.
2 changes: 1 addition & 1 deletion inst/extdata/colors.csv
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
person_id,age,favorite_color
1,20,BLUE
2,REFUSED,BLUE
3,21,RED
3,21,REFUSED
4,30,OMITTED
5,1,N/A
6,41,RED
Expand Down
210 changes: 202 additions & 8 deletions vignettes/interlacer.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -12,15 +12,209 @@ knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
library(interlacer)
library(readr)
library(dplyr)
```

```{r setup}
library(interlacer)
In many datasets, reasons for missing values are interlaced with data as special
values or codes. For example, consider the following CSV:

```{r}
interlacer_example("colors.csv") |>
read_file() |>
cat()
```

As you can see, this data source has three variables: `person_id` and `age`,
both numeric variables and `favorite_color`, a character or factor variable.
Interlaced in their values are three possible missing reasons: `REFUSED`,
`OMITTED`, and `N/A`.

To load the values of this data source, it is an easy call to the venerable
`readr::read_csv()`:

```{r}
(df <- read_csv(
interlacer_example("colors.csv"),
na = c("REFUSED", "OMITTED", "N/A"),
))
```

As you can see, the data were loaded into a dataframe with three columns,
and all of the missing reasons were replaced with `NA` values.

## Aggregations with missing reasons

Now, if we were only interested in the *values* of our source data, this
functionality is all we need. But what if we wanted to know *why* some values
were `NA`? Although that information was encoded in our source data, it was
lost when all of the missing reasons were converted into `NA` values.

For example, consider the `favorite_color` column. How many respondents
`REFUSED` to give their favorite color? How many people just `OMITTED` their
answer? Was the question `N/A` for some respondents (e.g. wasn't on their
survey form)? What was the mean respondent age for each of these groups?

Our current dataframe only gets us part way:

```{r}
df |>
mutate(
favorite_color_missing = is.na(favorite_color)
) |>
summarize(
mean_age = mean(age, na.rm = T),
n = n(),
.by = favorite_color_missing
)
```

As you can see, because we converted all our missing reasons into a single `NA`,
we can only answer these questions about missingness in general, rather than
work with the specific reasons stored in our source data.

Unfortunately, if we try load our data with the missing reasons intact, we lose
something else: the type information of the values.

```{r}
(df_with_missing <- read_csv(
interlacer_example("colors.csv"),
col_types = cols(.default = "c")
))
```

Now we have access to our missing reasons, but all the columns are character
vectors. This means that in order to do anything with our values, we always
have to filter out the missing reasons, and cast the remaining values to our
desired type:

```{r}
reasons <- c("REFUSED", "OMITTED", "N/A")
df_with_missing |>
mutate(
age_values = as.numeric(if_else(age %in% reasons, NA, age)),
favorite_color_missing_reasons = if_else(
favorite_color %in% reasons,
favorite_color,
NA
)
) |>
summarize(
mean_age = mean(age_values, na.rm=T),
n = n(),
.by = favorite_color_missing_reasons
)
```

This gives us the information we want, but it is cumbersome and starts to get
really complex when different columns have different sets of possible missing
reasons. It means you have to do a lot of type conversion
gymnastics to switch between value types and missing types.

### The interlacer approach

Interlacer was built based on the insight that everything becomes much more
tidy, simple, and expressive when we explicitly work with values and missing
reasons as separate *channels* of the same variable. The functions
the `read_interlaced_*` functions in interlacer do this for you:
they *deinterlace* variables from interlaced data sources into two columns per
variable: one for holding values, one for holding missing reasons.

```{r}
(df_deinterlaced <- read_interlaced_csv(
interlacer_example("colors.csv"),
na = c("REFUSED", "OMITTED", "N/A"),
))
```
As you can see, missing reasons columns are denoted by names surrounded by
dots: the `.age.` column holds the missing reasons for the `age` variable,
and so on.

Now, all the missing reason information you need is right at your fingertips,
AND the value types are preserved. To make the same report as we did before,
we would run:

```{r}
df_deinterlaced |>
summarize(
mean_age = mean(age, na.rm=T),
n = n(),
.by = .favorite_color.
)
```

We get the same results as before but without needing to do any type gymnastics!

## Filtering based on missing reasons

Having separate columns for values and missing reasons also helpful for creating
samples with inclusion / exclusion criteria based on missing reasons. For
example, using our example data, say we wanted to create a sample of respondents
that `REFUSED` to give their age?

```{r}
df_deinterlaced |>
filter(.age. == "REFUSED")
```

How about people who `REFUSED` to report their age AND favorite color?

```{r}
df_deinterlaced |>
filter(.age. == "REFUSED" & .favorite_color. == "REFUSED")
```

With separate columns, we can combine value conditions with missing reason
conditions. For example, this will select everyone who `REFUSED` to give
their favorite color, who were over 20 years old:

```{r}
df_deinterlaced |>
filter(age > 20 & .favorite_color. == "REFUSED")
```

After we've created our sample, and are ready to start analyzing our data,
we typically don't need to keep the missing reasons around anymore. Interlacer
provides a convenient `drop_missing_reasons()` function to take care of this:

```{r}
df_deinterlaced |>
filter(.age. == "REFUSED") |>
drop_missing_reasons()
```

## Next steps

So far, we've covered how interlacer's `read_interlaced_*` family
of functions enabled us to deinterlace value and missing reason channels from
interlaced data sources into separate dataframe columns. Separate value and
missing reason columns enable us to create tidy and type-aware aggregation
and filtering pipelines that can simultaneously consider a variable's value
AND missing reasons.

That's all well and good, but what happens when we want to make modifications
to our dataframe? What if we want to add variables to our dataframe, replace
values with missing reasons, or missing reasons with values? Inevitably, we'll
create situations where we simultaneously have a value and a missing reason,
or neither a value nor a missing reason:

```{r}
# Value and missing reason:
df_deinterlaced |>
mutate(.age. = "REDACTED")
# No value, no missing reason:
df_deinterlaced |>
mutate(
favorite_color = na_if(favorite_color, "BLUE")
)
```

Unfortunately, when these are loaded into R with
functions like `readr::read_csv()` these all become `NA` values and lose the
missing reason. If you want to create samples or calculate statistics connected
to the *reasons* values are missing, you're forced to load your data as raw
character columns and do a bunch of manual string operations to obtain the
information you want.
These operations produce dataframes that don't conform to the rule of "one value
OR missing reason per variable row". We could manually solve this by manually
fixing the corresponding column, but as the above output hints,
interlacer provides an easier way by way of the function
`coalesce_missing_reasons()`. In the next vignette,
`vignette("mutations")`, we will show how this works!

0 comments on commit 79b02ea

Please sign in to comment.