Skip to content

Commit

Permalink
finish writing other-approaches
Browse files Browse the repository at this point in the history
  • Loading branch information
khusmann committed Mar 5, 2024
1 parent 3c5ccd7 commit c491143
Show file tree
Hide file tree
Showing 4 changed files with 157 additions and 5 deletions.
16 changes: 12 additions & 4 deletions data-raw/colors.R
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,12 @@ missing_labels <- c(
OMITTED = -97
)

missing_tags <- c(
`N/A` = tagged_na(""),
REFUSED = tagged_na("a"),
OMITTED = tagged_na("b")
)

color_labels <- c(
BLUE = 1,
RED = 2,
Expand Down Expand Up @@ -54,15 +60,17 @@ df_stata <- read_csv(
suppressWarnings(as.double(x))
)
),
person_id = labelled(person_id, label = "Person ID"),
age = labelled(age, label = "Age"),
person_id = labelled(person_id, labels = missing_tags, label = "Person ID"),
age = labelled(age, labels = missing_tags, label = "Age"),
favorite_color = labelled(
favorite_color, labels = color_labels, label = "Favorite color"
favorite_color,
labels = c(missing_tags, color_labels),
label = "Favorite color"
)
)

# Interesting. write_dta does not save veriable label if there are no value
# labels? When this file is read, neither person_id and age have variable
# labels? When this file is read, person_id doesn't have variable
# labels...

write_dta(df_stata, "inst/extdata/colors.dta")
Binary file modified inst/extdata/colors.dta
Binary file not shown.
Binary file modified inst/extdata/colors.sav
Binary file not shown.
146 changes: 145 additions & 1 deletion vignettes/other-approaches.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -105,6 +105,150 @@ df_spss |>

## "Tagged" missing values

For loading Stata and SAS files, haven uses a "tagged missingness" approach to
mirror how these values are handled in Stata and SAS:

```{r}
(df_stata <- read_stata(
interlacer_example("colors.dta")
))
```

This approach is deviously clever. It takes advantage of the way `NaN` floating
point values are stored in memory, to make it possible to have different
"flavors" of `NA` values. (For more info on how this is done, check out
[tagged_na.c](https://github.com/tidyverse/haven/blob/main/src/tagged_na.c) in
the source code for haven)

They still all act like regular `NA` values... but now they can include a single
character "tag" (usually a letter from a-z). This means that they work with
`is.na()` AND will not include missing reason codes in aggregations!

```{r}
is.na(df_stata$age)
mean(df_stata$age, na.rm=TRUE)
```

Unfortunately, you can't group by them, because `dplyr::group_by()` is not
missing tag-aware :(

```{r}
df_stata |>
mutate(
favorite_color_missing_reasons = if_else(
is.na(favorite_color), favorite_color, NA
)
) |>
summarize(
mean_age = mean(age, na.rm=T),
n = n(),
.by = favorite_color_missing_reasons
)
```

Another limitation of this approach is that it requires values types to be
numeric, because the trick of "tagging" the `NA` values depends on the
peculiarities of how floating point values are stored in memory. Again,
keeping separate columns for values and missing reasons solves all these issues.

## The "ideal" approach

An "ideal" missing value API would make use of truly typed
The biggest downside of keeping separate columns for values and missing reasons
are the invalid states that come up when you start trying to mutate your data
frames. `coalesce_channels()` helps a lot, but it's not ideal.

I think the ideal way to handle missing reasons would be to implement a proper
generic [`Result` type](https://en.wikipedia.org/wiki/Result_type) natively
into R's type system. A real `Result` type would act similar to haven's
`haven::tagged_na()`, but be a container for any type of value, not only
missing values.

In an early attempt of this library, I tried using nested data frames for this
effect:

```{r}
df_interlaced <- read_interlaced_csv(
interlacer_example("colors.csv"),
na = c("REFUSED", "OMITTED", "N/A")
)
(df_nested <- tibble(
person_id = tibble(
v = df_interlaced$person_id,
m = df_interlaced$.person_id.,
),
age = tibble(
v = df_interlaced$age,
m = df_interlaced$.age.,
),
favorite_color = tibble(
v = df_interlaced$favorite_color,
m = df_interlaced$.favorite_color.,
)
))
```

This sort of works, because we can use `$v` and `$m` to reference separate
channels of the data frame. Unfortunately it requires creating separate columns
when grouping:

```{r}
df_nested |>
mutate(
favorite_color_missing = favorite_color$m
) |>
summarize(
mean_age = mean(age$v, na.rm=T),
n = n(),
.by = favorite_color_missing
)
```

And mutations get ugly... (There's probably a way to tickle this back to a
nicely displayed data frame, but I couldn't find it)

```{r}
df_nested |>
mutate(
favorite_color = if_else(
favorite_color$v %in% c("RED", "YELLOW"),
tibble(v = favorite_color$v, m = NA),
tibble(v = NA, m = "TECHNICAL_ERROR")
)
)
```

If we were to implement this somehow as a custom native type in R, I'd want
syntax something like this instead:

```{r, eval = FALSE}
df_mutated <- df |>
mutate(
favorite_color = if_else(
favorite_color %in% c("RED", "YELLOW"),
favorite_color,
missing_reason("TECHNICAL_ERROR")
)
df_mutated |>
summarize(
mean_age = mean(age, na.rm=T),
n = n(),
.by = missing_reason(favorite_color)
)
```

This would be "ideal" in my book: we can use values as usual, but anytime
we want to access the "missing reason" channel, we can wrap it in a
`missing_reason()` (similar to how `haven::tagged_na()` works). It's type safe
and super ergonomic. But implementing this would be a major headache and
involve very intimate knowledge of R internals... (@Hadley Wickham if by some
miracle you're reading this, could we talk sometime??)

So this is why I'm using the present current "deinterlaced data frame"
approach. It is easy to understand and use, even though it's not "perfect"
from a strongly typed functional programming perspective. If there's enough
demand for missing-reason-aware tooling in R though, it might convince me
to go down the "generic tagged type" rabbit hole...
[Please drop me a line](mailto:[email protected]) to let me know what you think!

0 comments on commit c491143

Please sign in to comment.