Skip to content

Commit

Permalink
fill in coded-data vignette
Browse files Browse the repository at this point in the history
  • Loading branch information
khusmann committed Mar 5, 2024
1 parent 03a9002 commit 688ffc1
Show file tree
Hide file tree
Showing 6 changed files with 279 additions and 38 deletions.
2 changes: 1 addition & 1 deletion DESCRIPTION
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ Imports:
Suggests:
knitr,
rmarkdown,
haven,
forcats,
testthat (>= 3.0.0)
Config/testthat/edition: 3
Encoding: UTF-8
Expand Down
2 changes: 1 addition & 1 deletion _pkgdown.yml
Original file line number Diff line number Diff line change
Expand Up @@ -8,5 +8,5 @@ articles:
contents:
- mutations
- column-types
- recipes
- coded-data
- other-approaches
File renamed without changes.
274 changes: 274 additions & 0 deletions vignettes/coded-data.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,274 @@
---
title: "Coded Data"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Coded Data}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```

In addition to interlacing values and missing reasons, many statistical software
packages will store categorical values and missing reasons as alphanumeric
codes. These codes are often chosen so that numeric comparisons or casts can be
used to determine if a value represents a real value or missing reason.
Like 8-character variable name limits, this practice comes from a
historical need to save digital storage space even if it made analyses less
readable and more error-prone.

Even though storage is cheap these days, coded formats continue to be the
standard format used
by statistical software packages like SPSS, SAS, and Stata. This article
will describe these common coding schemes and how they can be decoded and
deinterlaced to make them easier to work with in R.

## Numeric codes with negative missing reasons (SPSS)

It's extremely common to find data sources that encode all categorical responses
as numeric values, with negative values representing missing values codes. SPSS
is one such example. Here's an SPSS-formatted version of the `colors.csv`
example:

```{r}
library(readr)
library(interlacer)
read_file(
interlacer_example("colors_coded.csv")
) |>
cat()
```

Where missing values are:

> -99: N/A
>
> -98: REFUSED
>
> -97: OMITTED
And colors are coded:

> 1: BLUE
>
> 2: RED
>
> 3: YELLOW
This format gives you the ability to load everything as a numeric type:

```{r}
(df_coded <- read_csv(
interlacer_example("colors_coded.csv"),
col_types = "n"
))
```

To test if a value is a missing code, you can check if it's less than 0:

```{r}
library(dplyr, warn.conflicts = FALSE)
df_coded |>
mutate(
favorite_color_missing = if_else(favorite_color < 0, favorite_color, NA),
age = if_else(age > 0, age, NA)
) |>
summarize(
mean_age = mean(age, na.rm=T),
n = n(),
.by = favorite_color_missing
)
```

The downsides of this approach are twofold: 1) all of your values and
missing reasons become codes you have to remember and 2) it's really easy to
make mistakes.

What sort of mistakes? Well, because everything is numeric, there's nothing
stopping us from treating missing values as if they are regular values...
If you forget to remove your missing values, R will still happily
compute aggregations using the negative numbers!

```{r}
df_coded |>
mutate(
favorite_color_missing = if_else(favorite_color < 0, favorite_color, NA),
# age = if_else(age > 0, age, NA)
) |>
summarize(
mean_age = mean(age, na.rm=T),
n = n(),
.by = favorite_color_missing
)
```

Have you ever thought you had a significant result, only to find that it's
only because there are some stray missing reason codes still interlaced with
your values? It's a bad time.

You're much better off loading these formats with interlacer, then converting
the codes into labelled factor levels:

```{r}
library(forcats)
(df_decoded_deinterlaced <- read_interlaced_csv(
interlacer_example("colors_coded.csv"),
na = c("-99", "-98", "-97")
) |>
mutate(
across(
missing_cols(),
\(x) fct_recode(x,
`N/A` = "-99",
REFUSED = "-98",
OMITTED = "-97",
)
),
favorite_color = fct_recode(
as.character(favorite_color),
BLUE = "1",
RED = "2",
YELLOW = "3",
)
))
```

Now aggregations won't mix up values and missing codes, and you won't have to
keep cross-referencing your codebook to know what values mean:

```{r}
df_decoded_deinterlaced |>
summarize(
mean_age = mean(age, na.rm=T),
n = n(),
.by = .favorite_color.
)
```

## Numeric codes with character missing reasons (SAS, Stata)

Like SPSS, SAS and Stata will encode factor levels as numeric values, but
instead of representing missing reasons as negative codes, they are given
character codes:

```{r}
read_file(
interlacer_example("colors_coded_char.csv")
) |>
cat()
```

Here, the same value codes are used as the previous example, except the missing
reasons are coded as follows:

> ".": N/A
>
> ".a": REFUSED
>
> ".b": OMITTED

To handle these missing reasons without interlacer, columns must be loaded as
character vectors:

```{r}
(df_coded_char <- read_csv(
interlacer_example("colors_coded_char.csv"),
col_types = "c"
))
```

To test if a value is missing, they can be cast to numeric types. If the cast fails,
you know it's a missing code. If it is successful, you know it's a coded value.

```{r}
df_coded_char |>
mutate(
favorite_color_missing = if_else(
is.na(as.numeric(favorite_color)),
favorite_color,
NA
),
age = if_else(!is.na(as.numeric(age)), as.numeric(age), NA)
) |>
summarize(
mean_age = mean(age, na.rm=T),
n = n(),
.by = favorite_color_missing
)
```

Although the character missing codes help prevent us from mistakenly including
missing codes in value aggregations, having to cast our columns to numeric
all the time to check for missingness is hardly ergonomic, and generates
annoying warnings. Like before, it's easier to import with interlacer and
decode the values and missing reasons:

```{r}
read_interlaced_csv(
interlacer_example("colors_coded_char.csv"),
na = c(".", ".a", ".b")
) |>
mutate(
across(
missing_cols(),
\(x) fct_recode(x,
`N/A` = ".",
REFUSED = ".a",
OMITTED = ".b",
)
),
favorite_color = fct_recode(
as.character(favorite_color),
BLUE = "1",
RED = "2",
YELLOW = "3",
)
)
```

## Encoding a decoded & deinterlaced data frame.

Re-coding and re-interlacing a data frame is easily done as follows:

```{r, eval = FALSE}
df_decoded_deinterlaced |>
mutate(
across(
missing_cols(),
\(x) fct_recode(x,
`-99` = "N/A",
`-98` = "REFUSED",
`-97` = "OMITTED"
)
),
favorite_color = fct_recode(
favorite_color,
`1` = "BLUE",
`2` = "RED",
`3` = "YELLOW"
)
) |>
write_interlaced_csv("output.csv")
```

## haven

The [haven](https://haven.tidyverse.org/) package has functions for loading
native SPSS, SAS, and Stata native file formats into
special data frames that use column attributes and special values to keep track
of interlaces values and missing reasons. For a complete discussion of how this
compares to interlacer's approach, see `vignette("other-approaches")`.

Future versions of interlacer could have the ability to convert haven data
frames to and from deinterlaced data frames, but I want to gauge interest for
this feature before I invested the time to implement. If this is a a feature
you'd use, [please let me know](mailto:[email protected])!
20 changes: 3 additions & 17 deletions vignettes/other-approaches.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -12,27 +12,13 @@ knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
library(interlacer)
```

Why don't data

Values and missing reasons
## Labelled missing values

## Negative numeric codes

## String types in numeric fields

## haven

```{r}
library(haven)
```

### Labelled missing values

### "Tagged" missing values
## "Tagged" missing values

## The "ideal" approach


An "ideal" missing value API would make use of truly typed
19 changes: 0 additions & 19 deletions vignettes/recipes.Rmd

This file was deleted.

0 comments on commit 688ffc1

Please sign in to comment.