-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
6 changed files
with
279 additions
and
38 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -8,5 +8,5 @@ articles: | |
contents: | ||
- mutations | ||
- column-types | ||
- recipes | ||
- coded-data | ||
- other-approaches |
File renamed without changes.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,274 @@ | ||
--- | ||
title: "Coded Data" | ||
output: rmarkdown::html_vignette | ||
vignette: > | ||
%\VignetteIndexEntry{Coded Data} | ||
%\VignetteEngine{knitr::rmarkdown} | ||
%\VignetteEncoding{UTF-8} | ||
--- | ||
|
||
```{r, include = FALSE} | ||
knitr::opts_chunk$set( | ||
collapse = TRUE, | ||
comment = "#>" | ||
) | ||
``` | ||
|
||
In addition to interlacing values and missing reasons, many statistical software | ||
packages will store categorical values and missing reasons as alphanumeric | ||
codes. These codes are often chosen so that numeric comparisons or casts can be | ||
used to determine if a value represents a real value or missing reason. | ||
Like 8-character variable name limits, this practice comes from a | ||
historical need to save digital storage space even if it made analyses less | ||
readable and more error-prone. | ||
|
||
Even though storage is cheap these days, coded formats continue to be the | ||
standard format used | ||
by statistical software packages like SPSS, SAS, and Stata. This article | ||
will describe these common coding schemes and how they can be decoded and | ||
deinterlaced to make them easier to work with in R. | ||
|
||
## Numeric codes with negative missing reasons (SPSS) | ||
|
||
It's extremely common to find data sources that encode all categorical responses | ||
as numeric values, with negative values representing missing values codes. SPSS | ||
is one such example. Here's an SPSS-formatted version of the `colors.csv` | ||
example: | ||
|
||
```{r} | ||
library(readr) | ||
library(interlacer) | ||
read_file( | ||
interlacer_example("colors_coded.csv") | ||
) |> | ||
cat() | ||
``` | ||
|
||
Where missing values are: | ||
|
||
> -99: N/A | ||
> | ||
> -98: REFUSED | ||
> | ||
> -97: OMITTED | ||
And colors are coded: | ||
|
||
> 1: BLUE | ||
> | ||
> 2: RED | ||
> | ||
> 3: YELLOW | ||
This format gives you the ability to load everything as a numeric type: | ||
|
||
```{r} | ||
(df_coded <- read_csv( | ||
interlacer_example("colors_coded.csv"), | ||
col_types = "n" | ||
)) | ||
``` | ||
|
||
To test if a value is a missing code, you can check if it's less than 0: | ||
|
||
```{r} | ||
library(dplyr, warn.conflicts = FALSE) | ||
df_coded |> | ||
mutate( | ||
favorite_color_missing = if_else(favorite_color < 0, favorite_color, NA), | ||
age = if_else(age > 0, age, NA) | ||
) |> | ||
summarize( | ||
mean_age = mean(age, na.rm=T), | ||
n = n(), | ||
.by = favorite_color_missing | ||
) | ||
``` | ||
|
||
The downsides of this approach are twofold: 1) all of your values and | ||
missing reasons become codes you have to remember and 2) it's really easy to | ||
make mistakes. | ||
|
||
What sort of mistakes? Well, because everything is numeric, there's nothing | ||
stopping us from treating missing values as if they are regular values... | ||
If you forget to remove your missing values, R will still happily | ||
compute aggregations using the negative numbers! | ||
|
||
```{r} | ||
df_coded |> | ||
mutate( | ||
favorite_color_missing = if_else(favorite_color < 0, favorite_color, NA), | ||
# age = if_else(age > 0, age, NA) | ||
) |> | ||
summarize( | ||
mean_age = mean(age, na.rm=T), | ||
n = n(), | ||
.by = favorite_color_missing | ||
) | ||
``` | ||
|
||
Have you ever thought you had a significant result, only to find that it's | ||
only because there are some stray missing reason codes still interlaced with | ||
your values? It's a bad time. | ||
|
||
You're much better off loading these formats with interlacer, then converting | ||
the codes into labelled factor levels: | ||
|
||
```{r} | ||
library(forcats) | ||
(df_decoded_deinterlaced <- read_interlaced_csv( | ||
interlacer_example("colors_coded.csv"), | ||
na = c("-99", "-98", "-97") | ||
) |> | ||
mutate( | ||
across( | ||
missing_cols(), | ||
\(x) fct_recode(x, | ||
`N/A` = "-99", | ||
REFUSED = "-98", | ||
OMITTED = "-97", | ||
) | ||
), | ||
favorite_color = fct_recode( | ||
as.character(favorite_color), | ||
BLUE = "1", | ||
RED = "2", | ||
YELLOW = "3", | ||
) | ||
)) | ||
``` | ||
|
||
Now aggregations won't mix up values and missing codes, and you won't have to | ||
keep cross-referencing your codebook to know what values mean: | ||
|
||
```{r} | ||
df_decoded_deinterlaced |> | ||
summarize( | ||
mean_age = mean(age, na.rm=T), | ||
n = n(), | ||
.by = .favorite_color. | ||
) | ||
``` | ||
|
||
## Numeric codes with character missing reasons (SAS, Stata) | ||
|
||
Like SPSS, SAS and Stata will encode factor levels as numeric values, but | ||
instead of representing missing reasons as negative codes, they are given | ||
character codes: | ||
|
||
```{r} | ||
read_file( | ||
interlacer_example("colors_coded_char.csv") | ||
) |> | ||
cat() | ||
``` | ||
|
||
Here, the same value codes are used as the previous example, except the missing | ||
reasons are coded as follows: | ||
|
||
> ".": N/A | ||
> | ||
> ".a": REFUSED | ||
> | ||
> ".b": OMITTED | ||
|
||
To handle these missing reasons without interlacer, columns must be loaded as | ||
character vectors: | ||
|
||
```{r} | ||
(df_coded_char <- read_csv( | ||
interlacer_example("colors_coded_char.csv"), | ||
col_types = "c" | ||
)) | ||
``` | ||
|
||
To test if a value is missing, they can be cast to numeric types. If the cast fails, | ||
you know it's a missing code. If it is successful, you know it's a coded value. | ||
|
||
```{r} | ||
df_coded_char |> | ||
mutate( | ||
favorite_color_missing = if_else( | ||
is.na(as.numeric(favorite_color)), | ||
favorite_color, | ||
NA | ||
), | ||
age = if_else(!is.na(as.numeric(age)), as.numeric(age), NA) | ||
) |> | ||
summarize( | ||
mean_age = mean(age, na.rm=T), | ||
n = n(), | ||
.by = favorite_color_missing | ||
) | ||
``` | ||
|
||
Although the character missing codes help prevent us from mistakenly including | ||
missing codes in value aggregations, having to cast our columns to numeric | ||
all the time to check for missingness is hardly ergonomic, and generates | ||
annoying warnings. Like before, it's easier to import with interlacer and | ||
decode the values and missing reasons: | ||
|
||
```{r} | ||
read_interlaced_csv( | ||
interlacer_example("colors_coded_char.csv"), | ||
na = c(".", ".a", ".b") | ||
) |> | ||
mutate( | ||
across( | ||
missing_cols(), | ||
\(x) fct_recode(x, | ||
`N/A` = ".", | ||
REFUSED = ".a", | ||
OMITTED = ".b", | ||
) | ||
), | ||
favorite_color = fct_recode( | ||
as.character(favorite_color), | ||
BLUE = "1", | ||
RED = "2", | ||
YELLOW = "3", | ||
) | ||
) | ||
``` | ||
|
||
## Encoding a decoded & deinterlaced data frame. | ||
|
||
Re-coding and re-interlacing a data frame is easily done as follows: | ||
|
||
```{r, eval = FALSE} | ||
df_decoded_deinterlaced |> | ||
mutate( | ||
across( | ||
missing_cols(), | ||
\(x) fct_recode(x, | ||
`-99` = "N/A", | ||
`-98` = "REFUSED", | ||
`-97` = "OMITTED" | ||
) | ||
), | ||
favorite_color = fct_recode( | ||
favorite_color, | ||
`1` = "BLUE", | ||
`2` = "RED", | ||
`3` = "YELLOW" | ||
) | ||
) |> | ||
write_interlaced_csv("output.csv") | ||
``` | ||
|
||
## haven | ||
|
||
The [haven](https://haven.tidyverse.org/) package has functions for loading | ||
native SPSS, SAS, and Stata native file formats into | ||
special data frames that use column attributes and special values to keep track | ||
of interlaces values and missing reasons. For a complete discussion of how this | ||
compares to interlacer's approach, see `vignette("other-approaches")`. | ||
|
||
Future versions of interlacer could have the ability to convert haven data | ||
frames to and from deinterlaced data frames, but I want to gauge interest for | ||
this feature before I invested the time to implement. If this is a a feature | ||
you'd use, [please let me know](mailto:[email protected])! |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file was deleted.
Oops, something went wrong.