Overview

The repository shows example code that I used to

Make a simulated data set,
Examine the missingness,
Specify the imputation model,
Pefrom multiple impuation with the mice package,
Calculate a number of agreement statistics, and
Pool the results with mice.

Also included are the slides from my presentation at CSP (2018) my final submitted thesis (2019-06-29).

Simulate a data set

Due to the sensitive nature of electronic health record (EHR) data, in order to provide a data set for a reproducible example, I wrote code below to simulate a data set that can be used with the rest of the code in this example. The only similarities to the original data are the variable names and the data categories.

Note that the eligibility requirements for screening services are based on sex, age, and medical history. I do not simulate medical history and so the eligibility here is based only on sex and age.

The proportions of screenings in the data were chosen to try to simulate some data that would make for interesting example with the rest of the code. The proportions of missingness are similar to the actual work that I presented in that I did a little bit of rounding before simulation.

Disclaimer: All the data in this example is simulated from the following code. Any similarities to the original data set or any other existing data is purely by chance alone.

Simulate the data

source(here::here("src", "simulate-data-set.R"))

valdata <- make_sim_data(n_rows = 14000, seed = seed_for_imp)

Do some checks and take a glimpse

dplyr::glimpse(valdata)

Observations: 14,000
Variables: 42
$ study_id         <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1…
$ sex              <fct> F, F, M, M, F, M, M, M, M, F, M, F, F, F, M, M, M, F…
$ age_start        <int> 44, 22, 34, 47, 36, 50, 42, 29, 27, 38, 48, 63, 42, …
$ primary_dept     <fct> 021, 031, 039, 020, 015, 005, 026, 008, 005, 024, 03…
$ ethnic_cat       <fct> NH White, NH White, NH White, NH White, NH White, NH…
$ lang_cat         <fct> English, English, English, English, English, English…
$ race_cat         <fct> Black, Black, White, White, White, White, White, Whi…
$ fpl_cat          <fct> <=138% FPL, <=138% FPL, >138% FPL, <=138% FPL, <=138…
$ age_cat          <fct> "[35,51)", "[19,35)", "[19,35)", "[35,51)", "[35,51)…
$ elig_cervical    <fct> 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1…
$ elig_breast      <fct> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1…
$ elig_colon       <fct> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1…
$ elig_colonoscopy <fct> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1…
$ elig_flexsig     <fct> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1…
$ elig_fobt        <fct> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1…
$ elig_bmi         <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ elig_flu         <fct> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1…
$ elig_chlam       <fct> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0…
$ elig_smoking     <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ elig_cholest     <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ dmap_cervical    <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0…
$ dmap_breast      <fct> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1…
$ dmap_colon       <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ dmap_colonoscopy <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ dmap_flexsig     <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ dmap_fobt        <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1…
$ dmap_bmi         <fct> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ dmap_flu         <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0…
$ dmap_chlam       <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0…
$ dmap_smoking     <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ dmap_cholest     <fct> 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0…
$ ehr_cervical     <fct> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ ehr_breast       <fct> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ ehr_colon        <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1…
$ ehr_colonoscopy  <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ ehr_flexsig      <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ ehr_fobt         <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ ehr_bmi          <fct> 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1…
$ ehr_flu          <fct> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1…
$ ehr_chlam        <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ ehr_smoking      <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ ehr_cholest      <fct> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0…

EDA / explore missingness

Table One

A quick table one to look at the demographics of the sample data.

#### Create a table one --------------------------------
# Table one summary stats using the tableone package

tab1 <- tableone::CreateTableOne(
  vars = c("sex", "race_cat", "ethnic_cat", "lang_cat", "fpl_cat", "age_cat"), 
  data = valdata, 
  factorVars = 
    c("sex", "race_cat", "ethnic_cat", "lang_cat", "fpl_cat", "age_cat"), 
  includeNA = TRUE
) %>% 
  print(., 
        showAllLevels = TRUE, 
        printToggle = FALSE, 
        noSpaces = TRUE
        ) 

#### Print table one --------------------------------

tab1 %>% 
  as.data.frame() %>% 
  tibble::rownames_to_column(., var = "rowname") %>% 
  tibble::as_tibble() %>% 
  mutate(rowname = rownames(tab1)) %>% 
  mutate(level = as.character(level), 
         level = tidyr::replace_na(level, "(Missing)")) %>% 
  knitr::kable(booktabs = TRUE, 
               longtable = TRUE, 
               col.names = c("", names(.)[-1])) %>% 
  kableExtra::kable_styling(full_width = FALSE, 
                            latex_options = c("repeat_header")) %>% 
  kableExtra::column_spec(1, width = "10em")

	level	Overall
n		14000
sex (%)	F	9144 (65.3)
	M	4856 (34.7)
race\_cat (%)	AIAN	295 (2.1)
	API	687 (4.9)
	Black	1346 (9.6)
	Multiple Races	131 (0.9)
	White	10548 (75.3)
	(Missing)	993 (7.1)
ethnic\_cat (%)	Hispanic	1405 (10.0)
	NH Other	2083 (14.9)
	NH White	9801 (70.0)
	(Missing)	711 (5.1)
lang\_cat (%)	English	11920 (85.1)
	Other	1442 (10.3)
	Spanish	638 (4.6)
fpl\_cat (%)	<=138% FPL	10463 (74.7)
	>138% FPL	709 (5.1)
	(Missing)	2828 (20.2)
age\_cat (%)	\[19,35)	4666 (33.3)
	\[35,51)	4666 (33.3)
	\[51,65)	4668 (33.3)

Visualizing missingness in the data set ---------------------------------------

The table one above shows that there are missing data in the variables race_cat, ethnic_cat, and fpl_cat. When beginning to think about how to handle the missing data in your project, visualization is a great place to begin. The naniar and the visdat packages provide helpful plots.

The `vis_miss()` plot

In the figure below, we get an overview of the missing values in the data set. Missing are shown in black and observed values are shown in gray. We see that there are only 3 variables in the data set with missing values.

naniar::vis_miss(valdata)

We can also get a numeric percent missing for the variables.

naniar::miss_var_summary(valdata) %>% 
  head(., n = 10)

# A tibble: 10 x 3
   variable      n_miss pct_miss
   <chr>          <int>    <dbl>
 1 fpl_cat         2828    20.2 
 2 race_cat         993     7.09
 3 ethnic_cat       711     5.08
 4 study_id           0     0   
 5 sex                0     0   
 6 age_start          0     0   
 7 primary_dept       0     0   
 8 lang_cat           0     0   
 9 age_cat            0     0   
10 elig_cervical      0     0

Setting `cluster = TRUE`

It can also be useful to look at the missingness plot with the values clustered. This gives a sense of how many values are missing in one row. Many rows with multiple missing values can be problematic when trying to do imputation.

I’m only going to show this plot for a few variables since we saw above that many were completely observed.

valdata %>% 
  dplyr::select(race_cat, ethnic_cat, fpl_cat, lang_cat, sex, age_cat) %>% 
  naniar::vis_miss(., 
                   cluster = TRUE)

The `VIM` package

The VIM package also has many helpful tools for visualizing missing values. I really like their combination bar plot and aggregation plot.

Here I have filtered the data to just those with missing values; I think that this helps the plot to be more clear. The bar plot on the left shows the proportion of missing values in each variable. The aggregation plot on the right shows the combinations of missing (dark gray) and observed (light gray) values that exist in the data.

valdata %>% 
  dplyr::select(race_cat, ethnic_cat, fpl_cat) %>% 
  VIM::aggr(., 
            col = c("gray", "gray29"), 
            numbers = TRUE, 
            sortVars = TRUE, 
            labels = names(.), 
            cex.axis = .7, 
            gap = 3, 
            ylab = c("Histogram of missing data","Pattern"))

 Variables sorted by number of missings: 
   Variable      Count
    fpl_cat 0.20200000
   race_cat 0.07092857
 ethnic_cat 0.05078571

Numerical summaries

mssng_pattern <- valdata %>% 
  dplyr::select(race_cat, ethnic_cat, fpl_cat) %>% 
  mice::md.pattern(.) %>% 
  as.data.frame() %>% 
  tibble::as_tibble(.) %>% 
  tibble::rownames_to_column(., var = "count") %>% 
  dplyr::rename(., 
                # new = old
                num_available = V4)

mssng_pattern %>% 
  knitr::kable(., escape = FALSE)

count	ethnic\_cat	race\_cat	fpl\_cat	num\_available
1	1	1	1	0
2	1	1	0	1
3	1	0	1	1
4	1	0	0	2
5	0	1	1	1
6	0	1	0	2
7	0	0	1	2
8	0	0	0	3
9	711	993	2828	4532

See Stef Van Buuren’s vignette for more in depth interpretation of the table above. There’s a lot of good info there. How I interpret it is to pay attention to the 1’s and 0’s in the main body of the table. They correspond to what is observed (1) and missing (0) for the variables listed in the top row.

So the first row of all 1’s means that all listed variables are observed; the count to the left shows the number of rows in the data set that fit this description. The last row of all 0’s means that all variables are missing; similar that the count on the left shows the number of rows in the data set that are missing for all the variables shown. Then all the 1’s and 0’s in between represent some combination of missingness among the variables.

The naniar package has summary that is a little simpler and gives some of the same information. But not as much detail as the md.pattern() in the mice package.

valdata %>% 
  naniar::miss_case_table(.) %>% 
  knitr::kable(.)

n\_miss\_in\_case	n\_cases	pct\_cases
0	9859	70.4214286
1	3758	26.8428571
2	375	2.6785714
3	8	0.0571429

### Number of observations per pattern of missing pairs

mssng_pairs <- valdata %>% 
  dplyr::select(ethnic_cat, race_cat, fpl_cat) %>% 
  mice::md.pairs(.)

mssng_pairs

$rr
           ethnic_cat race_cat fpl_cat
ethnic_cat      13289    12340   10608
race_cat        12340    13007   10387
fpl_cat         10608    10387   11172

$rm
           ethnic_cat race_cat fpl_cat
ethnic_cat          0      949    2681
race_cat          667        0    2620
fpl_cat           564      785       0

$mr
           ethnic_cat race_cat fpl_cat
ethnic_cat          0      667     564
race_cat          949        0     785
fpl_cat          2681     2620       0

$mm
           ethnic_cat race_cat fpl_cat
ethnic_cat        711       44     147
race_cat           44      993     208
fpl_cat           147      208    2828

Four missingness patterns:

rr both are observed,
rm first variable is observed, the second is missing,
mr first variable is missing, the second is observed, and
mm both variable are missing.

Proportion of usable cases

Measures how many cases with missing data on the target variable actually have observed values on the predictor. The proportion will be low if both target and predictor are missing on the same cases.

prop_usable_cases <- valdata %>% 
  dplyr::select(ethnic_cat, race_cat, fpl_cat) %>% 
  mice::md.pairs(.)

with(prop_usable_cases, 
     round(mr / (mr + mm), digits = 3))

           ethnic_cat race_cat fpl_cat
ethnic_cat      0.000    0.938   0.793
race_cat        0.956    0.000   0.791
fpl_cat         0.948    0.926   0.000

Target on the vertical axis (i.e. left), predictor on the horizontal (i.e. top).

Interpret: Of the records with values for ethnic_cat, xx% have observed information on race_cat and xx% have observed information on fplp_cat. Etc.

This gives a sense of what variables may be good to include/exclude in the imputation model. Higher % indicates more information and likely good predictor; lower % indicates that the variables are missing for the same observations and may not be good predictor.

Number of incomplete cases

More recent advice on how many imputations to perform suggests a rule of thumb that the number of imputations should be similar to the percentage of cases that are incomplete. So, not just as a part of the EDA, it is important to know the number of incomplete cases to inform the specification of the imputation model later.

An incomplete case would be an observation (or row) with at least one missing value. There are a number of ways to get at this information, but the naniar package makes it super easy.

n_missing <- naniar::prop_miss_case(valdata)
n_missing

[1] 0.2957857

So we know that in the data set 30% of observations have missing values. I round this up to 30 to select the number of imputations to perform.

Imputation

Below I show some of the set up that I went through to perform the multiple imputations using the mice package in R. Some resources were absolutely indispensable in my set up and learning:

the mice package documentation
the vignette from the Journal of Statistical Software (December 2011, Volume 45, Issue 3)
Flexible Imputation of Missing Data by Stef van Buuren

Also, since my work was done, online resources and github info have been added.

For more information about MICE: Multivariate Imputation by Chained Equations, sometimes called Fully Conditional Specification, I highly recommend any of the materials (code or books) or published papers by Stef van Buuren. Much can be found through his website, www.multiple-imputation.com.

Some set up

Here I am just going to define the number of imputations and the number of iterations to objects for use in later code. This is just a convenience step so that I only have to update these values in one place if I want to change them later.

The default number of imputations in the mice software is 5. Based on the exploration of the missingness above, we saw that they number of imputations suggested is much higher. Here in this example, I am going to keep it set at the default 5, just to limit the computation time if someone want to run this code on their own.

In practice, my suggestion would be to “tune” the imputation using a lower number like the default setting. Then once the imputation model is set, perform you final or near final imputations using the higher number as suggested by recent literature.

As far as iterations go, they tend to converge rather quickly with the mice algorithm. My advice is to run them out long enough to see whether there is convergence or not, while not getting super bogged down with computation time.

Here I am going to continue to use the software default of 5 for our example. In my actual work, I used 20. In your practice I would suggest to try to use 10 to 20.

imp_num <- 5  # number of imputations, dflt = 5
iter_num <- 5  # number of iterations, dflt = 5

We can also tel the mice software to run an “initial” empty imputation. The only effect of this is to give us some objects in R to work with as we go through the steps. See below where I run the initial imputation. Note that the maximum number of iterations (maxit) is set to zero.

init <- mice::mice(valdata, maxit = 0)
meth <- init$method
predM <- init$predictorMatrix
# print(ini)

The init object contains lots of information that we will work with as we go forward. Of interest for now though is the method selected for the variables. This is the form of the imputation model for the variables to be imputed.

meth

        study_id              sex        age_start     primary_dept 
              ""               ""               ""               "" 
      ethnic_cat         lang_cat         race_cat          fpl_cat 
       "polyreg"               ""        "polyreg"         "logreg" 
         age_cat    elig_cervical      elig_breast       elig_colon 
              ""               ""               ""               "" 
elig_colonoscopy     elig_flexsig        elig_fobt         elig_bmi 
              ""               ""               ""               "" 
        elig_flu       elig_chlam     elig_smoking     elig_cholest 
              ""               ""               ""               "" 
   dmap_cervical      dmap_breast       dmap_colon dmap_colonoscopy 
              ""               ""               ""               "" 
    dmap_flexsig        dmap_fobt         dmap_bmi         dmap_flu 
              ""               ""               ""               "" 
      dmap_chlam     dmap_smoking     dmap_cholest     ehr_cervical 
              ""               ""               ""               "" 
      ehr_breast        ehr_colon  ehr_colonoscopy      ehr_flexsig 
              ""               ""               ""               "" 
        ehr_fobt          ehr_bmi          ehr_flu        ehr_chlam 
              ""               ""               ""               "" 
     ehr_smoking      ehr_cholest 
              ""               ""

We see that the software made no choice for the variables without missing data. For those with missing data, based on the type of variable, it makes some default choices. We can override these later.

We also get the default matrix of predictors that the software chose. This is an object of 1’s and 0’s. For those variables with no missing and that we won’t be imputing, the values are zero. Here is a glimpse of just those that we intend to impute.

predM[rowSums(predM) > 0, ]

                 study_id sex age_start primary_dept ethnic_cat lang_cat
study_id                0   1         1            1          1        1
sex                     1   0         1            1          1        1
age_start               1   1         0            1          1        1
primary_dept            1   1         1            0          1        1
ethnic_cat              1   1         1            1          0        1
lang_cat                1   1         1            1          1        0
race_cat                1   1         1            1          1        1
fpl_cat                 1   1         1            1          1        1
age_cat                 1   1         1            1          1        1
elig_cervical           1   1         1            1          1        1
elig_breast             1   1         1            1          1        1
elig_colon              1   1         1            1          1        1
elig_colonoscopy        1   1         1            1          1        1
elig_flexsig            1   1         1            1          1        1
elig_fobt               1   1         1            1          1        1
elig_bmi                1   1         1            1          1        1
elig_flu                1   1         1            1          1        1
elig_chlam              1   1         1            1          1        1
elig_smoking            1   1         1            1          1        1
elig_cholest            1   1         1            1          1        1
dmap_cervical           1   1         1            1          1        1
dmap_breast             1   1         1            1          1        1
dmap_colon              1   1         1            1          1        1
dmap_colonoscopy        1   1         1            1          1        1
dmap_flexsig            1   1         1            1          1        1
dmap_fobt               1   1         1            1          1        1
dmap_bmi                1   1         1            1          1        1
dmap_flu                1   1         1            1          1        1
dmap_chlam              1   1         1            1          1        1
dmap_smoking            1   1         1            1          1        1
dmap_cholest            1   1         1            1          1        1
ehr_cervical            1   1         1            1          1        1
ehr_breast              1   1         1            1          1        1
ehr_colon               1   1         1            1          1        1
ehr_colonoscopy         1   1         1            1          1        1
ehr_flexsig             1   1         1            1          1        1
ehr_fobt                1   1         1            1          1        1
ehr_bmi                 1   1         1            1          1        1
ehr_flu                 1   1         1            1          1        1
ehr_chlam               1   1         1            1          1        1
ehr_smoking             1   1         1            1          1        1
ehr_cholest             1   1         1            1          1        1
                 race_cat fpl_cat age_cat elig_cervical elig_breast elig_colon
study_id                1       1       1             0           1          1
sex                     1       1       1             0           1          1
age_start               1       1       1             0           1          1
primary_dept            1       1       1             0           1          1
ethnic_cat              1       1       1             0           1          1
lang_cat                1       1       1             0           1          1
race_cat                0       1       1             0           1          1
fpl_cat                 1       0       1             0           1          1
age_cat                 1       1       0             0           1          1
elig_cervical           1       1       1             0           1          1
elig_breast             1       1       1             0           0          1
elig_colon              1       1       1             0           1          0
elig_colonoscopy        1       1       1             0           1          1
elig_flexsig            1       1       1             0           1          1
elig_fobt               1       1       1             0           1          1
elig_bmi                1       1       1             0           1          1
elig_flu                1       1       1             0           1          1
elig_chlam              1       1       1             0           1          1
elig_smoking            1       1       1             0           1          1
elig_cholest            1       1       1             0           1          1
dmap_cervical           1       1       1             0           1          1
dmap_breast             1       1       1             0           1          1
dmap_colon              1       1       1             0           1          1
dmap_colonoscopy        1       1       1             0           1          1
dmap_flexsig            1       1       1             0           1          1
dmap_fobt               1       1       1             0           1          1
dmap_bmi                1       1       1             0           1          1
dmap_flu                1       1       1             0           1          1
dmap_chlam              1       1       1             0           1          1
dmap_smoking            1       1       1             0           1          1
dmap_cholest            1       1       1             0           1          1
ehr_cervical            1       1       1             0           1          1
ehr_breast              1       1       1             0           1          1
ehr_colon               1       1       1             0           1          1
ehr_colonoscopy         1       1       1             0           1          1
ehr_flexsig             1       1       1             0           1          1
ehr_fobt                1       1       1             0           1          1
ehr_bmi                 1       1       1             0           1          1
ehr_flu                 1       1       1             0           1          1
ehr_chlam               1       1       1             0           1          1
ehr_smoking             1       1       1             0           1          1
ehr_cholest             1       1       1             0           1          1
                 elig_colonoscopy elig_flexsig elig_fobt elig_bmi elig_flu
study_id                        0            0         0        0        0
sex                             0            0         0        0        0
age_start                       0            0         0        0        0
primary_dept                    0            0         0        0        0
ethnic_cat                      0            0         0        0        0
lang_cat                        0            0         0        0        0
race_cat                        0            0         0        0        0
fpl_cat                         0            0         0        0        0
age_cat                         0            0         0        0        0
elig_cervical                   0            0         0        0        0
elig_breast                     0            0         0        0        0
elig_colon                      0            0         0        0        0
elig_colonoscopy                0            0         0        0        0
elig_flexsig                    0            0         0        0        0
elig_fobt                       0            0         0        0        0
elig_bmi                        0            0         0        0        0
elig_flu                        0            0         0        0        0
elig_chlam                      0            0         0        0        0
elig_smoking                    0            0         0        0        0
elig_cholest                    0            0         0        0        0
dmap_cervical                   0            0         0        0        0
dmap_breast                     0            0         0        0        0
dmap_colon                      0            0         0        0        0
dmap_colonoscopy                0            0         0        0        0
dmap_flexsig                    0            0         0        0        0
dmap_fobt                       0            0         0        0        0
dmap_bmi                        0            0         0        0        0
dmap_flu                        0            0         0        0        0
dmap_chlam                      0            0         0        0        0
dmap_smoking                    0            0         0        0        0
dmap_cholest                    0            0         0        0        0
ehr_cervical                    0            0         0        0        0
ehr_breast                      0            0         0        0        0
ehr_colon                       0            0         0        0        0
ehr_colonoscopy                 0            0         0        0        0
ehr_flexsig                     0            0         0        0        0
ehr_fobt                        0            0         0        0        0
ehr_bmi                         0            0         0        0        0
ehr_flu                         0            0         0        0        0
ehr_chlam                       0            0         0        0        0
ehr_smoking                     0            0         0        0        0
ehr_cholest                     0            0         0        0        0
                 elig_chlam elig_smoking elig_cholest dmap_cervical dmap_breast
study_id                  1            0            1             1           1
sex                       1            0            1             1           1
age_start                 1            0            1             1           1
primary_dept              1            0            1             1           1
ethnic_cat                1            0            1             1           1
lang_cat                  1            0            1             1           1
race_cat                  1            0            1             1           1
fpl_cat                   1            0            1             1           1
age_cat                   1            0            1             1           1
elig_cervical             1            0            1             1           1
elig_breast               1            0            1             1           1
elig_colon                1            0            1             1           1
elig_colonoscopy          1            0            1             1           1
elig_flexsig              1            0            1             1           1
elig_fobt                 1            0            1             1           1
elig_bmi                  1            0            1             1           1
elig_flu                  1            0            1             1           1
elig_chlam                0            0            1             1           1
elig_smoking              1            0            1             1           1
elig_cholest              1            0            0             1           1
dmap_cervical             1            0            1             0           1
dmap_breast               1            0            1             1           0
dmap_colon                1            0            1             1           1
dmap_colonoscopy          1            0            1             1           1
dmap_flexsig              1            0            1             1           1
dmap_fobt                 1            0            1             1           1
dmap_bmi                  1            0            1             1           1
dmap_flu                  1            0            1             1           1
dmap_chlam                1            0            1             1           1
dmap_smoking              1            0            1             1           1
dmap_cholest              1            0            1             1           1
ehr_cervical              1            0            1             1           1
ehr_breast                1            0            1             1           1
ehr_colon                 1            0            1             1           1
ehr_colonoscopy           1            0            1             1           1
ehr_flexsig               1            0            1             1           1
ehr_fobt                  1            0            1             1           1
ehr_bmi                   1            0            1             1           1
ehr_flu                   1            0            1             1           1
ehr_chlam                 1            0            1             1           1
ehr_smoking               1            0            1             1           1
ehr_cholest               1            0            1             1           1
                 dmap_colon dmap_colonoscopy dmap_flexsig dmap_fobt dmap_bmi
study_id                  1                1            1         1        1
sex                       1                1            1         1        1
age_start                 1                1            1         1        1
primary_dept              1                1            1         1        1
ethnic_cat                1                1            1         1        1
lang_cat                  1                1            1         1        1
race_cat                  1                1            1         1        1
fpl_cat                   1                1            1         1        1
age_cat                   1                1            1         1        1
elig_cervical             1                1            1         1        1
elig_breast               1                1            1         1        1
elig_colon                1                1            1         1        1
elig_colonoscopy          1                1            1         1        1
elig_flexsig              1                1            1         1        1
elig_fobt                 1                1            1         1        1
elig_bmi                  1                1            1         1        1
elig_flu                  1                1            1         1        1
elig_chlam                1                1            1         1        1
elig_smoking              1                1            1         1        1
elig_cholest              1                1            1         1        1
dmap_cervical             1                1            1         1        1
dmap_breast               1                1            1         1        1
dmap_colon                0                1            1         1        1
dmap_colonoscopy          1                0            1         1        1
dmap_flexsig              1                1            0         1        1
dmap_fobt                 1                1            1         0        1
dmap_bmi                  1                1            1         1        0
dmap_flu                  1                1            1         1        1
dmap_chlam                1                1            1         1        1
dmap_smoking              1                1            1         1        1
dmap_cholest              1                1            1         1        1
ehr_cervical              1                1            1         1        1
ehr_breast                1                1            1         1        1
ehr_colon                 1                1            1         1        1
ehr_colonoscopy           1                1            1         1        1
ehr_flexsig               1                1            1         1        1
ehr_fobt                  1                1            1         1        1
ehr_bmi                   1                1            1         1        1
ehr_flu                   1                1            1         1        1
ehr_chlam                 1                1            1         1        1
ehr_smoking               1                1            1         1        1
ehr_cholest               1                1            1         1        1
                 dmap_flu dmap_chlam dmap_smoking dmap_cholest ehr_cervical
study_id                1          1            1            1            1
sex                     1          1            1            1            1
age_start               1          1            1            1            1
primary_dept            1          1            1            1            1
ethnic_cat              1          1            1            1            1
lang_cat                1          1            1            1            1
race_cat                1          1            1            1            1
fpl_cat                 1          1            1            1            1
age_cat                 1          1            1            1            1
elig_cervical           1          1            1            1            1
elig_breast             1          1            1            1            1
elig_colon              1          1            1            1            1
elig_colonoscopy        1          1            1            1            1
elig_flexsig            1          1            1            1            1
elig_fobt               1          1            1            1            1
elig_bmi                1          1            1            1            1
elig_flu                1          1            1            1            1
elig_chlam              1          1            1            1            1
elig_smoking            1          1            1            1            1
elig_cholest            1          1            1            1            1
dmap_cervical           1          1            1            1            1
dmap_breast             1          1            1            1            1
dmap_colon              1          1            1            1            1
dmap_colonoscopy        1          1            1            1            1
dmap_flexsig            1          1            1            1            1
dmap_fobt               1          1            1            1            1
dmap_bmi                1          1            1            1            1
dmap_flu                0          1            1            1            1
dmap_chlam              1          0            1            1            1
dmap_smoking            1          1            0            1            1
dmap_cholest            1          1            1            0            1
ehr_cervical            1          1            1            1            0
ehr_breast              1          1            1            1            1
ehr_colon               1          1            1            1            1
ehr_colonoscopy         1          1            1            1            1
ehr_flexsig             1          1            1            1            1
ehr_fobt                1          1            1            1            1
ehr_bmi                 1          1            1            1            1
ehr_flu                 1          1            1            1            1
ehr_chlam               1          1            1            1            1
ehr_smoking             1          1            1            1            1
ehr_cholest             1          1            1            1            1
                 ehr_breast ehr_colon ehr_colonoscopy ehr_flexsig ehr_fobt
study_id                  1         1               1           1        1
sex                       1         1               1           1        1
age_start                 1         1               1           1        1
primary_dept              1         1               1           1        1
ethnic_cat                1         1               1           1        1
lang_cat                  1         1               1           1        1
race_cat                  1         1               1           1        1
fpl_cat                   1         1               1           1        1
age_cat                   1         1               1           1        1
elig_cervical             1         1               1           1        1
elig_breast               1         1               1           1        1
elig_colon                1         1               1           1        1
elig_colonoscopy          1         1               1           1        1
elig_flexsig              1         1               1           1        1
elig_fobt                 1         1               1           1        1
elig_bmi                  1         1               1           1        1
elig_flu                  1         1               1           1        1
elig_chlam                1         1               1           1        1
elig_smoking              1         1               1           1        1
elig_cholest              1         1               1           1        1
dmap_cervical             1         1               1           1        1
dmap_breast               1         1               1           1        1
dmap_colon                1         1               1           1        1
dmap_colonoscopy          1         1               1           1        1
dmap_flexsig              1         1               1           1        1
dmap_fobt                 1         1               1           1        1
dmap_bmi                  1         1               1           1        1
dmap_flu                  1         1               1           1        1
dmap_chlam                1         1               1           1        1
dmap_smoking              1         1               1           1        1
dmap_cholest              1         1               1           1        1
ehr_cervical              1         1               1           1        1
ehr_breast                0         1               1           1        1
ehr_colon                 1         0               1           1        1
ehr_colonoscopy           1         1               0           1        1
ehr_flexsig               1         1               1           0        1
ehr_fobt                  1         1               1           1        0
ehr_bmi                   1         1               1           1        1
ehr_flu                   1         1               1           1        1
ehr_chlam                 1         1               1           1        1
ehr_smoking               1         1               1           1        1
ehr_cholest               1         1               1           1        1
                 ehr_bmi ehr_flu ehr_chlam ehr_smoking ehr_cholest
study_id               1       1         1           1           1
sex                    1       1         1           1           1
age_start              1       1         1           1           1
primary_dept           1       1         1           1           1
ethnic_cat             1       1         1           1           1
lang_cat               1       1         1           1           1
race_cat               1       1         1           1           1
fpl_cat                1       1         1           1           1
age_cat                1       1         1           1           1
elig_cervical          1       1         1           1           1
elig_breast            1       1         1           1           1
elig_colon             1       1         1           1           1
elig_colonoscopy       1       1         1           1           1
elig_flexsig           1       1         1           1           1
elig_fobt              1       1         1           1           1
elig_bmi               1       1         1           1           1
elig_flu               1       1         1           1           1
elig_chlam             1       1         1           1           1
elig_smoking           1       1         1           1           1
elig_cholest           1       1         1           1           1
dmap_cervical          1       1         1           1           1
dmap_breast            1       1         1           1           1
dmap_colon             1       1         1           1           1
dmap_colonoscopy       1       1         1           1           1
dmap_flexsig           1       1         1           1           1
dmap_fobt              1       1         1           1           1
dmap_bmi               1       1         1           1           1
dmap_flu               1       1         1           1           1
dmap_chlam             1       1         1           1           1
dmap_smoking           1       1         1           1           1
dmap_cholest           1       1         1           1           1
ehr_cervical           1       1         1           1           1
ehr_breast             1       1         1           1           1
ehr_colon              1       1         1           1           1
ehr_colonoscopy        1       1         1           1           1
ehr_flexsig            1       1         1           1           1
ehr_fobt               1       1         1           1           1
ehr_bmi                0       1         1           1           1
ehr_flu                1       0         1           1           1
ehr_chlam              1       1         0           1           1
ehr_smoking            1       1         1           0           1
ehr_cholest            1       1         1           1           0

Specify the imputation model

Here I will follow the 7 steps that van Buuren suggests in order to set up the algorithm. See his writings for more details than I will go into here.

Step 1 - Decide if the missing at random (MAR) assumption is reasonable

In this example, we randomly assigned missing values. So here, it kind of has to be reasonable. In practice though, this can be challenging to know for sure which is why the exploration of the data and the missingness is such an important step to take as I showed above.

Assuming MAR is typically a reasonable place to start. There is literature on sensitivity analysis with the imputations to examine if this assumption is met. And there are techniques to model the missing mechanism with the imputation if there is violation. This work is outside the scope of what I hope to share here.

Step 2 - Decide on the form of the imputation model

We want to decide the form of the model used to impute the missing values of each variable. This can be specified on a variable by variable basis. We saw above from the meth object that the software made default decisions for us.

FPL – logistic regression (logreg), for factor with 2 levels.

Race – Multinomial logit regression (polyreg), factor with > 2 levels.

Ethnicity – Multinomial logit regression (polyreg), factor with > 2 levels.

I am going to overwrite those just to show how it is done. By overwriting the meth object we can force the algorithm to use this later.

meth[c("ethnic_cat")] <- "polyreg"
meth[c("race_cat")] <- "polyreg"
meth[c("fpl_cat")] <- "logreg"
meth

        study_id              sex        age_start     primary_dept 
              ""               ""               ""               "" 
      ethnic_cat         lang_cat         race_cat          fpl_cat 
       "polyreg"               ""        "polyreg"         "logreg" 
         age_cat    elig_cervical      elig_breast       elig_colon 
              ""               ""               ""               "" 
elig_colonoscopy     elig_flexsig        elig_fobt         elig_bmi 
              ""               ""               ""               "" 
        elig_flu       elig_chlam     elig_smoking     elig_cholest 
              ""               ""               ""               "" 
   dmap_cervical      dmap_breast       dmap_colon dmap_colonoscopy 
              ""               ""               ""               "" 
    dmap_flexsig        dmap_fobt         dmap_bmi         dmap_flu 
              ""               ""               ""               "" 
      dmap_chlam     dmap_smoking     dmap_cholest     ehr_cervical 
              ""               ""               ""               "" 
      ehr_breast        ehr_colon  ehr_colonoscopy      ehr_flexsig 
              ""               ""               ""               "" 
        ehr_fobt          ehr_bmi          ehr_flu        ehr_chlam 
              ""               ""               ""               "" 
     ehr_smoking      ehr_cholest 
              ""               ""

Step 3 - Decide the set of predictors to include in the imputation model

What variables to include in the multiple imputation model?

The advice is to include as many relevant variables as possible. One should include all variables that are in your scientific model of interest that will be used after imputation. Also variables that are related to the missingness of the variables you are imputing. Van Buuren has more advice here.

Including as many predictors as possible makes the MAR assumption more reasonable. But with larger data sets, this is not advisable for computation purposes. Van Buuren suggests that 15 to 25 variables will work well. He also offers advice to cull that list.

My case is interesting. I am not doing modelling; I am calculating scalar statistics of agreement. Also, my data set isn’t really too large (41 variables once you ignore study ID which isn’t too important for imputation purposes).

To aid in these decisions the mice package has a function that produces a “quick predictor matrix” that is useful for dealing with data sets with large number of variables. The software chooses by calculating two correlations with the available cases, taking the larger, and seeing if it meets a minimum threshold. Type ?quickpred in the R console for better description.

Below I run the quickpred() to see what the software chooses. Only show the matrix below for those records with > 1 rows or columns

predGuess <- valdata %>% 
  mice::quickpred(.)

predGuess[rowSums(predGuess) > 0, colSums(predGuess) > 0]

<0 x 0 matrix>

Hmm. As I am working through this example with the simulated data, the software did not choose any. In my actual work, it discovered about 2 to 3 important predictors for each variable.

Since we randomly generated each variable independently, there should not be a high correlation between them. So as surprised as I was at first, this really does makes sense.

In my actual work, I went to the lead investigator for insight into the data set and suggestions on which variables would be informative. The code chunk below shows how I took the list that they provided and modified the predM object.

# Store the names of the variables in an object
var_names <- dput(names(valdata))

c("study_id", "sex", "age_start", "primary_dept", "ethnic_cat", 
"lang_cat", "race_cat", "fpl_cat", "age_cat", "elig_cervical", 
"elig_breast", "elig_colon", "elig_colonoscopy", "elig_flexsig", 
"elig_fobt", "elig_bmi", "elig_flu", "elig_chlam", "elig_smoking", 
"elig_cholest", "dmap_cervical", "dmap_breast", "dmap_colon", 
"dmap_colonoscopy", "dmap_flexsig", "dmap_fobt", "dmap_bmi", 
"dmap_flu", "dmap_chlam", "dmap_smoking", "dmap_cholest", "ehr_cervical", 
"ehr_breast", "ehr_colon", "ehr_colonoscopy", "ehr_flexsig", 
"ehr_fobt", "ehr_bmi", "ehr_flu", "ehr_chlam", "ehr_smoking", 
"ehr_cholest")

# create another vector of the names selected by the investigator
pi_list <-
  c("fpl_cat", "race_cat", "ethnic_cat", "lang_cat", "age_start", "sex", 
    "primary_dept", "ehr_cervical", "ehr_breast", "ehr_colon", 
    "ehr_colonoscopy", "dmap_breast", "dmap_colonoscopy", "ehr_cholest", 
    "dmap_cholest", "elig_cholest", "ehr_flexsig", "ehr_fobt", "ehr_bmi", 
    "ehr_flu", "ehr_chlam", "ehr_smoking", "dmap_cervical", "dmap_colon", 
    "dmap_flexsig", "dmap_fobt", "dmap_bmi", "dmap_flu", "dmap_chlam", 
    "elig_cervical", "elig_breast", "elig_colon", "elig_bmi", "elig_flu", 
    "elig_chlam", "elig_smoking")

Note that the investigator had me exclude the variables, elig_colonoscopy, elig_flexsig, and elig_fobt, because these have the exact same information as elig_colon. The eligibility for all these screenings is the same.

We also did not include dmap_smoking because there was little to no information here.

Also, we included age_start as a continuous variable and did not include the categorical version of this variable, age_cat, hoping to get more information.

# Make a vector of the variable names that we want to include.
vars_to_include <- var_names[(var_names %in% pi_list)]

# adjust the default prediction matrix for the variables we want to include
pred <- predM    # Set equal to the orginal pred matrix
pred[, ] <- 0    # change to all zeroes to clean it out
pred[, vars_to_include] <- 1     # Set to 1 for variables that we want
diag(pred) <- 0    # set the diagonal to zero (can't predict itself)

# take a glimpse
head(pred[, 1:10], n = 10)

              study_id sex age_start primary_dept ethnic_cat lang_cat race_cat
study_id             0   1         1            1          1        1        1
sex                  0   0         1            1          1        1        1
age_start            0   1         0            1          1        1        1
primary_dept         0   1         1            0          1        1        1
ethnic_cat           0   1         1            1          0        1        1
lang_cat             0   1         1            1          1        0        1
race_cat             0   1         1            1          1        1        0
fpl_cat              0   1         1            1          1        1        1
age_cat              0   1         1            1          1        1        1
elig_cervical        0   1         1            1          1        1        1
              fpl_cat age_cat elig_cervical
study_id            1       0             1
sex                 1       0             1
age_start           1       0             1
primary_dept        1       0             1
ethnic_cat          1       0             1
lang_cat            1       0             1
race_cat            1       0             1
fpl_cat             0       0             1
age_cat             1       0             1
elig_cervical       1       0             0

Note that there are 1s in the matrix for variables that we will not be imputing. This will have no impact.

In the actual work, I made 3 scenarios:

Full list of variables chosen by the investigator
Reduced list of about half the number
Very reduced list based on the defaults from the software (2-3)

Here in this example to keep it simple, I will just work with the full list. But the other scenarios could be created by adjusting the predictor matrix similar to above.

Step 4 - Decide how to impute variables that are functions of other (incomplete) variables

Transformations, sum scores, etc. were not used in this data set so not much to consider here. In some cases, there can be a lot to think about particularly if a variable is transformed solely to meet the normality assumption in a regression model. So do a literature review if this issue applies to you.

In my example, I mentioned above that I am including continous age_start instead of the categorical age. fpl_cat is a variable that is derived from a continuous value that had missingness. We considered to included the continuous value, but decided that it is more relevant to our question to look at binary fpl_cat.

Step 5 - Decide the order to impute the variables

The defualt in the software goes by appearance in the data set left to right. It can be overwritten per the user’s direction. This becomes more of an issue if there is a longitudinal nature to the data set where missingness at an earlier time point would affect the missingness later. So impute early to later.

I examined the imputation order by magnitude of missingness: low to high and high to low. There did not seem to be a difference in performance or convergence, nor an impact to the estimates. In the actual work, we decided to impute from highest percent missing to lowest.

The code init$visitSequence gives the variables to be imputed and their column positions that were chosen by the software by default. Shown in the next chunk.

init$visitSequence

 [1] "study_id"         "sex"              "age_start"        "primary_dept"    
 [5] "ethnic_cat"       "lang_cat"         "race_cat"         "fpl_cat"         
 [9] "age_cat"          "elig_cervical"    "elig_breast"      "elig_colon"      
[13] "elig_colonoscopy" "elig_flexsig"     "elig_fobt"        "elig_bmi"        
[17] "elig_flu"         "elig_chlam"       "elig_smoking"     "elig_cholest"    
[21] "dmap_cervical"    "dmap_breast"      "dmap_colon"       "dmap_colonoscopy"
[25] "dmap_flexsig"     "dmap_fobt"        "dmap_bmi"         "dmap_flu"        
[29] "dmap_chlam"       "dmap_smoking"     "dmap_cholest"     "ehr_cervical"    
[33] "ehr_breast"       "ehr_colon"        "ehr_colonoscopy"  "ehr_flexsig"     
[37] "ehr_fobt"         "ehr_bmi"          "ehr_flu"          "ehr_chlam"       
[41] "ehr_smoking"      "ehr_cholest"

I override this with the following to order things as: FPL -> Race -> Ethnicity

visit_order <- c("fpl_cat",  
                 "race_cat", 
                 "ethnic_cat")
visit_order

[1] "fpl_cat"    "race_cat"   "ethnic_cat"

Step 6 - Decide the number of iterations

This is to ensure convergence. The default is 5. 10 to 20 are recommended. In actual practice, we chose 20. Here I will keep it set at 5 as defined above to keep the computation strain low.

iter_num

[1] 5

Step 7 - Decide on the number of multiply imputed data sets

The rule of thumb from more recent authors is that the number of imputations should be similar to the percentage of cases (observations) that are incomplete (at least 5).

The software default is 5 and we will use that for this example to keep computation time low. We had set this previously above.

imp_num

[1] 5

Run the algorithm

All the setup work has been done and considerations made. Using the specifications that saved in objects above, we will run the mice command to impute the data sets.

Note that I wrap this in the command system.time() to see how long it runs.

system.time(
imp_full <- 
  mice::mice(data = valdata, 
             m = imp_num,  # number of imputations, dflt = 5
             method = meth,  # specify the method
             predictorMatrix = pred, 
             visitSequence = visit_order, 
             seed = seed_for_imp, 
             maxit = iter_num,  # number of iterations, dflt = 5
             print = FALSE
             )
)

   user  system elapsed 
229.679  10.563 243.362

That’s elapsed time in seconds.

Plot of convergence

We plot the values of mean and standard deviation for each variable by number of iteration. We want to look for mixing of the lines and we do not want to see any trend.

plot(imp_full, c("ethnic_cat", "race_cat", "fpl_cat"))

Calculating results and pooling

Now that we have out imputed data sets, we want to analyze each of them and then pool the results.

When pooling results, the mice software has tools to help whether pooling model results or scalar estimates. Things get messy quickly because I am calculating a number of agreement statistics, for 11 procedures, with 7 categorical variables, and each categorical variable has a number of levels.

When I first did these steps, I wrote many nested for loops. While this worked and it ran, it was slow and a little bit like a black box. A wise person once said that if you you find yourself using nested for loops in R then you should consider using the functions from the purrr package. They were right. I didn’t compare speed with the previous approach, but the way I present here has clear points where you can check that things make sense. This advantage makes the choice clear for me.

The first step in wrangling the results is to get the separate imputed data sets into one data frame with a variable to track which imputation number the data came from.

# Create an empty list
mylist <- list() 

# Put the imputed sets into the list
for (i in 1:imp_full$m) {
  mylist[[i]] <- mice::complete(imp_full, i)
}

# mylist[[1]]

# Take the list and stack the data sets into a data frame
# with an indication for the number of imputed data set
output <- dplyr::bind_rows(mylist, .id = "n_imp")

# This would do the same thing, maybe faster, but the above has the added 
# benefit of easily adding the number of the imputation so that I can track 
# things. Either way, I chose the one above as my preference.
# df <- do.call(rbind, mylist)
# df

# Check it out that the dimensions make sense and spot check the data
dplyr::glimpse(output)

Observations: 70,000
Variables: 43
$ n_imp            <chr> "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1…
$ study_id         <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1…
$ sex              <fct> F, F, M, M, F, M, M, M, M, F, M, F, F, F, M, M, M, F…
$ age_start        <int> 44, 22, 34, 47, 36, 50, 42, 29, 27, 38, 48, 63, 42, …
$ primary_dept     <fct> 021, 031, 039, 020, 015, 005, 026, 008, 005, 024, 03…
$ ethnic_cat       <fct> NH White, NH White, NH White, NH White, NH White, NH…
$ lang_cat         <fct> English, English, English, English, English, English…
$ race_cat         <fct> Black, Black, White, White, White, White, White, Whi…
$ fpl_cat          <fct> <=138% FPL, <=138% FPL, >138% FPL, <=138% FPL, <=138…
$ age_cat          <fct> "[35,51)", "[19,35)", "[19,35)", "[35,51)", "[35,51)…
$ elig_cervical    <fct> 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1…
$ elig_breast      <fct> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1…
$ elig_colon       <fct> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1…
$ elig_colonoscopy <fct> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1…
$ elig_flexsig     <fct> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1…
$ elig_fobt        <fct> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1…
$ elig_bmi         <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ elig_flu         <fct> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1…
$ elig_chlam       <fct> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0…
$ elig_smoking     <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ elig_cholest     <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ dmap_cervical    <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0…
$ dmap_breast      <fct> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1…
$ dmap_colon       <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ dmap_colonoscopy <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ dmap_flexsig     <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ dmap_fobt        <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1…
$ dmap_bmi         <fct> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ dmap_flu         <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0…
$ dmap_chlam       <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0…
$ dmap_smoking     <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ dmap_cholest     <fct> 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0…
$ ehr_cervical     <fct> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ ehr_breast       <fct> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ ehr_colon        <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1…
$ ehr_colonoscopy  <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ ehr_flexsig      <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ ehr_fobt         <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ ehr_bmi          <fct> 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1…
$ ehr_flu          <fct> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1…
$ ehr_chlam        <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ ehr_smoking      <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ ehr_cholest      <fct> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0…

# dim(output)
# head(output)
# tail(output)

Then I gather the data into a tidy format so that I can group by category or imputation number or whatever I want really. I employ a user written function because it’s a little bit of work to do this.

source(here::here("src", "make_df_long.R"))
output_long <- make_df_long(df = output)

We can check that this works by grouping by category and checking that the sums and counts of some of the variables match.

output_long %>% 
  group_by(cat) %>% 
  summarise(
    n = n(), 
    ELIG = sum(as.numeric(elig), na.rm = TRUE), 
    EHR = sum(as.numeric(ehr), na.rm = TRUE), 
    DMAP = sum(as.numeric(dmap), na.rm = TRUE), 
    id = sum(as.numeric(study_id), na.rm = TRUE), 
    age = sum(age_start, na.rm = TRUE)
  )

# A tibble: 7 x 7
  cat            n   ELIG    EHR  DMAP         id      age
  <chr>      <int>  <dbl>  <dbl> <dbl>      <dbl>    <int>
1 age       770000 409795 208525 84905 5390385000 32458360
2 all       770000 409795 208525 84905 5390385000 32458360
3 ethnicity 770000 409795 208525 84905 5390385000 32458360
4 fpl       770000 409795 208525 84905 5390385000 32458360
5 language  770000 409795 208525 84905 5390385000 32458360
6 race      770000 409795 208525 84905 5390385000 32458360
7 sex       770000 409795 208525 84905 5390385000 32458360

I now want the data to be grouped by (1) procedure, (2) category, (3) level, and (4) imputation number.

output_nested <- output_long %>% 
  group_by(proc, cat, level, n_imp) %>% 
  tidyr::nest()

output_nested %>% 
  head(., n = 10)

# A tibble: 10 x 5
# Groups:   n_imp, proc, cat, level [10]
   n_imp proc     cat   level data                 
   <chr> <chr>    <chr> <chr> <list>               
 1 1     cervical all   all   <tibble [14,000 × 6]>
 2 2     cervical all   all   <tibble [14,000 × 6]>
 3 3     cervical all   all   <tibble [14,000 × 6]>
 4 4     cervical all   all   <tibble [14,000 × 6]>
 5 5     cervical all   all   <tibble [14,000 × 6]>
 6 1     breast   all   all   <tibble [14,000 × 6]>
 7 2     breast   all   all   <tibble [14,000 × 6]>
 8 3     breast   all   all   <tibble [14,000 × 6]>
 9 4     breast   all   all   <tibble [14,000 × 6]>
10 5     breast   all   all   <tibble [14,000 × 6]>

# df_nested$data[[1]]
# head(df_nested$data[[1]])
# tail(df_nested$data[[1]])

We now have a nested column of the data by the groups and imputations that we want. Next, we need to calculate two columns: Q - the estimates, and U - their standard errors. These needed user written functions to make the process automated and efficient.

source(here::here("src", "calc-stats.R"))
output_nested <- output_nested %>% 
  mutate(Q = purrr::map(.x = data, .f = calc_stats_p), 
         U = purrr::map(.x = data, .f = calc_stats_se))

output_nested %>% 
  head(., n = 10)

# A tibble: 10 x 7
# Groups:   n_imp, proc, cat, level [10]
   n_imp proc     cat   level data               Q               U              
   <chr> <chr>    <chr> <chr> <list>             <list>          <list>         
 1 1     cervical all   all   <tibble [14,000 ×… <df[,2] [23 × … <df[,2] [23 × …
 2 2     cervical all   all   <tibble [14,000 ×… <df[,2] [23 × … <df[,2] [23 × …
 3 3     cervical all   all   <tibble [14,000 ×… <df[,2] [23 × … <df[,2] [23 × …
 4 4     cervical all   all   <tibble [14,000 ×… <df[,2] [23 × … <df[,2] [23 × …
 5 5     cervical all   all   <tibble [14,000 ×… <df[,2] [23 × … <df[,2] [23 × …
 6 1     breast   all   all   <tibble [14,000 ×… <df[,2] [23 × … <df[,2] [23 × …
 7 2     breast   all   all   <tibble [14,000 ×… <df[,2] [23 × … <df[,2] [23 × …
 8 3     breast   all   all   <tibble [14,000 ×… <df[,2] [23 × … <df[,2] [23 × …
 9 4     breast   all   all   <tibble [14,000 ×… <df[,2] [23 × … <df[,2] [23 × …
10 5     breast   all   all   <tibble [14,000 ×… <df[,2] [23 × … <df[,2] [23 × …

# output_nested$Q[[1]]
# output_nested$U[[1]]

Next, we want to unnest the results from calculating the statistics and the standard errors.

output_nested <- output_nested %>% 
  tidyr::unnest(c(Q, U)) %>% 
  mutate_at(.vars = vars(Q, U), 
            .funs = list(~ as.numeric(.)))

Error: Column name `stat` must not be duplicated.
Use .name_repair to specify repair.

output_nested %>% 
  head(., n = 10)

# A tibble: 10 x 7
# Groups:   n_imp, proc, cat, level [10]
   n_imp proc     cat   level data               Q               U              
   <chr> <chr>    <chr> <chr> <list>             <list>          <list>         
 1 1     cervical all   all   <tibble [14,000 ×… <df[,2] [23 × … <df[,2] [23 × …
 2 2     cervical all   all   <tibble [14,000 ×… <df[,2] [23 × … <df[,2] [23 × …
 3 3     cervical all   all   <tibble [14,000 ×… <df[,2] [23 × … <df[,2] [23 × …
 4 4     cervical all   all   <tibble [14,000 ×… <df[,2] [23 × … <df[,2] [23 × …
 5 5     cervical all   all   <tibble [14,000 ×… <df[,2] [23 × … <df[,2] [23 × …
 6 1     breast   all   all   <tibble [14,000 ×… <df[,2] [23 × … <df[,2] [23 × …
 7 2     breast   all   all   <tibble [14,000 ×… <df[,2] [23 × … <df[,2] [23 × …
 8 3     breast   all   all   <tibble [14,000 ×… <df[,2] [23 × … <df[,2] [23 × …
 9 4     breast   all   all   <tibble [14,000 ×… <df[,2] [23 × … <df[,2] [23 × …
10 5     breast   all   all   <tibble [14,000 ×… <df[,2] [23 × … <df[,2] [23 × …

We want to re-group_by the data by procedure, category, level, and type of statistic. This step gives us the nested estimates and variances for all the imputed sets which we can now pool the resuls.

output_nested <- output_nested %>% 
  group_by(proc, cat, level, stat) %>% 
  tidyr::nest()

Error: Column `stat` is unknown

Pool and unnest. Note that my user written function mi_pool is really just a wrapper for the pool.scalar() function in the mice package.

output_nested <- output_nested %>% 
  mutate(pooled = purrr::map(.x = data, 
                             .f = ~ mi_pool(.x)))

Error in var(Q): 'x' is NULL

output_nested %>% 
  head(., n = 10)

# A tibble: 10 x 7
# Groups:   n_imp, proc, cat, level [10]
   n_imp proc     cat   level data               Q               U              
   <chr> <chr>    <chr> <chr> <list>             <list>          <list>         
 1 1     cervical all   all   <tibble [14,000 ×… <df[,2] [23 × … <df[,2] [23 × …
 2 2     cervical all   all   <tibble [14,000 ×… <df[,2] [23 × … <df[,2] [23 × …
 3 3     cervical all   all   <tibble [14,000 ×… <df[,2] [23 × … <df[,2] [23 × …
 4 4     cervical all   all   <tibble [14,000 ×… <df[,2] [23 × … <df[,2] [23 × …
 5 5     cervical all   all   <tibble [14,000 ×… <df[,2] [23 × … <df[,2] [23 × …
 6 1     breast   all   all   <tibble [14,000 ×… <df[,2] [23 × … <df[,2] [23 × …
 7 2     breast   all   all   <tibble [14,000 ×… <df[,2] [23 × … <df[,2] [23 × …
 8 3     breast   all   all   <tibble [14,000 ×… <df[,2] [23 × … <df[,2] [23 × …
 9 4     breast   all   all   <tibble [14,000 ×… <df[,2] [23 × … <df[,2] [23 × …
10 5     breast   all   all   <tibble [14,000 ×… <df[,2] [23 × … <df[,2] [23 × …

Unnest the pooled results, we see that for each procedure, by category, by strata, we have for each statistic, the results that the pooling in mice provides.

output_nested <- output_nested %>% 
  tidyr::unnest(pooled)

Error: Can't subset columns that don't exist.
�[31mx�[39m The column `pooled` doesn't exist.

output_nested %>% 
  dplyr::select(-data) %>% 
  dplyr::glimpse(.)

Observations: 1,045
Variables: 6
Groups: n_imp, proc, cat, level [1,045]
$ n_imp <chr> "1", "2", "3", "4", "5", "1", "2", "3", "4", "5", "1", "2", "3"…
$ proc  <chr> "cervical", "cervical", "cervical", "cervical", "cervical", "br…
$ cat   <chr> "all", "all", "all", "all", "all", "all", "all", "all", "all", …
$ level <chr> "all", "all", "all", "all", "all", "all", "all", "all", "all", …
$ Q     <list> [<data.frame[23 x 2]>, <data.frame[23 x 2]>, <data.frame[23 x …
$ U     <list> [<data.frame[23 x 2]>, <data.frame[23 x 2]>, <data.frame[23 x …

The value qbar is the pooled post-imputation estimate that we are most interested. See the mice documenation for information on the rest. So if we just want to get those results for all the agreement statistics:

Annoyingly the variables are in alphabetical order when they are unnnested, so I add an extra select command to order them

output_nested %>% 
  dplyr::select(proc:stat, qbar) %>% 
  tidyr::spread(., key = stat, value = qbar) %>% 
  dplyr::select(
    proc, cat, level, 
    n, a, b, c, d, 
    EHR.n, EHR.p, DMAP.n, DMAP.p, Combo.n, Combo.p, 
    Po, Pe, 
    Ppos, Pneg, sens, spec, 
    kap, Kmax, Kmin, 
    BI, PI, PABAK
  ) %>% 
  knitr::kable(
    booktabs = TRUE, 
    digits = 2
  )

Error: This tidyselect interface doesn't support predicates yet.
ℹ Contact the package author and suggest using `eval_select()`.

Well, the procedures and the categories need should be arranged differently for formal presentation. But that’s pretty straightforward so I hope that you forgive me for stopping here. Thanks for reading!

Check the run time

I saved an object above when this all started to run. Now I want to see how long it took.

end_time <- Sys.time()

lubridate::as.period(
  lubridate::interval(start_time, end_time)
  )

[1] "5M 5.79863500595093S"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Overview

Simulate a data set

Simulate the data

Do some checks and take a glimpse

EDA / explore missingness

Table One

The `vis_miss()` plot

Setting `cluster = TRUE`

The `VIM` package

Numerical summaries

Proportion of usable cases

Number of incomplete cases

Imputation

Some set up

Specify the imputation model

Step 1 - Decide if the missing at random (MAR) assumption is reasonable

Step 2 - Decide on the form of the imputation model

Step 3 - Decide the set of predictors to include in the imputation model

Step 4 - Decide how to impute variables that are functions of other (incomplete) variables

Step 5 - Decide the order to impute the variables

Step 6 - Decide the number of iterations

Step 7 - Decide on the number of multiply imputed data sets

Run the algorithm

Plot of convergence

Calculating results and pooling

Check the run time

Files

README.md

Latest commit

History

README.md

File metadata and controls

Overview

Simulate a data set

Simulate the data

Do some checks and take a glimpse

EDA / explore missingness

Table One

The vis_miss() plot

Setting cluster = TRUE

The VIM package

Numerical summaries

Proportion of usable cases

Number of incomplete cases

Imputation

Some set up

Specify the imputation model

Step 1 - Decide if the missing at random (MAR) assumption is reasonable

Step 2 - Decide on the form of the imputation model

Step 3 - Decide the set of predictors to include in the imputation model

Step 4 - Decide how to impute variables that are functions of other (incomplete) variables

Step 5 - Decide the order to impute the variables

Step 6 - Decide the number of iterations

Step 7 - Decide on the number of multiply imputed data sets

Run the algorithm

Plot of convergence

Calculating results and pooling

Check the run time

The `vis_miss()` plot

Setting `cluster = TRUE`

The `VIM` package