There are several steps in an analysis of missing data. Initially,
users must get their data into R. There are several ways to do so,
including the read.table
, read.csv
,
read.fwf
functions plus several functions in the
foreign package. All of these functions will generate a
data.frame
, which is a bit like a spreadsheet of data. http://cran.r-project.org/doc/manuals/R-data.html for
more information.
From there, the first step is to convert the data.frame
to a missing_data.frame
, which is an enhanced version of a
data.frame
that includes metadata about the variables that
is essential in a missing data context.
## NOTE: In the following pairs of variables, the missingness pattern of the first is a subset of the second.
## Please verify whether they are in fact logically distinct variables.
## [,1] [,2]
## [1,] "b.marr" "income"
The missing_data.frame
constructor function creates a
missing_data.frame
called mdf
, which in turn
contains seven missing_variable
s, one for each column of
the nlsyV
dataset.
The most important aspect of a missing_variable
is its
class, such as continuous
, binary
, and
count
among many others (see the table in the Slots section
of the help page for missing_variable-class
. The
missing_data.frame
constructor function will try to guess
the appropriate class for each missing_variable
, but rarely
will it correspond perfectly to the user’s intent. Thus, it is very
important to call the show
method on a
missing_data.frame
to see the initial guesses
## Object of class missing_data.frame with 400 observations on 7 variables
##
## There are 20 missing data patterns
##
## Append '@patterns' to this missing_data.frame to access the corresponding pattern for every observation or perhaps use table()
##
## type missing method model
## ppvtr.36 continuous 75 ppd linear
## first binary 0 <NA> <NA>
## b.marr binary 12 ppd logit
## income continuous 82 ppd linear
## momage continuous 0 <NA> <NA>
## momed ordered-categorical 40 ppd ologit
## momrace ordered-categorical 117 ppd ologit
##
## family link transformation
## ppvtr.36 gaussian identity standardize
## first <NA> <NA> <NA>
## b.marr binomial logit <NA>
## income gaussian identity standardize
## momage <NA> <NA> standardize
## momed multinomial logit <NA>
## momrace multinomial logit <NA>
and to modify them, if necessary, using the change
function, which can be used to change many things about
amissing_variable
, so see its help page for more details.
In the example below, we change the class of the momrace (race
of the mother) variable from the initial guess of
ordered-categorical
to a more appropriate
unordered-categorical
and change the income
nonnegative-continuous
.
## NOTE: In the following pairs of variables, the missingness pattern of the first is a subset of the second.
## Please verify whether they are in fact logically distinct variables.
## [,1] [,2]
## [1,] "b.marr" "income"
## Object of class missing_data.frame with 400 observations on 7 variables
##
## There are 20 missing data patterns
##
## Append '@patterns' to this missing_data.frame to access the corresponding pattern for every observation or perhaps use table()
##
## type missing method model
## ppvtr.36 continuous 75 ppd linear
## first binary 0 <NA> <NA>
## b.marr binary 12 ppd logit
## income nonnegative-continuous 82 ppd linear
## income:is_zero binary 82 ppd logit
## momage continuous 0 <NA> <NA>
## momed ordered-categorical 40 ppd ologit
## momrace unordered-categorical 117 ppd mlogit
##
## family link transformation
## ppvtr.36 gaussian identity standardize
## first <NA> <NA> <NA>
## b.marr binomial logit <NA>
## income gaussian identity logshift
## income:is_zero binomial logit <NA>
## momage <NA> <NA> standardize
## momed multinomial logit <NA>
## momrace multinomial logit <NA>
Once all of the missing_variable
s are set appropriately,
it is useful to get a sense of the raw data, which can be accomplished
by looking at the summary
, image
, and / or
hist
of a missing_data.frame
## ppvtr.36 first b.marr
## Min. : 41.00 Min. :0.000 Min. :0.0000
## 1st Qu.: 74.00 1st Qu.:0.000 1st Qu.:0.0000
## Median : 87.00 Median :0.000 Median :1.0000
## Mean : 85.94 Mean :0.435 Mean :0.7062
## 3rd Qu.: 99.00 3rd Qu.:1.000 3rd Qu.:1.0000
## Max. :132.00 Max. :1.000 Max. :1.0000
## NA's :75 NA's :12
## income momage momed momrace
## Min. : 0 Min. :16.00 Min. :1.000 1 : 55
## 1st Qu.: 8590 1st Qu.:22.00 1st Qu.:1.000 2 : 80
## Median : 17906 Median :24.00 Median :2.000 3 :148
## Mean : 32041 Mean :23.75 Mean :2.042 NA's:117
## 3rd Qu.: 31228 3rd Qu.:26.00 3rd Qu.:3.000
## Max. :1057448 Max. :32.00 Max. :4.000
## NA's :82 NA's :40
Next we use the mi
function to do the actual imputation,
which has several extra arguments that, for example, govern how many
independent chains to utilize, how many iterations to conduct, and the
maximum amount of time the user is willing to wait for all the
iterations of all the chains to finish. The imputation step can be quite
time consuming, particularly if there are many
missing_variable
s and if many of them are categorical. One
important way in which the computation time can be reduced is by
imputing in parallel, which is highly recommended and is implemented in
the mi function by default on non-Windows machines. If users encounter
problems running mi
with parallel processing, the problems
are likely due to the machine exceeding available RAM. Sequential
processing can be used instead for mi
by using the
parallel=FALSE
option.
rm(nlsyV) # good to remove large unnecessary objects to save RAM
options(mc.cores = 2)
imputations <- mi(mdf, n.iter = 30, n.chains = 4, max.minutes = 20)
show(imputations)
## Object of class mi with 4 chains, each with 30 iterations.
## Each chain is the evolution of an object of missing_data.frame class with 400 observations on 7 variables.
The next step is very important and essentially verifies whether enough iterations were conducted. We want the mean of each completed variable to be roughly the same for each of the 4 chains.
## chain:1 chain:2 chain:3 chain:4
## ppvtr.36 0.005 -0.007 0.003 0.011
## first 1.435 1.435 1.435 1.435
## b.marr 1.685 1.685 1.690 1.685
## income 9.566 9.538 9.555 9.556
## momage 0.000 0.000 0.000 0.000
## momed 2.080 2.050 2.058 2.045
## momrace 2.283 2.240 2.268 2.277
## missing_ppvtr.36 0.188 0.188 0.188 0.188
## missing_b.marr 0.030 0.030 0.030 0.030
## missing_income 0.205 0.205 0.205 0.205
## missing_momed 0.100 0.100 0.100 0.100
## missing_momrace 0.292 0.292 0.292 0.292
## mean_ppvtr.36 mean_b.marr mean_income mean_momed
## 0.9989784 1.0161485 1.0872158 1.0080560
## mean_momrace sd_ppvtr.36 sd_b.marr sd_income
## 1.0612264 0.9883607 1.0153688 1.1798300
## sd_momed sd_momrace
## 0.9941880 1.0388617
If so — and when it does in the example depends on the pseudo-random number seed — we can procede to diagnosing other problems. For the sake of example, we continue our 4 chains for another 5 iterations by calling
to illustrate that this process can be continued until convergence is reached.
Next, the plot
of an object produced by mi
displays, for all missing_variable
s (or some subset
thereof), a histogram of the observed, imputed, and completed data, a
comparison of the completed data to the fitted values implied by the
model for the completed data, and a plot of the associated binned
residuals. There will be one set of plots on a page for the first three
chains, so that the user can get some sense of the sampling variability
of the imputations. The hist
function yields the same
histograms as plot
, but groups the histograms for all
variables (within a chain) on the same plot. The
image
function gives a sense of the missingness patterns in
the data.
## $ppvtr.36
## $ppvtr.36$is_missing
## missing
## FALSE TRUE
## 325 75
##
## $ppvtr.36$imputed
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.68707 -0.42052 -0.06966 -0.04628 0.33780 1.24409
##
## $ppvtr.36$observed
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.20189 -0.31924 0.02848 0.00000 0.34944 1.23210
##
##
## $first
## $first$is_missing
## [1] "all values observed"
##
## $first$observed
##
## 1 2
## 226 174
##
##
## $b.marr
## $b.marr$crosstab
##
## observed imputed
## 0 456 46
## 1 1096 2
##
##
## $income
## $income$is_missing
## missing
## FALSE TRUE
## 318 82
##
## $income$imputed
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.592 8.262 9.411 8.901 10.198 12.405
##
## $income$observed
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.323 9.079 9.804 9.699 10.358 13.872
##
##
## $momage
## $momage$is_missing
## [1] "all values observed"
##
## $momage$observed
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.21377 -0.27468 0.03835 0.00000 0.35137 1.29046
##
##
## $momed
## $momed$crosstab
##
## observed imputed
## 1 472 43
## 2 540 57
## 3 324 45
## 4 104 15
##
##
## $momrace
## $momrace$crosstab
##
## observed imputed
## 1 220 127
## 2 320 133
## 3 592 208
Finally, we pool over m = 5
imputed datasets – pulled from
across the 4 chains – in order to estimate a descriptive linear
regression of test scores (ppvtr.36) at 36 months on a variety
of demographic variables pertaining to the mother of the child.
analysis <- pool(ppvtr.36 ~ first + b.marr + income + momage + momed + momrace,
data = imputations, m = 5)
display(analysis)
## bayesglm(formula = ppvtr.36 ~ first + b.marr + income + momage +
## momed + momrace, data = imputations, m = 5)
## coef.est coef.se
## (Intercept) 79.67 8.10
## first1 4.67 1.94
## b.marr1 5.96 2.96
## income 0.00 0.00
## momage -0.10 0.32
## momed.L 11.12 2.98
## momed.Q 0.99 2.52
## momed.C -0.12 1.64
## momrace2 -5.98 3.42
## momrace3 11.64 3.17
## n = 390, k = 10
## residual deviance = 93678.4, null deviance = 143164.5 (difference = 49486.1)
## overdispersion parameter = 240.2
## residual sd is sqrt(overdispersion) = 15.50
The rest is optional and only necessary if you want to perform some
operation that is not supported by the mi package,
perhaps outside of R. Here we create a list of data.frame
s,
which can be saved to the hard disk and / or exported in a variety of
formats with the foreign package. Imputed data can be
exported to Stata by using the mi2stata
function instead of
complete
.