---
title: "An Example of mi Usage"
author: "Ben Goodrich and Jonathan Kropko, for this version, based on earlier 
versions written by Yu-Sung Su, Masanao Yajima, Maria Grazia Pittau, Jennifer Hill, 
and Andrew Gelman"
date: "06/16/2014"
output: pdf_document
---
<!--
%\VignetteEngine{knitr::rmarkdown}
%\VignetteIndexEntry{An Example of mi Usage}
-->

There are several steps in an analysis of missing data. Initially, users must 
get their data into R. There are several ways to do so, including the 
`read.table`, `read.csv`, `read.fwf` functions plus several functions in the 
__foreign__ package. All of these functions will generate a `data.frame`, which
is a bit like a spreadsheet of data.
http://cran.r-project.org/doc/manuals/R-data.html for more information.

```{r step0}
options(width = 65)
suppressMessages(library(mi))
data(nlsyV, package = "mi")
```

From there, the first step is to convert the `data.frame` to a `missing_data.frame`, 
which is an enhanced version of a `data.frame` that includes metadata about the 
variables that is essential in a missing data context. 
```{r step1}
mdf <- missing_data.frame(nlsyV)
```
The `missing_data.frame` constructor function creates a `missing_data.frame` 
called `mdf`, which in turn contains seven `missing_variable`s, one for each 
column of the `nlsyV` dataset.

The most important aspect of a `missing_variable` is its class, such as
`continuous`, `binary`, and `count` among many others (see the table in the Slots 
section of the help page for `missing_variable-class`. The `missing_data.frame`
constructor function will try to guess the appropriate class for each 
`missing_variable`, but rarely will it correspond perfectly to the user's intent. 
Thus, it is very important to call the `show` method on a `missing_data.frame` to 
see the initial guesses 
```{r step1.5}
show(mdf) # momrace is guessed to be ordered
```
and to modify them, if necessary, using the `change` function, which can be used
to change many things about a`missing_variable`, so see its help page for more 
details. In the example below, we change the class of the _momrace_ (race of the 
mother) variable from the initial guess of `ordered-categorical` to a more appropriate 
`unordered-categorical` and change the income `nonnegative-continuous`.
```{r, step2}
mdf <- change(mdf, y = c("income", "momrace"), what = "type", 
                     to = c("non", "un"))
show(mdf)
```
Once all of the `missing_variable`s are set appropriately, it is useful to get a 
sense of the raw data, which can be accomplished by looking at the `summary`,
`image`, and / or `hist` of a `missing_data.frame`
```{r, step3}
summary(mdf)
image(mdf)
hist(mdf)
```
Next we use the `mi` function to do the actual imputation, which has several 
extra arguments that, for example, govern how many independent chains to utilize, 
how many iterations to conduct, and the maximum amount of time the user is willing 
to wait for all the iterations of all the chains to finish. The imputation step 
can be quite time consuming, particularly if there are many `missing_variable`s 
and if many of them are categorical. One important way in which the computation 
time can be reduced is by imputing in parallel, which is highly recommended and 
is implemented in the mi function by default on non-Windows machines. If users encounter problems 
running `mi` with parallel processing, the problems are likely due  to the machine 
exceeding available RAM.  Sequential processing can be used instead for `mi`
by using the `parallel=FALSE` option.

```{r, step4}
rm(nlsyV)       # good to remove large unnecessary objects to save RAM
options(mc.cores = 2)
imputations <- mi(mdf, n.iter = 30, n.chains = 4, max.minutes = 20)
show(imputations)
```
The next step is very important and essentially verifies whether enough iterations 
were conducted. We want the mean of each completed variable to be roughly the 
same for each of the 4 chains. 
```{r, step5A}
round(mipply(imputations, mean, to.matrix = TRUE), 3)
Rhats(imputations)
```
If so --- and when it does in the example depends on the pseudo-random number seed 
--- we can procede to diagnosing other problems. For the sake of example, we 
continue our 4 chains for another 5 iterations by calling 
```{r, step5B}
imputations <- mi(imputations, n.iter = 5)
```
to illustrate that this process can be continued until convergence is reached.

Next, the `plot` of an object produced by `mi` displays, for all 
`missing_variable`s (or some subset thereof), a histogram of the observed, imputed, 
and completed data, a comparison of the completed data to the fitted values implied 
by the model for the completed data, and a plot of the associated binned residuals. 
There will be one set of plots on a page for the first three chains, so that the 
user can get some sense of the sampling variability of the imputations. The `hist` 
function yields the same histograms as `plot`, but groups the histograms for all 
variables (within a chain) on the same plot. The `image`function gives a sense of 
the missingness patterns in the data.
```{r, step6}
plot(imputations)
plot(imputations, y = c("ppvtr.36", "momrace"))
hist(imputations)
image(imputations)
summary(imputations)
```
Finally, we pool over `m = 5` imputed datasets -- pulled from across the 
4 chains -- in order to estimate a descriptive linear regression of test scores 
(_ppvtr.36_) at 36 months on a variety of demographic variables pertaining to the 
mother of the child.
```{r, step7}
analysis <- pool(ppvtr.36 ~ first + b.marr + income + momage + momed + momrace, 
                 data = imputations, m = 5)
display(analysis)
```
The rest is optional and only necessary if you want to perform some operation that 
is not supported by the __mi__ package, perhaps outside of R. Here we create a 
list of `data.frame`s, which can be saved to the hard disk and / or 
exported in a variety of formats with the __foreign__ package.  Imputed data can 
be exported to Stata by using the `mi2stata` function instead of `complete`.
```{r, step8}
dfs <- complete(imputations, m = 2)
```