Multiple Imputation Stata (ice) How and when to use it.
How ice() works Each variable with missing data is the subject of a regression. –Typically all other variables are used as predictors –Estimate ß, σ via the regression –Draw σ* from its posterior distribution (non-informative prior) –Draw ß* from its posterior distribution (non-informative prior) –Find predicted values: Ŷ=Xß*, then either: Keep Ŷ for the missing values (default option) Predictive Mean Matching –Move on to the next variable, using the newly-predicted values –Cycle through the variables a number of times (10 is default)
Assumptions Missing at Random –No getting around this one. MCAR is fine, of course. Distinct Parameters –Does the missing data mechanism govern what data-generating parameters you can see? Ex: limits of detection. Adequate Sample Size –Hard to quantify. Regression on continuous variables doesn’t take much, but other methods certainly can Convergence to a Posterior Distribution –Standard MI (such as Proc MI) is known to converge to a posterior distribution with enough iterations. Ice() does not have this guarantee. This is typically ignored when ice() is used.
Predictive Mean Matching We have Ŷ mis for the variable with missing information –Previously Find the ŷ obs that is closest to ŷ mis, fill in the missing observation’s value with the true value of the ŷ obs Was the default behavior for previous versions of ice() Could be a problem; not enough variability. –Currently Find a set of ŷ obs that are close to ŷ mis, choose one randomly, fill in the missing observation’s value with the true value of the ŷ obs Invoked by using the “match” argument
Other Regression Methods Multinomial Logistic Regression –For categorical variables, ordered or unordered –Finds a probability for each category value, then imputes a value using those probabilities. –My advice: try to avoid using it, as I’ve found its results to be incorrect (biased) Ordinal Logistic Regression –For ordered categorical variables –My advice: it seems to work well, but it needs a large (n>1000) sample size to work
Useful Material: How to run ice() Getting the program –Help -> Search -> [Search all] “ice imputation” –Click on st_0067_2 ( –Click “click here to install” –This gets you ice and micombine, as well as a few other commands
Running ice –Have the dataset open insheet using "C:\path\example.csv", clear –Four variables with missing information npnitm: binary variable npceradm, npneurm: continuous variables npbrkm: 3-category ordered variable –Four variables with complete data –We need to make dummy variables for categorical variables: recode npbrkm (4=0) (5=1) (6=0) (.=.), generate(brk5) recode npbrkm (4=0) (5=0) (6=1) (.=.), generate(brk6)
Running ice, continued (1) –Call ice() ice educ mmselast npdage npgender npnitm npceradm npbrkm brk5 brk6 npneurm using "C:\path\outfile", m(5) passive(brk5:npbrkm==5 \ brk6:npbrkm==6) substitute(npbrkm:brk5 brk6) cmd(npbrkm:mlogit, npnitm:logit) –Here’s what the code pieces do: educ … npneurm: Variables to be used for imputation using "C:\path\outfile“: the result; outfile.dta m(5): 5 imputed datasets passive(brk5:npbrkm==5 \ brk6:npbrkm==6) –Stata will not impute for brk5 and brk6: they will be updated from the new values in npbrkm
Running ice, continued (2) –Here’s what the code pieces do: substitute(npbrkm:brk5 brk6) –npbrkm won’t be used to impute other variables; brk5 and brk6 will be used in its place –cmd(npbrkm:mlogit, npnitm:logit) –npbrkm will have multiple logistic regression –npnitm will have logistic regression –all other variables with missing data use default methods: »continuous: OLS »n=2 categories: Logistic Regression »n>2 categories: Multinomial Logistic Regression
Results A dataset, outfile.dta –use “C:\path\outfile.dta”, clear New variables –_i: row number per dataset (not generally used) –_j: imputed dataset number (same as _Imputation_ from Proc MI) Analyzing the results using micombine, an example –xi: micombine regress mmselast npgender npnitm npceradm i.npbrkm –xi: expand interactions. Used to break npbrkm into dummy variables for the analysis –micombine: automatically does the MI analysis, using _j to distinguish between the imputed datasets See its help file for a list of supported regression commands For some methods, SAS’s MIANALYZE may be needed
The end. Questions?