Multiple Imputation Using Stata Chuck Huber, PhD StataCorp chuber@stata.com University of Michigan January 30, 2018
Outline Example Dataset Missing Data Mechanisms What is multiple imputation? Multiple imputation in Stata Why use multiple imputation?
Example Dataset
Example Dataset The objective is to examine the relationship between smoking and heart attacks adjusting for age, body mass index, educational status, and gender We want to perform a logistic regression of heart attack (attack) with the other variables as regressors
Example Dataset
Example Dataset
Complete Case Analysis
Mean Substitution?
Mean Substitution?
Mean Substitution?
Mean Substitution? Complete Case Analysis (N=132) Mean Substitution
Outline Example Dataset Missing Data Mechanisms What is multiple imputation? Multiple imputation in Stata Why use multiple imputation?
Missing Data Mechanisms Missing Completely At Random (MCAR) Missing At Random (MAR) Missing Not At Random (MNAR)
Missing Completely At Random (MCAR) Definition Missing data are MCAR if the reason for missing data is unrelated to the observed or unobserved (missing) data. That is, missing values are a simple random sample of all data values. Example Subjects withdraw from a study for reasons unrelated to the study. Data are missing because of equipment failures or data-recording errors
Missing Completely At Random (MCAR) Other variables Missing values not related to the variable with missing data Missing values not related to other observed variables
Missing At Random (MAR) Definition Missing data are MAR if the reason for missing data is unrelated to the unobserved (missing) data but may depend on the observed data. That is, missing values are not a simple random sample of all data values. Example In a study of blood pressure, subjects withdraw from the study because of severe side effects caused by a high dosage of a treatment. In a study of income, respondents with low education might be less inclined to report their income
Missing At Random (MAR) Other variables Missing values not related to the variable with missing data Missing values are related to other observed variables
Missing Not At Random (MNAR) Definition Missing data are MNAR if the reason for missing data is related to the unobserved (missing) data. Example In a study of income, respondents with low or high income might be less inclined to report their income; in a study of depression, respondents who are depressed might be less likely to report that they are depressed
Missing Not At Random (MNAR) Other variables Missing values are related to the variable with missing data
Checking MCAR vs MAR
Outline Example Dataset Missing Data Mechanisms What is multiple imputation? Multiple imputation in Stata Why use multiple imputation?
What is multiple imputation? Multiple imputation (MI) is a flexible, simulation-based statistical technique for handling missing data. Multiple imputation consists of three steps: Imputation step. M imputations (completed datasets) are generated under some chosen imputation model. Completed-data analysis (estimation) step. The desired analysis is performed separately on each imputation (m = 1, … , M). This is called completed-data analysis and is the primary analysis to be performed once missing data have been imputed. Pooling step. The results obtained from M completed-data analyses are combined into a single multiple-imputation result.
Notation and some terminology Original data are the data containing missing values With a slight abuse of terminology, by an imputation we mean a copy of the original data in which missing values are imputed M is the number of imputations m (= 0, . . . ,M) refers to the original or imputed data: m = 0 means original data and m > 0 means imputed data. m = 1 means the first imputation, m = 2 means the second imputation, etc.
The Imputation Step Original Data (m=0) Copy of Data (m = 1)
The Imputation Step
The Imputation Step bmi_new = 26.6 + 1.7(attack) - .47(smokes) - .03(age) - .31(female)
The Imputation Step bmi_new = 26.5 + 1.7(attack) - .47(smokes) - .03(age) - .31(female) + rnormal()
The Imputation Step Original Data bmi_new = 26.5 + 1.7(attack) - .47(smokes) - .03(age) - .31(female) + rnormal() bmi_new = 26.5 + 1.7(attack) - .47(smokes) - .03(age) - .31(female) + rnormal() bmi_new = 26.5 + 1.7(attack) - .47(smokes) - .03(age) - .31(female) + rnormal()
The Imputation Step Original Data bmi_new = 26.5 + 1.7(attack) - .47(smokes) - .03(age) - .31(female) + 1.7 bmi_new = 26.5 + 1.7(attack) - .47(smokes) - .03(age) - .31(female) + 0.9 bmi_new = 26.5 + 1.7(attack) - .47(smokes) - .03(age) - .31(female) + -2.1
The Estimation Step Original Data logistic attack smokes age bmi_new hsgrad female logistic attack smokes age bmi_new hsgrad female logistic attack smokes age bmi_new hsgrad female
The Pooling Step 𝑇= 1 𝑀 𝑊+ 1+ 1 𝑀 𝐵 The within-imputation (W) variance is calculated for each imputed dataset during estimation step. The between-imputation (B) variance is calculated during the pooling step. The total variance (T) is then: 𝑇= 1 𝑀 𝑊+ 1+ 1 𝑀 𝐵
Outline Example Dataset Missing Data Mechanisms What is multiple imputation? Multiple imputation in Stata Why use multiple imputation?
Main features of Stata’s mi command Stata’s mi suite of commands perform all three steps of multiple imputation: Create imputed datasets, each with the missing values filled in (mi impute) Fit your model on each imputed dataset (mi estimate) Collect all the model fits and apply Rubin’s combination rules to form “mi-adjusted” parameter estimates and standard errors (mi estimate)
Multiple Imputation Using Stata The mi Control Panel Examining and setting up mi data Univariate imputation Estimation Testing Prediction
The mi Control Panel
Examine Missing Data
Examine Missing Data
The Imputation Step NOTE: We’re only using 5 imputations to keep things simple but you should use at least 20.
The Imputation Step
The Imputation Step Three new variables were created by mi set and mi impute: _mi_id An identification number for records within an imputed dataset _mi_miss An indicator for missing values of the imputed variable _mi_m The number (m) for each imputed dataset (m=0 is original data)
The Imputation Step
Data Management
The Estimation Step
The Estimation and Pooling Step
Testing Coefficients
Testing Coefficients
Predictions
Outline Example Dataset Missing Data Mechanisms What is multiple imputation? Multiple imputation in Stata Why use multiple imputation?
Why use multiple imputation? The objective of MI is not to predict missing values as close as possible to the true ones but to handle missing data in a way resulting in valid statistical inference (Rubin 1996)
Why use multiple imputation? It is more flexible than fully-parametric methods, e.g. maximum likelihood, purely Bayesian analysis It can be more efficient than listwise deletion (complete-cases analysis) and can avoid potential bias It accounts for missing-data uncertainty and, thus, does not underestimate the variance of estimates unlike single imputation methods
Statistical validity of MI MI yields statistically valid inference if an imputation method used is proper per Rubin (1987, 118–119) Loosely speaking, the imputation mechanism, which produces imputations, must maintain the existing characteristics of the data and incorporate adequate variability (uncertainty) induced by unobserved data.
Summary MI is a stochastic method. Remember to set the random-number seed to reproduce the same point estimates later MI preserves all available data and thus can be more efficient than complete-cases analysis. It can also avoid potential bias when complete cases differ from incomplete cases Unlike fully-parametric methods, MI can easily be applied to a wide range of analyses
Summary MI separates the stochastic, imputation step from the analysis step — the imputer and the analyst can be different people! In Stata, use mi impute for imputation and mi estimate for analysis Use MI Control Panel to guide you through all the phases of MI
For more information
For more information Files Videos 09_multiple_imputation.do heart.dta Multiple imputation in Stata®: Setup, imputation, estimation--regression imputation Multiple imputation in Stata®: Setup, imputation, estimation--predictive mean matching Multiple imputation in Stata®: Setup, imputation, estimation--logistic regression
Thanks for letting me hang out with you today! Questions? You can contact me anytime at chuber@stata.com