Presentation is loading. Please wait.

Presentation is loading. Please wait.

Multiple Imputation Using Stata

Similar presentations


Presentation on theme: "Multiple Imputation Using Stata"— Presentation transcript:

1 Multiple Imputation Using Stata
Chuck Huber, PhD StataCorp University of Michigan January 30, 2018

2 Outline Example Dataset Missing Data Mechanisms
What is multiple imputation? Multiple imputation in Stata Why use multiple imputation?

3 Example Dataset

4 Example Dataset The objective is to examine the relationship between smoking and heart attacks adjusting for age, body mass index, educational status, and gender We want to perform a logistic regression of heart attack (attack) with the other variables as regressors

5 Example Dataset

6 Example Dataset

7 Complete Case Analysis

8 Mean Substitution?

9 Mean Substitution?

10 Mean Substitution?

11 Mean Substitution? Complete Case Analysis (N=132) Mean Substitution

12 Outline Example Dataset Missing Data Mechanisms
What is multiple imputation? Multiple imputation in Stata Why use multiple imputation?

13 Missing Data Mechanisms
Missing Completely At Random (MCAR) Missing At Random (MAR) Missing Not At Random (MNAR)

14 Missing Completely At Random (MCAR)
Definition Missing data are MCAR if the reason for missing data is unrelated to the observed or unobserved (missing) data. That is, missing values are a simple random sample of all data values. Example Subjects withdraw from a study for reasons unrelated to the study. Data are missing because of equipment failures or data-recording errors

15 Missing Completely At Random (MCAR)
Other variables Missing values not related to the variable with missing data Missing values not related to other observed variables

16 Missing At Random (MAR)
Definition Missing data are MAR if the reason for missing data is unrelated to the unobserved (missing) data but may depend on the observed data. That is, missing values are not a simple random sample of all data values. Example In a study of blood pressure, subjects withdraw from the study because of severe side effects caused by a high dosage of a treatment. In a study of income, respondents with low education might be less inclined to report their income

17 Missing At Random (MAR)
Other variables Missing values not related to the variable with missing data Missing values are related to other observed variables

18 Missing Not At Random (MNAR)
Definition Missing data are MNAR if the reason for missing data is related to the unobserved (missing) data. Example In a study of income, respondents with low or high income might be less inclined to report their income; in a study of depression, respondents who are depressed might be less likely to report that they are depressed

19 Missing Not At Random (MNAR)
Other variables Missing values are related to the variable with missing data

20 Checking MCAR vs MAR

21 Outline Example Dataset Missing Data Mechanisms
What is multiple imputation? Multiple imputation in Stata Why use multiple imputation?

22 What is multiple imputation?
Multiple imputation (MI) is a flexible, simulation-based statistical technique for handling missing data. Multiple imputation consists of three steps: Imputation step. M imputations (completed datasets) are generated under some chosen imputation model. Completed-data analysis (estimation) step. The desired analysis is performed separately on each imputation (m = 1, … , M). This is called completed-data analysis and is the primary analysis to be performed once missing data have been imputed. Pooling step. The results obtained from M completed-data analyses are combined into a single multiple-imputation result.

23 Notation and some terminology
Original data are the data containing missing values With a slight abuse of terminology, by an imputation we mean a copy of the original data in which missing values are imputed M is the number of imputations m (= 0, ,M) refers to the original or imputed data: m = 0 means original data and m > 0 means imputed data. m = 1 means the first imputation, m = 2 means the second imputation, etc.

24 The Imputation Step Original Data (m=0) Copy of Data (m = 1)

25 The Imputation Step

26 The Imputation Step bmi_new = (attack) - .47(smokes) - .03(age) - .31(female)

27 The Imputation Step bmi_new = (attack) - .47(smokes) - .03(age) - .31(female) + rnormal()

28 The Imputation Step Original Data
bmi_new = (attack) - .47(smokes) - .03(age) - .31(female) + rnormal() bmi_new = (attack) - .47(smokes) - .03(age) - .31(female) + rnormal() bmi_new = (attack) - .47(smokes) - .03(age) - .31(female) + rnormal()

29 The Imputation Step Original Data
bmi_new = (attack) - .47(smokes) - .03(age) - .31(female) + 1.7 bmi_new = (attack) - .47(smokes) - .03(age) - .31(female) + 0.9 bmi_new = (attack) - .47(smokes) - .03(age) - .31(female)

30 The Estimation Step Original Data
logistic attack smokes age bmi_new hsgrad female logistic attack smokes age bmi_new hsgrad female logistic attack smokes age bmi_new hsgrad female

31 The Pooling Step 𝑇= 1 𝑀 𝑊+ 1+ 1 𝑀 𝐵
The within-imputation (W) variance is calculated for each imputed dataset during estimation step. The between-imputation (B) variance is calculated during the pooling step. The total variance (T) is then: 𝑇= 1 𝑀 𝑊 𝑀 𝐵

32 Outline Example Dataset Missing Data Mechanisms
What is multiple imputation? Multiple imputation in Stata Why use multiple imputation?

33 Main features of Stata’s mi command
Stata’s mi suite of commands perform all three steps of multiple imputation: Create imputed datasets, each with the missing values filled in (mi impute) Fit your model on each imputed dataset (mi estimate) Collect all the model fits and apply Rubin’s combination rules to form “mi-adjusted” parameter estimates and standard errors (mi estimate)

34 Multiple Imputation Using Stata
The mi Control Panel Examining and setting up mi data Univariate imputation Estimation Testing Prediction

35 The mi Control Panel

36 Examine Missing Data

37 Examine Missing Data

38 The Imputation Step NOTE: We’re only using 5 imputations to keep things simple but you should use at least 20.

39 The Imputation Step

40 The Imputation Step Three new variables were created by mi set and mi impute: _mi_id An identification number for records within an imputed dataset _mi_miss An indicator for missing values of the imputed variable _mi_m The number (m) for each imputed dataset (m=0 is original data)

41 The Imputation Step

42 Data Management

43

44 The Estimation Step

45 The Estimation and Pooling Step

46 Testing Coefficients

47 Testing Coefficients

48 Predictions

49 Outline Example Dataset Missing Data Mechanisms
What is multiple imputation? Multiple imputation in Stata Why use multiple imputation?

50 Why use multiple imputation?
The objective of MI is not to predict missing values as close as possible to the true ones but to handle missing data in a way resulting in valid statistical inference (Rubin 1996)

51 Why use multiple imputation?
It is more flexible than fully-parametric methods, e.g. maximum likelihood, purely Bayesian analysis It can be more efficient than listwise deletion (complete-cases analysis) and can avoid potential bias It accounts for missing-data uncertainty and, thus, does not underestimate the variance of estimates unlike single imputation methods

52 Statistical validity of MI
MI yields statistically valid inference if an imputation method used is proper per Rubin (1987, 118–119) Loosely speaking, the imputation mechanism, which produces imputations, must maintain the existing characteristics of the data and incorporate adequate variability (uncertainty) induced by unobserved data.

53 Summary MI is a stochastic method. Remember to set the random-number seed to reproduce the same point estimates later MI preserves all available data and thus can be more efficient than complete-cases analysis. It can also avoid potential bias when complete cases differ from incomplete cases Unlike fully-parametric methods, MI can easily be applied to a wide range of analyses

54 Summary MI separates the stochastic, imputation step from the analysis step — the imputer and the analyst can be different people! In Stata, use mi impute for imputation and mi estimate for analysis Use MI Control Panel to guide you through all the phases of MI

55 For more information

56 For more information Files Videos 09_multiple_imputation.do heart.dta
Multiple imputation in Stata®: Setup, imputation, estimation--regression imputation Multiple imputation in Stata®: Setup, imputation, estimation--predictive mean matching Multiple imputation in Stata®: Setup, imputation, estimation--logistic regression

57 Thanks for letting me hang out with you today! Questions?
You can contact me anytime at


Download ppt "Multiple Imputation Using Stata"

Similar presentations


Ads by Google