Presentation is loading. Please wait.

Presentation is loading. Please wait.

Multiple Imputation (MI) Technique Using a Sequence of Regression Models OJOC Cohort 15 Veronika N. Stiles, BSDH University of Michigan September’2012.

Similar presentations


Presentation on theme: "Multiple Imputation (MI) Technique Using a Sequence of Regression Models OJOC Cohort 15 Veronika N. Stiles, BSDH University of Michigan September’2012."— Presentation transcript:

1 Multiple Imputation (MI) Technique Using a Sequence of Regression Models OJOC Cohort 15 Veronika N. Stiles, BSDH University of Michigan September’2012 BIOSTATISTICS 590

2 Basis for Presentation This presentation is based on an article by: T.E. Raghunathan J.M. Lepkowski J.V. Hoewyk P. Solenberger “A multivariate Technique for Multiply Imputing Missing Values Using a Sequence of Regression Models” Survey Methodology, June 2001 Vol. 27, No. 1, pp. 85-95

3 Rationale for Multiple Imputation Incomplete data is a common problem Allows to use an existing complete-data software, once the missing values have been imputed

4 Basic Definitions “Imputation” is the placement of one or more estimated answers into a field of a data record that previously had NO data Draws from a predictive distribution Basic Strategy To create imputations through fitting a sequence of multiple regressions Regressions use the variable with missing data as the outcome (Y) variable Regression models based on complete data are used to make predictions of Y when Y is missing To draw values from the predictive distributions Cyclical manner The type of regression model varies by imputed variable (Example is coming up in future slides)

5 Types of Regression Models Used 1.Linear 2.Logistic 3.Poisson 4.Generalized logit 5.Mixture of the above Remember! The type of regression model depends on the type of imputed variable!

6 Assumptions in MI Technique Population is infinite Sample is SRS Variables are one of the following: Continuous Binary Categorical Counts Mixed

7 Advantages of Multiple Imputation + Method for imputation is known; + Analyses are based on the same # of cases; + All data provided is used in each analysis; + Allows for multiple predictors; + Valid points and interval estimates under a general set of conditions are obtained  by repeatedly applying the complete data software

8 Imputation Method Each imputation consists of “rounds” Start round 1 by regressing the variable with fewest # of missing values Remember! Imputations for missing values in Y are draws from the predictive distribution (Use predicted mean Y + a random draw from the normal error distribution) Then, update X by replacing missing Y with the imputed value X=full matrix with all variables (including Y) Lesion LocationEtiologyLesion SizeChronicity TemporalLobectomy2.7289.3 OccipitalStroke.36.3 TemporalHemorrhage.55.3

9 Imputation Method Move on to the next Y with fewest missing values Repeat MI using updated X as predictors until all variables have been imputed  Run the process M times;  Yield M entire datasets;  Each dataset has different set of imputed values, but the same data for complete values

10 Example Time Effect of Smoking on Primary Cardiac Arrest (CA) Case-control study Examine relationship between smoking and CA

11 Means and Proportions of Key Variables and Percent Missing VariableControl (n = 551)Cases (n = 347) % MissingMean (SD)% MissingMean (SD) Age058.4 (10.4)059.4 (9.9) BMI8.225.8 (4.1)2.626.4 (4.6) Years Smoked16.824.8 (14.7)5.431.7 (13.8) Proportion Female023.2019.9 >= High School076.8061.9 Smoking Status0 Never Smoked047.2027.3 Former Smoker042.1038.2 Current Smoker010.7034.5

12 Intuitively… What variables might predict missing data? Could age, education, smoking status predict BMI? Could age predict years smoked? However, years smoked can only be imputed for current and former smokers! Some values may need to be fixed post-MI

13 Multiple Imputation Process in CA Study Log (BMI) has fewest missing values Regress Log (BMI) on age, female, education, Years_Smoked, smoking status, and cardiac arrest through normal linear model Cardiac Arrest IS included in the imputation model Predicted values of log (BMI) are saved to the dataset, replacing the missing values

14 Multiple Imputation Process in CA Study Next, Years Smoked was regressed on all of the variables above+ log (BMI) (Please note that the regression excludes ‘never-smokers’) Predicted values of Years Smoked are saved to the dataset, replacing the missing values M=25 imputations (Note: many researchers use M=5 or 5<M<10) Original logistic regression model was fit for each MI data set

15 How were estimates of coefficients and covariance matrices obtained? IVEware software performs calculations, using estimates and covariance matrix Combines the results from 5-25 regressions Combines both within-regression and between- regression error IVEware: Imputation and Variance Estimation Software http://www.isr.umich.edu/src/smp/ive/ Developed by our own Dr. Raghunathan & researchers at the Survey Methodology Program

16 Complete-Case Analysis vs MI Predictor VariablesComplete CaseSRMI (n = 795)Method 1 (n = 898) Estimate (SE) Intercept-2.922(0.791)-2.61(0.757) Age0.015(0.009)0.015(0.009) Female-0.007(0.203)-0.115(0.189) Education-0.448(0.173)-0.467(0.166) BMI0.056(0.018)0.049(0.013) Current Smoker1.693(0.569)2.001(0.543) Former Smoker0.003(0.284)-0.029(0.262) Current Smoker x Yrs Smoked-0.003(0.015)-0.008(0.013) Former Smoker x Yrs Smoked0.019(0.009)0.014(0.009)

17 Results of the Multiple Imputations MI standard errors are smaller: due to additional subjects in imputed data Modest changes in relationship between smoking and CA Years Smoked in Former Smokers is a significant predictor of cardiac arrest in the complete-case analysis, but NOT in the MI analysis (!!!)

18 Additional Variables MI Approach Additional variables NOT in the substantive analysis can be used Prediction for missing values in each variable borrows strength from all other variables In our cardiac arrest example, impute dataset +50 additional variables  SE are smaller Improved efficiency vs. variables in model only

19 In Addition… IVEware performs… 1.Single or multiple imputations 2.Analyses accounting for: Clustering S tratification Weighting 3.Combines information from multiple sources (+some other functions beyond the scope of this presentation)

20 Critique This article might be too challenging and complicated as an entry-level description of multiple imputation Some of the foundational concepts from this article have not been covered thus far in OJOC program nonignorable missing mechanism RECOMMENDATION Start with “Survey Methodology” (2nd edition) by R.M. Groves, F. J. Fowler, Jr., M.P. Couper, J.M. Lepkowski, E. Singer, R. Tourangeau. Wiley Series in Survey Methodology, A John Wiley & Sons, Inc., Publication, 2009, p. 356.

21 Thank You for Your Attention!


Download ppt "Multiple Imputation (MI) Technique Using a Sequence of Regression Models OJOC Cohort 15 Veronika N. Stiles, BSDH University of Michigan September’2012."

Similar presentations


Ads by Google