Peng Zhang Jinnan Liu Mei-ting Chiang Yin Liu Missing Data Peng Zhang Jinnan Liu Mei-ting Chiang Yin Liu 12/29/2018
Outline Introduction Exercise 1 Exercise 2 Exercise 3 Conclusion 12/29/2018
Introduction Objectives Distinguish non-response mechanisms Examine methods used to deal with non-response -> Data Background 12/29/2018
Variable name Description Health Determinants AGE non-negative integer 1~15, No missing, ordinal INCOME non-negative integer 1~11, 99='NOT STATED' i.e. missing ordinal DEPRESSION probability of depression, non-negative probability, 2 decimal points 0~1, No missing CHRONIC number of chronic conditions , non-negative integer 0~20, No Missing, continuous VISITS number of doctor visits, non-negative integer, possibility >10, No missing, continuous BodyMass Body Mass Index (BMI) SEX Binary * Male (1), Female (2) SOMKING smoking status, non-negative integer 1~6, 99='NOT STATED' i.e. missing, ordinal Health Status HINDEX1 the self-assessment to the quality of health, valued as integers from 1 to 5, ordinal, as 1 is good and 5 is bad HINDEX2 the health-utility-index, valued as two decimals from 0 to 1, continuous, as 1 is prefect and 0 is poor. 12/29/2018
Exercise One Assessing the nature of response mechanism MCAR (Missing Completely at Random) MAR (Missing at Random) NMAR (Not Missing at Random) 12/29/2018
Analysis of Maximum Likelihood Estimates Assessing response mechanism *SAS OUTPUT* Analysis of Maximum Likelihood Estimates Standard Parameter DF Estimate Error Chi-Square Pr > ChiSq Intercept 1 -1.3812 0.2657 27.0207 <.0001 Intercept2 1 0.7257 0.2640 7.5563 0.0060 Intercept3 1 2.8791 0.2803 105.4983 <.0001 Intercept4 1 5.0895 0.3701 189.1461 <.0001 age 1 -0.1260 0.0252 25.0844 <.0001 sex 1 0.1492 0.0937 2.5361 0.1113 income 1 0.1146 0.0198 33.5146 <.0001 bodymass 1 -0.00350 0.00385 0.8236 0.3641 smoking 1 0.1623 0.0224 52.5551 <.0001 depression 1 -0.2716 0.1819 2.2294 0.1354 chronic 1 -0.3970 0.0408 94.8984 <.0001 visits 1 -0.0371 0.00497 55.6429 <.0001 12/29/2018
Result of Assessing It’s NOT MCAR but MAR All the following imputation will base on this response mechanism. 12/29/2018
Exercise Two Deciding on the method to deal with the missing data out of the popular methods: Mean Regression Multiple Imputation EM Algorithm Nearest Neighbour 12/29/2018
Conclusion of imputation ->Impute missing value using Regression Comparing with the results for the 5 methods, we conclude the Regression Imputation is most efficient in our case. 12/29/2018
Exercise Three Analysis Comparison The linear mixed model The log-linear regression model Comparison Hindex1 & Hindex2 12/29/2018
Linear Mixed Model Figure 1: The histogram of hindex2 12/29/2018
Figure 2 : The relationship between index2 and income in each age group 12/29/2018
Figure 3 : The relationship b/w index1 and income in each age group 12/29/2018
Linear Mixed Model Fit the linear mixed model: Fixed effects: hindex2log ~ income +depression + chronic + visits Value Std.Error DF t-value p-value (Intercept) 1.535417 0.07323138 2376 20.96665 <.0001 income 0.028688 0.00400363 2376 7.16541 <.0001 depression -0.400730 0.03896362 2376 -10.28473 <.0001 chronic -0.059074 0.00765291 2376 -7.71910 <.0001 visits -0.011897 0.00100756 2376 -11.80800 <.0001 12/29/2018
Log-linear Regression Model Fit the log-linear model: Coefficients: Value Std. Error t value (Intercept) 0.848604497 0.00204314170 415.34295 age 0.005055119 0.00019830895 25.49113 income -0.027352596 0.00019163407 -142.73348 smoking -0.030203852 0.00021805431 -138.51527 chronic 0.077974419 0.00033867474 230.23394 visits 0.006523856 0.00004247647 153.58752 12/29/2018
Association between Hindex1 & Hindex2 Figure 4 : The relationship b/w index1 and index2 Coefficients: Value Std. Error t value (Intercept) 1.1409198 0.03973353 28.71428 hindex2log -0.2669291 0.02450110 -10.89457 12/29/2018
Figure 5: The relationship b/w index1 and index2 in each age group 12/29/2018
Conclusion Since index2, “the health utility index” is more subject, useful, and appropriate index to measure the health status comparing to index1, the self-assessment answer. It will reveal more information, while index1 seems all close to 1 or 2 which means despite their age, income level, people tends to overestimate their health status. Age still plays the most important role about people's health status 12/29/2018
Thank you! Statistical Society of Canada for Providing the Data Prof. Peggy Ng for Financial Support Prof. Peter Song for Providing Books on EM Algorithm Mr. BaiFang Xing for Helpful Discussions 12/29/2018