A REVIEW By Chi-Ming Kam Surajit Ray April 23, 2001 April 23, 2001
Imputation Techniques Implemented in SOLAS 3.0 SINGLE IMPUTATION Hot Decking Predicted Mean Imputation Last Value Carried Forward MULTIPLE IMPUTATIONS Propensity Score Based Imputation Predictive Model Based Imputation
Method 1: Propensity Score Based Imputation This was the only Method in Version 1. Method similar to Lavori,Dawson,Shera (1995) “A multiple imputation strategy for clinical trials with truncation of patient data” GOAL: To impute Missing values by minimal Distributional Assumptions
How it Works Let R be the indicator for the missingness pattern (R=0 or 1) X 1 X 2 ………. X P Y ??..???..? R Model R from X 1, X 2,..., X P using logistic regression p=Prob(R=1| X 1, X 2,…, X P ) for each case yielding N p i ’s.
How it works…. (Approximate Bayesian bootstrap, Rubin, 1987) Group (user specified) the units by the value of the quintiles of p. Suppose that within a particular group there are n 1 observed and n 0 missing values. Quintiles of p
s ample n 1 +n 0 units with replacement from the observed values. From the sampled pool, subsample n 0 units with replacement Use these n 0 units as the imputed values for the n 0 missing values Repeat the procedure m times to get m imputations with replacement with replacement n 1 obs n 0 + n 1 n 0
Theoretical Justification It produces an imputed distribution of Y that has been corrected for biases due to missingness related to X. It's similar in spirit to reweighting but here we have a multiple imputation version of it. The method produces unbiased estimates for marginal distribution of Y.
Problems/Drawbacks The method does not preserve the association between Y and individual X i ’s. Reasoning: The only aspect of X i ’s that is used here is the linear prediction for Y ( 0 + 1 X 1 + 2 X 2 …. + p X p ) in the logistic model. This is the function that predicts missingness of Y (R) but not Y itself.
Problems/Drawbacks (Continued….) Suppose X 1 is highly correlated with Y but is unrelated to P(R=1). X 1 will drop out of the the logistic model and it is not used in the imputation. As a result, the model will misrepresent the correlation of X 1 and Y. Suppose X 1 is highly correlated with Y but is unrelated to P(R=1). X 1 will drop out of the the logistic model and it is not used in the imputation. As a result, the model will misrepresent the correlation of X 1 and Y. Also, by not using X 1 in the imputation, we are failing to impute Y efficiently.
Simulation Results Using SOLAS 1.1 Data Generation Mechanism: Y=X+Z+ , whereand ~ (0,1) Source: Paul D. Allison “Multiple Imputation for Missing Data, A Cautionary Tale”
Some Comments About the Propensity Score Based Method The method can provide valid but possibly inefficient inferences about Y (marginal). The method can lead to very misleading inferences about the relationships between Y and other variables.
Method 2: Predictive Model Based Multiple Imputation This method is implemented in SOLAS 2.0 and 3.0 HOW IT WORKS: Regress Y on X 1, X 2,…, X p Get the estimates of 0, 1, 2,…. p and Draw 0 *, 1 *, 2 * …. p *, * from an approximate posterior distribution Impute Y * = 0 * + 1 * X 1 + 2 * X 2 …. + p * X p + * where * Normal(0, * ) Repeat m times to get the m imputed datasets
Good points The method provides correct model based MI under the regression model and MAR It also preserves the correlation between X i and Y It also preserves the correlation between X i 's and Y What is the difference with NORM ? NORM does the same thing with MCMC Under multivariate normal model, both methods give the same results
Which Software is More General ? I work for arbitrary missingness pattern I work for non-linear relation of y on X But that’s probably very similar to norm with rounding
Concluding Remarks SOLAS is the first commercial missing data software. It has good graphical interface. Easy data import and export to other softwares. Performs well under monotone missingness pattern. Estimates are not always unbiased.