How to Handle Missing Values in Multivariate Data By Jeff McNeal & Marlen Roberts 1
The Missing Data Problem Problems with Statistical Inference Sample Size & Power Biased Results Little, R., & Rubin, D. (2002). Introduction. In Statistical Analysis with Missing Data (2nd ed., pp. 1-2). Hoboken, New Jersey: John Wiley & Sons.2
Real World Examples Respondents in a household survey refuse to report income Missing results of manufacturing experiment due to equipment failure Voters’ inability to express preference for a political candidate in an opinion poll Little, R., & Rubin, D. (2002). Introduction. In Statistical Analysis with Missing Data (2nd ed., pp. 1-2). Hoboken, New Jersey: John Wiley & Sons.3
Outline Common Assumptions and Missing Data Patterns Taxonomy of Methods for Handling Missing Values Multiple Imputation Maximum Likelihood Simulation 4
Missing Data Patterns All missing data are not created equal Missing due to a random process Missing due to a non-random process 5
A Simple Example: Income Survey Westfall, P., & Henning, K. (2013). Understanding Advanced Statistical Methods (1st ed.). Boca Raton, Florida: CRC Press, Taylor & Francis Group.6
Univariate Missing Data Process: MCAR P.H. Westfall7
Multivariate Missing Data Processes: MCAR and MAR
Missing Data Processes: MNAR
Taxonomy of Missing-Data Methods Complete Case Analysis (Listwise Deletion) Available Case Analysis (Pairwise Deletion) Least Squares on Imputed Data Multiple Imputation Maximum Likelihood (and Bayes) Little, R., & Rubin, D. (2002). Introduction. In Statistical Analysis with Missing Data (2nd ed., pp ). Hoboken, New Jersey: John Wiley & Sons.10
Complete Case Analysis (Listwise Deletion) Easy to implement Works well when MCAR assumption is met Wastes a lot of information Q/Regression%20with%20Missing%20X's.pdf 11
Available Case Analysis (Pairwise Deletion) Attempts to minimize the loss of data in listwise deletion Increases the power of your test Usually is outperformed by Maximum Likelihood Caveat: Can result in non-positive definite covariance matrices Q/Regression%20with%20Missing%20X's.pdf 12
Least Squares Imputation Methods Unconditional Mean Substitution Conditional Mean Imputation based on X Conditional Mean Imputation based on X and Y Q/Regression%20with%20Missing%20X's.pdf 13
Unconditional Mean Substitution Just take the sample mean of the observed data and use it for the missing values Heavily biases the covariance matrix Bias can be corrected but the inferences (confidence intervals, tests, etc.) are distorted and over-precise Q/Regression%20with%20Missing%20X's.pdf 14
Conditional Mean Imputation Q/Regression%20with%20Missing%20X's.pdf 15
Multiple Imputation Little, R., & Rubin, D. (2002). Introduction. In Statistical Analysis with Missing Data (2nd ed., pp ). Hoboken, New Jersey: John Wiley & Sons.16
Steps Involved in Multiple Imputation Introduce random variation into the process of imputing missing values Generate several data sets, each with different imputed values Perform an analysis on each data set Combine the results into a single set of parameter estimates, standard errors, and test statistics
Introducing Randomness into a M.I. Model
Adding Variability to the Imputed Values
Why Do We Want to Add Variability? This is the whole point of multiple imputation
Combining Inferences from Imputed Data
Simplified Form using a Regression Example
Likelihood-Based Inference
ML with Ignorable Missing Data
ML with Ignorable Missing Data
Comparison of Methods ListwisePairwise Easiest to implement Has minimal effect if data are MCAR, or MAR for large sample sizes Has a tendency to bias results Uses more information than listwise Increases statistical power Also easy to implement Multiple ImputationMaximum Likelihood Requires no special software once the imputed datasets are generated Requires specification of a model Requires more assumptions Requires specification of a model for each variable Most asymptotically efficient Most complex You get model comparison statistics (AIC, BIC, etc.) 26