Download presentation
Presentation is loading. Please wait.
Published byHubert Norman Modified over 6 years ago
1
Missing data: Why you should care about it and what to do about it
2
Lecture overview Why care about missing data? Diagnosing different types of missing data Bad methods of handling missing data Better methods of handling missing data
3
General info about missing data: Acock (2005)
Theory behind multiple imputation: Rubin (1996); Schafer (1999) How to do multiple imputation using the MICE package in R: Van Buuren and Groothuis-Oudshoorn (2011) Empirical examples: Van Buuren, Boshuizen, and Knook (1999); Sundell et al. (2008); Devine et al. (2012)
4
Why care about missing data
Why care about missing data? Diagnosing different types of missing data Bad methods of handling missing data Better methods of handling missing data
5
Not all missing data are the same
Missing by design Values are missing by definition of the population of interest Missing completely at random (MCAR) Missing values are randomly distributed Missing at random (MAR) After accounting for one or more other variables, missing values are randomly distributed Non-ignorable (NI) Missing values are functions of the variables themselves
7
Income randomly missing (MCAR)
8
Full data MCAR
9
Income missing for high-women professions (MAR)
10
Full data MAR
11
Income missing for low-income professions (NI)
12
Full data NI
13
Why care about missing data?
Missing by design data are not a problem MCAR data bias upward standard errors of your parameter estimates MAR or NI data bias BOTH parameter estimates and standard errors in unpredictable ways
14
How much missing data is too much?
Hard to say Small amounts of missing data can sometimes greatly affect analysis if missing values are extreme Missing data are particularly problematic when the data are MAR or NI
15
Why care about missing data
Why care about missing data? Diagnosing different types of missing data Bad methods of handling missing data Better methods of handling missing data
16
Missing data diagnostics
Goals: Make reasonable guess about the type of missing data you have Find variables that predict missingness or observed missing values Diagnostic options: Statistical Graphical Missingness patterns Pairwise complete correlations between your available variables Correlations between response indicators for your variables with missing values and other variables Margin plots Other options in the VIM package
17
Finding patterns of missingness
rr matrix: The number of observations for which both the row and column variables were observed rm matrix: The number of observations for which the row variable was observed, but the column variable was not mr matrix: The number of observations for which the column variable was observed, but the row variable was not mm matrix: The number of observations for which neither the row nor column variables were observed
18
Finding patterns of missingness
1: not missing 0: missing # of cases fitting this missingness pattern # of variables with missing values following this pattern # of cases with missing values on the column variable With this simple pattern of missingness, all available variables potentially have information about why some values of income are missing
19
Correlations Predicting available cases: pairwise complete correlations % women is a strong predictor of missingness, prestige is a strong predictor of observed income Predicting missingness: correlations with response indicators
20
Margin plots
21
Why care about missing data
Why care about missing data? Diagnosing different types of missing data Bad methods of handling missing data Better methods of handling missing data
22
Bad methods: Casewise exclusion
Eliminate each case that has a missing value The implicit “standard” method of dealing with missing data Can be a (somewhat) acceptable method when data are MCAR
23
How reasonable is it to assume that missing data in the social sciences is MCAR?
24
Bad methods: Casewise exclusion
Eliminate each case that has a missing value The implicit “standard” method of dealing with missing data Can be a (somewhat) acceptable method when data are MCAR When data are MAR or NI, unpredictable bias in standard errors and parameter estimates
25
Bad methods: Mean substitution
Substitute the mean of the variable for the missing values
26
Bad methods: Mean substitution
Substitute the mean of the variable for the missing values Leads to systematic bias in SE and, when data are MAR or NI, parameter estimates NEVER a good method of handling missing data
27
Why care about missing data
Why care about missing data? Diagnosing different types of missing data Bad methods of handling missing data Better methods of handling missing data
28
Better methods of handling missing data
Full information maximum likelihood (FIML) methods Can handle data that are MAR and NI Implemented as part of particular statistical models Missing data handled during analysis Multiple imputation Can also handle data that are MAR and NI Simulation-based approach Missing data are handled separately from analysis
29
Multiple imputation Generate multiple complete-case datasets (imputations) through simulation (only 5 – 10 are needed) Perform analyses on each imputation Combine the multiple analyses using a set of special rules (Rubin’s (1987) rules)
31
Multiple imputation Generate multiple complete-case datasets (imputations) through simulation (only 5 – 10 are needed) Perform analyses on each imputation Combine the multiple analyses using a set of special rules (Rubin’s (1987) rules)
32
Generating imputations
Imputations generated through maximum-likelihood based Markov-chain Monte Carlo (MCMC) Exact details of how imputations generated vary from method to method The quality of the simulations depends on how well the analyst can explain observed values and missingness of imputed variables
33
Which variables do you use to generate your simulations?
In general, the more variables the better (to a point; multicollinear variables can crash the simulation) Always use variables that will be involved in your final analysis, including interaction terms and contrasts Use variables that will not be included in the analysis, but that are good predictors of observed values of imputed variables Include variables that are good predictors of missingness Only use variables that have a high proportion of observations where the imputed variables have missing observations
34
Convergence For each imputation, the missing values of each variable are iteratively estimated In each iteration, the means and standard deviations of those missing values are slightly different Iteration continues until the means and standard deviations of the imputed values across the imputed datasets start to cluster (“converge”)
35
Imputation procedure Checking for convergence
36
Two examples of non-convergence
Means SDs Means SDs
37
The estimated means for gen and phb start high in the first few iterations, then converge toward lower values Means SDs
38
Checking that the imputed values are reasonable
39
Multiple imputation Generate multiple complete-case datasets (imputations) through simulation (only 5 – 10 are needed) Perform analyses on each imputation Combine the multiple analyses using a set of special rules (Rubin’s (1987) rules)
40
Perform analyses as usual on each simulated dataset
The intercepts and slopes of the linear model vary slightly across the simulated datasets
41
Multiple imputation Generate multiple complete-case datasets (imputations) through simulation (only 5 – 10 are needed) Perform analyses on each imputation Combine the multiple analyses using a set of special rules (Rubin’s (1987) rules)
42
The overall estimate of your parameter (Q-bar) is its mean across the m imputations
The within-imputation variance (U-bar) of the Q parameter is the mean of the variances across the m imputations The between-imputation variance (B) of the Q parameter is standard deviation of Q across the m imputations The total variance of Q is a function of U-bar and B. This total variance is used to calculate the standard error used for test statistics The degrees of freedom (v) are adjusted for the amount of information lost to missing data
43
Pooled results No missing data Casewise exclusion
44
Conclusions When you have missing data, think about WHY they are missing Missing data handled improperly can bias your conclusions Multiple imputation is one good way of handling missing data Caveat: Multiple imputation is complex, so do some reading before you do it
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.