Missing data: Why you should care about it and what to do about it

Missing data: Why you should care about it and what to do about it

Lecture overview Why care about missing data? Diagnosing different types of missing data Bad methods of handling missing data Better methods of handling missing data

General info about missing data: Acock (2005)
Theory behind multiple imputation: Rubin (1996); Schafer (1999) How to do multiple imputation using the MICE package in R: Van Buuren and Groothuis-Oudshoorn (2011) Empirical examples: Van Buuren, Boshuizen, and Knook (1999); Sundell et al. (2008); Devine et al. (2012)

Why care about missing data
Why care about missing data? Diagnosing different types of missing data Bad methods of handling missing data Better methods of handling missing data

Not all missing data are the same
Missing by design Values are missing by definition of the population of interest Missing completely at random (MCAR) Missing values are randomly distributed Missing at random (MAR) After accounting for one or more other variables, missing values are randomly distributed Non-ignorable (NI) Missing values are functions of the variables themselves

Income randomly missing (MCAR)

Full data MCAR

Income missing for high-women professions (MAR)

Full data MAR

Income missing for low-income professions (NI)

Full data NI

Why care about missing data?
Missing by design data are not a problem MCAR data bias upward standard errors of your parameter estimates MAR or NI data bias BOTH parameter estimates and standard errors in unpredictable ways

How much missing data is too much?
Hard to say Small amounts of missing data can sometimes greatly affect analysis if missing values are extreme Missing data are particularly problematic when the data are MAR or NI

Missing data diagnostics
Goals: Make reasonable guess about the type of missing data you have Find variables that predict missingness or observed missing values Diagnostic options: Statistical Graphical Missingness patterns Pairwise complete correlations between your available variables Correlations between response indicators for your variables with missing values and other variables Margin plots Other options in the VIM package

Finding patterns of missingness
rr matrix: The number of observations for which both the row and column variables were observed rm matrix: The number of observations for which the row variable was observed, but the column variable was not mr matrix: The number of observations for which the column variable was observed, but the row variable was not mm matrix: The number of observations for which neither the row nor column variables were observed

Finding patterns of missingness
1: not missing 0: missing # of cases fitting this missingness pattern # of variables with missing values following this pattern # of cases with missing values on the column variable With this simple pattern of missingness, all available variables potentially have information about why some values of income are missing

Correlations Predicting available cases: pairwise complete correlations % women is a strong predictor of missingness, prestige is a strong predictor of observed income Predicting missingness: correlations with response indicators

Margin plots

Bad methods: Casewise exclusion
Eliminate each case that has a missing value The implicit “standard” method of dealing with missing data Can be a (somewhat) acceptable method when data are MCAR

How reasonable is it to assume that missing data in the social sciences is MCAR?

Bad methods: Casewise exclusion
Eliminate each case that has a missing value The implicit “standard” method of dealing with missing data Can be a (somewhat) acceptable method when data are MCAR When data are MAR or NI, unpredictable bias in standard errors and parameter estimates

Bad methods: Mean substitution
Substitute the mean of the variable for the missing values

Bad methods: Mean substitution
Substitute the mean of the variable for the missing values Leads to systematic bias in SE and, when data are MAR or NI, parameter estimates NEVER a good method of handling missing data

Better methods of handling missing data
Full information maximum likelihood (FIML) methods Can handle data that are MAR and NI Implemented as part of particular statistical models Missing data handled during analysis Multiple imputation Can also handle data that are MAR and NI Simulation-based approach Missing data are handled separately from analysis

Multiple imputation Generate multiple complete-case datasets (imputations) through simulation (only 5 – 10 are needed) Perform analyses on each imputation Combine the multiple analyses using a set of special rules (Rubin’s (1987) rules)

Generating imputations
Imputations generated through maximum-likelihood based Markov-chain Monte Carlo (MCMC) Exact details of how imputations generated vary from method to method The quality of the simulations depends on how well the analyst can explain observed values and missingness of imputed variables

Which variables do you use to generate your simulations?
In general, the more variables the better (to a point; multicollinear variables can crash the simulation) Always use variables that will be involved in your final analysis, including interaction terms and contrasts Use variables that will not be included in the analysis, but that are good predictors of observed values of imputed variables Include variables that are good predictors of missingness Only use variables that have a high proportion of observations where the imputed variables have missing observations

Convergence For each imputation, the missing values of each variable are iteratively estimated In each iteration, the means and standard deviations of those missing values are slightly different Iteration continues until the means and standard deviations of the imputed values across the imputed datasets start to cluster (“converge”)

Imputation procedure Checking for convergence

Two examples of non-convergence
Means SDs Means SDs

The estimated means for gen and phb start high in the first few iterations, then converge toward lower values Means SDs

Checking that the imputed values are reasonable

Perform analyses as usual on each simulated dataset
The intercepts and slopes of the linear model vary slightly across the simulated datasets

The overall estimate of your parameter (Q-bar) is its mean across the m imputations
The within-imputation variance (U-bar) of the Q parameter is the mean of the variances across the m imputations The between-imputation variance (B) of the Q parameter is standard deviation of Q across the m imputations The total variance of Q is a function of U-bar and B. This total variance is used to calculate the standard error used for test statistics The degrees of freedom (v) are adjusted for the amount of information lost to missing data

Pooled results No missing data Casewise exclusion

Conclusions When you have missing data, think about WHY they are missing Missing data handled improperly can bias your conclusions Multiple imputation is one good way of handling missing data Caveat: Multiple imputation is complex, so do some reading before you do it

Missing data: Why you should care about it and what to do about it

Similar presentations

Presentation on theme: "Missing data: Why you should care about it and what to do about it"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Missing data: Why you should care about it and what to do about it

Similar presentations

Presentation on theme: "Missing data: Why you should care about it and what to do about it"— Presentation transcript:

Similar presentations

About project

Feedback