Presentation is loading. Please wait.

Presentation is loading. Please wait.

Missing data: Why you should care about it and what to do about it

Similar presentations


Presentation on theme: "Missing data: Why you should care about it and what to do about it"— Presentation transcript:

1 Missing data: Why you should care about it and what to do about it

2 Lecture overview Why care about missing data? Diagnosing different types of missing data Bad methods of handling missing data Better methods of handling missing data

3 General info about missing data: Acock (2005)
Theory behind multiple imputation: Rubin (1996); Schafer (1999) How to do multiple imputation using the MICE package in R: Van Buuren and Groothuis-Oudshoorn (2011) Empirical examples: Van Buuren, Boshuizen, and Knook (1999); Sundell et al. (2008); Devine et al. (2012)

4 Why care about missing data
Why care about missing data? Diagnosing different types of missing data Bad methods of handling missing data Better methods of handling missing data

5 Not all missing data are the same
Missing by design Values are missing by definition of the population of interest Missing completely at random (MCAR) Missing values are randomly distributed Missing at random (MAR) After accounting for one or more other variables, missing values are randomly distributed Non-ignorable (NI) Missing values are functions of the variables themselves

6

7 Income randomly missing (MCAR)

8 Full data MCAR

9 Income missing for high-women professions (MAR)

10 Full data MAR

11 Income missing for low-income professions (NI)

12 Full data NI

13 Why care about missing data?
Missing by design data are not a problem MCAR data bias upward standard errors of your parameter estimates MAR or NI data bias BOTH parameter estimates and standard errors in unpredictable ways

14 How much missing data is too much?
Hard to say Small amounts of missing data can sometimes greatly affect analysis if missing values are extreme Missing data are particularly problematic when the data are MAR or NI

15 Why care about missing data
Why care about missing data? Diagnosing different types of missing data Bad methods of handling missing data Better methods of handling missing data

16 Missing data diagnostics
Goals: Make reasonable guess about the type of missing data you have Find variables that predict missingness or observed missing values Diagnostic options: Statistical Graphical Missingness patterns Pairwise complete correlations between your available variables Correlations between response indicators for your variables with missing values and other variables Margin plots Other options in the VIM package

17 Finding patterns of missingness
rr matrix: The number of observations for which both the row and column variables were observed rm matrix: The number of observations for which the row variable was observed, but the column variable was not mr matrix: The number of observations for which the column variable was observed, but the row variable was not mm matrix: The number of observations for which neither the row nor column variables were observed

18 Finding patterns of missingness
1: not missing 0: missing # of cases fitting this missingness pattern # of variables with missing values following this pattern # of cases with missing values on the column variable With this simple pattern of missingness, all available variables potentially have information about why some values of income are missing

19 Correlations Predicting available cases: pairwise complete correlations % women is a strong predictor of missingness, prestige is a strong predictor of observed income Predicting missingness: correlations with response indicators

20 Margin plots

21 Why care about missing data
Why care about missing data? Diagnosing different types of missing data Bad methods of handling missing data Better methods of handling missing data

22 Bad methods: Casewise exclusion
Eliminate each case that has a missing value The implicit “standard” method of dealing with missing data Can be a (somewhat) acceptable method when data are MCAR

23 How reasonable is it to assume that missing data in the social sciences is MCAR?

24 Bad methods: Casewise exclusion
Eliminate each case that has a missing value The implicit “standard” method of dealing with missing data Can be a (somewhat) acceptable method when data are MCAR When data are MAR or NI, unpredictable bias in standard errors and parameter estimates

25 Bad methods: Mean substitution
Substitute the mean of the variable for the missing values

26 Bad methods: Mean substitution
Substitute the mean of the variable for the missing values Leads to systematic bias in SE and, when data are MAR or NI, parameter estimates NEVER a good method of handling missing data

27 Why care about missing data
Why care about missing data? Diagnosing different types of missing data Bad methods of handling missing data Better methods of handling missing data

28 Better methods of handling missing data
Full information maximum likelihood (FIML) methods Can handle data that are MAR and NI Implemented as part of particular statistical models Missing data handled during analysis Multiple imputation Can also handle data that are MAR and NI Simulation-based approach Missing data are handled separately from analysis

29 Multiple imputation Generate multiple complete-case datasets (imputations) through simulation (only 5 – 10 are needed) Perform analyses on each imputation Combine the multiple analyses using a set of special rules (Rubin’s (1987) rules)

30

31 Multiple imputation Generate multiple complete-case datasets (imputations) through simulation (only 5 – 10 are needed) Perform analyses on each imputation Combine the multiple analyses using a set of special rules (Rubin’s (1987) rules)

32 Generating imputations
Imputations generated through maximum-likelihood based Markov-chain Monte Carlo (MCMC) Exact details of how imputations generated vary from method to method The quality of the simulations depends on how well the analyst can explain observed values and missingness of imputed variables

33 Which variables do you use to generate your simulations?
In general, the more variables the better (to a point; multicollinear variables can crash the simulation) Always use variables that will be involved in your final analysis, including interaction terms and contrasts Use variables that will not be included in the analysis, but that are good predictors of observed values of imputed variables Include variables that are good predictors of missingness Only use variables that have a high proportion of observations where the imputed variables have missing observations

34 Convergence For each imputation, the missing values of each variable are iteratively estimated In each iteration, the means and standard deviations of those missing values are slightly different Iteration continues until the means and standard deviations of the imputed values across the imputed datasets start to cluster (“converge”)

35 Imputation procedure Checking for convergence

36 Two examples of non-convergence
Means SDs Means SDs

37 The estimated means for gen and phb start high in the first few iterations, then converge toward lower values Means SDs

38 Checking that the imputed values are reasonable

39 Multiple imputation Generate multiple complete-case datasets (imputations) through simulation (only 5 – 10 are needed) Perform analyses on each imputation Combine the multiple analyses using a set of special rules (Rubin’s (1987) rules)

40 Perform analyses as usual on each simulated dataset
The intercepts and slopes of the linear model vary slightly across the simulated datasets

41 Multiple imputation Generate multiple complete-case datasets (imputations) through simulation (only 5 – 10 are needed) Perform analyses on each imputation Combine the multiple analyses using a set of special rules (Rubin’s (1987) rules)

42 The overall estimate of your parameter (Q-bar) is its mean across the m imputations
The within-imputation variance (U-bar) of the Q parameter is the mean of the variances across the m imputations The between-imputation variance (B) of the Q parameter is standard deviation of Q across the m imputations The total variance of Q is a function of U-bar and B. This total variance is used to calculate the standard error used for test statistics The degrees of freedom (v) are adjusted for the amount of information lost to missing data

43 Pooled results No missing data Casewise exclusion

44 Conclusions When you have missing data, think about WHY they are missing Missing data handled improperly can bias your conclusions Multiple imputation is one good way of handling missing data Caveat: Multiple imputation is complex, so do some reading before you do it


Download ppt "Missing data: Why you should care about it and what to do about it"

Similar presentations


Ads by Google