Best Practices for Handling Missing Data

Best Practices for Handling Missing Data
David R. Johnson Professor of Sociology, Demography and Family Studies Pennsylvania State University

Outline What are missing data and why do we need to do something about them? Classic Approaches and their problems. Modern Approaches Multiple Imputation (MI) Maximum Likelihood (ML or FIML) When to use modern approaches Focus on decisions in multiple imputation

What are Missing Data? Refusals
Don’t Know (can be treated either as a missing value or as an actual value) Not Applicable (may or may not be missing data) No response (common in self-report questionnaires) Bad data (when the value is clearly wrong and you don’t know what the correct value would be) Planned Missing or Missing by Design. (questions missing in some years of the GSS) When the partner does not respond

How common are missing data?
In surveys there is almost always some missing values. In official statistics (states, countries, etc.) some values are either not collected or not available for some reason. Quite common in data collected using contingency questions or skip patterns (e.g., occupational prestige only answered if the respondent works). May occur when questions were not asked in some years of a multiyear survey, or were asked only of a random sample of respondents (a planned missing pattern).

Why should we be concerned about missing data?
In multivariate analysis—like regression—all variables in the model must have value. Can’t handle incomplete matrices. Called analysis of Complete Cases. Any case with even one variable with a missing value is excluded from the analysis. Even with little missing data in a each variable can results in loss of many cases. When a case is lost you loose all the information on that case and your standard errors go up. To use all available information requires some way of using both complete and incomplete cases.

How do we handle missing data in regression analysis?
Most regression models require a complete data matrix. Each cell of all rows and columns of data must have valid nonmissing values. Called “Complete Case Analysis” When you have a large number of independent variables the proportion of the sample that has complete data may be quite small even if the proportion missing on each variable is not large. If missing on one variable is independent of missing on another then the proportion of cases with no missing data = (1 – mx1)* (1 – mx2)*…(1 – mxi) where m is the proportion non-missing on variable x.

Many strategies for handling missing data in regression
Complete Case Analysis (listwise or casewise) Pairwise Deletion Mean Substitution Regression Substitution or Regression Imputation Hot (and Cold) Deck methods Expectations-Maximization (EM) methods Full Information Maximum Likelihood (FIML) Multiple Imputation Methods Plus a number of other lesser know (and less used) methods

Complete Case Analysis (casewise or listwise deletion)
May yield a small sample size Remaining sample of cases with complete data may no longer be representative of the total sample or the population May reduce the statistical power – you throwing away the partial information you have on incomplete cases Probably the most common solution in the literature up until the last several years. There are still situations in which it might be an appropriate method (Allison suggest that when you lose only a small proportion of cases (5% or so) casewise is OK)

Pairwise Deletion (obsolete method)
Incomplete data are analyzed by computing a correlation or covariance matrix based on the pairs that have complete data Analyze this matrix rather than the raw data. Works for OLS regression. Is an option in some computer programs (SPSS) Main problem is that each correlation may be based on a very different set of cases. Can produce a covariance/correlation matrix that does not meet the assumptions of a proper matrix (not Gramian). Can result in very biased and poorly estimated coefficients. Any more, it is seldom used in the literature.

Mean Substitution Fill in the missing value with the mean of that variable. As the regression line always goes through the mean of the variables, this seems like a good answer because it will not influence the regression line one way or another. However, putting values on regression line leads to no error for these cases which artificially increases the explained variance Procedure also leads to biased estimates of the standard deviation of the variables as the missing cases all have a standard deviation of 0.

Mean Substitution with a Missing Data Indicator (dummy variable)
Occasionally researchers will use mean substitution along with adding a dummy variable for each variable indicating whether or not that observation was missing on that variable. Often see this with income in published papers. Allison has demonstrated both statistically and with simulation data that including a dummy variable along with mean substitution will likely lead to biased estimates. There have been some articles indicating that mean substitution is not likely to yield biased estimates, particularly when the amount of missing data is quite low. However, would not recommend it.

Regression Imputation (Regression substitution)
These methods may vary from program to program, but are basically as follows: Estimate a regression model from complete cases in which the variable X is the dependent variable. Should include all other variables in the model as independent variables. Get a predicted value for each case with a missing observation and substitute that predicted value for the missing value. Better than mean substitution because at least the missing data now have some variance for a given variable (different respondents have different values) Still biased as the missing values are still perfectly predicted as a linear combination of the other independent variables. Tends to inflate R-square and bias standard errors.

Regression Imputation (Regression substitution with a stochastic error term)
A form of regression imputation (stochastic regression imputation) adds a random error component to each predicted value. This reduces the problem of all cases being on the regression line and inflated explained variance. Similar to more statistically complex methods used in some multiple imputation approaches.

Cold and Hot Deck Imputation
Method used by the Census Bureau since the 1940’s to impute missing values in census data. (Census uses the Hot version) Defining a “Deck”. A big N-way crosstab of (usually) 3 to 5 demographic variables (e.g., age, gender, education, employment, marital status) each with a small number of categories (<5) In each of the cells of the crosstab there is a value for another variable (e.g. income) When a record is missing on the variable (e.g. income) they are assigned the value in the cell that corresponds to their demographic characteristic. If the value in the cell stays the same it is called a “Cold Deck”; if it changes as more records are processed it is a “Hot Deck”

Hot Deck Imputation Create an n1 x n2 x n3 … table that is stored in the computer using basic demographic characteristics of respondents. (e.g., Age in 4 categories, gender (2), marital status (5), education (4) = yields 160 cells) Fill in the hot deck matrix for a given variable (let’s say employment status) by going though records sequentially and fill in an observed value for employment status for each cell based on a person with that set of characteristics. The values in the deck keep changing as more records are read. If a person has a missing value on employment they would get the current value in the matrix based on their demographic characteristics

Limitations of Hot Deck
Requires a very large sample to work adequately. In the deck, you can use only a small number (3 – 5) characteristics with limited numbers of categories in each. Can produce biased estimates. Does not take into consideration the uncertainty introduced by the imputation.

Expectation-Maximization (EM) Method
This is an iterative method for filling in missing values based on the MAR assumption. Implemented in the SPSS MVA module for single imputations and in SAS MI for generating the starting point for multiple imputations. This is a “model-based” imputation method. The most common model is called the Normal Model. The Normal Model assumes that the variables are quantitative (continuous) and their distribution is multivariate normal. Although quantitative variables are required, it can be used with categorical variables if transformed to a set of dummy variables. Estimates have been found to be quite robust to violation of the multivariate normal assumption.

Example of Observed and Imputed Values for Normal Model
2 1 Observed Values 3 Imputed Values 4 Importance of having children Imputed values are continuous and follow a multivariate normal distribution

How EM works (simplified)
Uses regression (or some other method) to develop a plausible initial value for a missing value on variable x taking into account another set of variables. E.g., use the predicted value of x from its regression on all other variables in the model. Also add in an error term. This is the E (expectations) step. With these initial estimates substituted into the data matrix, it re-computes the regression (or other method) and develops a new plausible value. This is the M (maximization) step. It continues with these two steps until the estimated missing values stop changing (within a certain tolerance limit). Depending on the algorithm used these estimates can be unbiased under the assumption of MAR and the statistical assumptions of the “model”

Full Information Maximum Likelihood (FIML or ML)
Works for statistical methods based on a covariance matrix (e.g. OLS regression, structural equation models). Models with binary outcomes can be estimated in some versions (e.g. Mplus) Does not produce values for the missing data, just estimates the covariance matrix in the presence of missing data. Similar to EM except it does not actually assign values for cases. (Because EM is actually a maximum likelihood method)

Full Information Maximum Likelihood (FIML or ML)
Creates a proper covariance matrix taking into account the information in both complete and incomplete cases. (in contrast with pairwise deletion in which case the covariance matrix is often not “proper”). This covariance matrix can then be used to do regression or other covariance based methods (e.g. SEM) Assumes MAR and correctly adjusts the standard errors for the uncertainty due to missingness Available in most Structural Equations Model (SEM) computer programs (Mplus, Amos, LISREL, MX, EQS, and the new SEM package in Stata) In some of these packages you can estimate a wide variety of regression type models (e.g. logistic regression). Mplus is probably the most flexible. Some missing data experts argue that this is probably the best approach if you can use it (e.g. Paul Allison).

Multiple Imputation (MI)
One of the “Modern Methods” that is widely recommended. You create several versions of the dataset which only differ in the missing values assigned. Should create at least 5 datasets, but more (e.g. 20+) is generally recommended for most models. The missing values assigned for a case on a variable X have both a fixed (predicted) component and a random error component. The size of random error component is selected to reflect the degree of uncertainty in assigning a value. The imputed data sets differ in the random components assigned. High certainty would lead to little variation between data sets in the random values, low certainty to more variation.

Multiple Imputation (MI)
Once the datasets are generated, the researcher conducted the regression in each dataset then combines the estimates using “Rubin’s” rules. These are a set of equations which yield “correct” standard errors taking into account the uncertainty in the imputation in estimating the standard errors. ICE and MI Stata do multiple imputation and SAS has the program MI. These have procedures that combine the multiple data sets and compute pooled estimates. (SAS has mianalyze: Stata has micombine and mim) A MI module in now available in SPSS (or PASW). SPSS also has a capacity to combine the multiple datasets for many statistical procedures. The SPSS MI is similar to ICE in Stata. Version 11+ of Stata has an “official” MI program which uses either the same statistical method as the SAS MI or the ICE procedure.

More on Multiple Imputation
Lots of issue remain with MI. How may variables should you take into consideration when imputing values? If a variable is not used to inform the imputation then the missing values imputed are assumed to be uncorrelated with that variable. MI with many variables and cases takes a LONG time to compute because most imputation software uses “Data Augmentation” methods that are simulation-based and go through many thousands of iterations. More recent versions are more efficient. See Johnson & Young (2011) were the consequences of these decisions are compared and recommendations about best choices are made.

Other Issues Are you “making up” data when you do imputations?
If done properly, no. Imputation just enables you to use methods that require a full data matrix. No new information is added. With a proper imputation, no new information is added. The missing values can be viewed as neutral “fillers” to allow for the use of complete case statistical procedures. Why are some of the imputed values so strange? (sometimes impute values will be out of the range of the observed data) The values themselves are not all that important for the regression estimates. Many people recode them to fall within the range, but this is not necessary for proper estimation. Recoding them into the correct range and to the discrete values is OK if the data will be used for other purposes as well. May, but probably will not, produce a slight bias.

Other Issues What is the most acceptable approach to use today to get published in the literature? FIML is acceptable for SEM or OLS regression and becoming more common for other approaches (e.g. mixture models). MI methods are accepted for other approaches. How many imputations should you use? Old standard was 5. Recent work suggests that the number of imputed datasets should increase with the amount of missing information. From 10 to 25 is acceptable. (We did not find much difference in estimates based on different numbers of datasets.)

Working with Missing Data What will the Future bring?
The literature on how missing data should be handled is still in transition. Many issues are still unresolved and more work is needed. Our own work shows that, if you have a relatively small proportion of missing data, as long as you use one of the “modern” methods you are not likely to bias your findings. My prognosis is that in the near future, going through special steps, creating special datasets, etc. will no longer be necessary. Proper handling of missing data will be built into the statistical software. Similar to the case with Mplus and SEM packages. Already, Mplus can handle a wide variety of models with incomplete data. Newest version (6) of Mplus can also do the imputations and pooling of estimates automatically if you request that option. No need to do a two-step process.

Best Practices for Handling Missing Data

Similar presentations

Presentation on theme: "Best Practices for Handling Missing Data"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Best Practices for Handling Missing Data

Similar presentations

Presentation on theme: "Best Practices for Handling Missing Data"— Presentation transcript:

Similar presentations

About project

Feedback