Missing data: Why you should care about it and what to do about it

Slides:



Advertisements
Similar presentations
11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.
Advertisements

Treatment of missing values
CountrySTAT Team-I November 2014, ECO Secretariat,Teheran.
Missing Data Analysis. Complete Data: n=100 Sample means of X and Y Sample variances and covariances of X Y
TRIM Workshop Arco van Strien Wildlife statistics Statistics Netherlands (CBS)
/k 2DS00 Statistics 1 for Chemical Engineering lecture 4.
6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.
1 SSS II Lecture 1: Correlation and Regression Graduate School 2008/2009 Social Science Statistics II Gwilym Pryce
CJT 765: Structural Equation Modeling Class 3: Data Screening: Fixing Distributional Problems, Missing Data, Measurement.

1 Unsupervised Learning With Non-ignorable Missing Data Machine Learning Group Talk University of Toronto Monday Oct 4, 2004 Ben Marlin Sam Roweis Rich.
Modeling Achievement Trajectories When Attrition is Informative Betsy J. Feldman & Sophia Rabe- Hesketh.
PEAS wprkshop 2 Non-response and what to do about it Gillian Raab Professor of Applied Statistics Napier University.
Quantitative Business Analysis for Decision Making Multiple Linear RegressionAnalysis.
Introduction to Multilevel Modeling Using SPSS
Multiple imputation using ICE: A simulation study on a binary response Jochen Hardt Kai Görgen 6 th German Stata Meeting, Berlin June, 27 th 2008 Göteborg.
Inference for regression - Simple linear regression
Workshop on methods for studying cancer patient survival with application in Stata Karolinska Institute, 6 th September 2007 Modeling relative survival.
Guide to Handling Missing Information Contacting researchers Algebraic recalculations, conversions and approximations Imputation method (substituting missing.
1 1 Slide © 2007 Thomson South-Western. All Rights Reserved OPIM 303-Lecture #9 Jose M. Cruz Assistant Professor.
1 1 Slide © 2007 Thomson South-Western. All Rights Reserved Chapter 13 Multiple Regression n Multiple Regression Model n Least Squares Method n Multiple.
1 Multiple Imputation : Handling Interactions Michael Spratt.
Introduction Multilevel Analysis
1 1 Slide © 2008 Thomson South-Western. All Rights Reserved Chapter 15 Multiple Regression n Multiple Regression Model n Least Squares Method n Multiple.
1 Introduction to Survey Data Analysis Linda K. Owens, PhD Assistant Director for Sampling & Analysis Survey Research Laboratory University of Illinois.
G Lecture 11 G Session 12 Analyses with missing data What should be reported?  Hoyle and Panter  McDonald and Moon-Ho (2002)
Multiple Regression and Model Building Chapter 15 Copyright © 2014 by The McGraw-Hill Companies, Inc. All rights reserved.McGraw-Hill/Irwin.
Inference for Regression Simple Linear Regression IPS Chapter 10.1 © 2009 W.H. Freeman and Company.
Imputation for Multi Care Data Naren Meadem. Introduction What is certain in life? –Death –Taxes What is certain in research? –Measurement error –Missing.
Chapter 13 Multiple Regression
Chapter 14 Repeated Measures and Two Factor Analysis of Variance
SW 983 Missing Data Treatment Most of the slides presented here are from the Modern Missing Data Methods, 2011, 5 day course presented by the KUCRMDA,
Multiple Regression. Simple Regression in detail Y i = β o + β 1 x i + ε i Where Y => Dependent variable X => Independent variable β o => Model parameter.
Chapter 13 Repeated-Measures and Two-Factor Analysis of Variance
Missing Values Raymond Kim Pink Preechavanichwong Andrew Wendel October 27, 2015.
Tutorial I: Missing Value Analysis
Multiple Regression Learning Objectives n Explain the Linear Multiple Regression Model n Interpret Linear Multiple Regression Computer Output n Test.
Pre-Processing & Item Analysis DeShon Pre-Processing Method of Pre-processing depends on the type of measurement instrument used Method of Pre-processing.
Biostatistics Regression and Correlation Methods Class #10 April 4, 2000.
A framework for multiple imputation & clustering -Mainly basic idea for imputation- Tokei Benkyokai 2013/10/28 T. Kawaguchi 1.
Data Screening. What is it? Data screening is very important to make sure you’ve met all your assumptions, outliers, and error problems. Each type of.
DATA STRUCTURES AND LONGITUDINAL DATA ANALYSIS Nidhi Kohli, Ph.D. Quantitative Methods in Education (QME) Department of Educational Psychology 1.
Lecture 2 Survey Data Analysis Principal Component Analysis Factor Analysis Exemplified by SPSS Taylan Mavruk.
Stats Methods at IC Lecture 3: Regression.
Best Practices for Handling Missing Data
HANDLING MISSING DATA.
Regression Analysis.
11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.
Multiple Imputation using SOLAS for Missing Data Analysis
CJT 765: Structural Equation Modeling
CJT 765: Structural Equation Modeling
Maximum Likelihood & Missing data
Introduction to Survey Data Analysis
Multiple Imputation.
POSC 202A: Lecture Lecture: Substantive Significance, Relationship between Variables 1.
Multiple Imputation Using Stata
How to handle missing data values
Statistical Methods For Engineers
CHAPTER 29: Multiple Regression*
Dealing with missing data
Presenter: Ting-Ting Chung July 11, 2017
The greatest blessing in life is
The European Statistical Training Programme (ESTP)
CH2. Cleaning and Transforming Data
EM for Inference in MV Data
Chapter 10: Estimating with Confidence
Regression Analysis.
EM for Inference in MV Data
Clinical prediction models
MGS 3100 Business Analysis Regression Feb 18, 2016
Presentation transcript:

Missing data: Why you should care about it and what to do about it

Lecture overview Why care about missing data? Diagnosing different types of missing data Bad methods of handling missing data Better methods of handling missing data

General info about missing data: Acock (2005) Theory behind multiple imputation: Rubin (1996); Schafer (1999) How to do multiple imputation using the MICE package in R: Van Buuren and Groothuis-Oudshoorn (2011) Empirical examples: Van Buuren, Boshuizen, and Knook (1999); Sundell et al. (2008); Devine et al. (2012)

Why care about missing data Why care about missing data? Diagnosing different types of missing data Bad methods of handling missing data Better methods of handling missing data

Not all missing data are the same Missing by design Values are missing by definition of the population of interest Missing completely at random (MCAR) Missing values are randomly distributed Missing at random (MAR) After accounting for one or more other variables, missing values are randomly distributed Non-ignorable (NI) Missing values are functions of the variables themselves

Income randomly missing (MCAR)

Full data MCAR

Income missing for high-women professions (MAR)

Full data MAR

Income missing for low-income professions (NI)

Full data NI

Why care about missing data? Missing by design data are not a problem MCAR data bias upward standard errors of your parameter estimates MAR or NI data bias BOTH parameter estimates and standard errors in unpredictable ways

How much missing data is too much? Hard to say Small amounts of missing data can sometimes greatly affect analysis if missing values are extreme Missing data are particularly problematic when the data are MAR or NI

Why care about missing data Why care about missing data? Diagnosing different types of missing data Bad methods of handling missing data Better methods of handling missing data

Missing data diagnostics Goals: Make reasonable guess about the type of missing data you have Find variables that predict missingness or observed missing values Diagnostic options: Statistical Graphical Missingness patterns Pairwise complete correlations between your available variables Correlations between response indicators for your variables with missing values and other variables Margin plots Other options in the VIM package

Finding patterns of missingness rr matrix: The number of observations for which both the row and column variables were observed rm matrix: The number of observations for which the row variable was observed, but the column variable was not mr matrix: The number of observations for which the column variable was observed, but the row variable was not mm matrix: The number of observations for which neither the row nor column variables were observed

Finding patterns of missingness 1: not missing 0: missing # of cases fitting this missingness pattern # of variables with missing values following this pattern # of cases with missing values on the column variable With this simple pattern of missingness, all available variables potentially have information about why some values of income are missing

Correlations Predicting available cases: pairwise complete correlations % women is a strong predictor of missingness, prestige is a strong predictor of observed income Predicting missingness: correlations with response indicators

Margin plots

Why care about missing data Why care about missing data? Diagnosing different types of missing data Bad methods of handling missing data Better methods of handling missing data

Bad methods: Casewise exclusion Eliminate each case that has a missing value The implicit “standard” method of dealing with missing data Can be a (somewhat) acceptable method when data are MCAR

How reasonable is it to assume that missing data in the social sciences is MCAR?

Bad methods: Casewise exclusion Eliminate each case that has a missing value The implicit “standard” method of dealing with missing data Can be a (somewhat) acceptable method when data are MCAR When data are MAR or NI, unpredictable bias in standard errors and parameter estimates

Bad methods: Mean substitution Substitute the mean of the variable for the missing values

Bad methods: Mean substitution Substitute the mean of the variable for the missing values Leads to systematic bias in SE and, when data are MAR or NI, parameter estimates NEVER a good method of handling missing data

Why care about missing data Why care about missing data? Diagnosing different types of missing data Bad methods of handling missing data Better methods of handling missing data

Better methods of handling missing data Full information maximum likelihood (FIML) methods Can handle data that are MAR and NI Implemented as part of particular statistical models Missing data handled during analysis Multiple imputation Can also handle data that are MAR and NI Simulation-based approach Missing data are handled separately from analysis

Multiple imputation Generate multiple complete-case datasets (imputations) through simulation (only 5 – 10 are needed) Perform analyses on each imputation Combine the multiple analyses using a set of special rules (Rubin’s (1987) rules)

Multiple imputation Generate multiple complete-case datasets (imputations) through simulation (only 5 – 10 are needed) Perform analyses on each imputation Combine the multiple analyses using a set of special rules (Rubin’s (1987) rules)

Generating imputations Imputations generated through maximum-likelihood based Markov-chain Monte Carlo (MCMC) Exact details of how imputations generated vary from method to method The quality of the simulations depends on how well the analyst can explain observed values and missingness of imputed variables

Which variables do you use to generate your simulations? In general, the more variables the better (to a point; multicollinear variables can crash the simulation) Always use variables that will be involved in your final analysis, including interaction terms and contrasts Use variables that will not be included in the analysis, but that are good predictors of observed values of imputed variables Include variables that are good predictors of missingness Only use variables that have a high proportion of observations where the imputed variables have missing observations

Convergence For each imputation, the missing values of each variable are iteratively estimated In each iteration, the means and standard deviations of those missing values are slightly different Iteration continues until the means and standard deviations of the imputed values across the imputed datasets start to cluster (“converge”)

Imputation procedure Checking for convergence

Two examples of non-convergence Means SDs Means SDs

The estimated means for gen and phb start high in the first few iterations, then converge toward lower values Means SDs

Checking that the imputed values are reasonable

Multiple imputation Generate multiple complete-case datasets (imputations) through simulation (only 5 – 10 are needed) Perform analyses on each imputation Combine the multiple analyses using a set of special rules (Rubin’s (1987) rules)

Perform analyses as usual on each simulated dataset The intercepts and slopes of the linear model vary slightly across the simulated datasets

Multiple imputation Generate multiple complete-case datasets (imputations) through simulation (only 5 – 10 are needed) Perform analyses on each imputation Combine the multiple analyses using a set of special rules (Rubin’s (1987) rules)

The overall estimate of your parameter (Q-bar) is its mean across the m imputations The within-imputation variance (U-bar) of the Q parameter is the mean of the variances across the m imputations The between-imputation variance (B) of the Q parameter is standard deviation of Q across the m imputations The total variance of Q is a function of U-bar and B. This total variance is used to calculate the standard error used for test statistics The degrees of freedom (v) are adjusted for the amount of information lost to missing data

Pooled results No missing data Casewise exclusion

Conclusions When you have missing data, think about WHY they are missing Missing data handled improperly can bias your conclusions Multiple imputation is one good way of handling missing data Caveat: Multiple imputation is complex, so do some reading before you do it