How to handle missing data values

Slides:



Advertisements
Similar presentations
Handling attrition and non- response in longitudinal data Harvey Goldstein University of Bristol.
Advertisements

REALCOM Multilevel models for realistically complex data Measurement errors Multilevel Structural equations Multivariate responses at several levels and.
Missing data – issues and extensions For multilevel data we need to impute missing data for variables defined at higher levels We need to have a valid.
MCMC estimation in MlwiN
Non response and missing data in longitudinal surveys.
Latent normal models for missing data Harvey Goldstein Centre for Multilevel Modelling University of Bristol.
Efficient modelling of record linked data A missing data perspective Harvey Goldstein Record Linkage Methodology Research Group Institute of Child Health.
Treatment of missing values
Efficient modelling of record linked data A missing data perspective Harvey Goldstein Record Linkage Methodology Research Group Institute of Child Health.
Resampling techniques Why resampling? Jacknife Cross-validation Bootstrap Examples of application of bootstrap.
Chapter 10 Simple Regression.

Linear Regression Models Based on Chapter 3 of Hastie, Tibshirani and Friedman Slides by David Madigan.
Statistical Methods for Missing Data Roberta Harnett MAR 550 October 30, 2007.
1 Psych 5500/6500 Statistics and Parameters Fall, 2008.
Multiple imputation using ICE: A simulation study on a binary response Jochen Hardt Kai Görgen 6 th German Stata Meeting, Berlin June, 27 th 2008 Göteborg.
Guide to Handling Missing Information Contacting researchers Algebraic recalculations, conversions and approximations Imputation method (substituting missing.
1 Multiple Imputation : Handling Interactions Michael Spratt.
Introduction Multilevel Analysis
Random Regressors and Moment Based Estimation Prepared by Vera Tabakova, East Carolina University.
Multilevel Modeling Software Wayne Osgood Crime, Law & Justice Program Department of Sociology.
Chapter 5 Parameter estimation. What is sample inference? Distinguish between managerial & financial accounting. Understand how managers can use accounting.
Eurostat Statistical matching when samples are drawn according to complex survey designs Training Course «Statistical Matching» Rome, 6-8 November 2013.
SW 983 Missing Data Treatment Most of the slides presented here are from the Modern Missing Data Methods, 2011, 5 day course presented by the KUCRMDA,
1 G Lect 13W Imputation (data augmentation) of missing data Multiple imputation Examples G Multiple Regression Week 13 (Wednesday)
Nonlinear Models. Agenda Omitted Variables Dummy Variables Nonlinear Models Nonlinear in variables Polynomial Regressions Log Transformed Regressions.
A REVIEW By Chi-Ming Kam Surajit Ray April 23, 2001 April 23, 2001.
Tutorial I: Missing Value Analysis
- 1 - Preliminaries Multivariate normal model (section 3.6, Gelman) –For a multi-parameter vector y, multivariate normal distribution is where  is covariance.
1 Ka-fu Wong University of Hong Kong A Brief Review of Probability, Statistics, and Regression for Forecasting.
Statistics 350 Lecture 2. Today Last Day: Section Today: Section 1.6 Homework #1: Chapter 1 Problems (page 33-38): 2, 5, 6, 7, 22, 26, 33, 34,
Data Screening. What is it? Data screening is very important to make sure you’ve met all your assumptions, outliers, and error problems. Each type of.
Tom.h.wilson Department of Geology and Geography West Virginia University Morgantown, WV.
Tom.h.wilson Department of Geology and Geography West Virginia University Morgantown, WV.
STA302/1001 week 11 Regression Models - Introduction In regression models, two types of variables that are studied:  A dependent variable, Y, also called.
Research and Evaluation Methodology Program College of Education A comparison of methods for imputation of missing covariate data prior to propensity score.
Virtual University of Pakistan
Best Practices for Handling Missing Data
Inference about the slope parameter and correlation
Missing data: Why you should care about it and what to do about it
ECO 173 Chapter 10: Introduction to Estimation Lecture 5a
Point and interval estimations of parameters of the normally up-diffused sign. Concept of statistical evaluation.
Multiple Imputation using SOLAS for Missing Data Analysis
MISSING DATA AND DROPOUT
Introduction to estimation: 2 cases
CH 5: Multivariate Methods
Hypothesis Testing and Confidence Intervals (Part 1): Using the Standard Normal Lecture 8 Justin Kern October 10 and 12, 2017.
The Centre for Longitudinal Studies Missing Data Strategy
Maximum Likelihood & Missing data
Multiple Imputation.
ECO 173 Chapter 10: Introduction to Estimation Lecture 5a
Multiple Imputation Using Stata
Discrete Event Simulation - 4
BOOTSTRAPPING: LEARNING FROM THE SAMPLE
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
10701 / Machine Learning Today: - Cross validation,
Ch13 Empirical Methods.
The European Statistical Training Programme (ESTP)
Sampling Distributions
Introduction to Estimation
Chapter 7: The Normality Assumption and Inference with OLS
Missing Data Mechanisms
Non response and missing data in longitudinal surveys
Regression Analysis.
Clinical prediction models
Type I and Type II Errors
Multiple Regression Berlin Chen
Chapter 13: Item nonresponse
Testing Causal Hypotheses
MGS 3100 Business Analysis Regression Feb 18, 2016
Presentation transcript:

How to handle missing data values How not to do it How to do it better How to do it properly HG. CMM Bristol.

Information about missing data techniques www.missingdata.org.uk (free to register) Software: REALCOM-IMPUTE: free 2-level, general but slow STATJR: General n-level (2-level free) fast To illustrate concepts let’s consider a simple example: 11/11/2018

Simple regression: y on x 𝑦 𝑖 =𝛼+𝛽 𝑥 𝑖 + 𝑒 𝑖 𝑦, 𝑥, 𝑏𝑖𝑣𝑎𝑟𝑖𝑎𝑡𝑒 𝑛𝑜𝑟𝑚𝑎𝑙 For example: y x 31.5 * 22.3 3.2 1.9 . So let’s generate a dataset – large enough so we can illustrate matters without having to go through tedious simulations. 11/11/2018

The dataset Fundamental idea is that of Imputation. i.e. regression is: 𝑦=0.5𝑥 𝜎 2 =0.75 ( 𝛽 =𝑐𝑜𝑣(𝑥𝑦)/𝑣𝑎𝑟(𝑥)) Simulate 100,000 pairs of values and estimate regression: We get 𝛽 =0.502 (0.00274) 𝜎 2 =0.751 Set about 20% of y’s missing at random 20% x’s missing – but not at random. Pr⁡(𝑥 𝑚𝑖𝑠𝑠𝑖𝑛𝑔)∝|y|) We’ll apply some popular (intuitive?) procedures Note that if both missing (~4%) there is no information so delete these, leaving 96028 cases. Fundamental idea is that of Imputation. This silence for my sin you did impute, Which shall be most my glory, being dumb; W.S. Sonnet 83. 11/11/2018

What not to do - 1 Calculate mean of observed and substitute for missing We get estimates (standard error in brackets); 𝛽 =0. 455(0.00275) 𝜎 2 =0.632 This is biased. Also note standard error wrong since ~36% values are ‘imputed’ and not actually observed so estimate too small. Correct standard error is elusive. 11/11/2018

What not to do - 2 Use observed ‘complete’ records to predict y|𝑥 & 𝑥|𝑦 and use predicted values to plug in for missing. Prediction imputation We get estimates (standard error in brackets); 𝛽 =0.573 (0.00264) 𝜎 2 =0.591 Parameters still biased as is standard error. Now use just the complete records 𝛽 =0.569 (0.00357) 𝜎 2 =0.851 This is popular because simple but large bias still because complete records not a random sample But, if you have complete cases as a random sample we get; 𝛽 =0.500 (0.00341) 𝜎 2 =0.748 So now unbiased but standard error larger than for full data since smaller sample. 11/11/2018

How to do it better - 3 For plug in and prediction imputation 𝜎 𝑦 2 , 𝜎 𝑥 2 , 𝜎 𝑥𝑦 biased. So for rgression impute let’s add a random variable on to each imputed (predicted) value, drawn from the regression residual distributions, i.e. 𝑓 𝑦 𝑥 & 𝑓(𝑥|𝑦) We now get estimates (standard error in brackets); 𝛽 =0.502 (0.00279) 𝜎 2 =0.754 Bias now virtually gone but standard error still too small since takes no account of fact that imputed values are derived from data and not observed. We call this random regression imputation 11/11/2018

How to do it better - 4 Hot decking – as enjoyed by survey analysts. For each missing x (or y) find the set of y’s (x’s) that are ‘similar’ to the value of y – say y* (x – say x*) associated with the missing x (y). Issue about how to define similar – in present case we shall take all those y’s in the range y∗∓0.1 y∗∓0.1 but we can do sensitivity analyses. For that set of y’s select one at random – or if you want to be sophisticated sample according to the distance from y*. This then becomes the imputed value. Results when it works similar to random regression imputation In practice the choice of range is crucial and for several variables we may not be able to find suitable pools of records from which to randomly select 11/11/2018

How to do it better - 4 ctd. So: when missing not random the only procedure that gives unbiased parameter estimates, but incorrect standard errors is random imputation. When missingness random complete case analysis and random imputation are unbiased; the former is inefficient, the latter gives incorrect standard error. 11/11/2018

How to do it properly Known as multiple imputation it basically does a random imputation but repeats it independently n times, where n is a suitably large number – traditionally 5 but more realistically up to 20. An MCMC chain is typically used. We therefore obtain n estimates of 𝛽, 𝜎 2 , and these are averaged – using Rubin’s rules - to obtain final values together with consistent standard error estimates. For n=5 we get 𝛽 =0.503 (0.00290) 𝜎 2 =0.752 And we now have unbiased estimates with correct standard error and we see that it is relatively efficient This is then the basis for a more general implementation (multilevel with mixed variable types) as in REALCOM and STATJR. Finally – a fully Bayesian procedure has been developed that is fast, very general and will also become available in STATJR 11/11/2018

A general approach We have a multilevel MOI with a response (possibly >1) and covariates – possibly at several levels. We take all the variables at each level and make a multivariate response model with ‘complete’ variables either as responses or covariates. We finish up with a multilevel multivariate response model and at each higher level we allow the responses at that level to correlate with random effects derived from a lower level. At this point we can include ‘auxiliaries’ that are not in the MOI but might be associated with the propensity to be missing thus improving our ability to satisfy the missing at random (MAR) – i.e. randomly missing given the other variables in the MOI. This assumption is needed. Within an MCMC chain we produce n ‘complete data sets and fit the MOI to each one. Then we combine: 11/11/2018

Combining MOIs Using Rubin’s Rules Take the average estimates Between-imputation average of variances Combine 11/11/2018

Handling a mixture of variable types All of this so far assumes normality. What if we also have categorical data? Key reference: Goldstein, H., Carpenter, J., Kenward, M. and Levin, K. (2009). Multilevel models with multivariate mixed response types. Statistical Modelling, 9,3, 173-197. Essentially works by assuming underlying normal distributions that generate discrete variables via thresholds, e.g. probit model for binary data. STATJR software: 2-level version freely downloadable from CMM. Note that while STATA (MICE) can handle mixed variable types it cannot handle higher level variables and doesn’t have a strong methodological foundation. Most other software assumes normality throughout. 11/11/2018

Further developments Existing MI methods cannot properly handle interactions including power terms New development: (Goldstein, H., Carpenter, J. R. and Browne, W. J. (2013), JRSSA doi: 10.1111/rssa.12022) Allows interactions and polynomials: At each iteration of an MCMC chain we fit IM the MOI together based on joint likelihood. This gives a single MOI chain that can be used for inferences in the usual ways. Avoids a 2-stage procedure and allows sensitive data diagnostics. 11/11/2018