Multiple imputation using ICE: A simulation study on a binary response Jochen Hardt Kai Görgen 6 th German Stata Meeting, Berlin June, 27 th 2008 Göteborg.

Slides:



Advertisements
Similar presentations
Multiple Regression Analysis
Advertisements

Treatment of missing values
Logistic Regression Psy 524 Ainsworth.
Irwin/McGraw-Hill © Andrew F. Siegel, 1997 and l Chapter 12 l Multiple Regression: Predicting One Factor from Several Others.
CmpE 104 SOFTWARE STATISTICAL TOOLS & METHODS MEASURING & ESTIMATING SOFTWARE SIZE AND RESOURCE & SCHEDULE ESTIMATING.
FTP Biostatistics II Model parameter estimations: Confronting models with measurements.
Some birds, a cool cat and a wolf
Probabilistic & Statistical Techniques Eng. Tamer Eshtawi First Semester Eng. Tamer Eshtawi First Semester
CJT 765: Structural Equation Modeling Class 3: Data Screening: Fixing Distributional Problems, Missing Data, Measurement.
How to Handle Missing Values in Multivariate Data By Jeff McNeal & Marlen Roberts 1.

Point estimation, interval estimation
Class 5: Thurs., Sep. 23 Example of using regression to make predictions and understand the likely errors in the predictions: salaries of teachers and.
Econ Prof. Buckles1 Multiple Regression Analysis y =  0 +  1 x 1 +  2 x  k x k + u 4. Further Issues.
1Prof. Dr. Rainer Stachuletz Multiple Regression Analysis y =  0 +  1 x 1 +  2 x  k x k + u 4. Further Issues.
How to deal with missing data: INTRODUCTION
Modeling Achievement Trajectories When Attrition is Informative Betsy J. Feldman & Sophia Rabe- Hesketh.
Missing Data.. What do we mean by missing data? Missing observations which were intended to be collected but: –Never collected –Lost accidently –Wrongly.
Review for Exam 2 Some important themes from Chapters 6-9 Chap. 6. Significance Tests Chap. 7: Comparing Two Groups Chap. 8: Contingency Tables (Categorical.
Statistical Methods for Missing Data Roberta Harnett MAR 550 October 30, 2007.
PEAS wprkshop 2 Non-response and what to do about it Gillian Raab Professor of Applied Statistics Napier University.
Objectives of Multiple Regression
Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved Section 10-3 Regression.
Chapter 13: Inference in Regression
STAT 3130 Statistical Methods II Missing Data and Imputation.
Sociology 5811: Lecture 7: Samples, Populations, The Sampling Distribution Copyright © 2005 by Evan Schofer Do not copy or distribute without permission.
Workshop on methods for studying cancer patient survival with application in Stata Karolinska Institute, 6 th September 2007 Modeling relative survival.
STA Lecture 161 STA 291 Lecture 16 Normal distributions: ( mean and SD ) use table or web page. The sampling distribution of and are both (approximately)
Population All members of a set which have a given characteristic. Population Data Data associated with a certain population. Population Parameter A measure.
Section Copyright © 2014, 2012, 2010 Pearson Education, Inc. Lecture Slides Elementary Statistics Twelfth Edition and the Triola Statistics Series.
1 1 Slide © 2007 Thomson South-Western. All Rights Reserved Chapter 13 Multiple Regression n Multiple Regression Model n Least Squares Method n Multiple.
1 Multiple Imputation : Handling Interactions Michael Spratt.
1 1 Slide © 2008 Thomson South-Western. All Rights Reserved Chapter 15 Multiple Regression n Multiple Regression Model n Least Squares Method n Multiple.
Andrew Thomson on Generalised Estimating Equations (and simulation studies)
Multiple Regression The Basics. Multiple Regression (MR) Predicting one DV from a set of predictors, the DV should be interval/ratio or at least assumed.
Applied Epidemiologic Analysis - P8400 Fall 2002 Lab 10 Missing Data Henian Chen, M.D., Ph.D.
When and why to use Logistic Regression?  The response variable has to be binary or ordinal.  Predictors can be continuous, discrete, or combinations.
MGS3100_04.ppt/Sep 29, 2015/Page 1 Georgia State University - Confidential MGS 3100 Business Analysis Regression Sep 29 and 30, 2015.
Introduction to Multiple Imputation CFDR Workshop Series Spring 2008.
Imputation for Multi Care Data Naren Meadem. Introduction What is certain in life? –Death –Taxes What is certain in research? –Measurement error –Missing.
1 Javier Aparicio División de Estudios Políticos, CIDE Primavera Regresión.
SW 983 Missing Data Treatment Most of the slides presented here are from the Modern Missing Data Methods, 2011, 5 day course presented by the KUCRMDA,
© John M. Abowd 2007, all rights reserved General Methods for Missing Data John M. Abowd March 2007.
Right Hand Side (Independent) Variables Ciaran S. Phibbs.
The Impact of Missing Data on the Detection of Nonuniform Differential Item Functioning W. Holmes Finch.
Statistics 300: Elementary Statistics Sections 7-2, 7-3, 7-4, 7-5.
Missing Values Raymond Kim Pink Preechavanichwong Andrew Wendel October 27, 2015.
Special Topics in Educational Data Mining HUDK5199 Spring term, 2013 March 13, 2013.
Missing Values C5.2 Data Screening. Missing Data Use the summary function to check out the missing data for your dataset. summary(notypos)
Tutorial I: Missing Value Analysis
Pre-Processing & Item Analysis DeShon Pre-Processing Method of Pre-processing depends on the type of measurement instrument used Method of Pre-processing.
Advanced Methods and Analysis for the Learning and Social Sciences PSY505 Spring term, 2012 April 9, 2012.
The accuracy of averages We learned how to make inference from the sample to the population: Counting the percentages. Here we begin to learn how to make.
Data Screening. What is it? Data screening is very important to make sure you’ve met all your assumptions, outliers, and error problems. Each type of.
Best Practices for Handling Missing Data
HANDLING MISSING DATA.
Missing data: Why you should care about it and what to do about it
Multiple Imputation using SOLAS for Missing Data Analysis
Maximum Likelihood & Missing data
Multiple Imputation.
Multiple Imputation Using Stata
How to handle missing data values
Lecture Slides Elementary Statistics Thirteenth Edition
Dealing with missing data
Presenter: Ting-Ting Chung July 11, 2017
The European Statistical Training Programme (ESTP)
CH2. Cleaning and Transforming Data
Clinical prediction models
Chapter 13: Item nonresponse
Considerations for the use of multiple imputation in a noninferiority trial setting Kimberly Walters, Jie Zhou, Janet Wittes, Lisa Weissfeld Joint Statistical.
Presentation transcript:

Multiple imputation using ICE: A simulation study on a binary response Jochen Hardt Kai Görgen 6 th German Stata Meeting, Berlin June, 27 th 2008 Göteborg University University of Mainz, Bernsteincenter for Computational Neuroscience, Berlin

Almost all sociological / medical data have missings - typically in the range of.5 to 5 % in a variable Many statistical procedures can only use cases without missings What we already know about missing substitution: 1)With a small amount of missings everything is easy 2)Large samples are easy

Missingness at random A very simple example - Analysis of complete cases - Imputation of means - Singular regression imputation - Multiple imputation: hotdeck - Multiple imputation: chained equations A not so simple example Multiple imputation by chained equations in real data Overview

There is a distinction in the literature about data being missing completely at random (MCAR), missing at random (MAR) or being missing not at random (Rubin, 1996). MCAR means that the pattern of missings is totally at random, not depending on any variable in or not in the analysis. MAR is an intuitively somewhat misleading label, because it allows strong dependencies in the pattern of missings. If, for example, in a set of variables all data for men are missing and for women are non- missing, the dataset is still MAR as long as gender is included as a variable. The formal definition is that missings are at random given all information available in the dataset. Background I

MCAR usually does not apply to data in social sciences MAR seems quite plausible for many datasets. But the definition has the disadvantage that it can never be tested on any given dataset – always it is possible that some unobserved variables - at least partitially - cause the pattern of missing. MNAR means that there is such an unknown process in the data that creates the missings. E.g. for socially undesirable behaviour, such as lying, stealing or betraying, it is plausible to assume that missing values rather reflect higher than lower levels of such behaviour, but an exact modelling of the answering process is mostly not possible. One of the most prominent question for MNAR is the one about income, which has high rates of missings, usually in the range of 20 % - 50 %. Background II

reg Y X, both standard distributed continuous variables Y = 1*X + 1*error, n = 50, i = 3%, 8%, 13%…. 68% of X are set missing, for each I: 200 replications were made A very simple example: y x

Works ok but waste of information, particularly in multivariate analyses The old solution: take only the cases without missings. Percent missings in x Standard deviation for beta Estimate for beta ± sd

Quite stable estimate, stronger increase in sd than in complete case analysis The 2nd solution: mean substitution

Overestimation of the effect when response is included The 3rd solution: substitution by regression

Hotdeck Imputation s Typo: 1 of course

Considerably more variance due to imputation, break-down at about 50 % missings (m = 5, 4 variables) Number 4: Multiple imputation - hotdeck

Multiple Imputation by Chained Equations: ICE s

Multiple Imputation a random subset of the data is drawn A value for each missing of var X 1 is estimated via (linear, logistic, ordered, etc) regression The closest observed values to that estimate are chosen and replace the missings The program switches to X 2 …….. Cycled over ten times Finish when m datasets are created

Multiple Imputation: Analysis in each dataset a (regression) analysis is performed Results are combined due to Rubins rule (a) parameters (b) variances within between total

Stable estimates with small variances (m = 5, 4 variables) No 5 finally: Multiple Imputation on Chained Equations - ICE

Analysis of complete cases: not bad when only few variables Imputation of means not bad for continous variables don‘t impute the mode take the mean for categorical variables, too no inflation of ß‘s when no replacement in response Regression imputation don‘t include response into model Multiple imputation: Hotdeck Stata‘s version is not recomendable Multiple Imputation by Chained Equations very good Let‘s have a look onto a not so simple example Preliminary summary on the very simple example

Var ß sd X1: maternal love Response: Lifetime suicide attempt 0 = no (83 %) 1 = yes (17 %) N = 505 One binary response (Suicide attempts) is predicted by 20 continous variables plus 5 discrete Variables:

Percent missing in x ICE estimate for beta, 4 variables in the model, CMAR

ICE estimate for beta, 4 variables in the model, CMAR Percent missing in x

ICE estimate for beta, 4 variables in the model, CMAR Percent missing in x

ICE estimate for beta, 11 variables in the model, CMAR Percent missing in x

ICE estimate for beta, 25 variables in the model, CMAR Percent missing in x

The same done with MICE in R Percent missing in x estimate for beta, 11 variables in the model, CMAR

single regression substitution estimate for beta, CMAR 10 variables in the model (response excluded) Percent missing in x

mean substitution imputation estimate for beta, CMAR Percent missing in x

ICE estimate for beta, 11 variables in the model, NMAR Percent missing in x

Single regression imputation 10 variables in the model,NMAR Percent missing in x

All non-linear effects are downward biased by any method. The example shows an interaction coefficient estimated with ICE, 11 variables in the model, CMAR Percent missing in x

Summary - In large samples we can substitute considerable higher proportions of missings than in small ones. - Multiple imputation with ICE performs well in all situations (as far as we examined) - Having more variables in the imputation model leads to better estimates, i.e.smaller sd’s. - With binary responses, ICE may report extreme sd’s when the number of variables grows high, or the number of cases low. Then we have gone too far. - Single regression imputation performs quite well under certain conditions - Non-linear effects get lost with all methods