Multiple Imputation Using Stata

Slides:



Advertisements
Similar presentations
Treatment of missing values
Advertisements

Research on Improvements to Current SIPP Imputation Methods ASA-SRM SIPP Working Group September 16, 2008 Martha Stinson.
6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.
11 Simple Linear Regression and Correlation CHAPTER OUTLINE
Some birds, a cool cat and a wolf
HSRP 734: Advanced Statistical Methods July 24, 2008.
CJT 765: Structural Equation Modeling Class 3: Data Screening: Fixing Distributional Problems, Missing Data, Measurement.
Adapting to missing data
How to Handle Missing Values in Multivariate Data By Jeff McNeal & Marlen Roberts 1.

How to deal with missing data: INTRODUCTION
Missing Data.. What do we mean by missing data? Missing observations which were intended to be collected but: –Never collected –Lost accidently –Wrongly.
Statistical Methods for Missing Data Roberta Harnett MAR 550 October 30, 2007.
Multiple imputation using ICE: A simulation study on a binary response Jochen Hardt Kai Görgen 6 th German Stata Meeting, Berlin June, 27 th 2008 Göteborg.
The Mimix Command Reference Based Multiple Imputation For Sensitivity Analysis of Longitudinal Trials with Protocol Deviation Suzie Cro EMERGE.
Guide to Handling Missing Information Contacting researchers Algebraic recalculations, conversions and approximations Imputation method (substituting missing.
1 Multiple Imputation : Handling Interactions Michael Spratt.
1 Introduction to Survey Data Analysis Linda K. Owens, PhD Assistant Director for Sampling & Analysis Survey Research Laboratory University of Illinois.
Multiple Imputation (MI) Technique Using a Sequence of Regression Models OJOC Cohort 15 Veronika N. Stiles, BSDH University of Michigan September’2012.
G Lecture 11 G Session 12 Analyses with missing data What should be reported?  Hoyle and Panter  McDonald and Moon-Ho (2002)
Applied Epidemiologic Analysis - P8400 Fall 2002 Lab 10 Missing Data Henian Chen, M.D., Ph.D.
Introduction to Multiple Imputation CFDR Workshop Series Spring 2008.
Imputation for Multi Care Data Naren Meadem. Introduction What is certain in life? –Death –Taxes What is certain in research? –Measurement error –Missing.
1 Introduction to Survey Data Analysis Linda K. Owens, PhD Assistant Director for Sampling & Analysis Survey Research Laboratory University of Illinois.
1 11 Simple Linear Regression and Correlation 11-1 Empirical Models 11-2 Simple Linear Regression 11-3 Properties of the Least Squares Estimators 11-4.
SW 983 Missing Data Treatment Most of the slides presented here are from the Modern Missing Data Methods, 2011, 5 day course presented by the KUCRMDA,
© John M. Abowd 2007, all rights reserved General Methods for Missing Data John M. Abowd March 2007.
1 G Lect 13W Imputation (data augmentation) of missing data Multiple imputation Examples G Multiple Regression Week 13 (Wednesday)
The Impact of Missing Data on the Detection of Nonuniform Differential Item Functioning W. Holmes Finch.
Missing Values Raymond Kim Pink Preechavanichwong Andrew Wendel October 27, 2015.
Tutorial I: Missing Value Analysis
A framework for multiple imputation & clustering -Mainly basic idea for imputation- Tokei Benkyokai 2013/10/28 T. Kawaguchi 1.
DATA STRUCTURES AND LONGITUDINAL DATA ANALYSIS Nidhi Kohli, Ph.D. Quantitative Methods in Education (QME) Department of Educational Psychology 1.
Research and Evaluation Methodology Program College of Education A comparison of methods for imputation of missing covariate data prior to propensity score.
Best Practices for Handling Missing Data
Bootstrap and Model Validation
HANDLING MISSING DATA.
Missing data: Why you should care about it and what to do about it
Handling Attrition and Non-response in the 1970 British Cohort Study
11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.
Multiple Imputation using SOLAS for Missing Data Analysis
MISSING DATA AND DROPOUT
Applied Biostatistics: Lecture 2
CJT 765: Structural Equation Modeling
How useful is a reminder system in collection of follow-up quality of life data in clinical trials? Dr Shona Fielding.
The Centre for Longitudinal Studies Missing Data Strategy
Maximum Likelihood & Missing data
Introduction to Survey Data Analysis
Multiple Imputation.
How to handle missing data values
Multiple logistic regression
Dealing with missing data
Presenter: Ting-Ting Chung July 11, 2017
Working with missing Data
Peng Zhang Jinnan Liu Mei-ting Chiang Yin Liu
The European Statistical Training Programme (ESTP)
CH2. Cleaning and Transforming Data
Task 6 Statistical Approaches
Missing Data Mechanisms
Analysis of missing responses to the sexual experience question in evaluation of an adolescent HIV risk reduction intervention Yu-li Hsieh, Barbara L.
Field procedures and non-sampling errors
Chapter 4: Missing data mechanisms
The European Statistical Training Programme (ESTP)
Rachael Bedford Mplus: Longitudinal Analysis Workshop 23/06/2015
Clinical prediction models
Chapter 13: Item nonresponse
Enhancing Causal Inference in Observational Studies
Enhancing Causal Inference in Observational Studies
Missing data: Is it all the same?
Considerations for the use of multiple imputation in a noninferiority trial setting Kimberly Walters, Jie Zhou, Janet Wittes, Lisa Weissfeld Joint Statistical.
Presentation transcript:

Multiple Imputation Using Stata Chuck Huber, PhD StataCorp chuber@stata.com University of Michigan January 30, 2018

Outline Example Dataset Missing Data Mechanisms What is multiple imputation? Multiple imputation in Stata Why use multiple imputation?

Example Dataset

Example Dataset The objective is to examine the relationship between smoking and heart attacks adjusting for age, body mass index, educational status, and gender We want to perform a logistic regression of heart attack (attack) with the other variables as regressors

Example Dataset

Example Dataset

Complete Case Analysis

Mean Substitution?

Mean Substitution?

Mean Substitution?

Mean Substitution? Complete Case Analysis (N=132) Mean Substitution

Outline Example Dataset Missing Data Mechanisms What is multiple imputation? Multiple imputation in Stata Why use multiple imputation?

Missing Data Mechanisms Missing Completely At Random (MCAR) Missing At Random (MAR) Missing Not At Random (MNAR)

Missing Completely At Random (MCAR) Definition Missing data are MCAR if the reason for missing data is unrelated to the observed or unobserved (missing) data. That is, missing values are a simple random sample of all data values. Example Subjects withdraw from a study for reasons unrelated to the study. Data are missing because of equipment failures or data-recording errors

Missing Completely At Random (MCAR) Other variables Missing values not related to the variable with missing data Missing values not related to other observed variables

Missing At Random (MAR) Definition Missing data are MAR if the reason for missing data is unrelated to the unobserved (missing) data but may depend on the observed data. That is, missing values are not a simple random sample of all data values. Example In a study of blood pressure, subjects withdraw from the study because of severe side effects caused by a high dosage of a treatment. In a study of income, respondents with low education might be less inclined to report their income

Missing At Random (MAR) Other variables Missing values not related to the variable with missing data Missing values are related to other observed variables

Missing Not At Random (MNAR) Definition Missing data are MNAR if the reason for missing data is related to the unobserved (missing) data. Example In a study of income, respondents with low or high income might be less inclined to report their income; in a study of depression, respondents who are depressed might be less likely to report that they are depressed

Missing Not At Random (MNAR) Other variables Missing values are related to the variable with missing data

Checking MCAR vs MAR

Outline Example Dataset Missing Data Mechanisms What is multiple imputation? Multiple imputation in Stata Why use multiple imputation?

What is multiple imputation? Multiple imputation (MI) is a flexible, simulation-based statistical technique for handling missing data. Multiple imputation consists of three steps: Imputation step. M imputations (completed datasets) are generated under some chosen imputation model. Completed-data analysis (estimation) step. The desired analysis is performed separately on each imputation (m = 1, … , M). This is called completed-data analysis and is the primary analysis to be performed once missing data have been imputed. Pooling step. The results obtained from M completed-data analyses are combined into a single multiple-imputation result.

Notation and some terminology Original data are the data containing missing values With a slight abuse of terminology, by an imputation we mean a copy of the original data in which missing values are imputed M is the number of imputations m (= 0, . . . ,M) refers to the original or imputed data: m = 0 means original data and m > 0 means imputed data. m = 1 means the first imputation, m = 2 means the second imputation, etc.

The Imputation Step Original Data (m=0) Copy of Data (m = 1)

The Imputation Step

The Imputation Step bmi_new = 26.6 + 1.7(attack) - .47(smokes) - .03(age) - .31(female)

The Imputation Step bmi_new = 26.5 + 1.7(attack) - .47(smokes) - .03(age) - .31(female) + rnormal()

The Imputation Step Original Data bmi_new = 26.5 + 1.7(attack) - .47(smokes) - .03(age) - .31(female) + rnormal() bmi_new = 26.5 + 1.7(attack) - .47(smokes) - .03(age) - .31(female) + rnormal() bmi_new = 26.5 + 1.7(attack) - .47(smokes) - .03(age) - .31(female) + rnormal()

The Imputation Step Original Data bmi_new = 26.5 + 1.7(attack) - .47(smokes) - .03(age) - .31(female) + 1.7 bmi_new = 26.5 + 1.7(attack) - .47(smokes) - .03(age) - .31(female) + 0.9 bmi_new = 26.5 + 1.7(attack) - .47(smokes) - .03(age) - .31(female) + -2.1

The Estimation Step Original Data logistic attack smokes age bmi_new hsgrad female logistic attack smokes age bmi_new hsgrad female logistic attack smokes age bmi_new hsgrad female

The Pooling Step 𝑇= 1 𝑀 𝑊+ 1+ 1 𝑀 𝐵 The within-imputation (W) variance is calculated for each imputed dataset during estimation step. The between-imputation (B) variance is calculated during the pooling step. The total variance (T) is then: 𝑇= 1 𝑀 𝑊+ 1+ 1 𝑀 𝐵

Outline Example Dataset Missing Data Mechanisms What is multiple imputation? Multiple imputation in Stata Why use multiple imputation?

Main features of Stata’s mi command Stata’s mi suite of commands perform all three steps of multiple imputation: Create imputed datasets, each with the missing values filled in (mi impute) Fit your model on each imputed dataset (mi estimate) Collect all the model fits and apply Rubin’s combination rules to form “mi-adjusted” parameter estimates and standard errors (mi estimate)

Multiple Imputation Using Stata The mi Control Panel Examining and setting up mi data Univariate imputation Estimation Testing Prediction

The mi Control Panel

Examine Missing Data

Examine Missing Data

The Imputation Step NOTE: We’re only using 5 imputations to keep things simple but you should use at least 20.

The Imputation Step

The Imputation Step Three new variables were created by mi set and mi impute: _mi_id An identification number for records within an imputed dataset _mi_miss An indicator for missing values of the imputed variable _mi_m The number (m) for each imputed dataset (m=0 is original data)

The Imputation Step

Data Management

The Estimation Step

The Estimation and Pooling Step

Testing Coefficients

Testing Coefficients

Predictions

Outline Example Dataset Missing Data Mechanisms What is multiple imputation? Multiple imputation in Stata Why use multiple imputation?

Why use multiple imputation? The objective of MI is not to predict missing values as close as possible to the true ones but to handle missing data in a way resulting in valid statistical inference (Rubin 1996)

Why use multiple imputation? It is more flexible than fully-parametric methods, e.g. maximum likelihood, purely Bayesian analysis It can be more efficient than listwise deletion (complete-cases analysis) and can avoid potential bias It accounts for missing-data uncertainty and, thus, does not underestimate the variance of estimates unlike single imputation methods

Statistical validity of MI MI yields statistically valid inference if an imputation method used is proper per Rubin (1987, 118–119) Loosely speaking, the imputation mechanism, which produces imputations, must maintain the existing characteristics of the data and incorporate adequate variability (uncertainty) induced by unobserved data.

Summary MI is a stochastic method. Remember to set the random-number seed to reproduce the same point estimates later MI preserves all available data and thus can be more efficient than complete-cases analysis. It can also avoid potential bias when complete cases differ from incomplete cases Unlike fully-parametric methods, MI can easily be applied to a wide range of analyses

Summary MI separates the stochastic, imputation step from the analysis step — the imputer and the analyst can be different people! In Stata, use mi impute for imputation and mi estimate for analysis Use MI Control Panel to guide you through all the phases of MI

For more information

For more information Files Videos 09_multiple_imputation.do heart.dta Multiple imputation in Stata®: Setup, imputation, estimation--regression imputation Multiple imputation in Stata®: Setup, imputation, estimation--predictive mean matching Multiple imputation in Stata®: Setup, imputation, estimation--logistic regression

Thanks for letting me hang out with you today! Questions? You can contact me anytime at chuber@stata.com