1 Multiple Imputation : Handling Interactions Michael Spratt.

Slides:



Advertisements
Similar presentations
Autocorrelation and Heteroskedasticity
Advertisements

Non response and missing data in longitudinal surveys.
Testing the performance of the two-fold FCS algorithm for multiple imputation of longitudinal clinical records Catherine Welch 1, Irene Petersen 1, Jonathan.
Latent normal models for missing data Harvey Goldstein Centre for Multilevel Modelling University of Bristol.
FACTORIAL ANOVA Overview of Factorial ANOVA Factorial Designs Types of Effects Assumptions Analyzing the Variance Regression Equation Fixed and Random.
Logistic Regression Psy 524 Ainsworth.
Simulation of “forwards-backwards” multiple imputation technique in a longitudinal, clinical dataset Catherine Welch 1, Irene Petersen 1, James Carpenter.
CmpE 104 SOFTWARE STATISTICAL TOOLS & METHODS MEASURING & ESTIMATING SOFTWARE SIZE AND RESOURCE & SCHEDULE ESTIMATING.
TRIM Workshop Arco van Strien Wildlife statistics Statistics Netherlands (CBS)
6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.
Prediction, Correlation, and Lack of Fit in Regression (§11. 4, 11
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 13 Nonlinear and Multiple Regression.
RELATIVE RISK ESTIMATION IN RANDOMISED CONTROLLED TRIALS: A COMPARISON OF METHODS FOR INDEPENDENT OBSERVATIONS Lisa N Yelland, Amy B Salter, Philip Ryan.
MEASUREMENT ERROR 1 In this sequence we will investigate the consequences of measurement errors in the variables in a regression model. To keep the analysis.
CJT 765: Structural Equation Modeling Class 3: Data Screening: Fixing Distributional Problems, Missing Data, Measurement.
Statistics for Managers Using Microsoft® Excel 5th Edition

Data Sources The most sophisticated forecasting model will fail if it is applied to unreliable data Data should be reliable and accurate Data should be.
Additional Topics in Regression Analysis
Clustered or Multilevel Data
Chapter 11 Multiple Regression.
Partially Missing At Random and Ignorable Inferences for Parameter Subsets with Missing Data Roderick Little Rennes
Missing Data.. What do we mean by missing data? Missing observations which were intended to be collected but: –Never collected –Lost accidently –Wrongly.
Modeling clustered survival data The different approaches.
Regression Model Building Setting: Possibly a large set of predictor variables (including interactions). Goal: Fit a parsimonious model that explains variation.
Statistical Methods for Missing Data Roberta Harnett MAR 550 October 30, 2007.
Simple Linear Regression Analysis
Generalized Linear Models
Single and Multiple Spell Discrete Time Hazards Models with Parametric and Non-Parametric Corrections for Unobserved Heterogeneity David K. Guilkey.
Objectives of Multiple Regression
A simulation study of the effect of sample size and level of interpenetration on inference from cross-classified multilevel logistic regression models.
Multiple imputation using ICE: A simulation study on a binary response Jochen Hardt Kai Görgen 6 th German Stata Meeting, Berlin June, 27 th 2008 Göteborg.
The Mimix Command Reference Based Multiple Imputation For Sensitivity Analysis of Longitudinal Trials with Protocol Deviation Suzie Cro EMERGE.
Workshop on methods for studying cancer patient survival with application in Stata Karolinska Institute, 6 th September 2007 Modeling relative survival.
Population All members of a set which have a given characteristic. Population Data Data associated with a certain population. Population Parameter A measure.
Multiple Imputation (MI) Technique Using a Sequence of Regression Models OJOC Cohort 15 Veronika N. Stiles, BSDH University of Michigan September’2012.
Various topics Petter Mostad Overview Epidemiology Study types / data types Econometrics Time series data More about sampling –Estimation.
Generalized Linear Models All the regression models treated so far have common structure. This structure can be split up into two parts: The random part:
Introduction to Multiple Imputation CFDR Workshop Series Spring 2008.
Imputation for Multi Care Data Naren Meadem. Introduction What is certain in life? –Death –Taxes What is certain in research? –Measurement error –Missing.
1 Introduction to Survey Data Analysis Linda K. Owens, PhD Assistant Director for Sampling & Analysis Survey Research Laboratory University of Illinois.
Eurostat Statistical Matching using auxiliary information Training Course «Statistical Matching» Rome, 6-8 November 2013 Marco Di Zio Dept. Integration,
LECTURE 3: ANALYSIS OF EXPERIMENTAL DATA
The Impact of Missing Data on the Detection of Nonuniform Differential Item Functioning W. Holmes Finch.
A generalized bivariate Bernoulli model with covariate dependence Fan Zhang.
Missing Values Raymond Kim Pink Preechavanichwong Andrew Wendel October 27, 2015.
A REVIEW By Chi-Ming Kam Surajit Ray April 23, 2001 April 23, 2001.
Simulation Study for Longitudinal Data with Nonignorable Missing Data Rong Liu, PhD Candidate Dr. Ramakrishnan, Advisor Department of Biostatistics Virginia.
A shared random effects transition model for longitudinal count data with informative missingness Jinhui Li Joint work with Yingnian Wu, Xiaowei Yang.
Machine Learning 5. Parametric Methods.
Tutorial I: Missing Value Analysis
A framework for multiple imputation & clustering -Mainly basic idea for imputation- Tokei Benkyokai 2013/10/28 T. Kawaguchi 1.
DATA STRUCTURES AND LONGITUDINAL DATA ANALYSIS Nidhi Kohli, Ph.D. Quantitative Methods in Education (QME) Department of Educational Psychology 1.
Research and Evaluation Methodology Program College of Education A comparison of methods for imputation of missing covariate data prior to propensity score.
Missing data: Why you should care about it and what to do about it
MISSING DATA AND DROPOUT
The Centre for Longitudinal Studies Missing Data Strategy
Generalized Linear Models
Multiple Imputation.
Multiple Imputation Using Stata
How to handle missing data values
Does cognitive ability in childhood predict fertility
The European Statistical Training Programme (ESTP)
Marco Di Zio Dept. Integration, Quality, Research and Production
Missing Data Mechanisms
Non response and missing data in longitudinal surveys
Clinical prediction models
Chapter 13: Item nonresponse
Considerations for the use of multiple imputation in a noninferiority trial setting Kimberly Walters, Jie Zhou, Janet Wittes, Lisa Weissfeld Joint Statistical.
Imputation Strategies When a Continuous Outcome is to be Dichotomized for Responder Analysis: A Simulation Study Lysbeth Floden, PhD1 Melanie Bell, PhD2.
Presentation transcript:

1 Multiple Imputation : Handling Interactions Michael Spratt

2 Introduction Missing data is considerable problem Complete case analysis will generally lead to systematic bias Have to make some assumption –Most commonly used is Missing at Random (MAR) –MNAR uses different assumptions In this talk we are discussing analysis when MAR assumption is made

3 Introduction : MAR In MAR, the probability of being missing does not depend on the missing data itself, given the observed data and the model parameters –Unlike MNAR analysis, we do not have to explicitly the model missingness mechanism

4 Introduction : MAR and multiple imputation Most common approach : perform imputation of missing data and save multiple imputed datasets –Each imputed dataset differs (slightly) due to stochastic nature of imputation Then carry out substantive analysis on each of the imputed datasets Then combine the individual results (using Rubin’s Rules) to obtain combined imputation estimates and standard errors

5 MICE/ICE for imputation assuming MAR MICE : Multiple Imputation using Chained Equations has been widely used for imputation –Sometimes called FCS (fully conditional specification) –For general missingness patterns (does not not have to assume monotone missingness) Implemented in –MICE package in R (van Buuren et. al.) –ICE command in Stata (Royston) –IVEWARE (Raghunathan et. al.) –Potentially task-specific versions be written in other programs e.g. WinBUGS Ref : Multiple imputation of missing blood pressure covariates in survival analysis. Van Buuren, Boshuizen, Knook. Statistics in Medicine 1999; 18(6): 681–94.

6 X 1, X 2, X 3 …X n partially observed Z obs represents set of fully observed variables Chained equations are : X 1 ~ f(X 2, X 3, X 4 … X n, Z obs ) X 2 ~ f(X 1, X 3, X 4 … X n, Z obs ) X 3 ~ f(X 1, X 2, X 4 … X n, Z obs ) etc. Comparable to Gibbs Sampler Much shorter chains which on termination produce an imputed dataset MICE/ICE for imputation assuming MAR

7 Interactions in the Analysis Model A useful practical guide to using imputation to perform analysis in the presence of missing data are Multiple imputation: current perspectives (Kenward and Carpenter Statistical Methods in Medical Rearch16: 199– 218) Multiple imputation for missing data in epidemiological and clinical research: potential and pitfalls (Sterne, White, Carpenter et. al. BMJ 2009;338:b2393) also contains useful guidance The imputation model should be at least as rich as the substantive model –The imputation model should preserve the structure of the data

8 MICE/ICE for imputation For most datasets where distributional assumptions are met, MICE/ICE has been shown in practice to work well for MAR data More care is needed when models contain structures such as interactions, multi-level, non- linearity etc. In particular the structure of the substantive model should be reflected in the imputation model This talk focuses on interactions

9 Why omitting interactions in the imputation may cause problems Take as an example 3 binary variables X, Y, Z We are interested in a substantive analysis in the presence of missing data of the logistic regression of Y on X and Z with an interaction logit(P(Y=1| X, Z)) =  0 +  x x +  z z +  xz x.z We initially have a full [X,Y,Z] dataset, but it then becomes subject to missingness (MAR mechanisms) We would like the parameter estimates after MAR followed by imputation occurs to be the same as the full data estimates

10 Why omitting interactions in the imputation may cause problems The coefficients are the same as the coefficients of the corresponding log-linear model –logistic : logit(Y) =  0 +  x x +  z z +  xz x.z –log-linear log(  xyz ) =  0 +  x x +  y y +  z z +  xy x.y +  xz x.z +  yz y.z +  xyz x.y.z –Examining the bias of  x is equivalent to examining the bias of  xy ; same for  z and  yz ; and for  xz and  xyz

11 Why omitting interactions in the imputation may cause problems Omitting interactions terms in the full conditional models will lead to interactions in the log-linear model being underestimated and hence P(X, Y, Z) being incorrectly estimated This can also be seen by looking at the number of parameter estimates needed –If just X is subject to missingness, we need to be able to estimate P(X | Y,Z) P(Y, Z) 4 parameter estimates needed for P(X | Y, Z) This cannot be done with chained equation without interaction X =  +  y Y +  z z as there are only 3 free parameters –If X and Y are subject to missingess, we need to be able to estimate P(X, Y | Z) P(Z) 8 parameter estimates in general needed for P(X, Y | Z) This cannot be done with chained equations without interaction x =  x +  xy y +  xz z and y =  y +  yx x +  yz z as there are only 6 free parameters

12 Passive Imputation Imputation interactions are needed. Both the Stata program ICE (and also the R MICE package) support passive imputation The interaction term is recalculated from the main effects after every mice cycle and can then be made use of in the subsequent chained equations in the cycle for the imputation of other variable(s) –Other possible approaches : –Von Hippel “How to impute interactions, squares and other transformed variables”, Sociological Methodology 39: is a less established alternative to passive imputation –It is also worth noting that where a categorical variable is fully observed an alternative method of imputation is to split it by values of the fully observed variable and separately impute subsets of data

13 Simulation Structure 1.We created a [X,Z] dataset 2.We created Y stochastically given X and Z 3.We stochastically created missingness (MAR) in 1, 2 or 3 variables 4.Using a number of imputation models we did the imputation and performed the substantive analysis Steps 2-4 were repeated 100 times and parameter estimates and standard errors were recorded We tabulated the median of the parameter estimates, the median of the confidence intervals and the coverage of the original data generation parameter within the parameter estimate’s confidence intervals

14 Simulations We examined the effect of interactions on analysis of imputed data in a series of simulation scenarios involving 3 variables; –Regression with outcome Y and covariates X and Z The simulation scenarios ranged through all 3 variables being binary; 2 variables binary and one variable normal; one variable binary and 2 normal variables; to 3 normal variables –In each case varying combinations of outcomes and covariates complete/incomplete We present a subset of the simulation scenarios

15 All variables binary; X and Z incomplete Dataset : 20,000 observations, Y generated stochastically logit(Y) = 0.5 × X × Z × X × Z Data divided into 2 sections with Bernoulli distribution (p = 0.5) [splitting allows missingness to be MAR] Two stratified MAR patterns : –logit(Z is missing) = -2 + X + Y (In one section of data) –logit(X is missing) = × Z × Y (Other section of data) Imputation then substantive analysis performed In a second simulation scenario there were 3 stratified MAR patterns : –P(Z missing | X, Y) (In section 1 of data) –P(X missing | Y, Z) (In section 2 of data) –P(X and Z jointly missing | Y) (In section 3 of data)

16 All variables binary; X and Z incomplete

17 All variables binary; X, Z and Y incomplete Data generated stochastically logit(Y) = 0.5 × X × Z × X × Z Data divided randomly into 3 sections with equal probability 3 stochastic stratified MAR patterns : –logit(Z is missing) = -2 + X + Y (In section 1 of data) –logit(X is missing) = × Z × Y (In section 2 of data) –logit(Y is missing) = × Z × X (In section 3 of data) data) In a second simulation scenario there were 6 stratified MAR patterns : –P(Z missing | X, Y) (In section 1 of data) –P(X missing | Y, Z) (In section 2 of data) –P(Y missing | X, Z) (In section 3 of data) –P(X and Y jointly missing | Z) (In section 4 of data) –P(X and Z jointly missing | Y) (In section 5 of data) –P(Y and Z jointly missing | X) (In section 6 of data)

18 All variables binary; X, Z and Y incomplete

19 Y continuous, X and Z binary; X, Z and Y incomplete Data generated stochastically, this time Y is continuous Y ~ 0.45 × X × Z × X × Z + N(0, 1) Data divided randomly into 3 sections with equal probability 3 stochastic stratified MAR patterns : logit(Z is missing) = × X + Y (In section 1 of data) logit(X is missing) = × Z × Y (In section 2 of data) logit(Y is missing) = × Z × X (In section 3 of data)

20 Y continuous, X and Z binary; X, Z and Y incomplete

21 Further simulations In further simulations similar results were obtained, where the distributional assumptions of the imputations models were adhered to In each case omitting an interaction in a chained equation produced biased results. All 2-way interactions had to be included Starting with a tri-variate normal distribution and introducing a slight interaction (slight non- normality results) also gave imputed estimates closest to the full data estimates when the full interactions were introduced into the imputation model

22 Conclusions In general the imputation models should reflect the structure of the substantive analysis, and should be at least as rich as the analysis model In order to reflect the structure of the substantive model, the imputation model should not exclude its interactions, and should also include any corresponding interactions involving the outcome variable

23 Acknowledgements This work was done in collaboration with Jonathan Sterne, Kate Tilling and James Carpenter Helpful comments and suggestions from Paul Clarke are gratefully acknowledged