Missing Data.
What do we mean by missing data? Missing observations which were intended to be collected but: –Never collected –Lost accidently –Wrongly collected so deleted Outcomes and/or Explanatory variables
Effect of Missing Data Can cause –Biased estimates, means, regression parameters –Biased standard errors, resulting in incorrect P-values and CI
Missing data mechanism 1. Missing Completely At Random : MCAR –Missing does not depend on observed or unobserved values –Eg. Missing FBC because a tube with blood material is accidently broken –BP missing due broken machine
Missing data mechanism 2. Missing At Random : MAR –Missing depends on observed data, but not on the unobserved data. –Eg year olds are less likely to respond to a follow up postal questionnaire – more likely to change address several times
3. Missing Not At Random: MNAR –Given all available observed information, the probability of being missing still depends on the unobserved data –Eg. Patient misses an appointment because they feel ill. This illness (e.g.flu) is related to the measurement intended to be made (e.g temperature) Missing data mechanism
The Assumptions –Cannot tell from data at hand whether the missing values are MCAR, MNAR or MAR –Can distinguish between MCAR and MAR –MAR can be made more likely by looking at associations between missing values and non missing observations in explanatory variables
Simple methods to handle missing data Complete Case (CC) analysis Mean Imputation Regression imputation Stochastic Imputation Problem: Makes results too certain
Multiple Imputation (MI) Under MAR assumption, gives less biased estimates and SEs, when compared to CC Covers many different data structures Never absolute best thing to do
Multiple Imputation (MI) IDx1x IDx1x ? ? x2x1ID
Express our uncertainty about missing data by creating ‘m’ imputed data sets Analyse each of these in usual way Combine estimates using particular rules (Rubin’s rules) Key Idea behind Imputation
Two variables: X1 and X2 –X1 missing in some records –X2 not missing, observed in every unit Learn relationship between X1 and X2 Complete data set by drawing the missing observations from X1 | X2
Example 1 Longitudinal Breast Cancer study –Outcome: Early death or disease recurrence –Explanatory variables: age, meno, tam Cox regression
How much is missing? variables with no mv's: id meno rectime censrec _st _d _t _t0 lnt Variable | type obs mv variable label age | float age, years tam | byte hormonal therapy N: 686
CC Analysis Cox regression -- Breslow method for ties No. of subjects = 452 Number of obs = 452 No. of failures = 193 Time at risk = LR chi2(3) = 5.15 Log likelihood = Prob > chi2 = _t | Haz. Ratio Std. Err. z P>|z| [95% Conf. Interval] age | tam | meno |
MI in Practice STATA : ICE –Multiple Imputation by Chained Equations (MICE) Univariate imputation - uvis Multivariate imputation - ice
Density Graphs by agemiss Age (years)
MI Analysis mim: stcox age tam meno Multiple-imputation estimates (stcox) Imputations = 5 Minimum obs = 686 Minimum dof = _t | Haz. Rat. Std. Err. t P>|t| [95% Conf. Int.] FMI age | tam | meno |
Summary Most studies will have missing data MI suitable. Gives less biased estimates, SE, under MAR and MCAR MI is a useful tool for dealing with missing data.