Presentation is loading. Please wait.

Presentation is loading. Please wait.

Missing Data in Epidemiology: Issues & Approaches

Similar presentations


Presentation on theme: "Missing Data in Epidemiology: Issues & Approaches"— Presentation transcript:

1 Missing Data in Epidemiology: Issues & Approaches
N. Birkett, September 4, 2014 Presented to EPI8166 (PhD seminar course)

2 There are known knowns; We also know there are known unknowns;
There are things we know we know. We also know there are known unknowns; That is to say we know there are some things we do not know. But there are also unknown unknowns; The ones we don’t know we don’t know. U.S. Secretary of Defense, Donald H. Rumsfeld Department of Defense news briefing, February 12, 2002

3 Example (1) # getting disease # total Cumulative Incidence Female 300
1,000 0.3 Male 200 0.2 500 2,000 0.25 RR = 1.5 Now, assume 50% of the females refuse to give you information about their final outcome (decline that question but continue in the study). # getting disease # total Cumulative Incidence Female: missing data 150 500 0.3 Female: valid data Male 200 1,000 0.2 350 1,500 0.233 # getting disease # total Cumulative Incidence Female 300 1,000 0.3 Male 200 0.2 500 2,000 0.25 # getting disease # total Cumulative Incidence Female: missing data 150 500 0.3 Female: valid data Male 200 1,000 0.2 RR = 1.5

4 Example (2) We are missing the outcome status on 50% of the females
Using available data, we find: Overall estimate of the rate of disease is biased The RR for risk in females compared to males is OK Why? Subjects missing the outcome status are a random subset of all females Female-specific incidence risk is correct Prevalence of female sex is lower in study ‘complete cases’ fails to reflect the 50:50 distribution of sex in the target population External validity

5 Example (3) # getting disease # total Cumulative Incidence Female 300
1,000 0.3 Male 200 0.2 500 2,000 0.25 RR = 1.5 Now, assume 50% of the females refuse to give you information about their final outcome. BUT only people not getting outcome refuse. # getting disease # total Cumulative Incidence Female: missing data 500 0.0 Female: valid data 300 0.6 Male 200 1,000 0.2 # getting disease # total Cumulative Incidence Female: missing data 500 0.0 Female: valid data 300 0.6 Male 200 1,000 0.2 1,500 0.33 # getting disease # total Cumulative Incidence Female 300 1,000 0.3 Male 200 0.2 500 2,000 0.25 RR = 3.0

6 Example (4) The chance the outcome data is missing depends on the true status of the outcome Using available data, we find: Overall estimate of the rate of disease is biased The RR for risk in females compared to males is biased Why? Female-specific incidence risk is biased Over-estimated Prevalence of female sex is lower in study ‘complete cases’ Fails to reflect the 50:50 distribution of sex in the target population

7 Why missing data matters (1)
All studies have missing data People drop out of studies People decline one of several questionnaires People decline to complete certain questions (e.g. income) People miss questions (pages get stuck together) Lab tests fail biological levels are ‘below threshold of detection’ Missing data is usually not the focus of a study In many cases, missing data is just ignored

8 Why missing data matters
Failing to adjust properly for missing data can causes serious problems. Introduce potential bias in parameter estimation Weaken the generalizability of the results Ignoring cases with missing data leads to the loss of information Decreases statistical power Increases standard errors Failing to adjust data properly for missing values can make the data unsuitable for a statistical procedure Can also make the statistical analyses vulnerable to violations of assumptions

9 Levels of missing data Data can be missing at two ‘levels’
Unit-level non-response A subject included in the study declines to take part and provides no information at all. Serious issue in much research Mainly affects external generalizibility Not the focus of further discussions Item-level non-response Subject participates in the study Fails to provide information for some items Applies a skip sequence wrongly Two pages get stuck together

10 Types of missing data patterns (1)
Three patterns are generally recognized: Missing Completely at Random (MCAR) Missing at Random (MAR) Missing not at Random (MNAR or NMAR)

11 Types of missing data patterns (2)
Missing Completely at Random (MCAR) The probability of a data value being missing is independent of all observed and non-observed data. Missing data is a random sample of all data Observed data is an unbiased estimator of the results from total data Complete-case (listwise deletion) methods work fine Can identify MCAR by comparing cases with and without missing data Example Biosamples collected for genotyping Some results are missing because the instrument failed for one batch of samples

12 Types of missing data patterns (3)
Missing at Random (MAR) The probability of a data value being missing is related to observed data but not to non-observed data. Can be analyzed using Multiple Imputation methods or likelihood-based methods Example Looking at prognostic value of SNPs for sub-types of breast cancer Eligible subjects with advanced stage breast cancer (III/IV) were more likely to be missing SNP information Subjects with advanced disease are less cooperative with the study. Conditional on disease stage, the probability of missing the SNP is unrelated to the value of the SNP.

13 Types of missing data patterns (4)
Missing Not at Random (MNAR or NMAR) The probability of a data value being missing is related to the unobserved values. e.g. high values are more likely to be missing than low values Can be analyzed using Multiple Imputation methods or likelihood-based methods much more complex to use requires modeling the process yielding the missing values Example Looking at study which requires measurement of tumor size. Smaller tumors are less likely to have size recorded Harder to measure size of small tumors Requires more complex methods (e.g. MRI or PET scanning). Probability of size being missing relates to the size of the tumor

14 Another classification of Patterns
Univariate missing data Data are missing on only one variable in the analysis set Monotonic missing data You can rearrange the data so the following is true: If a subject is missing data on variable ‘i’, then they are missing data on all variables after that Longitudinal study with drop-outs. Arbitrary missing data Doesn’t meet the above conditions.

15 Ignorability And now, some confusing terminology
Rubin introduced the term ‘ignorability’ If data is MCAR or MAR, then the mechanism which produces the missing data is not important and can be ignored in analysis. He called this ‘Ignorability’ This does not mean that the missing data can be ignored!

16 Missing data in the literature (1)
Peng et al (2006) Education & psychology journals 36% had no missing data 48% had missing data 16% were unclear 97% used listwise deletion or pairwise deletion methods.

17 Missing data in the literature (2)
Klebanoff & Cole (2008) Looked at the use of multiple imputation methods 2 years of articles from Amer J Epidem, Annals Epi, Epidemiology & Int J Epidem 1,105 original research articles 16 papers (1.4%) used one of Multiple Imputation (n=12) Inverse probability weighing Expectation-minimization algorithm 99 papers had imput as text

18 Missing data in the literature (3)
Desai et al (2011) Focused on molecular epidemiology studies in Cancer Epidemiology, Biomakers and Prevention 15 month period ( ) 278 eligible articles 95% either had missing data or excluded cases with missing data Only 23 papers (13%) used missing data methods for analysis 9 dealt with ‘assays below detection limit’ Single imputation 7 used ‘missing data indicators’ 26 (14%) reported differences between subjects with and without missing data.

19 Missing data methods (n=23, 12%)
All articles (n=278) Had missing data (n=184, 66%) Used CC only (n=161, 88%) Missing data methods (n=23, 12%) Beyond limits of detection (n=9) Single imputation (n=7) Missing value indicators (n=7) Required data for eligibility (n=81, 29%) Population defined by biomarker (n=11, 4%)) Just nothing missing (n=2)

20 Methods to handle missing data (1)
Need to decide on a model for missing data MCAR MAR MNAR If MNAR, how is the data related to the unobserved value? Set a statistical model for the full data Commonly assumed to be multivariate normal Limiting, especially for categorical data Some other form

21 Methods to handle missing data (2)
Complete Case (Listwise deletion) Pairwise deletion (e.g.. Proc Corr) Corrected complete case method Weighted regression model with complete cases Weights related to inverse of probability that a case is complete Fill the contingency table Allocate subjects with missing values of a row/column to cells in proportion to the complete cases. Replacement with the frequency or mean of complete cases For categorical variables, create multiple variables (one per level) Impute the percent of the group at each level Indicator variable for missing data

22 Methods to handle missing data (3)
Simple/Single imputation Multiple imputation Full MLE methods SAS can use FIMR (Full information Maximum Likelihood) Assumes multivariate normality and MAR Linked to Structural Equation Modeling (PROC CALIS) Reweighting estimation equations Used in complex survey studies Sample weights are adjusted to reflect missing data patterns.

23 Complete Case (listwise deletion)
Subject missing any values for any variable included in analysis or model are excluded. Most commonly used method (‘the default’) Usually used without any thought to missing data patterns, etc. Acceptable if data is MCAR Leads to lose of sample size and reduced power/precision Often produces reasonable results especially if amount if missing data is small Can be strongly biased is data is MAR Methodological results from multiple papers and theory

24 Pairwise deletion Similar to casewise deletion BUT, only subjects with missing data for variables involved in the specific analysis are subject to exclusion. Consider a case where x1 is missing some data but x2 and x3 are complete. Suppose the analysis looks at these two models: Y = B2 * x2 + B3 * x3 Y = B1 * x1 + B2 * x2 + B3 * x3 In the complete case method, subjects missing x1 will be excluded for both models. Pairwise deletion: All subjects would be used in model 1; Some cases would be excluded in model 2. Leads to different sub-sets being used for different analyses Complicates interpretation. PROC CORR in SAS uses this approach

25 Corrected Complete Case Method
Subjects missing any values for any variable included in analysis or model are excluded. Regression models use weighted regression. Weights are computed to reflect the inverse of the probability that a subject will have complete data. Works OK if data is MAR but can be seriously biased if not true. Figuring out the weights is difficult Finding SE’s can be difficult Results from Vach et al, 1991

26 Fill the Contingency Table
Under MAR, the distribution of subjects with missing data across the 4 cells in a contingency table is the same as the distribution of the complete cases. Modify the Contingency table by allocating ‘counts’ of missing subjects to the table Similar to the ‘corrected complete case’ method. Leads to non-integer counts in the cells Computing variance is tricky because standard formulae don’t work Logistic Regression needs integer counts in the tables. Results from Vach et al, 1991

27 Replacement with the frequency or mean of complete cases
Really a type of single imputation For each subject with missing data, replace the missing value by the mean of the complete cases For categorical data, define indicator variables 0/1 if there is valid data If data is missing, use the proportion of the complete cases with that level of the variable. Leads to indicator variables which have non-integer components. Strongly biased method, even with MAR more biased than Complete Case method Henry et al, 2013

28 Indicator variable for missing data
Treat ‘missing values’ as if they are a valid response to the questionnaire Assign them a code value Example (Do you drink alcohol?): Yes: 1 No: 2 Missing: 3 Analysis is done using three levels 2 dummy variables This is a very bad method which is strongly biased.

29 Indicator variable for missing data
Commonly used and commonly taught in epidemiology courses. Studied by multiple authors (Vach, Greenland) Very strongly biased in every study, including theoretical analyses Consider two situations: Variable is the main effect of interest:

30 Full Population data Cases Controls Exp +ve 140 60 Exp -ve 200
OR = 5.44 Now, assume 30% of data is missing, MCAR. Define the ‘missing data’ indicator variable Cases Controls Exp +ve 98 42 Exp -ve Missing 60 200 What is OR of Exp +ve to Exp –ve? It is still 5.44=

31 Confounding example (1)
So, we gain nothing by defining the missing category. But, suppose the missing data is in a confounder. Here is the population data. Crude table is as before (OR=5.44): Level 1 Level 2 Cases Controls Exp +ve 50 90 Exp -ve 10 100 Cases Controls Exp +ve 90 50 Exp -ve 10 100 OR = 9.0 OR = 9.0 Adjusted OR would be  strong confounding

32 Confounding example (2)
Now, 30% of the data on the confounder is missing. We create the missing value indicator level. Means we now have three 2x2 tables for our confounding analysis. Level 1 Level 2 Cases Controls Exp +ve 35 63 Exp -ve 7 70 Cases Controls Exp +ve 63 35 Exp -ve 7 70 OR = 9.0 OR = 9.0 Cases Controls Exp +ve 42 18 Exp -ve 60 Level 3: Missing OR = 5.44

33 Confounding example (3)
When there is no missing data, the OR’s are as follows. Clearly, there is confounding with the adjusted OR being 9.0 No Missing data With missing indicator Stratum 1 9.0 Stratum 2 Stratum 3 N/A 5.44 Crude Adjusted OR around 8.0 No Missing data Stratum 1 9.0 Stratum 2 Stratum 3 N/A Crude 5.44 Adjusted OR When we have the missing indicator in the data, the adjusted OR is not 9.0 but around 8. Very strongly biased.

34 Indicator Variable for Missing Data
This method has no role in handling missing data Is strongly biased, even with MCAR data. One core requirement for any method to address missing data is that it gives the ‘right’ answer for MCAR data.

35 Single Imputation (1) Replace a missing value with an estimate of what the value should have been Various methods are possible Overall mean Group-specific mean Last observation carried forward (in follow-up studies) An extreme value (e.g. missing = heavy alcohol use) Regression modeling Works best with monotonic missing data. To impute Yj, regress Y1 to Yj-1 for all subjects with valid data for Yj This gives a group of Betas with SE’s. Select a value of each beta at random from the distributions. For single imputation, you often use the actual estimated Beta values Use the regression equation to estimate the mean value of Yj for subjects with missing data. Hot-deck imputation MCMC methods

36 Single Imputation (2) Hard to generate validate variance estimates
Greenland found regression-based single imputation to be subject to serious errors in the face of mis-specified models.

37 Multiple Imputation (1)
MI handles missing data in three steps: Impute missing data ‘m’ times to produce ‘m’ complete data sets; Analyze each data set using a standard statistical procedure; Combine the ‘m’ results into one using formulae from Rubin (1987) or Schafer (1997). Most MI methods assume MAR Multivariate normality If the assumptions are met, and if these three steps are done correctly, multiple imputation produces estimates that have nearly optimal statistical properties. They are: Consistent (and, hence, approximately unbiased in large samples), Asymptotically efficient (almost), and Asymptotically normal.

38 Multiple Imputation (2)
One common method uses regression models in step #1 Three kinds of variables are included in an imputation regression model: Variables that are of theoretical interest, Variables that are associated with the missing mechanism, & Variables that are correlated with the variables with missing data. Consider adding interactions terms for continuous variables. Bayesian ideas can be used in step #1 Regression based Set a prior distribution for the regression parameters and error term Fit model to generate posterior distribution Select at random from posterior distribution to generate several imputation equations

39 Multiple Imputation (2)
Bayesian ideas can be used in step #1 MCMC (Markov Chain Monte Carlo) Divide sample into subsets with the same missing data for variables e.g. Group #1: missing x1 & x2 Group #2: missing x1, x3 & x4 Fit regression models within each pattern of missingness Impute using these models Uses full data set to update means, variances and covariances Make a random selection from the posterior distribution of these parameters Update the regression models Repeat FCS (Fully Conditional Specification) Similar to above but handles categorical data better No strong theoretical justification

40 Multiple Imputation (3)
Most MI models assume variables are multivariate normal Issues arise with categorical variables Can treat as continuous and then round to generate a suitable categorical value Round based on the normal approximation to the binomial distribution Most studies find MI methods to be the most valid of missing variable methods Some issues/questions How many replicate (multiples) to include? What variables to include in model? How to handle non-normal variables, including categorical variables? Software limitations

41 Full MLE methods (1) Suppose we have a data set and we want to fit a regression model (could be linear, logistic, etc.) With no missing data, we use Maximum Likelihood methods n observations on k variables: Based on regression model assumptions, the likelihood of the data can be given as: θ is the set of parameters to estimate We find the values of θ to make ‘L’ as big as possible

42 Full MLE methods (2) What if we have some missing data?
Suppose y1 & y2 have missing data which is MAR For a subject with missing values, we can not generate the likelihood contribution since we don’t know y1 & y2 Instead, consider all possible values which they might have, combined with the probability of those values. Add up the likelihood contribution for every possible value: Substitute this into the MLE equation and estimate ‘θ’

43 Full MLE methods (3) FIML is one way to do this in SAS MPlus
Part of PROC CALIS Assumes multivariate normality MPlus Software which can handle non-linear models Can us various regression models Logistic Poisson Tobit Cox Etc.

44 Reweighting estimation equations
Discussed by Henry et al (2013) Applies to complex surveys Differential probability of selection from target population Analysis requires ‘weights’ to adjust for this Standard weights are proportional to the inverse of the probability of selection With missing data, complete case analysis leads to different subsets for each set of variables weights are incorrect Adjust each weight to account for probability of being a complete case Do analysis using new weights and complete cases only Henry shows it produces very good estimates Limited area of application

45 Summary Missing data can be very important
More than 5-10% of data missing is considered a potential source of serious bias Need to consider the model which produces the missing data ‘ad hoc’ methods are poor and should not be used Multiple Imputation or Full MLE methods give excellent results in most situations If missing data is MNAR, need to consider the model which gives rise to the missing data If missingness is strongly related to value of variable, problem is complex

46 One suggested approach (1)
Describe target population Clearly describe derivation of analytic data set Describe population characteristics of analytic data set, including missing values Describe differences in population characteristics for subjects with valid and missing data for key variables § adapted from Desai et al Cancer Epidemiol Biomarkers Prev; 20(8), 2011

47 One suggested approach (2)
Investigate possible assumptions for missing data Assume MCAR if no data to suggest it is violated & no mechanism to generate MNAR Assume MAR if MCAR is not acceptable, no mechanism to generate MNAR & candidate ancillary variables exist Assume MNAR if a priori knowledge exists that missing data are related to unknown values Conduct a CC analysis

48 One suggested approach (3)
Choose an additional analysis as appropriate For MAR, Use Multiple Imputation with suitable ancillary variables For MNAR, Use Multiple Imputation, Need to model the method which generated the missing data. If a variable is limited by sensitivity of a lab detection device, Use a likelihood-based method Implement the additional analysis Include all potential ancillary variables Use SAS if you can postulate a joint distribution for ancillary variables Use STATA or R (fully conditional method).

49 One suggested approach (4)
Perform sensitivity analyses Do both CC & MI Use different subsets of ancillary variables for MI Use different models for MNAR missing generation Interpret the results If all analyses give same results, this is easy If they differ, need to present a more complex result in the paper.

50


Download ppt "Missing Data in Epidemiology: Issues & Approaches"

Similar presentations


Ads by Google