Presentation is loading. Please wait.

Presentation is loading. Please wait.

Missing Data and Selection Bias

Similar presentations


Presentation on theme: "Missing Data and Selection Bias"— Presentation transcript:

1 Missing Data and Selection Bias
Danielle J. Harvey, PhD Paul K. Crane, MD MPH Many thanks to Elena Erosheva, PhD, for permission to use her slides from FH 2006

2 Disclaimer Funding for this conference was made possible, in part by Grant R13 AG from the National Institute on Aging. The views expressed do not necessarily reflect the official policies of the Department of Health and Human Services; nor does mention by trade names, commercial practices, or organizations imply endorsement by the U.S. Government. Drs. Harvey and Crane have no conflicts of interest to report.

3 Outline - Missing Data Why should we worry about missing data?
Patterns of missingness. Types of missingness: Missing completely at random (MCAR), Missing at random (MAR), Not missing at random (MNAR). How to deal with missing data: Some basic ad hoc methods, Imputation and multiple imputation, Model-based procedures.

4 Why should we worry? Missing values: intended measurements that are not taken, lost or are otherwise unavailable. Missing data in longitudinal studies result in: Unbalanced data (technical difficulties); Wondering whether it will have effect on our findings. One scenario of what may go wrong: A mixed effect (growth) model assumes each person’s record is a random sample from his underlying true trajectory. Missing data may not meet this assumption. Estimates may be biased; inferences invalid.

5 Missing data patterns Univariate nonresponse (some records on one variable are missing) Multivariate nonresponse (the same cases miss records on a set of variables). Two sets of variables never jointly observed. General nonresponse (no clear pattern). Monotone missingness: Attrition in longitudinal studies.

6 Types of missingness - MCAR
Data are called missing completely at random (CAR) if missingness does not depend on the values of data, missing or observed Example: Y1 is age and Y2 is income, and the probability that Y2 is missing is constant.

7 Types of missingness - MAR
Data are called missing at random (MAR) if missingness depends only on the components that are observed, and not on the components that are missing Example: Y1 is age and Y2 is income, and the probability that Y2 is missing depends on Y1.

8 Types of missingness - NMAR
Data are called not missing at random (NMAR) if missingness depends on the missing values. Example: Y1 is age and Y2 is income, and the probability that Y2 is missing varies according to income for those with the same age. NMAR missingness may be the most likely case in practice. The MAR assumption might be made more plausible by collecting additional data that are predictive both of the missing variable values and the probability of being missing.

9 More missing data terminology
NMAR is also known as informative or nonignorable missingness, and sometimes as MNAR. MAR is also known as noninformative or ignorable missingness. MCAR is a subset of MAR mechanism.

10 How to deal with missing data
Working with missing values, the goals are valid and efficient inferences about a population of interest (not recovery of missing observations). Hence, proper treatment of missingness should be in conjunction with the research problem and the model, estimation or testing procedure. When missing values occur for reasons beyond our control, we must make assumptions (usually untestable) about the processes that create them.

11 Basic methods Method: Case deletion (listwise deletion or complete-case analysis). Properties: It’s simple! Valid only under MCAR. If not MCAR, complete cases may be unrepresentative of the full population. Can be very inefficient. Reasonable choice with small amounts of missing data.

12 Basic methods Method: Available-case analysis.
Special case: Pairwise deletion/inclusion for covariance estimation. Properties: Uses all available data for each parameter. May use different subsets of units for different parameters. Measures of uncertainty are difficult to compute.

13 Basic methods Method: Reweighting.
Idea: Incomplete cases are removed, complete cases are weighted so that the modified sample closely resembles the full sample on fully observed variables. Properties: No need to model data generation process. Need to model probabilities of response. Can eliminate biases due to observed variables. Easy for univariate and monotone patterns.

14 Single imputation Method: Single imputation.
Special cases: unconditional means, conditional means based on a regression model, last observation carried forward, draws from a conditional distribution. Properties: Underestimates uncertainty in estimates; ignores the fact that imputed values are guesses. Complexity varies, depending on special case. Often distorts correlations.

15 Single imputation Method: Single imputation.
Single imputation is a reasonable choice when a small total percentage of the data is missing but listwise deletion discards a substantial amount of cases.

16 Multiple imputation Method: Multiple imputation.
Idea: Each missing value is replaced by m simulated values; m versions of the data are analyzed; results are combined for valid inferences. Properties: Complete-data techniques used for analyses. m=20 is usually sufficient. Imputation model is needed (often Normal). The theory is Bayesian (good for small sample performance). Better than single imputation but more work.

17 Model-based methods Also known as likelihood-based.
Likelihood: how likely are observed data given the model and the parameter values? General theory: Maximum likelihood (ML) estimates. ML estimates have attractive statistical properties (unbiased and asymptotically efficient). Formally, there is no difference between likelihood-based inference for incomplete data and likelihood-based inference for complete data.

18 Model-based methods Likelihood maximization:
Expectation-Maximization (EM) algorithm: fill in the missing data with the best current guesses, re-estimate parameters, update the missing data. Newton-Raphson and Fisher scoring numerical methods. Special or custom-made software is required. Best performance in large samples. Often MAR assumption is employed.

19 Methods that do not assume MAR
Two approaches: Selection Models: - Specify a model for missingness. - Have intuitive appeal. - Parameters may be poorly identified. - Too unstable for scientific applications. 2. Pattern-Mixture Models: - Classify missing data patterns into groups. - Describe observed data within each group. - Likelihood behavior is better. - Difficult to interpret scientifically.

20 Longitudinal data: Final words
Attrition (vs. intermediate missingness): - Simpler interpretation for missing data processes. - Easier to formulate models. - Much of the literature is restricted to it. - Most common form of missingness. - Data can often be “reduced” to it.

21 Longitudinal data: Final words
Investigators’ interests may be - Pragmatic (mirror what had been observed). - Explanatory (what could be observed if there were no missing data). Accordingly, one will be describing - Conditional (given the subject remains in the trial) response. - Marginal response (irrespective of whether we were able to record them or not). Note: marginal representation is not meaningful when dropout means that the response cannot subsequently exist.

22 Final words All approaches rely on untestable assumptions about the relation between the measurement process and the dropout process. Hence, it is advisable to always perform a sensitivity analysis. “All statistical models are wrong, but some are useful” George Box

23 Final Words - ROS ROS a good study for understanding/testing impacts of missing data Essentially no missing data Can create specific missing data patterns in the data Conduct inferences based on “new” data Compare results with those from the complete data

24 Selected References Schafer, J., Graham, J.W. Missing data: our view of the state of the art. Psychological Methods, 7(2), , 2002 Little, R. J. A., & Rubin, D. B (2002) Statistical Analysis with Missing Data. John Wiley & Sons. Verbeke, G., & Molenberghs, G. (2000) Linear Mixed Models for Longitudinal Data. Springer-Verlag. Singer, J. D., & Willet, J. B. (2003) Applied Longitudinal Data Analysis, Oxford University Press. Diggle,P. J., Heagerty, P., Liang, Kung-Yee, & Zeger, S. L. (2002). Analysis of Longitudinal Data, Oxford University Press. Weiss, R. (2005) Modeling Longitudinal Data, Springer.

25 Outline Considerations of data from autopsy samples
Missing data more generally Covariates Outcomes

26 Autopsied individuals are not representative
Seems simplistic, but they are dead Not all dead people receive an autopsy There are factors associated with willingness to consent to an autopsy In the ROS, 100% of the sample consented to autopsy, so the autopsied sample does not differ from the sample of dead people due to factors associated with consent So we are primarily concerned with death leading to non-representativeness of the autopsied sample.

27 Consider a time point… At any given time, a subset of the overall cohort will have died, and the rest of the cohort is alive Amount of follow-up time for the dead cohort is known, but for the alive cohort is truncated Length of time prior to death for the alive cohort is not known, nor is age at death

28 Consider a participant alive at time t
At time t=t+1, they either have: Another year of age and another year of neuropsychological test results, and missing neuropathological data, OR No further neuropsychological test results, but non-missing neuropathological data.

29 Early vs. late in the study
Eventually all cohort members will die At that time, the autopsied sample will be representative of the entire cohort As deaths accumulate, representativeness of the autopsied sample increases Early in the study, the autopsied sample looks very different from the study cohort as a whole Representativeness is thus a time-varying phenomenon Ignoring this issue (or mis-handling the analyses) may result in conclusions that change from the beginning to the end of a study.

30 Age makes it more complicated
The strength of association between neuropathological findings and dementia status (and with cognitive tests) is reduced in older people compared with younger people So as a year passes, either The participant dies and there is a slightly stronger relationship between our observed neuropathology and the trajectory of cognitive functioning, or The participant doesn’t die and the relationship between our unobserved neuropathology and the trajectory of cognitive functioning is slightly weaker.

31 What factors may be involved in selection?
Factors associated with mortality Cardiovascular health / disease / risk factors Cancer, other comorbidities Dementia

32 Preliminary analyses As a first step, we should check to see whether the autopsied sample is similar to the whole sample for factors we care about Table 1 type analyses Also look at trajectories of executive functioning and memory in those who have died and those who have not died Think about generalizability

33 More complicated analyses
Can we weight each observation at each time point based on the probability that neuropathological data are observed and that neuropsychological data are not observed (I.e., the probability that the person is alive)? Weights are not uniform across time within person, as baseline predictors of mortality become less relevant as time passes Exercise is left for the reader (I.e., I have no idea how to operationalize this.) Bayesian flavor?


Download ppt "Missing Data and Selection Bias"

Similar presentations


Ads by Google