MISSING DATA AND DROPOUT

Slides:



Advertisements
Similar presentations
Treatment of missing values
Advertisements

Cox Model With Intermitten and Error-Prone Covariate Observation Yury Gubman PhD thesis in Statistics Supervisors: Prof. David Zucker, Prof. Orly Manor.
CJT 765: Structural Equation Modeling Class 3: Data Screening: Fixing Distributional Problems, Missing Data, Measurement.

Chapter 4 Multiple Regression.
Topic 2: Statistical Concepts and Market Returns
Chapter 11 Multiple Regression.
How to deal with missing data: INTRODUCTION
Statistical Methods for Missing Data Roberta Harnett MAR 550 October 30, 2007.
Separate multivariate observations
Multiple imputation using ICE: A simulation study on a binary response Jochen Hardt Kai Görgen 6 th German Stata Meeting, Berlin June, 27 th 2008 Göteborg.
ANCOVA Lecture 9 Andrew Ainsworth. What is ANCOVA?
1 Multiple Imputation : Handling Interactions Michael Spratt.
1 Introduction to Survey Data Analysis Linda K. Owens, PhD Assistant Director for Sampling & Analysis Survey Research Laboratory University of Illinois.
G Lecture 11 G Session 12 Analyses with missing data What should be reported?  Hoyle and Panter  McDonald and Moon-Ho (2002)
Chapter 7 Sampling and Sampling Distributions ©. Simple Random Sample simple random sample Suppose that we want to select a sample of n objects from a.
Chapter 5 Parameter estimation. What is sample inference? Distinguish between managerial & financial accounting. Understand how managers can use accounting.
SW 983 Missing Data Treatment Most of the slides presented here are from the Modern Missing Data Methods, 2011, 5 day course presented by the KUCRMDA,
Simulation Study for Longitudinal Data with Nonignorable Missing Data Rong Liu, PhD Candidate Dr. Ramakrishnan, Advisor Department of Biostatistics Virginia.
Tutorial I: Missing Value Analysis
Copyright © Cengage Learning. All rights reserved. 5 Joint Probability Distributions and Random Samples.
1 Ka-fu Wong University of Hong Kong A Brief Review of Probability, Statistics, and Regression for Forecasting.
DATA STRUCTURES AND LONGITUDINAL DATA ANALYSIS Nidhi Kohli, Ph.D. Quantitative Methods in Education (QME) Department of Educational Psychology 1.
Virtual University of Pakistan
Descriptive and Inferential Statistics
HANDLING MISSING DATA.
Sampling and Sampling Distributions
CHAPTER 6 Random Variables
DTC Quantitative Methods Bivariate Analysis: t-tests and Analysis of Variance (ANOVA) Thursday 20th February 2014  
Virtual University of Pakistan
Visual Recognition Tutorial
12. Principles of Parameter Estimation
ECO 173 Chapter 10: Introduction to Estimation Lecture 5a
Joint Probability Distributions and Random Samples
CH 5: Multivariate Methods
Hypothesis Testing and Confidence Intervals (Part 1): Using the Standard Normal Lecture 8 Justin Kern October 10 and 12, 2017.
The Centre for Longitudinal Studies Missing Data Strategy
Distribution of the Sample Means
Maximum Likelihood & Missing data
Virtual University of Pakistan
Introduction to Survey Data Analysis
Multiple Imputation.
ECO 173 Chapter 10: Introduction to Estimation Lecture 5a
Multiple Imputation Using Stata
How to handle missing data values
Determining the distribution of Sample statistics
Introduction to Instrumentation Engineering
Presenter: Ting-Ting Chung July 11, 2017
Discrete Event Simulation - 4
Common Problems in Writing Statistical Plan of Clinical Trial Protocol
Comparing Populations
The European Statistical Training Programme (ESTP)
CH2. Cleaning and Transforming Data
An Introductory Tutorial
EM for Inference in MV Data
Paired Samples and Blocks
Missing Data Mechanisms
Confidence Intervals for Proportions
Lecture # 2 MATHEMATICAL STATISTICS
Inference on the Mean of a Population -Variance Known
EM for Inference in MV Data
12. Principles of Parameter Estimation
Clinical prediction models
Propagation of Error Berlin Chen
Chapter 13: Item nonresponse
MGS 3100 Business Analysis Regression Feb 18, 2016
Missing data: Is it all the same?
CHAPTER – 1.2 UNCERTAINTIES IN MEASUREMENTS.
Considerations for the use of multiple imputation in a noninferiority trial setting Kimberly Walters, Jie Zhou, Janet Wittes, Lisa Weissfeld Joint Statistical.
Imputation Strategies When a Continuous Outcome is to be Dichotomized for Responder Analysis: A Simulation Study Lysbeth Floden, PhD1 Melanie Bell, PhD2.
Presentation transcript:

MISSING DATA AND DROPOUT

What is missing data? Missing data arise in longitudinal studies whenever one or more of the sequences of measurements are incomplete, in the sense that some intended measurements are not obtained.

Missing data Let Y denote the complete response vector which can be partitioned into two sub-vectors: (i) the measurements observed (ii) the measurements that are missing

Missing data If there were no missing data, we would have observed the complete response vector Y. Instead, we get to observe .

What is the problem? The main problem that arises with missing data is that the distribution of the observed data may not be the same as the distribution of the complete data.

Consider the following simple illustration: Suppose we intend to measure subjects at 6 months (Y1) and 12 months (Y2) post treatment. All of the subjects return for measurement at 6 months, but many do not return at 12 months.

If subjects fail to return for measurement at 12 months because they are not well (say, values of Y2 are low), then the distribution of observed Y2’s will be positively skewed compared to the distribution of Y2’s in the population of interest.

In general, the situation may often be quite complex, with some missingness unrelated to either the observed or unobserved response, some related to the observed, some related to the unobserved, and some to both.

Monotone missing data A particular pattern of missingness that is common in longitudinal studies is ‘dropout’ or ‘attrition’. This is where an individual is observed from baseline up until a certain point in time, thereafter no more measurements are made.

Study Dropout Possible reasons for dropout: Recovery Lack of improvement or failure Undesirable side effects External reasons unrelated to specific treatment or outcome Death

Examples In clinical trials, missing data can arise from a variety of circumstances: Late entrants: If the study has staggered entry, at any interim analysis some individuals may have only partial response data. Usually, this sort of missing data does not introduce any bias.

Dropout: Individuals may drop out of a clinical trial because of side effects or lack of efficacy. Usually, this type of missing data is of concern, especially if dropout is due to lack of efficacy. Dropout due to lack of efficacy suggests that those who drop out come from the lower end of the spectrum. Dropout due to side effects may or may not be a problem, depending upon the relationship between side effects and the outcome of interest.

Intermittent vs Dropout Missing Data important feature is whether the missing values pattern is dropout (monotone) or intermittent (nonmonotone). The dropout pattern, in the sense that some subjects may withdraw prematurely, i.e. any missing value is never followed by an observed value. The intermittent pattern whenever an observed value is available even after a missing value occurs.

Examples of intermittent missing data?

Type of missing data A hierarchy of three different types of missing data mechanisms can be distinguished: Data are missing completely at random (MCAR) when the probability that an individual value will be missing is independent of Y(o) and Y(m).

Data are missing at random (MAR) when the probability that an individual value will be missing is independent of Y(m) (but may depend on Y(o)). Missing data are nonignorable when the probability that an individual value will be missing depends on Y(m).

Note: Under assumptions 1) and 2), the missing data mechanism is often referred to as being ‘ignorable’. If missingness depends only on X, then technically it is MCAR. However, sometimes this is referred to as covariate dependent non-response.

Thus, in general, if non-response depends on covariates, X, it is harmless and the same as MCAR provided you always condition on the covariates (i.e., incorporate the covariate in the analysis). This type of missingness is only a problem if you do not condition on X.

Example Suppose that systolic blood pressures of N participants are recorded in January (X). Some of them have a second reading in February (Y), but others do not. Table 1 shows simulated data for N = 30 participants drawn from a bivariate normal population with means mx =my = 125, standard deviations sx = sy = 25, and correlation r=0.60.

The first two columns of the measurements exceeded 140 (X > 140), a level used for diagnosing hypertension; this is MAR but not MCAR. In the third method, those recorded in February were those whose February measurements exceeded 140 (Y > 140).

This could happen, for example, if all individuals returned in February, but the staff person in charge decided to record the February value only if it was in the hypertensive range. This third mechanism is an example of MNAR. (Other MNAR mechanisms are possible; e.g., the February measurement may be recorded only if it is substantially different from the January reading.)

Notice that as we move from MCAR to MAR to MNAR, the observed Y values become an increasingly select and unusual group relative to the population; the sample mean increases, and the standard deviation decreases. This phenomenon is not a universal feature of MCAR, MAR, and MNAR, but it does happen in many realistic examples.

Methods of Handling Missing Data Complete Case Methods: These methods omit all cases with missing values at any measurement occasion. Drawbacks: Can results in a very substantial loss of information which has an impact on precision and power. Can give severely biased results if complete cases are not a random sample of population of interest, i.e. complete case methods require MCAR assumption.

All Available Case Methods: This is a general term for a variety of different methods that use the available information to estimate means and covariances (the latter based on all available pairs of cases). In general, these methods are more efficient than complete case methods (and can be fully efficient in some cases).

Drawbacks: Sample base of cases changes over measurement occasions. Available case methods require MCAR assumption (sometimes MAR).

Single Imputation Methods: These are methods that fill in the missing values. Once imputation is done, the analysis is straightforward.

Drawbacks: Systematically underestimate the variance and covariance. Treating imputed data as real data leads to standard errors that are too small (multiple imputation addresses this problem). Produce biased estimates under any kind of missingness.

Last Value Carried Forward: Set the response equal to the last observed value (or sometimes the ‘worst’ observed value). In general, LVCF is not recommended!

Likelihood-based Methods: At least in principle, maximum likelihood estimation for incomplete data is the same as for complete data and provides valid estimates and standard errors for more general circumstances than methods 1), 2), or 3). That is, under clearly stated assumptions likelihood-based methods have optimal statistical properties.

For example, if missing data are ‘ignorable’ (MCAR/MAR), likelihood-based methods (e.g. PROC MIXED) simply maximize the marginal distribution of the observed responses. If missing data are ‘non-ignorable’, likelihood-based inference must also explicitly (and correctly) model the non-response process. However, with ‘non-ignorable’ missingness the methods are very sensitive to unverifiable assumptions.

Weighting Methods: Base estimation on observed data, but weight the data to account for missing data. Basic idea: some sub-groups of the population are under-represented in the observed data, therefore weight these up to compensate for under-representation.

For example, with dropout, can estimate the weights as a function of the individual’s covariates and responses up until the time of dropout. This approach is valid provided the model for dropout is correct, i.e. provided the correct weights are available.

Multiple imputation: Each missing value is replaced by a list of m>1 simulated values. Each of the m data sets is analyzed in the same fashion by a complete-data method. The results are then combined by simple arithmetic to obtain overall estimates and stand errors that reflects missing data uncertainty as well as finite sample variation.

The simplest method for combining the results of m analyses is Rubin’s (1987) method for a scalar (onedimensional) parameter. Suppose that Q represents a population quantity (e.g., a regression coefficient) to be estimated. Let Q* and U* denote the estimate of Q and the standard error that one would use if no data were missing.

The method assumes that the sample is large enough so that sqrt(U. )(Q The method assumes that the sample is large enough so that sqrt(U*)(Q * − Q) has approximately a standard normal distribution, so that Q * ± 1.96 sqrt( U*) has about 95% coverage.

Of course, we cannot compute Q. and U Of course, we cannot compute Q* and U*; rather, we have m different versions of them, [Q* ( j), U*( j)], j _=1, . . . , m. Rubin’s (1987) overall estimate is simply the average of the m estimates,

The uncertainy in Q has two parts: the average within imputation variance, and the between-imputations variance,

The total variance is a modified sum of the two components, and the square root of T is the overall standard error.