Analysis of Mismeasured Data David Yanez Department of Biostatistics University of Washington July 5, 2005 Biost/Stat 579
Outline Background Examples Assessing the extent of the bias Approaches to data analysis
Background Loose Definition - Measurement error is the difference between a measured value and its true value. It typically results from shortcomings in measurement processes (e.g., equipment, short-term variation, recall, human error, etc. ). Misconceptions : Bias due to measurement error does not diminish as the sample size increases. Bias due to measurement error does not always lead to an attenuation to the null.
Examples Nurses’ Health Study Investigate the association between breast cancer and (alcohol, nutrition) intake. Cardiovascular Health Study Investigate the association between carotid IMT (wall thickness) and CVD risk factors (smoking status, systolic bp, diabetes). Predicting MI risk using cholesterol, systolic bp, carotid IMT, age, gender, race, smoking status, alcohol and fat intake. Investigate 3-year change in carotid IMT and CVD risk factors.
Example Illustration of an additive measurement error model. Filled circles are the true (Y,X) data and the steeper line is the OLS fit to the data. The empty circles and attenuated line are the OLS fit of the observed data (Y,W). Model: Y = X + e,e ~ (0,.25), X ~ (0, 1), W = X + U,U ~ (0, 1)
Example – Simple Regression The above illustration is an example of attenuation bias. We have, W = X + U, U ~ (0, u 2 ), cov(X,U) = 0. 1 * = 1 x 2 /( x 2 + u 2 ) = 1 ( = reliability ratio) What if the observed variable, W, was not unbiased for X (e.g., dietary intake of saturated fat)? W = 0 + 1 X + U, U ~ (0, u 2 ), cov(X,U)= xu 1 * = ( 1 1 x 2 + xu )/( 1 2 x 2 + u 2 ) Residual variances are also adversely affected. We have Var(Y|W) > Var(Y|X).
Example – Multiple Regression Suppose you have a situation where there is a single predictor measured with error (e.g., carotid IMT) in multiple regression. Y = 0 + 1 X + 2 A + 3 G + e, W = X + U, U ~ (0, u 2 ), cov(X,U) = 0. One can show that the OLS estimates for 1, 2 and 3 will be biased, i.e., 1 * = 1 1, where 1 = x|A,G 2 /( x|A,G 2 + u 2 ), i * = j + 1 (1- 1 ) j, j = 2,3; where E[X|A,G] = 0 + 2 A + 3 G, The coefficients for age and gender will be biased unless they are uncorrelated with the true carotid IMT.
Assessing the extent of the bias Data sources: Internal subsets of the primary data. External or independent studies. Validation data – subset in which X is observed directly. Replication data – replicates of W are available. Instrumental data – a variable T is observed in addition to W. Internal data are preferred to external data. Assumptions about data transportability need to be made when comparing data from different studies. Validation data are preferred to replication or instrumental data.
Assessing the extent of the bias Is the error model known? Typically not, but plausible models could be used to assess the amount of error and direction of the bias. Example: Study of association between breast cancer and alcohol intake and fat intake Under-reporting fat or alcohol intake may reduce the amount of measurement error bias (assuming true association is positive). Example: Study of association between STDs and number of sexual partners Over-reporting number of partners may increase the amount of measurement error bias (assuming true association is positive). In both the above examples, the observed association will be attenuated toward the null.
Approaches to data analysis Bias correction methods Method of Moments: Components of bias known (e.g., 2 u, ): simple. Bias components unknown: estimate bias terms compute SE estimates (bootstrap, sandwich) Corrected estimating equations (Huang, Wang, 2000). Related methods: Regression calibration (Carroll et al. 1995, Ch. 3) Choice of method depends in part on type of auxiliary data available and assumptions one is willing to make.
Approaches to data analysis Sensitivity analyses: In the absence of auxiliary data, one could specify a range of values for components of bias to see whether the significance of association changes. Example: Association between change in carotid IMT versus age, gender, diabetes, smoking and baseline IMT.
Approaches to data analysis Analysis of data conditional on observed variables (similar to analysis of incomplete data). May be analysis of interest (e.g., prediction of carotid IMT) Exercise caution in interpreting results. Observed associations may differ greatly from associations of the unobserved variables. Sensitivity analysis may be useful in guessing bounds on degree of association (IMT analysis). Study designs (e.g., randomized trials) can, to some extend, remedy some ills caused by measurement error.