Download presentation
Presentation is loading. Please wait.
Published byKerry Fields Modified over 9 years ago
1
1crmda.KU.edu Todd D. Little University of Kansas Director, Quantitative Training Program Director, Center for Research Methods and Data Analysis Director, Undergraduate Social and Behavioral Sciences Methodology Minor Member, Developmental Psychology Training Program crmda. KU.edu Workshop presented 03-29-2011 @ Society for Research in Child Development Society for Research in Child Development Missing Data Estimation in Longitudinal Research: It’s not Cheating! It’s Essential!
2
2crmda.KU.edu Road Map Learn about the different types of missing data Learn about ways in which the missing data process can be recovered Understand why imputing missing data is not cheating Learn why NOT imputing missing data is more likely to lead to errors in generalization! Learn about intentionally missing designs Introduce a simple method for significance testing Discuss imputation with large longitudinal datasets
3
3crmda.KU.edu Key Considerations Recoverability Is it possible to recover what the sufficient statistics would have been if there was no missing data? (sufficient statistics = means, variances, and covariances) Is it possible to recover what the parameter estimates of a model would have been if there was no missing data. Bias Are the sufficient statistics/parameter estimates systematically different than what they would have been had there not been any missing data? Power Do we have the same or similar rates of power (1 – Type II error rate) as we would without missing data?
4
4crmda.KU.edu Types of Missing Data Missing Completely at Random (MCAR) No association with unobserved variables (selective process) and no association with observed variables Missing at Random (MAR) No association with unobserved variables, but maybe related to observed variables Random in the statistical sense of predictable Non-random (Selective) Missing (NMAR) Some association with unobserved variables and maybe with observed variables
5
5crmda.KU.edu Effects of imputing missing data
6
6crmda.KU.edu Effects of imputing missing data No Association with Observed Variable(s) An Association with Observed Variable(s) No Association with Unobserved /Unmeasured Variable(s) MCAR Fully recoverable Fully unbiased MAR Partly to fully recoverable Less biased to unbiased An Association with Unobserved /Unmeasured Variable(s) NMAR Unrecoverable Biased (same bias as not estimating) MAR/NMAR Partly recoverable Same to unbiased
7
7crmda.KU.edu No Association with ANY Observed Variable An Association with Analyzed Variables An Association with Unanalyzed Variables No Association with Unobserved /Unmeasured Variable(s) MCAR Fully recoverable Fully unbiased MAR Partly to fully recoverable Less biased to unbiased MAR Partly to fully recoverable Less biased to unbiased An Association with Unobserved /Unmeasured Variable(s) NMAR Unrecoverable Biased (same bias as not estimating) MAR/NMAR Partly to fully recoverable Same to unbiased MAR/NMAR Partly to fully recoverable Same to unbiased Effects of imputing missing data Statistical Power: Will always be greater when missing data is imputed!
8
8crmda.KU.edu Modern Missing Data Analysis In 1978, Rubin proposed Multiple Imputation (MI) An approach especially well suited for use with large public-use databases. First suggested in 1978 and developed more fully in 1987. MI primarily uses the Expectation Maximization (EM) algorithm and/or the Markov Chain Monte Carlo (MCMC) algorithm. Beginning in the 1980’s, likelihood approaches developed. Multiple group SEM Full Information Maximum Likelihood (FIML). An approach well suited to more circumscribed models MI or FIML
9
9crmda.KU.edu Full Information Maximum Likelihood FIML maximizes the casewise -2loglikelihood of the available data to compute an individual mean vector and covariance matrix for every observation. Since each observation’s mean vector and covariance matrix is based on its own unique response pattern, there is no need to fill in the missing data. Each individual likelihood function is then summed to create a combined likelihood function for the whole data frame. Individual likelihood functions with greater amounts of missing are given less weight in the final combined likelihood function than those will a more complete response pattern, thus controlling for the loss of information. Formally, the function that FIML is maximizing is where
10
10crmda.KU.edu Multiple Imputation Multiple imputation involves generating m imputed datasets (usually between 20 and 100), running the analysis model on each of these datasets, and combining the m sets of results to make inferences. By filling in m separate estimates for each missing value we can account for the uncertainty in that datum’s true population value. Data sets can be generated in a number of ways, but the two most common approaches are through an MCMC simulation technique such as Tanner & Wong’s (1987) Data Augmentation algorithm or through bootstrapping likelihood estimates, such as the bootstrapped EM algorithm used by Amelia II. SAS uses data augmentation to pull random draws from a specified posterior distribution (i.e., stationary distribution of EM estimates). After m data sets have been created and the analysis model has been run on each separately, the resulting estimates are commonly combined with Rubin’s Rules (Rubin, 1987).
11
Fraction Missing Fraction Missing is a measure of efficiency lost due to missing data. It is the extent to which parameter estimates have greater standard errors than they would have had all data been observed. It is a ratio of variances: Estimated parameter variance in the complete data set Between-imputation variance estimated parameter variance in the complete data set total parameter variance taking into account missingness 11crmda.KU.edu
12
12 Fraction Missing Fraction of Missing Information (asymptotic formula) Varies by parameter in the model Is typically smaller for MCAR than MAR data crmda.KU.edu
13
13crmda.KU.edu Estimate Missing Data With SAS Obs BADL0 BADL1 BADL3 BADL6 MMSE0 MMSE1 MMSE3 MMSE6 1 65 95 95 100 23 25 25 27 2 10 10 40 25 25 27 28 27 3 95 100 100 100 27 29 29 28 4 90 100 100 100 30 30 27 29 5 30 80 90 100 23 29 29 30 6 40 50.. 28 27 3 3 7 40 70 100 95 29 29 30 30 8 95 100 100 100 28 30 29 30 9 50 80 75 85 26 29 27 25 10 55 100 100 100 30 30 30 30 11 50 100 100 100 30 27 30 24 12 70 95 100 100 28 28 28 29 13 100 100 100 100 30 30 30 30 14 75 90 100 100 30 30 29 30 15 0 5 10. 3 3 3.
14
14crmda.KU.edu PROC MI PROC MI data=sample out=outmi seed = 37851 nimpute=100 EM maxiter = 1000; MCMC initial=em (maxiter=1000); Var BADL0 BADL1 BADL3 BADL6 MMSE0 MMSE1 MMSE3 MMSE6; run; out= Designates output file for imputed data nimpute = # of imputed datasets Default is 5 Var Variables to use in imputation
15
15crmda.KU.edu PROC MI output: Imputed dataset Obs _Imputation_ BADL0 BADL1 BADL3 BADL6 MMSE0 MMSE1 MMSE3 MMSE6 1 1 65 95 95 100 23 25 25 27 2 1 10 10 40 25 25 27 28 27 3 1 95 100 100 100 27 29 29 28 4 1 90 100 100 100 30 30 27 29 5 1 30 80 90 100 23 29 29 30 6 1 40 50 21 12 28 27 3 3 7 1 40 70 100 95 29 29 30 30 8 1 95 100 100 100 28 30 29 30 9 1 50 80 75 85 26 29 27 25 10 1 55 100 100 100 30 30 30 30 11 1 50 100 100 100 30 27 30 24 12 1 70 95 100 100 28 28 28 29 13 1 100 100 100 100 30 30 30 30 14 1 75 90 100 100 30 30 29 30 15 1 0 5 10 8 3 3 3 2
16
16crmda.KU.edu What to Say to Reviewers: I pity the fool who does not impute – Mr. T If you compute you must impute – Johnny Cochran Go forth and impute with impunity – Todd Little If math is God’s poetry, then statistics are God’s elegantly reasoned prose – Bill Bukowski
17
17crmda.KU.edu Missing Data and Estimation: Missingness by Design Assess all persons, but not all variables at each time of measurement McArdle, Graham Have core battery for all participants, but divide sample into groups and each group has additional measures Control entry into study, to estimate and control for retesting effects Randomly assign participants to their entry into a longitudinal study and to the occasions of assessment Likely to be key in providing unbiased estimates of growth or change
18
18crmda.KU.edu Form Common Variables Variable Set A Variable Set B Variable Set C 1¼ of Variables None 2¼ of Variables none¼ of Variables 3 none¼ of Variables 3-Form Intentionally Missing Design
19
19crmda.KU.edu Form Common Variables Variable Set A Variable Set B Variable Set C 1Marker Variables 1/3 of Variables None 2Marker Variables 1/3 of Variables none1/3 of Variables 3Marker Variables none1/3 of Variables 3-Form Protocol II
20
Expansions of 3-Form Design (Graham, Taylor, Olchowski, & Cumsille, 2006) crmda.KU.edu20
21
Expansions of 3-Form Design (Graham, Taylor, Olchowski, & Cumsille, 2006) crmda.KU.edu21
22
22 2-Method Planned Missing Design crmda.KU.edu
23
23 Controlled Enrollment crmda.KU.edu
24
Growth-Curve Design GroupTime 1Time 2Time 3Time 4Time 5 1xxxxx 2xxxxmissing 3xxx x 4xx xx 5x xxx 6 xxxx 24crmda.KU.edu
25
Growth Curve Design II GroupTime 1Time 2Time 3Time 4Time 5 1xxxxx 2xxxmissing 3xx x 4x xx 5 xxx 6xx x 7x x x 8 xx x 9x xx 10missingx xx 11missing xxx 25crmda.KU.edu
26
Growth Curve Design II GroupTime 1Time 2Time 3Time 4Time 5 1xxxxx 2xxxmissing 3xx x 4x xx 5 xxx 6xx x 7x x x 8 xx x 9x xx 10missingx xx 11missing xxx 26crmda.KU.edu
27
Efficiency of Planned Missing Designs 27crmda.KU.edu
28
28crmda.KU.edu Combined Elements
29
29crmda.KU.edu The Sequential Designs
30
30crmda.KU.edu Transforming to Accelerated Longitudinal
31
31crmda.KU.edu Transforming to Episodic Time
32
32 Generate multiply imputed datasets (m). Calculate a single covariance matrix on all N*m observations. By combining information from all m datasets, this matrix should represent the best estimate of the population associations. Run the Analysis model on this single covariance matrix and use the resulting estimates as the basis for inference and hypothesis testing. The fit function from this approach should be the best basis for making inferences about model fit and significance. Using a Monte Carlo Simulation, we test the hypothesis that this approach is reasonable. Simple Significance Testing with MI crmda.KU.edu
33
Population Model A6A5A4A3A2 Factor A Factor B.81.72 1* Note: These are fully standardized parameter estimates A7A8A10B6B2B3B4B5B7B1B8B9B10A1.74.70.71.79.69.81.73 A9.78.35.49.45.52.50.38.53.35.47.39.75.68.76.70.72.67.69.79.72.75.44.53.42.51.48.55.52.38.49.43.52 RMSEA =.047, CFI =.967, TLI =.962, SRMR =.021 33crmda.KU.edu
34
34www.Quant.KU.edu Change in Chi-squared Test Correlation Matrix Technique ConditionPRB 10% Missing -2.95% 30% Missing 4.39% 50% Missing 6.08%
35
35 Create a BLOCK of variables that contains as much information about the dataset as possible and has no missing data Reduce the data by creating scale averages Reduce the data by estimating a set of principal components Use both approaches Impute missingness in the block. Create product terms by key potential moderators and powered terms. Reduce the data again This block can be the auxiliary variables block in FIML estimation In a sequential set of steps impute the item-level data in groups of similar types of items Use the BLOCK of variables in each set of multiple imputations. Select the item-level data based on similarity of constructs. Use as many items as possible. Save, sort, and merge the imputed datasets. Use the super matrix approach to analyze. Imputing with Large Datasets crmda.KU.edu
36
36crmda.KU.edu Thanks for your attention! Questions? crmda. KU.edu Workshop presented 03-29-2011 Society for Research in Child Development Society for Research in Child Development Missing Data Estimation in Longitudinal Research: It’s not Cheating! It’s Essential!
37
Update Dr. Todd Little is currently at Texas Tech University Director, Institute for Measurement, Methodology, Analysis and Policy (IMMAP) Director, “Stats Camp” Professor, Educational Psychology and Leadership Email: yhat@ttu.eduyhat@ttu.edu IMMAP (immap.educ.ttu.edu) Stats Camp (Statscamp.org) 37www.Quant.KU.edu
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.