Non response and missing data in longitudinal surveys
Traditional ways of handling attrition and missing data Weighting typically used for attrition Sample design and initial non-response provides basic weights For several waves defines ‘typical’ pathways and provide weights for each one. e.g. LSYP may require 12 or more For item non-response use ‘hot deck’ single imputation
Problems with weighting procedures Inefficient – can only use complete data for each combination of variables analysed Restrictive since weights only provided for chosen ‘pathways’ Possibly inconsistent results through different weights for different analyses Not very transparent for use Problematic for ‘structurally missing’ items
Problems with hot deck imputation Not theoretically based Selection of ‘matched’ cases may not always be possible – especially in multilevel data Single imputation does not allow easy computation of standard errors
Multiple imputation – briefly and simply Consider the model of interest (MOI) We turn this into a multivariate response model and obtain residual estimates of (from an MCMC chain) where x, or y are missing. Use these to ‘fill in’ and produce a complete data set. Do this (independently) n (e.g. = 20) times. Fit MOI to each data set and combine according to rules to get estimates and standard errors. Note that at imputation stage we can use auxiliary data. Note also that we can handle attrition as missing data.
What not to do Omit all records with missing data – inneficient In categorical data use an extra category for missing - biased Plug in the mean over the non-missing values - biased
Multiple imputation in MLwiN Existing methods assume normality. For multilevel data they cannot handle level 2 variables with missing data Cannot handle discrete variables with missing data. REALCOM-IMPUTE links REALCOM with MLwiN and can handle level 2 and discrete variables. It works by transforming discrete variables to normality using a ‘latent variable’ model so that all response variables have a joint multivariate normal distribution and then applies MI theory.
Partially observed data values Where we have a prior (estimated) probability distribution (PD) for a missing discrete variable value we simply insert an extra MCMC step that accepts the ‘standard’ MI value with a probability that is just the probability given by the PD. A corresponding step is used for normal data. This thus uses all of the data efficiently. No data are discarded so long as it is possible to assign a PD. May also reduce ‘partial response bias’ Several completed data sets are produced and combined as in standard MI These procedures are computationally intensive but once the completed data sets are produced they can be used for many different models – so long as a model uses only variables that have been involved in the imputation procedure.
References Multilevel models with multivariate mixed response types (2009) Goldstein, H, Carpenter, J., Kenward, M., Levin, K. Statistical Modelling (to appear) - Gives methodological background Handling attrition and non-response in longitudinal data. International Journal of longitudinal and Life Course studies. April 2009. http://www.journal.longviewuk.com/index.php/llcs - Discusses issues for longitudinal studies in detail
Sampling weights Consider a 2-level model: Write level 2 weights as Level 1 weights for j-th level 2 unit as Final level 1 weights We use as the level 1 random part explanatory variable instead of the constant =1 This will be used for imputation and for MOI