The Centre for Longitudinal Studies Missing Data Strategy

Slides:



Advertisements
Similar presentations
Handling attrition and non- response in longitudinal data Harvey Goldstein University of Bristol.
Advertisements

Handling Missing Data on ALSPAC
Non response and missing data in longitudinal surveys.
Treatment of missing values
1 QOL in oncology clinical trials: Now that we have the data what do we do?
Improving health worldwide George B. Ploubidis The role of sensitivity analysis in the estimation of causal pathways from observational.
Some birds, a cool cat and a wolf

1 Unsupervised Learning With Non-ignorable Missing Data Machine Learning Group Talk University of Toronto Monday Oct 4, 2004 Ben Marlin Sam Roweis Rich.
How to deal with missing data: INTRODUCTION
Partially Missing At Random and Ignorable Inferences for Parameter Subsets with Missing Data Roderick Little Rennes
Missing Data.. What do we mean by missing data? Missing observations which were intended to be collected but: –Never collected –Lost accidently –Wrongly.
FINAL REPORT: OUTLINE & OVERVIEW OF SURVEY ERRORS
Multiple imputation using ICE: A simulation study on a binary response Jochen Hardt Kai Görgen 6 th German Stata Meeting, Berlin June, 27 th 2008 Göteborg.
1 Multiple Imputation : Handling Interactions Michael Spratt.
Practical Missing Data Analysis in SPSS (v17 onwards) Peter T. Donnan Professor of Epidemiology and Biostatistics.
1 Introduction to Survey Data Analysis Linda K. Owens, PhD Assistant Director for Sampling & Analysis Survey Research Laboratory University of Illinois.
G Lecture 11 G Session 12 Analyses with missing data What should be reported?  Hoyle and Panter  McDonald and Moon-Ho (2002)
Handling Attrition and Non- response in the 1970 British Cohort Study Tarek Mostafa Institute of Education – University of London.
Imputation for Multi Care Data Naren Meadem. Introduction What is certain in life? –Death –Taxes What is certain in research? –Measurement error –Missing.
1 Introduction to Survey Data Analysis Linda K. Owens, PhD Assistant Director for Sampling & Analysis Survey Research Laboratory University of Illinois.
SW 983 Missing Data Treatment Most of the slides presented here are from the Modern Missing Data Methods, 2011, 5 day course presented by the KUCRMDA,
© John M. Abowd 2007, all rights reserved General Methods for Missing Data John M. Abowd March 2007.
1 G Lect 13W Imputation (data augmentation) of missing data Multiple imputation Examples G Multiple Regression Week 13 (Wednesday)
The Impact of Missing Data on the Detection of Nonuniform Differential Item Functioning W. Holmes Finch.
Missing Values Raymond Kim Pink Preechavanichwong Andrew Wendel October 27, 2015.
A REVIEW By Chi-Ming Kam Surajit Ray April 23, 2001 April 23, 2001.
Tutorial I: Missing Value Analysis
INFO 4470/ILRLE 4470 Visualization Tools and Data Quality John M. Abowd and Lars Vilhuber March 16, 2011.
Mediation: The Causal Inference Approach David A. Kenny.
A framework for multiple imputation & clustering -Mainly basic idea for imputation- Tokei Benkyokai 2013/10/28 T. Kawaguchi 1.
DATA STRUCTURES AND LONGITUDINAL DATA ANALYSIS Nidhi Kohli, Ph.D. Quantitative Methods in Education (QME) Department of Educational Psychology 1.
Multiple Imputation in Finite Mixture Modeling Daniel Lee Presentation for MMM conference May 24, 2016 University of Connecticut 1.
Research and Evaluation Methodology Program College of Education A comparison of methods for imputation of missing covariate data prior to propensity score.
HANDLING MISSING DATA.
Survival-time inverse-probability- weighted regression adjustment
Missing data: Why you should care about it and what to do about it
Missing Data and Selection Bias
Theme (v): Managing change
Handling Attrition and Non-response in the 1970 British Cohort Study
Understanding Non Response in the 1958 Birth Cohort
MISSING DATA AND DROPOUT
Maximum Likelihood & Missing data
Introduction to Survey Data Analysis
Multiple Imputation.
Multiple Imputation Using Stata
How to handle missing data values
Dealing with missing data
Does cognitive ability in childhood predict fertility
Presenter: Ting-Ting Chung July 11, 2017
Discrete Event Simulation - 4
The bane of data analysis
Missing Data Imputation in the Bayesian Framework
The European Statistical Training Programme (ESTP)
Missing Data Mechanisms
Non response and missing data in longitudinal surveys
Analysis of missing responses to the sexual experience question in evaluation of an adolescent HIV risk reduction intervention Yu-li Hsieh, Barbara L.
Multiple Regression – Split Sample Validation
New Techniques and Technologies for Statistics 2017  Estimation of Response Propensities and Indicators of Representative Response Using Population-Level.
Chapter 4: Missing data mechanisms
The European Statistical Training Programme (ESTP)
What do Samples Tell Us Variability and Bias.
Sample Sizes for IE Power Calculations.
Rachael Bedford Mplus: Longitudinal Analysis Workshop 23/06/2015
Clinical prediction models
Chapter 13: Item nonresponse
A handbook on validation methodology. Metrics.
Missing data: Is it all the same?
Considerations for the use of multiple imputation in a noninferiority trial setting Kimberly Walters, Jie Zhou, Janet Wittes, Lisa Weissfeld Joint Statistical.
Imputation Strategies When a Continuous Outcome is to be Dichotomized for Responder Analysis: A Simulation Study Lysbeth Floden, PhD1 Melanie Bell, PhD2.
Presentation transcript:

The Centre for Longitudinal Studies Missing Data Strategy George B. Ploubidis, Tarek Mostafa & Brian Dodgeon

Outline CLS Applied Statistical Methods Missing data Rubin’s classification CLS Missing Data Strategy

CLS Applied Statistical Methods Applied methodological work which aims to reduce bias from the three major challenges in observational longitudinal data: _Missing data _Measurement error _Causal inference Interdisciplinary approach: Applying in the CLS data methods/ideas from Statistics/Biostatistics, Epidemiology, Econometrics, Psychometrics and Computer Science

Missing data Selection bias, in the form of incomplete or missing data, is unavoidable in longitudinal surveys Smaller samples, incomplete histories, lower statistical power Unbiased estimates cannot be obtained without properly addressing the implications of incompleteness Statistical methods available to exploit the richness of longitudinal data to address bias

Sample size in the 1958 cohort as % of the original sample

The 10% rule (of thumb)

Rubin’s framework A useful first step is to place the 1958 cohort data within Rubin’s framework A simple Directed Acyclic Graph (DAG) Y is an outcome X is an exposure (assumed complete/no missing) RY is binary indicator with R = 1 denoting whether a respondent has a missing value on Y

Missing Completely At Random - MCAR

Missing Completely At Random - MCAR There are no systematic differences between the missing values and the observed values There isn’t any association between observed or unobserved variables and non response Partially testable, since we can find out whether variables available in our data are associated with missingness However, if we fail to find such associations, we cannot be certain that unmeasured variables are not associated with the probability of non- response

Complete Case Analysis (CCA) OK when not much missing data (<10%) or FICO <10% Valid under MCAR – Assumes that complete records do not differ from incomplete But! CCA can be unbiased (but obviously less efficient) in specific scenarios even if the complete records are systematically different - not true that CCA is always biased if data are not MCAR Given the exposures/covariates X in the substantive model Outcome Y missing/complete exposures X, probability of missing on Y independent of observed values of Y Exposure X missing/outcome Y complete, probability of missing on X independent of Y or observed values of X (Daniel, Kenward, Cousens & De Stavola, 2012, Stat Methods Med Res)

Complete Case Analysis (CCA) In longitudinal studies, usually there are incomplete records on both exposure and outcome (on confounders and mediators too) In most scenarios missing data >10% Auxiliary variables/predictors of response not in the substantive model of interest may be available In the majority of research scenarios in the 1958 cohort CCA will probably be biased

Missing At Random DAG

Missing At Random - MAR Systematic differences between the missing values and the observed values can be explained by observed data Given the observed data, the reasons for missingness do not depend on unobserved variables Pr (RY) = Pr (RY |X) MAR methods: Multiple Imputation (various forms of), Full Information Maximum Likelihood, Inverse Probability Weighting, Fully Bayesian methods, Linear Increments, Doubly Robust Methods (IPW +MI) All methods assume that all/most important drivers of missingness are available Which variables?

Missing Not At Random - DAG

Missing Not At Random - MNAR Even after accounting for all observed information, differences remain between the missing values and the observed values Unobserved variables are responsible for missingness Pr (RY) = Pr (RY |Y,X) Untestable! Selection models and/or pattern mixture models Both approaches make unverifiable distributional assumptions! Choice depends on the complexity of the substantive question Choice between MAR and MNAR models not straightforward

Rubin’s framework and representativeness/balanced samples MCAR: No selection, sample is “representative”/balanced MAR: Observed variables account for selection. Given these, sample is representative/balanced MNAR: Observed variables do not account for selection (selection is due to unobservables too) MAR and MNAR are untestable, but if a “gold standard” for the target population exists (ONS survey for example), we could test whether after accounting for selection with auxiliary variables the distribution of target variables is similar to that observed in a ONS survey Even when distributions are similar the target variables can still be MNAR, but the bias (for this specific variable) is probably negligible

What happens in the 1958 cohort? We know that the missing data generating mechanism is not MCAR For the majority of research scenarios CCA will either be biased and/or inefficient CCA probably OK if outcome and exposure up to age 11 Missing data generating mechanism is either MAR on MNAR Both untestable – rely on unverifiable assumptions MAR vs MNAR related to omitted variable bias/unmeasured confounding bias In the majority of research scenarios in the 1958 cohort a principled approach to the analysis of incomplete records is needed

CLS Missing Data Strategy Applied methodological work A simple idea - Maximise the plausibility of the MAR assumption Exploit the richness of longitudinal data to address sources of bias In the 1958 cohort (and any study) the information that maximises the plausibility of MAR is finite We can identify the variables that are associated with non response Auxiliary variables – not in the substantive model Sounds straightforward, but it’s not (see Tarek’s talk)

How to turn MNAR into MAR

A data driven approach to maximise the plausibility of MAR Data driven approach to identify predictors of non response in all waves of the CLS studies Substantive interest: Understanding non response Is early life more important, or it’s all about what happened in the previous wave? Can we maximise the plausibility of MAR with sets of early life variables, or later waves are needed too? Are the drivers of non response similar between cohorts? The goal is to understand non response and in the process identify auxiliary variables that can be used in realistically complex models that assume MAR AV’s to be used in addition to the variables in the substantive model and predictors of item non response (if item non response >10%)

MAR vs MNAR Some missing data patterns/variables may be MNAR even after the introduction of auxiliary variables Non monotone patterns are more likely to be MNAR (Robins & Gill, 1997) We assume that after the introduction of AV’s our data is either MAR, or not far from being MAR, so bias is negligible Reasonable assumption - Richness of longitudinal data Can’t be sure! Our results will inform sensitivity analyses for departures from MAR

Outputs We will not make available imputed datasets Technical report, peer reviewed papers and user guide Stata code on how to use auxiliary variables (see Brian’s talk today) Transparent assumptions so users can make an informed choice Dynamic process, the results will be updated when new waves or other data become available (paradata for example)

Thank you for your attention!