Testing the performance of the two-fold FCS algorithm for multiple imputation of longitudinal clinical records Catherine Welch 1, Irene Petersen 1, Jonathan.

Slides:



Advertisements
Similar presentations
Multivariate Meta-analysis: Notes on Correlations Robert Platt Department of Epidemiology & Biostatistics McGill University Jack Ishak United BioSource.
Advertisements

Sources and effects of bias in investigating links between adverse health outcomes and environmental hazards Frank Dunstan University of Wales College.
How would you explain the smoking paradox. Smokers fair better after an infarction in hospital than non-smokers. This apparently disagrees with the view.
Statistical Analysis SC504/HS927 Spring Term 2008
Chapter 5 Multiple Linear Regression
Latent normal models for missing data Harvey Goldstein Centre for Multilevel Modelling University of Bristol.
Carol Coupland Paula Dhiman Tony Arthur Richard Morriss Julia Hippisley-Cox University of Nottingham Garry Barton University of East Anglia Antidepressant.
If we use a logistic model, we do not have the problem of suggesting risks greater than 1 or less than 0 for some values of X: E[1{outcome = 1} ] = exp(a+bX)/
Treatment of missing values
Simulation of “forwards-backwards” multiple imputation technique in a longitudinal, clinical dataset Catherine Welch 1, Irene Petersen 1, James Carpenter.
Multiple Imputation of missing data in longitudinal health records Irene Petersen and Cathy Welch Primary Care & Population Health.
Departments of Medicine and Biostatistics
Epidemiological evidence for a protective role for statins in Community Acquired Pneumonia British Thoracic Society Winter Meeting 2012, London Yana Vinogradova.
Is low-dose Aspirin use associated with a reduced risk of colorectal cancer ? a QResearch primary care database analysis Prof Richard Logan, Dr Yana Vinogradova,
COHORH STUDY A research paper on BMJ. What is cohort study? Investigates from exposure to outcome, in a group of patients without, or with appropriate.
April 25 Exam April 27 (bring calculator with exp) Cox-Regression
Journal Club Alcohol, Other Drugs, and Health: Current Evidence January–February 2009.
Model and Variable Selections for Personalized Medicine Lu Tian (Northwestern University) Hajime Uno (Kitasato University) Tianxi Cai, Els Goetghebeur,

Common Problems in Writing Statistical Plan of Clinical Trial Protocol Liying XU CCTER CUHK.
Introduction to Logistic Regression. Simple linear regression Table 1 Age and systolic blood pressure (SBP) among 33 adult women.
Clustered or Multilevel Data
Chapter 11 Survival Analysis Part 2. 2 Survival Analysis and Regression Combine lots of information Combine lots of information Look at several variables.
Missing Data.. What do we mean by missing data? Missing observations which were intended to be collected but: –Never collected –Lost accidently –Wrongly.
Statistical Methods for Missing Data Roberta Harnett MAR 550 October 30, 2007.
Linear Regression and Correlation Explanatory and Response Variables are Numeric Relationship between the mean of the response variable and the level of.
Regression and Correlation
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 12: Multiple and Logistic Regression Marshall University.
Introduction to Multilevel Modeling Using SPSS
Multiple imputation using ICE: A simulation study on a binary response Jochen Hardt Kai Görgen 6 th German Stata Meeting, Berlin June, 27 th 2008 Göteborg.
Multiple Choice Questions for discussion
Low level of high density lipoprotein cholesterol in children of patients with premature coronary heart disease. Relation to own and parental characteristics.
Measuring Output from Primary Medical Care, with Quality Adjustment Workshop on measuring Education and Health Volume Output OECD, Paris 6-7 June 2007.
Evidence-Based Medicine 3 More Knowledge and Skills for Critical Reading Karen E. Schetzina, MD, MPH.
1 Multiple Imputation : Handling Interactions Michael Spratt.
Statistics for clinicians Biostatistics course by Kevin E. Kip, Ph.D., FAHA Professor and Executive Director, Research Center University of South Florida,
Exposure to cyclo-oxygenase-2 inhibitors and risk of cancer: nested case-control studies IAE world Congress Epidemiology 2011 Edinburgh Yana Vinogradova,
Department of SOCIAL MEDICINE Producing Small Area Estimates of the Need for Hip and Knee Replacement Surgery ANDY JUDGE Nicky Welton Mary Shaw Yoav Ben-Shlomo.
1 Introduction to Survey Data Analysis Linda K. Owens, PhD Assistant Director for Sampling & Analysis Survey Research Laboratory University of Illinois.
Multiple Imputation (MI) Technique Using a Sequence of Regression Models OJOC Cohort 15 Veronika N. Stiles, BSDH University of Michigan September’2012.
Handling Attrition and Non- response in the 1970 British Cohort Study Tarek Mostafa Institute of Education – University of London.
LOGISTIC REGRESSION A statistical procedure to relate the probability of an event to explanatory variables Used in epidemiology to describe and evaluate.
April 4 Logistic Regression –Lee Chapter 9 –Cody and Smith 9:F.
SW 983 Missing Data Treatment Most of the slides presented here are from the Modern Missing Data Methods, 2011, 5 day course presented by the KUCRMDA,
MISSING DATA IN THE INFECTIOUS DISEASES INSTITUTE CLINIC DATABASE Agnes N Kiragga East Africa IeDEA investigators’ meeting 4-5 th May 2010 East African.
1 Multivariable Modeling. 2 nAdjustment by statistical model for the relationships of predictors to the outcome. nRepresents the frequency or magnitude.
Missing Values Raymond Kim Pink Preechavanichwong Andrew Wendel October 27, 2015.
A REVIEW By Chi-Ming Kam Surajit Ray April 23, 2001 April 23, 2001.
Simulation Study for Longitudinal Data with Nonignorable Missing Data Rong Liu, PhD Candidate Dr. Ramakrishnan, Advisor Department of Biostatistics Virginia.
Phenotype generation from EMR by tensor factorization SEDI Durham Cohort James Lu M.D. Ph.D. Department of Electrical and Computer Engineering Department.
Association between Systolic Blood Pressure and Congestive Heart Failure in Hypertensive Patients Mrs. Sutheera Intajarurnsan Doctor of Public Health Student.
Tutorial I: Missing Value Analysis
1 Introduction to Modeling Beyond the Basics (Chapter 7)
A framework for multiple imputation & clustering -Mainly basic idea for imputation- Tokei Benkyokai 2013/10/28 T. Kawaguchi 1.
Synthetic Approaches to Data Linkage Mark Elliot, University of Manchester Jerry Reiter Duke University Cathie Marsh Centre.
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 13: Multiple, Logistic and Proportional Hazards Regression.
Analysis of Mismeasured Data David Yanez Department of Biostatistics University of Washington July 5, 2005 Biost/Stat 579.
DATA STRUCTURES AND LONGITUDINAL DATA ANALYSIS Nidhi Kohli, Ph.D. Quantitative Methods in Education (QME) Department of Educational Psychology 1.
Missing data: Why you should care about it and what to do about it
How useful is a reminder system in collection of follow-up quality of life data in clinical trials? Dr Shona Fielding.
The Centre for Longitudinal Studies Missing Data Strategy
Maximum Likelihood & Missing data
Sensitivity analyses for missing not at random outcomes in clinical trials (Invited Session 1.2: Recent Advances in Methods for Handling Missing Data in.
Multiple Imputation Using Stata
Dealing with missing data
Does cognitive ability in childhood predict fertility
The European Statistical Training Programme (ESTP)
Clinical prediction models
Chapter 13: Item nonresponse
Presentation transcript:

Testing the performance of the two-fold FCS algorithm for multiple imputation of longitudinal clinical records Catherine Welch 1, Irene Petersen 1, Jonathan Bartlett 2, Ian White 3, Richard Morris 1, Louise Marston 1, Kate Walters 1, Irwin Nazareth 1 and James Carpenter 2 1 Department of Primary Care and Population Health, UCL 2 Department of Medical Statistics, LSHTM 3 MRC Biostatistics, Cambridge Funding: MRC

The Health Improvement Network (THIN) primary care database GP records 9 million patients over 15 years in 450 practices Powerful data source for research into coronary heart disease (CHD) Studies complicated by missing data Up to 38% of health indicator measurements are missing in newly registered patients 1 1 Marston et al, 2010 Pharmacoepidemiology and Drug Safety

Partially observed data in THIN Missing data never intended to be recorded Data recorded at irregular intervals Non-monotone missingness p pattern

Multiple Imputation (MI) and THIN Most MI designed for cross-sectional data Impute both continuous and discrete variables at many time points –Standard ICE using Stata struggles with this New method developed by Nevalainen et al –Two-fold fully conditional specification (FCS) algorithm –Imputes each time point separately –Uses information recorded before and after time point Nevalainen et al, 2009 Statistics in Medicine

A graphical illustration of the two-fold FCS algorithm Within-time iteration Among-time iteration Nevalainen et al, 2009 Statistics in Medicine

Algorithm validation Nevalainen et al –Proposed the two-fold FCS approach –Validated algorithm using data sampled from case-control –3 time points included with a linear substantive model Our previous work Imputed data had accurate coefficients and acceptable level of variation in these settings

Simulation Before we apply the algorithm to THIN we want to test it in a complex setting similar to THIN Test algorithm in simulation study: –Create 1000 full datasets –Remove values –Apply two-fold FCS algorithm –Fit regression model for risk of CHD Full data Complete case data Imputed data –Compare results

Advantages of using simulated data We know the original distributions so we can compare with distribution of imputed data and test for bias Create different scenarios to test the algorithm Design data so it is close to THIN data

Simple dataset 5000 men, 10 years of data CHD diagnosis from 2000 – yes/no Age – 5 year age bands Smoking status recorded in 2000 –smokers, ex- and non-smokers Anti-hypertensive drug prescription – yes/no Systolic blood pressure (mmHg) Weight (kg) Townsend score quintile – 1 (least) to 5 (most) Registration – indicate if patient registered in 1999

Results from exponential regression model Outcome : Time to CHD Exposures in year 2000: age, Townsend score quintile, weight, blood pressure, smoking status, anti-hypertensive drug treatment, registration in 1999 Analysis of 1000 datasets

Generated data results VariablesTHIN data log risk ratio Full simulated data Log risk ratioSE Anti-hypertensive drug treatment Systolic blood pressure (mmHg) Weight (kg) Smoking status Non- smoker Reference Ex- smoker Current smoker Adjusted for age, registration in 1999 and Townsend score quintile Results of fitting exponential regression model

70% missing completely at random (MCAR) missingness mechanisms Missing data on blood pressure, weight, smoking In THIN: – % missing in any given year, E.g. 70% missing equivalent to a health indicator recorded approximately every 3 years –If one variable is missing other variables also more likely to be missing

70% MCAR results VariablesTHIN data Simulated data Log risk ratio Full dataComplete case Log risk ratioSE Log risk ratioSE Anti-hypertensive drug treatment Systolic blood pressure (mmHg) Weight (kg) Smoking status Non- smoker Reference Ex- smoker Current smoker Adjusted for age, registration in 1999 and Townsend score quintile

Two-fold FCS algorithm Stata ICE – series of chained equations 3 among-time iterations, 10 within-time iterations Produce 3 imputed datasets 1 year time window ii+1i+2i+3i-3i-2i-1

Imputing time-independent variables Algorithm designed to impute time-dependent variables and does not account for imputing time- independent variables Smoking status in 2000 is a time-independent variable Need to extend algorithm for this

Imputing time-independent variables For each among-time iteration, time-independent variables imputed first Algorithm will be cycle through time points with smoking status included as an auxiliary variable. Impute time-independent variables

Results following imputation We would expect to see similar log risk ratios to the THIN data The standard errors for variables with no missing data will be close to those from the full data The standard errors for variables with missing data will be smaller to the complete case analysis but not recover to the size of the full data

Results following imputation VariablesTHIN data Simulated data Log risk ratio Full dataComplete caseImputed data Log risk ratioSE Log risk ratioSE Log risk ratioSE Anti-hypertensive drug treatment Systolic blood pressure (mmHg) Weight (kg) Smoking status Non- smoker Reference Ex- smoker Current smoker Adjusted for age, registration in 1999 and Townsend score quintile

Results following imputation VariablesTHIN data Simulated data Log risk ratio Full dataComplete caseImputed data Log risk ratioSE Log risk ratioSE Log risk ratioSE Anti-hypertensive drug treatment Systolic blood pressure (mmHg) Weight (kg) Smoking status Non- smoker Reference Ex- smoker Current smoker Adjusted for age, registration in 1999 and Townsend score quintile

Results following imputation VariablesTHIN data Simulated data Log risk ratio Full dataComplete caseImputed data Log risk ratioSE Log risk ratioSE Log risk ratioSE Anti-hypertensive drug treatment Systolic blood pressure (mmHg) Weight (kg) Smoking status Non- smoker Reference Ex- smoker Current smoker Adjusted for age, registration in 1999 and Townsend score quintile

Results following imputation VariablesTHIN data Simulated data Log risk ratio Full dataComplete caseImputed data Log risk ratioSE Log risk ratioSE Log risk ratioSE Anti-hypertensive drug treatment Systolic blood pressure (mmHg) Weight (kg) Smoking status Non- smoker Reference Ex- smoker Current smoker Adjusted for age, registration in 1999 and Townsend score quintile

Correlations Previous results imply accurate imputations for missing data in 2000 Alternative method required: –Assess correlations between measurements recorded at different times We would like to maintain the correlations structure in the generated and imputed data at all time points

Correlations

Increase time window Increased the time window to 2 and 3 years This slightly improves the estimates of coefficients and SE ii+1i+2i+3i-3i-2i-1 2 year time window 3 year time window

Increase time window

In summary The two-fold FCS algorithm gives unbiased imputations with: –70% missing data –Exponential regression model, and –MCAR missingness mechanisms The correlation structure is maintained as the time window increases

Discussion Algorithm effective because at least one measurement during follow-up Same results with MAR Future work… –Introduce censoring –Change smoking status to be time-dependent –Interactions