29 th TRF 2003, Denver July 14 th, 2003 www.kiprc.uky.edu Jenny H. Qin and Mike Singleton Kentucky CODES Kentucky Injury Prevention & Research Center University.

Slides:



Advertisements
Similar presentations
Katherine Jenny Thompson
Advertisements

Latent normal models for missing data Harvey Goldstein Centre for Multilevel Modelling University of Bristol.
Transformations & Data Cleaning
Treatment of missing values
CountrySTAT Team-I November 2014, ECO Secretariat,Teheran.
Kin 304 Regression Linear Regression Least Sum of Squares
Brief introduction on Logistic Regression
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.
CHAPTER 23: Two Categorical Variables: The Chi-Square Test
CJT 765: Structural Equation Modeling Class 3: Data Screening: Fixing Distributional Problems, Missing Data, Measurement.
© 2009, KAISER PERMANENTE CENTER FOR HEALTH RESEARCH Getting to Know Your Data Basic Data Cleaning Principles.
Exact Logistic Regression Larry Cook. Outline Review the logistic regression model Explore an example where model assumptions fail –Brief algebraic interlude.
Some Terms Y =  o +  1 X Regression of Y on X Regress Y on X X called independent variable or predictor variable or covariate or factor Which factors.
By Wendiann Sethi Spring  The second stages of using SPSS is data analysis. We will review descriptive statistics and then move onto other methods.
How to Handle Missing Values in Multivariate Data By Jeff McNeal & Marlen Roberts 1.

Missing Data in Randomized Control Trials
Multiple Imputation Stata (ice) How and when to use it.
How to deal with missing data: INTRODUCTION
Notes on Logistic Regression STAT 4330/8330. Introduction Previously, you learned about odds ratios (OR’s). We now transition and begin discussion of.
Review for Exam 2 Some important themes from Chapters 6-9 Chap. 6. Significance Tests Chap. 7: Comparing Two Groups Chap. 8: Contingency Tables (Categorical.
Statistical Methods for Missing Data Roberta Harnett MAR 550 October 30, 2007.
Inferential statistics Hypothesis testing. Questions statistics can help us answer Is the mean score (or variance) for a given population different from.
Multiple imputation using ICE: A simulation study on a binary response Jochen Hardt Kai Görgen 6 th German Stata Meeting, Berlin June, 27 th 2008 Göteborg.
Report Exemplar. Step 1: Purpose State the purpose of your investigation. Pose an appropriate comparison investigative question and do not forget to include.
Bootstrapping (And other statistical trickery). Reminder Of What We Do In Statistics Null Hypothesis Statistical Test Logic – Assume that the “no effect”
Excepted from HSRP 734: Advanced Statistical Methods June 5, 2008.
1 S T A T A U S E R S G R O U P M E E T I N G SEPTEMBER Multiple Imputation for households surveys A comparison of methods Stata Users Group Meeting.
April 6 Logistic Regression –Estimating probability based on logistic model –Testing differences among multiple groups –Assumptions for model.
Propensity Scores How to do it – Part 1. X 11 X 12 X 13 X 21 X 22 X 23 X 31 X 32 X 33 No matrices were harmed in this presentation.
Multiple Imputation (MI) Technique Using a Sequence of Regression Models OJOC Cohort 15 Veronika N. Stiles, BSDH University of Michigan September’2012.
Applied Epidemiologic Analysis - P8400 Fall 2002 Lab 10 Missing Data Henian Chen, M.D., Ph.D.
Introduction to Multiple Imputation CFDR Workshop Series Spring 2008.
Assessing Binary Outcomes: Logistic Regression Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research.
BPS - 5th Ed. Chapter 221 Two Categorical Variables: The Chi-Square Test.
September 18-19, 2006 – Denver, Colorado Sponsored by the U.S. Department of Housing and Urban Development Conducting and interpreting multivariate analyses.
Introduction to Inference: Confidence Intervals and Hypothesis Testing Presentation 8 First Part.
The Impact of Missing Data on the Detection of Nonuniform Differential Item Functioning W. Holmes Finch.
CODES and Traumatic Brain Injury Research in Kentucky Kentucky Injury Prevention and Research Center University of Kentucky School of Public Health CODES.
National Center for Statistics & Analysis People Saving People 28 th Annual Traffic Records Forum, Orlando, FL Session 38 Alcohol Imputation Model Why.
Missing Values Raymond Kim Pink Preechavanichwong Andrew Wendel October 27, 2015.
A REVIEW By Chi-Ming Kam Surajit Ray April 23, 2001 April 23, 2001.
Tutorial I: Missing Value Analysis
Multiple Imputation using SAS Don Miller 812 Oswald Tower
Two-Sample Proportions Inference. Sampling Distributions for the difference in proportions When tossing pennies, the probability of the coin landing.
Pre-Processing & Item Analysis DeShon Pre-Processing Method of Pre-processing depends on the type of measurement instrument used Method of Pre-processing.
A framework for multiple imputation & clustering -Mainly basic idea for imputation- Tokei Benkyokai 2013/10/28 T. Kawaguchi 1.
Chapter 11: Categorical Data n Chi-square goodness of fit test allows us to examine a single distribution of a categorical variable in a population. n.
7/14/2003(c) 2003 Strategic Matching, Inc.1 29 th International Traffic Records Forum Using Multiple Imputation to Resolve Missing Data Issues.
NURS 306, Nursing Research Lisa Broughton, MSN, RN, CCRN RESEARCH STATISTICS.
DATA STRUCTURES AND LONGITUDINAL DATA ANALYSIS Nidhi Kohli, Ph.D. Quantitative Methods in Education (QME) Department of Educational Psychology 1.
Research and Evaluation Methodology Program College of Education A comparison of methods for imputation of missing covariate data prior to propensity score.
Stats Methods at IC Lecture 3: Regression.
Sampling Distributions
Missing data: Why you should care about it and what to do about it
Notes on Logistic Regression
The Centre for Longitudinal Studies Missing Data Strategy
Maximum Likelihood & Missing data
Multiple Imputation Using Stata
How to handle missing data values
Dealing with missing data
Review for Exam 2 Some important themes from Chapters 6-9
Presenter: Ting-Ting Chung July 11, 2017
Jenny H. Qin Kentucky Injury Prevention & Research Center
Introduction to Logistic Regression
Missing Data Mechanisms
Clinical prediction models
Implementation of the Bayesian approach to imputation at SORS Zvone Klun and Rudi Seljak Statistical Office of the Republic of Slovenia Oslo, September.
Considerations for the use of multiple imputation in a noninferiority trial setting Kimberly Walters, Jie Zhou, Janet Wittes, Lisa Weissfeld Joint Statistical.
Presentation transcript:

29 th TRF 2003, Denver July 14 th, Jenny H. Qin and Mike Singleton Kentucky CODES Kentucky Injury Prevention & Research Center University of Kentucky Performing Sensitivity Analyses of Imputed Missing Values

Multiple Imputation in Public Health Research Handling Missing Data in Nursing Research with Multiple Imputation Application of Multiple Imputation in Medical Studies: from AIDS to NHANES NHTSA: Transitioning to Multiple Imputation! A new Method to Impute Missing BAC values in FARS Multiple Imputation Publications

29 th TRF 2003, Denver July 14 th, 2003www.kiprc.uky.edu Questions??? May I use MI to deal with missing data problems for my data sets? How can I believe that the MI will give me better analysis results? What should I do to get good results from MI?

29 th TRF 2003, Denver July 14 th, 2003www.kiprc.uky.edu???Answers Sensitivity Analyses on Imputed Values A sensitivity analysis tests if our study results are sensitive to our assumptions (missing data mechanism), data conditions (missing data rate), and choices (imputation models or number of imputations) made for obtaining the results

29 th TRF 2003, Denver July 14 th, 2003www.kiprc.uky.edu MI Process Data Set of Interest Missing Data Mechanism 1 Missing Data Rate 2 Proc MI Results Analysis Model Imputation Model 3 Proc MI Options 4 Set 1 Set 3 Set 2 Set n Proc MIANALYZE Set n Results n Results 3 Results 2 Results

29 th TRF 2003, Denver July 14 th, 2003www.kiprc.uky.edu Research Question: What was the relationship between driving under the influence of drugs and/or alcohol, and being killed or hospitalized in a crash, for motorcycle riders in Kentucky in 2001? Outcome (Dependent Variable): Killed or Hospitalized (K/H) Risk Factor Candidates (Independent Variables): Age, gender, suspected DUI, posted speed limit, helmet use, fixed object, head-on collision, collision time, rural vs. urban CODES Application

29 th TRF 2003, Denver July 14 th, 2003www.kiprc.uky.edu Logistic Regression Model: K/H = β 0 + β 1 *DUI + β 2 *Speed + β 3 *Fixed + β 4 *Head-On Total records in our study Data set: 1,226 Records with missing values: 14 (1.1%) Analysis Model

29 th TRF 2003, Denver July 14 th, 2003www.kiprc.uky.edu Results for the Gold Standard ParameterOR(95% CI)EstimateSEP DUI2.51 ( ) Speed1.58 ( ) Fixed1.70 ( ) Head-on1.70 ( ) This Gold Standard result is used to compare with all other results. Conclusion: comparing motorcyclists with DUI to motorcyclists without DUI, the odds of being killed or hospitalized are 2.5 times greater than the odds of not being killed or hospitalized, when other factors are controlled.

29 th TRF 2003, Denver July 14 th, 2003www.kiprc.uky.edu Analysis Model: K/H = β 0 + β 1 *DUI + β 2 *Speed + β 3 *Fixed + β 4 *Head-On Imputation Model: K/H DUI Speed Fixed Head-On Note: The imputation model does not have to be identical to the analysis model, but at least it should include all of the analysis covariates. You can add any additional variables that are correlated to the variables that have missing values. Imputation Model

29 th TRF 2003, Denver July 14 th, 2003www.kiprc.uky.edu MCARMARNMAR Study Data Set Missing Data Mechanism 1 Missing Data Rate 2 Proc MI Data Analysis Proc MIANALYZE Results Analysis Model Imputation Model 3 Proc MI options 4 SA: Missing Data Mechanism 1

29 th TRF 2003, Denver July 14 th, 2003www.kiprc.uky.edu SA: Missing Completely At Random (MCAR) –DFN: the missing data values are a simple random sample of all data values. –We simulated this condition by using SAS Proc SurveySelect to pick a random sample from the study data set, then set DUI = missing for those selected cases. Missing At Random (MAR) -DFN: the probability of missing values on one variable is unrelated to the values of this variable, after controlling for other variables in the analysis -We simulated this condition by setting DUI = missing for riders aged 46 or older Not Missing At Random (NMAR) –DFN: the probability of missing values on one variable is related to the values of this variable even if we control other variables in the analysis –We simulated this condition by setting DUI = missing for uninjured riders who were not suspected of DUI (DUI=‘NO’). Missing Data Mechanism 1

29 th TRF 2003, Denver July 14 th, 2003www.kiprc.uky.edu Created 3 data sets from the study data set with different missing data mechanisms, but with the same percent missing values for DUI (25%) MCAR 25% missing on DUI MAR 25% missing on DUI NMAR 25% missing on DUI ParameterESEPE PE P Intercept DUI Speed Fixed Head-on

29 th TRF 2003, Denver July 14 th, 2003www.kiprc.uky.edu Sensitivity analysis on missing data mechanism: Different Same What is the result? Imputation Model 3 Proc MI Options 4 Missing Data Rate (25%) 2 Missing Data Mechanism 1

29 th TRF 2003, Denver July 14 th, 2003www.kiprc.uky.edu Conclusions of SA on Missing Data Mechanism Even if we used the simplest imputation model MI was able to produce results that are consistent with the Gold Standard when the missing data mechanisms were MCAR or MAR, but not NMAR we would predict the increased odds of death or hospitalization for riders suspected of DUI to be 1.78 ( ) for NMAR, while our Gold Standard predicts it to be 2.51 ( ).

29 th TRF 2003, Denver July 14 th, 2003www.kiprc.uky.edu 6%25%50% Study Data Set Missing Data Mechanism 1 Missing Data Rate 2 Proc MI Data Analysis Proc MIANALYZE Results Analysis Model Imputation Model 3 Proc MI options 4 SA: Missing Data Rate 2

29 th TRF 2003, Denver July 14 th, 2003www.kiprc.uky.edu SA: Data sets with MCAR ( Test on percentage of values missing for DUI as 6%, 25%, 50% respectively) Data sets with MAR ( Test on percentage of values missing for DUI as 6%, 25%, 50% respectively) Missing Data Rate 2

29 th TRF 2003, Denver July 14 th, 2003www.kiprc.uky.edu Create 3 data sets with MCAR from the study data set having values missing for DUI as 6%, 25%, and 50% respectively. MCAR 6% missing on DUI MCAR 25% missing on DUI MCAR 50% missing on DUI ParameterESEPE PE P Intercept DUI Speed Fixed Head-on

29 th TRF 2003, Denver July 14 th, 2003www.kiprc.uky.edu Create 3 data sets with MAR from the study data set having values missing for DUI as 6%, 25%, and 50% respectively. MAR 6% missing on DUI MAR 25% missing on DUI MAR 50% missing on DUI ParameterESEPE PE P Intercept DUI Speed Fixed Head-on

29 th TRF 2003, Denver July 14 th, 2003www.kiprc.uky.edu Sensitivity analysis on Missing Data Rate? Same Different Same What is the result? Imputation Model 3 Proc MI Options 4 Missing Data Rate 2 Missing Data Mechanism MCAR or MAR 1

29 th TRF 2003, Denver July 14 th, 2003www.kiprc.uky.edu Conclusions of SA on Missing Data Rate For both missing data mechanisms, the 50% missing case produced the DUI parameter estimate farthest from the Gold Standard estimate, as well as the widest 95% CI. However, for MCAR the difference from the Gold Standard estimate was -7%, whereas for MAR it was 42%. In addition, the 95% CI for 50%MCAR was 19% wider than the Gold Standard 95% CI, whereas for 50%MAR it was 106% wider. It shows that the simplest imputation model is not sufficient to handle very high missing data rates.

29 th TRF 2003, Denver July 14 th, 2003www.kiprc.uky.edu Study Data Set Missing Data Mechanism 1 Missing Data Rate 2 Proc MI Data Analysis Proc MIANALYZE Results Analysis Model Imputation Model 3 Proc MI options 2 SA: Imputation Model 3 Model1Model2Model3Model4

29 th TRF 2003, Denver July 14 th, 2003www.kiprc.uky.edu SA: Data set with MAR and values missing for DUI=50% Tests on the following 4 Imputation models –Model1: D/H DUI Speed Fixed Head-on Model1 = Analysis model, it is the simplest imputation model –Model2: Model1 + age_group + colltime (Categorical) –Model3: Model1 + age_group + hour (Continuous) –Model4: Model1 + age_group + hour_normal (Continuous) We are adding age and collision time to help predict DUI in Model2, Model3, and Model4 Imputation Model 3

29 th TRF 2003, Denver July 14 th, 2003www.kiprc.uky.edu Use 4 different imputation models to do MI on the same data set with MAR, 50% missing on DUI. Model 2 50% missing on DUI Model 3 50% missing on DUI Model 4 50% missing on DUI ParameterESEPE PE P Intercept DUI Speed Fixed Head-on

29 th TRF 2003, Denver July 14 th, 2003www.kiprc.uky.edu Sensitivity analysis on Imputation Model Same Different Same What is the result? Imputation Models 3 Proc MI Options 4 Missing Data Rate (50%) 2 Missing Data Mechanism MAR 1

29 th TRF 2003, Denver July 14 th, 2003www.kiprc.uky.edu Conclusions of SA on Imputation Models Models 2, 3, and 4 are all improvements over model 1, and produced DUI parameter estimates and 95% CI widths close to those of the Gold Standard. So even with 50% missing values (MAR), we are able to get a good result by using a richer imputation model. The higher percent missing values (MAR) in your data set, the more you must include additional predictors in the imputation model.

29 th TRF 2003, Denver July 14 th, 2003www.kiprc.uky.edu Comparison of No MI and Model 4 to the Gold Standard

29 th TRF 2003, Denver July 14 th, 2003www.kiprc.uky.edu Comparison of No MI and Model 4 to the Gold Standard No MI G.S. MI

29 th TRF 2003, Denver July 14 th, 2003www.kiprc.uky.edu Study Data Set Missing Data Mechanism 1 Missing Data Rate 2 Proc MI Data Analysis Proc MIANALYZE Results Analysis Model Imputation Model 3 Proc MI: number of MI 4 N=2N=0N=5N=10N=20 SA: Proc MI: Number of Imputations 4

29 th TRF 2003, Denver July 14 th, 2003www.kiprc.uky.edu SA: Data set with MAR and values missing for DUI=50%, use Model4 to do MI Test on different number of imputations –N=0 –N=2 –N=5 –N=10 –N=20 4 Proc MI: Number of Imputations

29 th TRF 2003, Denver July 14 th, 2003www.kiprc.uky.edu Use same imputation model (Model4), but different number of imputations to do MI on the same data set with MAR, 50% missing on DUI. N=5 50% missing on DUI N=10 50% missing on DUI N=20 50% missing on DUI ParameterESEPE PE P Intercept DUI Speed Fixed Head-on

29 th TRF 2003, Denver July 14 th, 2003www.kiprc.uky.edu Sensitivity analysis on Number of Imputations Same Different What is the result? Imputation Model 3 Number of Imputation 4 Missing Data Rate (50%) 2 Missing Data Mechanism MAR 1

29 th TRF 2003, Denver July 14 th, 2003www.kiprc.uky.edu Conclusions of SA on Number of Imputations In our example, n=5 to 10 is enough to get good results for data set with 50% MAR on DUI. No MI (complete cases only), we would conclude that: motorcyclists with DUI had 4.2 (2.1, 8.4) times more likely killed or hospitalized than motorcyclists without DUI. But from the Gold Standard, the OR is 2.5 (1.5, 4.0)

29 th TRF 2003, Denver July 14 th, 2003www.kiprc.uky.edu Summary---Answers? May I use MI to deal with missing data problems for my data sets? Seems a good idea to try MI. Depend on the missing data mechanisms of variables with missing values in your data sets (however, even our results with MI for NMAR were better than No MI) How can I believe that the MI will give me the better analysis results? We found that using MI on our example gave us much better analysis results than No MI (the complete cases only) How can I get better analysis results by using MI? Understand the relationship between variables in your data sets; Know the missing data mechanisms of variables; Determine the percent of missing information; Build a reasonable imputation model; Use Proc MI options wisely

29 th TRF 2003, Denver July 14 th, 2003www.kiprc.uky.edu Q1. I like Denver. Q2. I like TRF. Q3. I liked the talk. Q4. I will use the MI. Missing Data Problems Everywhere Poll Results Like DenverLike TRFLiked the TalkUse MI YYYY Missing (left session early) YMissing (too nice to say “NO”) N YNYY YNNMissing (not sure yet) NMissing (daydreaming) YY Missing (fell asleep) YMissingN NNN N YY

29 th TRF 2003, Denver July 14 th, 2003www.kiprc.uky.edu Acknowledgment Special thanks to Dr. Mike McGlincy, who gave us helpful suggestions during our study of sensitivity analyses on imputed values and insightful comments on the analysis results.

29 th TRF 2003, Denver July 14 th, 2003www.kiprc.uky.edu Thank You

29 th TRF 2003, Denver July 14 th, 2003www.kiprc.uky.edu Questions?

29 th TRF 2003, Denver July 14 th, 2003www.kiprc.uky.edu Can We Improve Analysis Results for NMAR by Using a More Complex Imputation Model? Model5=Model1+age+hour +gender+safety Model4=Model1+age+hour Model1=K/H + DUI + Speed + Fixed + Head-on No MI=Complete cases only

29 th TRF 2003, Denver July 14 th, 2003www.kiprc.uky.edu Multiple Imputation inference involves three distinct phases: 1. The missing data are filled in m times to generate m complete data sets (using imputation model) 2. The m complete data sets are analyzed by using standard procedures (using analysis model) 3. The results from the m complete data sets are combined for the inference

29 th TRF 2003, Denver July 14 th, 2003www.kiprc.uky.edu Statistical Assumptions for Multiple Imputation 1. The MI procedure assumes that the data are from a continuous multivariate distribution. It also assumes that the data are from a multivariate normal distribution when the MCMC method is used According to Schafer’s MI FAQ page, MI tends to be quite forgiving of assumption for normal distribution. For example: when working with binary or ordered categorical variables, it is often acceptable to impute under a normality assumption and then round off the continuous imputed values to the nearest category. Variables whose distributions are heavily skewed may be transformed to approximate normality and then transformed back to their original scale after imputation. 2.Proc MI and Proc MIANALYZE assume that the missing data are Missing At Random (MAR) MCAR is unlikely for real world crash datasets NMAR may be shifted to MAR by using a richer imputation model to help predict missing values. Because crash datasets include many related variables that can help predict each other

29 th TRF 2003, Denver July 14 th, 2003www.kiprc.uky.edu

29 th TRF 2003, Denver July 14 th, 2003www.kiprc.uky.edu

29 th TRF 2003, Denver July 14 th, 2003www.kiprc.uky.edu