A framework for multiple imputation & clustering -Mainly basic idea for imputation- Tokei Benkyokai 2013/10/28 T. Kawaguchi 1.

Slides:



Advertisements
Similar presentations
Handling attrition and non- response in longitudinal data Harvey Goldstein University of Bristol.
Advertisements

Modeling of Data. Basic Bayes theorem Bayes theorem relates the conditional probabilities of two events A, and B: A might be a hypothesis and B might.
Treatment of missing values
Logistic Regression I Outline Introduction to maximum likelihood estimation (MLE) Introduction to Generalized Linear Models The simplest logistic regression.
Uncertainty and confidence intervals Statistical estimation methods, Finse Friday , 12.45–14.05 Andreas Lindén.
Second order cone programming approaches for handing missing and uncertain data P. K. Shivaswamy, C. Bhattacharyya and A. J. Smola Discussion led by Qi.
CJT 765: Structural Equation Modeling Class 3: Data Screening: Fixing Distributional Problems, Missing Data, Measurement.
Latent Growth Curve Modeling In Mplus:
Adapting to missing data
How to Handle Missing Values in Multivariate Data By Jeff McNeal & Marlen Roberts 1.
Missing Data in Randomized Control Trials
How to deal with missing data: INTRODUCTION
Modeling Achievement Trajectories When Attrition is Informative Betsy J. Feldman & Sophia Rabe- Hesketh.
Missing Data.. What do we mean by missing data? Missing observations which were intended to be collected but: –Never collected –Lost accidently –Wrongly.
Psych 524 Andrew Ainsworth Data Screening 2. Transformation allows for the correction of non-normality caused by skewness, kurtosis, or other problems.
Statistical Methods for Missing Data Roberta Harnett MAR 550 October 30, 2007.
ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:
Workshop on methods for studying cancer patient survival with application in Stata Karolinska Institute, 6 th September 2007 Modeling relative survival.
Model Inference and Averaging
1 Multiple Imputation : Handling Interactions Michael Spratt.
Learning Theory Reza Shadmehr logistic regression, iterative re-weighted least squares.
April 6 Logistic Regression –Estimating probability based on logistic model –Testing differences among multiple groups –Assumptions for model.
1 Introduction to Survey Data Analysis Linda K. Owens, PhD Assistant Director for Sampling & Analysis Survey Research Laboratory University of Illinois.
Overview of Supervised Learning Overview of Supervised Learning2 Outline Linear Regression and Nearest Neighbors method Statistical Decision.
Imputation for Multi Care Data Naren Meadem. Introduction What is certain in life? –Death –Taxes What is certain in research? –Measurement error –Missing.
Today Ensemble Methods. Recap of the course. Classifier Fusion
Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized and revised by Hee-Woong Lim.
SW 983 Missing Data Treatment Most of the slides presented here are from the Modern Missing Data Methods, 2011, 5 day course presented by the KUCRMDA,
The Impact of Missing Data on the Detection of Nonuniform Differential Item Functioning W. Holmes Finch.
Chapter 7 Point Estimation of Parameters. Learning Objectives Explain the general concepts of estimating Explain important properties of point estimators.
Missing Values Raymond Kim Pink Preechavanichwong Andrew Wendel October 27, 2015.
Simulation Study for Longitudinal Data with Nonignorable Missing Data Rong Liu, PhD Candidate Dr. Ramakrishnan, Advisor Department of Biostatistics Virginia.
Special Topics in Educational Data Mining HUDK5199 Spring term, 2013 March 13, 2013.
Logistic Regression Saed Sayad 1www.ismartsoft.com.
Tutorial I: Missing Value Analysis
Multiple Imputation using SAS Don Miller 812 Oswald Tower
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
G Lecture 71 Revisiting Hierarchical Mixed Models A General Version of the Model Variance/Covariances of Two Kinds of Random Effects Parameter Estimation.
Pre-Processing & Item Analysis DeShon Pre-Processing Method of Pre-processing depends on the type of measurement instrument used Method of Pre-processing.
Anders Nielsen Technical University of Denmark, DTU-Aqua Mark Maunder Inter-American Tropical Tuna Commission An Introduction.
Institute of Statistics and Decision Sciences In Defense of a Dissertation Submitted for the Degree of Doctor of Philosophy 26 July 2005 Regression Model.
Introduction We consider the data of ~1800 phenotype measurements Each mouse has a given probability distribution of descending from one of 8 possible.
DATA STRUCTURES AND LONGITUDINAL DATA ANALYSIS Nidhi Kohli, Ph.D. Quantitative Methods in Education (QME) Department of Educational Psychology 1.
Best Practices for Handling Missing Data
HANDLING MISSING DATA.
Missing data: Why you should care about it and what to do about it
Chapter 3: Maximum-Likelihood Parameter Estimation
MISSING DATA AND DROPOUT
Model Inference and Averaging
Ch3: Model Building through Regression
Linear Mixed Models in JMP Pro
CH 5: Multivariate Methods
Classification of unlabeled data:
The Centre for Longitudinal Studies Missing Data Strategy
Maximum Likelihood & Missing data
Microeconometric Modeling
Introduction to Survey Data Analysis
Multiple Imputation Using Stata
Dealing with missing data
Presenter: Ting-Ting Chung July 11, 2017
Working with missing Data
Mathematical Foundations of BME Reza Shadmehr
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
The European Statistical Training Programme (ESTP)
EM for Inference in MV Data
Task 6 Statistical Approaches
Missing Data Mechanisms
EM for Inference in MV Data
Clinical prediction models
Global PaedSurg Research Training Fellowship
Presentation transcript:

A framework for multiple imputation & clustering -Mainly basic idea for imputation- Tokei Benkyokai 2013/10/28 T. Kawaguchi 1

Outline 1.JFMC41 2.Introduction of Imputation of missing data 3.Framework for Multiple Imputation for Cluster Analysis (American Journal of Epidemiology, Feb27, 2013) 2

JFMC41 Time course Main object To find genetic factors that associate with the response (side effect) to anticancer drug. Features Side effects of 12 course chemotherapies are observed for 486 colorectal cancer patients. Main outcomes are peripheral neuropathy and allergy based on personal statement. Four grades classification is widely used to determine the degree of numbness. Several censored samples exist. 3

JFMC41 Question 1. How do I analyze this kind of data? Can we make a guideline? Examples of Hypothesis & Solutions 1.Person who become a bad status at least once may genetically different from others. Ex) Max grade=3 vs the others. Analyzed by logistic regression 2.The status may associate linearly. Ex) Max grade in 12 courses is outcome. Analyzed by ordinal logistic regression. 3.Timing of the bad status may associate with genetic background. Ex) survival analysis encoding die = {Max grade=3}. 4.There may be a good pathological classification. Ex) Clustering into early, late, robust, … groups. Analyzed by (multiple) logistic regression. Question 2. How do I classify pathology from the data automatically? How do I handle the missing data? 4

5

Algorithm 1. 多重代入法で r セット作成 2. 各セットでベストな k (クラスター数)を決 定 3. ベスト k の結果から最 終的なモデルに含める 変数を決める 4-7. 同一モデルで再度 全セットでクラスタリ ング 6

Evaluation (1) k クラスターにしたとき の各セットのフィット の分布 ⇒どの k がよいか k=2 のとき、各データセットで、 各変数が何回選択されたか ⇒どの変数群がよいか 7

Evaluation (2) 最終的に割り付けられたクラ スターに、 r 回中何回割り付け られるかを示す。⇒クラス ター数、一貫性等を評価 8

Evaluation (3) 各変数のクラスターごとの分布の違い、生データ(最 終的に割り付けられたクラスターと生データの関係) と Impute 後(各セットの結果を統合)を併記⇒変数の寄 与、 Impute による影響を総合的に評価 9

Introduction of Imputation 10

Pattern of missing MCAR : Missing Complete At Random MAR : Missing At Random MNAR : Missing Not At Random Y : A variable that has missing value R : Flag if the Y is missing or not (0/1) X : A variable included in the analysis (Y) (R) (X) (Y) (W) W

Including auxiliary variable Deleting association between Y (observed value including missing) and R (missing flag) by including auxiliary variable (like covariate) 12

Type of imputation methodhow toMCARMARMNAR delete listwisedelete missing data by sample○ xx pairwise delete missing data by sample in each analysis ○ xx impute values single imputation (primitive) make single dataset by substituting mean, median or value from regression model using deleted missing value. ○ xx single imputation make single dataset by substituting value from mixed model considering missing (FIML) ○○ △ multiple imputation impute many datasets by using MCMC etc. ○○ △ 13

Maximum Likelihood Method Density function for multi normal distribution μ : mean (n x 1 vector) Σ : variance-covariance matrix (n x n matrix) n : number of variable Find the μ and Σ that maximize the Probability to obtain the dataset conditioned by μ and Σ Log likelihood 14

Full Maximum Likelihood Method (FIML) 148 Complete record Missing record No bias for no missing item (IQ) because no record is eliminated Reduced bias for missing item (aptitude-test) because including IQ and etc. Mean = by FIML 15

FIML only – variance is underestimated (because estimated value is on the regression equation) FIML + uncertainty (random error) – Called ‘stochastic regression imputation’ By adding error term – But still variance is underestimated Multiple Imputation – Make several dataset 16

Multiple Imputation (MI) Statistical analysis Original with missing N imputed datasets Param set1 Param set2 Param setN N imputed parameters Single parameters Imputation step Integration step 17

Algorithm of imputation step 1.Calculate initial values of mean vector and variance-covariance matrix. 2.Estimate parameters of regression equation for each variable 3.Impute missing values by using equation of 2 as stochastic regression model (adding random error) 4.Calculate mean vector and variance-covariance matrix by 3 5.Simulate new mean vector and variance-covariance matrix by adding random value to 4 6.Iterate 2-5 several times (based on ‘data augmentation method’, a pattern of Gibbs sampler) Omit (burn-in) Obtained dataset : 18

Posterior/integration step Estimated parameter value – Mean of r sets Standard Error Between imputation (implies uncertainty of missing) Within imputation m : number of dataset 19

R code for Multiple Imputation > library(mice) > data(sleep, package="VIM") > imp<-mice(sleep, m=20) > fit<-with(imp, lm(Dream~Span+Gest)) > pooled<-pool(fit) > summary(pooled) est se t df Pr(>|t|) (Intercept) e-14 Span e-01 Gest e-02 lo 95 hi 95 nmis fmi lambda (Intercept) NA Span Gest