Download presentation
Presentation is loading. Please wait.
Published byPreston Briggs Modified over 8 years ago
1
A framework for multiple imputation & clustering -Mainly basic idea for imputation- Tokei Benkyokai 2013/10/28 T. Kawaguchi 1
2
Outline 1.JFMC41 2.Introduction of Imputation of missing data 3.Framework for Multiple Imputation for Cluster Analysis (American Journal of Epidemiology, Feb27, 2013) 2
3
JFMC41 Time course Main object To find genetic factors that associate with the response (side effect) to anticancer drug. Features Side effects of 12 course chemotherapies are observed for 486 colorectal cancer patients. Main outcomes are peripheral neuropathy and allergy based on personal statement. Four grades classification is widely used to determine the degree of numbness. Several censored samples exist. 3
4
JFMC41 Question 1. How do I analyze this kind of data? Can we make a guideline? Examples of Hypothesis & Solutions 1.Person who become a bad status at least once may genetically different from others. Ex) Max grade=3 vs the others. Analyzed by logistic regression 2.The status may associate linearly. Ex) Max grade in 12 courses is outcome. Analyzed by ordinal logistic regression. 3.Timing of the bad status may associate with genetic background. Ex) survival analysis encoding die = {Max grade=3}. 4.There may be a good pathological classification. Ex) Clustering into early, late, robust, … groups. Analyzed by (multiple) logistic regression. Question 2. How do I classify pathology from the data automatically? How do I handle the missing data? 4
5
5
6
Algorithm 1. 多重代入法で r セット作成 2. 各セットでベストな k (クラスター数)を決 定 3. ベスト k の結果から最 終的なモデルに含める 変数を決める 4-7. 同一モデルで再度 全セットでクラスタリ ング 6
7
Evaluation (1) k クラスターにしたとき の各セットのフィット の分布 ⇒どの k がよいか k=2 のとき、各データセットで、 各変数が何回選択されたか ⇒どの変数群がよいか 7
8
Evaluation (2) 最終的に割り付けられたクラ スターに、 r 回中何回割り付け られるかを示す。⇒クラス ター数、一貫性等を評価 8
9
Evaluation (3) 各変数のクラスターごとの分布の違い、生データ(最 終的に割り付けられたクラスターと生データの関係) と Impute 後(各セットの結果を統合)を併記⇒変数の寄 与、 Impute による影響を総合的に評価 9
10
Introduction of Imputation 10
11
Pattern of missing MCAR : Missing Complete At Random MAR : Missing At Random MNAR : Missing Not At Random Y : A variable that has missing value R : Flag if the Y is missing or not (0/1) X : A variable included in the analysis (Y) (R) 111100000111100000 (X) (Y) (W) W http://www4.ocn.ne.jp/~murakou/missing_data.pdf 148 11
12
Including auxiliary variable Deleting association between Y (observed value including missing) and R (missing flag) by including auxiliary variable (like covariate) 12
13
Type of imputation methodhow toMCARMARMNAR delete listwisedelete missing data by sample○ xx pairwise delete missing data by sample in each analysis ○ xx impute values single imputation (primitive) make single dataset by substituting mean, median or value from regression model using deleted missing value. ○ xx single imputation make single dataset by substituting value from mixed model considering missing (FIML) ○○ △ multiple imputation impute many datasets by using MCMC etc. ○○ △ 13
14
Maximum Likelihood Method Density function for multi normal distribution μ : mean (n x 1 vector) Σ : variance-covariance matrix (n x n matrix) n : number of variable Find the μ and Σ that maximize the Probability to obtain the dataset conditioned by μ and Σ Log likelihood 14
15
Full Maximum Likelihood Method (FIML) 148 Complete record Missing record 148 109.8 No bias for no missing item (IQ) because no record is eliminated Reduced bias for missing item (aptitude-test) because including IQ and etc. Mean = 110.9 by FIML 15
16
FIML only – variance is underestimated (because estimated value is on the regression equation) FIML + uncertainty (random error) – Called ‘stochastic regression imputation’ By adding error term – But still variance is underestimated Multiple Imputation – Make several dataset 16
17
Multiple Imputation (MI) Statistical analysis Original with missing N imputed datasets Param set1 Param set2 Param setN N imputed parameters Single parameters Imputation step Integration step 17
18
Algorithm of imputation step 1.Calculate initial values of mean vector and variance-covariance matrix. 2.Estimate parameters of regression equation for each variable 3.Impute missing values by using equation of 2 as stochastic regression model (adding random error) 4.Calculate mean vector and variance-covariance matrix by 3 5.Simulate new mean vector and variance-covariance matrix by adding random value to 4 6.Iterate 2-5 several times (based on ‘data augmentation method’, a pattern of Gibbs sampler) Omit (burn-in) Obtained dataset : 18
19
Posterior/integration step Estimated parameter value – Mean of r sets Standard Error Between imputation (implies uncertainty of missing) Within imputation m : number of dataset 19
20
R code for Multiple Imputation > library(mice) > data(sleep, package="VIM") > imp<-mice(sleep, m=20) > fit<-with(imp, lm(Dream~Span+Gest)) > pooled<-pool(fit) > summary(pooled) est se t df Pr(>|t|) (Intercept) 2.533814912 0.248338753 10.203059 55.64119 2.353673e-14 Span -0.005025580 0.011658356 -0.431071 56.66285 6.680523e-01 Gest -0.003793599 0.001459708 -2.598876 55.00831 1.198209e-02 lo 95 hi 95 nmis fmi lambda (Intercept) 2.036261806 3.0313680183 NA 0.05344903 0.020026315 Span -0.028374045 0.0183228856 4 0.04021465 0.006925097 Gest -0.006718909 -0.0008682897 4 0.06044704 0.026896558 20
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.