Exercise 1 In the ISwR data set alkfos, do a PCA of the placebo and Tamoxifen groups separately, then together. Plot the first two principal components of the whole group with color coding for the treatment and control subjects. For this and other parts to this assignment, omit the patients with missing data. Conduct a linear discriminant analysis of the two groups using the 7 variables. How well can you predict the treatment? Is this the usual kind of analysis you would see? Use logistic regression to predict the group based on the measurements. Compare the in-sample error rates. Use cross-validation with repeated training subsamples of 30/35 and test sets of size 5/35. What can you now conclude about the two methods? May 28, 2013SPH 247 Statistical Analysis of Laboratory Data1
Exercise 2 In the ISwR data set alkfos, cluster the data based on the 7 measurements using hclust(), kmeans(), and Mclust(). Compare the 2-group clustering with the placebo/Tamoxifen classification. May 28, 2013SPH 247 Statistical Analysis of Laboratory Data2
May 28, 2013SPH 247 Statistical Analysis of Laboratory Data3 > alkfos2 <- na.omit(alkfos)# omits missing values > pc1 <- prcomp(alkfos2[alkfos2[,1]==1,2:8],scale=T) > pc2 <- prcomp(alkfos2[alkfos2[,1]==2,2:8],scale=T) > pc.all <- prcomp(alkfos2[,2:8],scale=T) Standard deviations: [1] Rotation: PC1 PC2 PC3 PC4 PC5 PC6 PC7 c c c c c c c > plot(pc.all) > plot(pc.all$x,col=alkfos2[,1])
May 28, 2013SPH 247 Statistical Analysis of Laboratory Data4
May 28, 2013SPH 247 Statistical Analysis of Laboratory Data5
May 28, 2013SPH 247 Statistical Analysis of Laboratory Data6 > library(MASS) > alkfos.lda <- lda(alkfos2[,2:8],grouping=alkfos2[,1]) > alkfos.lda Call: lda(alkfos2[, 2:8], grouping = alkfos2[, 1]) Prior probabilities of groups: Group means: c0 c3 c6 c9 c12 c18 c Coefficients of linear discriminants: LD1 c c c c c c c
May 28, 2013SPH 247 Statistical Analysis of Laboratory Data7 > plot(alkfos.lda) > alkfos.pred <- predict(alkfos.lda) > table(alkfos2$grp,alkfos.pred$class) in 35 correct.
May 28, 2013SPH 247 Statistical Analysis of Laboratory Data8
May 28, 2013SPH 247 Statistical Analysis of Laboratory Data9 > alkfos.glm <- glm(as.factor(grp) ~ 1,data=alkfos2,family=binomial) > step(alkfos.glm,scope=formula(~ c0+c3+c6+c9+c12+c18+c24),steps=2) Start: AIC=49.11 as.factor(grp) ~ 1 Df Deviance AIC + c c c c c c c Step: AIC=42.47 as.factor(grp) ~ c6
May 28, 2013SPH 247 Statistical Analysis of Laboratory Data10 > alkfos.glm <- glm(as.factor(grp) ~ 1,data=alkfos2,family=binomial) > step(alkfos.glm,scope=formula(~ c0+c3+c6+c9+c12+c18+c24),steps=2) Step: AIC=42.47 as.factor(grp) ~ c6 Df Deviance AIC + c c c c c c c Step: AIC=30.28 as.factor(grp) ~ c6 + c0 We used step limited to two steps to avoid a model with undetermined coefficients. Once the predictions are perfect (with three or more variables in this case), nothing can be distinguished.
May 28, 2013SPH 247 Statistical Analysis of Laboratory Data11 alkfos.lda.cv <- function(ncv,ntrials) { require(MASS) data(alkfos) alkfos2 <- na.omit(alkfos) n1 <- dim(alkfos2)[1] nwrong <- 0 npred <- 0 for (i in 1:ntrials) { test <- sample(n1,ncv) test.set <- data.frame(alkfos2[test,2:8]) train.set <- data.frame(alkfos2[-test,2:8]) lda.ap <- lda(train.set,alkfos2[-test,1]) lda.pred <- predict(lda.ap,test.set) nwrong <- nwrong + sum(lda.pred$class != alkfos2[test,1]) npred <- npred + ncv } print(paste("total number classified = ",npred,sep="")) print(paste("total number wrong = ",nwrong,sep="")) print(paste("percent wrong = ",100*nwrong/npred,"%",sep="")) }
May 28, 2013SPH 247 Statistical Analysis of Laboratory Data12 alkfos.glm.cv <- function(ncv,ntrials) { require(MASS) data(alkfos) alkfos2 <- na.omit(alkfos) alkfos2$grp <- as.factor(alkfos2$grp) n1 <- dim(alkfos2)[1] nwrong <- 0 npred <- 0 for (i in 1:ntrials) { test <- sample(n1,ncv) test.set <- alkfos2[test,] train.set <- alkfos2[-test,] glm.ap <- glm(grp ~ 1,data=train.set,family=binomial) glmstep.ap <- step(glm.ap,scope=formula(~ c0+c3+c6+c9+c12+c18+c24),steps=2,trace=0) glm.pred <- predict(glmstep.ap,newdata=test.set,type="response") grp.pred 0.5)+1 nwrong <- nwrong + sum(grp.pred != test.set$grp) npred <- npred + ncv } print(paste("total number classified = ",npred,sep="")) print(paste("total number wrong = ",nwrong,sep="")) print(paste("percent wrong = ",100*nwrong/npred,"%",sep="")) }
Results of Cross Validation LDA has 1 error in 35 in sample (2.9%) Cross-Validated seven-fold this is 720/10000 = 7.2% Stepwise logistic regression with two variables has 3 errors in 35 in sample (8.6%) Cross-Validated seven-fold this is 1830/10000 = 18.3% May 28, 2013SPH 247 Statistical Analysis of Laboratory Data13
May 28, 2013SPH 247 Statistical Analysis of Laboratory Data14 > ap.hc <- hclust(dist(alkfos2[,2:8])) > plot(ap.hc) > cutree(ap.hc, 2) > table(cutree(ap.hc, 2),alkfos2$grp) > table(kmeans(alkfos2[,2:8],2)$cluster,alkfos2$grp) > library(mclust) > Mclust(alkfos2[,2:8]) 'Mclust' model object: best model: ellipsoidal, equal shape (VEV) with 6 components > table(Mclust(alkfos2[,2:8])$class,alkfos2$grp) > table(Mclust(alkfos2[,2:8],G=2)$class,alkfos2$grp)
May 28, 2013SPH 247 Statistical Analysis of Laboratory Data15