Download presentation
Presentation is loading. Please wait.
Published byDominick Harris Modified over 9 years ago
1
© Department of Statistics 2012 STATS 330 Lecture 25: Slide 1 Stats 330: Lecture 25
2
© Department of Statistics 2012 STATS 330 Lecture 25: Slide 2 Plan of the day In today’s lecture we discuss prediction and present a logistic regression case study. Topics covered will be Prediction in logistic regression In-sample and out-of-sample error rates Cross-validation and bootstrap estimates of error rates Sensitivity and specificity ROC curves Then, a case study
3
© Department of Statistics 2012 STATS 330 Lecture 25: Slide 3 Housekeeping Error in slide 34 in lecture 23: Function is now influenceplots Bug in ROC.curve – download replacement from web page
4
© Department of Statistics 2012 STATS 330 Lecture 25: Slide 4 Prediction Suppose we have fitted a logistic model and we want to use the model to predict new cases. If a new case presents with explanatory variables x, how do we predict the y-value, 0 or 1? Work out the estimated log-odds for the case Work out probability: Prob = exp(log-odds)/(1+exp(log.odds)) Predict –Y=1 if prob >= 0.5 (equivalently log.odds >=0) –Y=0 if prob < 0.5 (equivalently log.odds <0)
5
© Department of Statistics 2012 STATS 330 Lecture 25: Slide 5 Estimating the prediction error Prediction error is the probability of a wrong classification ( 0’s predicted as 1’s, 1’s predicted as 0’s) As in linear regression, using the training data to estimate these proportions tends to give an optimistic estimate We can use cross-validation or the bootstrap to improve the estimate –see the case study
6
© Department of Statistics 2012 STATS 330 Lecture 25: Slide 6 Sensitivity and specificity Sensitivity: probability of predicting a 1 when the case is truly a 1: the “true positive rate” Specificity: probability of predicting a 0 when the case is truly a 0: the “true negative rate” (1- specificity is called the “false positive rate”) Ideally, want both to be close to 1 We would like to know what these would be for new data – use cross-validation and the bootstrap as for normal regression
7
© Department of Statistics 2012 STATS 330 Lecture 25: Slide 7 Calculating sensitivity and specificity Model predicts Failure (0)Success (1) ActualFailure (0)100200 Success ( 1)250600 Specificity = 100/(100+200) = 33% Sensitivity = 600/(600+250) = 70% In-sample error rate = (200+250)/1150
8
© Department of Statistics 2012 STATS 330 Lecture 25: Slide 8 ROC curves We have predicted a “success” (Y=1) if the log-odds are positive. We can generalize this to predict a success if log-odds >=c for some constant c If c is large and –ve, almost every case will be predicted as a success (1) –Sensitivity close to 1, specificity close to 0 If c is large and +ve, almost every case will be predicted as a failure (0) –Sensitivity close to 0, specificity close to 1 Allows a trade-off between sensitivity and specificity As c varies, the sensitivity and specificity change. ROC curve is a plot of the points (1-specificity, sensitivity) as c changes.
9
© Department of Statistics 2012 STATS 330 Lecture 25: Slide 9
10
© Department of Statistics 2012 STATS 330 Lecture 25: Slide 10 ROC curves - cont False positive rate True positive rate False positive rate True positive rate Perfect prediction True positive rate Worst case prediction Predictor no help
11
© Department of Statistics 2012 STATS 330 Lecture 25: Slide 11 Area under the curve For a perfect predictor, the area under the ROC curve (AUC) is 1. If the predictor is independent of the response, the sensitivity and specificity are both 0.5. AUC curve serves as a measure of how good the model is at predicting.
12
© Department of Statistics 2012 STATS 330 Lecture 25: Slide 12 Case study The data comes from the University of Massachusetts AIDS Research Unit IMPACT study, a medical study performed in the US in the early 90’s. The study aimed to evaluate two different treatments for drug addiction. Reference: Hosmer and Lemeshow, Applied Logistic Regression (2 nd Ed), p28
13
© Department of Statistics 2012 STATS 330 Lecture 25: Slide 13 List of variables Variable DescriptionCodes/ValuesName Identification Code1-575ID Age at EnrollmentYearsAGE Beck Depression Score 0-54BECK IV Drug Use History 1 = Never, IVHX at Admission 2 = Previous, 3 = Recent No of prior Treatments 0-40NDRUGTX Subject's Race0 = White, RACE 1 = Other Treatment Duration0=short, 1 = LongTREAT Treatment Site0 = A, 1 = B SITE Remained Drug Free 1 = Yes, 0 = NoDFREE
14
© Department of Statistics 2012 STATS 330 Lecture 25: Slide 14 The variables The response DFREE is binary: records if subject is drug-free after conclusion of treatment. There is a mix of categorical and continuous explanatory variables Categorical: IVHX, RACE, TREAT, SITE Continuous: AGE, BECK, NDRUGTX.
15
© Department of Statistics 2012 STATS 330 Lecture 25: Slide 15 Questions Is the longer treatment more effective? Did Site A deliver the program more effectively than site B? What other variables have an effect on successful rehabilitation of addicts? Can we predict who is likely to be drug- free in 12 months?
16
© Department of Statistics 2012 STATS 330 Lecture 25: Slide 16 Analysis strategy Preliminary plots, tables Variable selection Model fitting Interpretation of coefficients Evaluation as a predictor of recovery from addiction.
17
© Department of Statistics 2012 STATS 330 Lecture 25: Slide 17 Preliminary plots
18
© Department of Statistics 2012 STATS 330 Lecture 25: Slide 18 Preliminary plots (2)
19
© Department of Statistics 2012 STATS 330 Lecture 25: Slide 19 Preliminary plots (3) Seems like number of previous drug treatments have an effect Seems like factors IVHX (Recent IV drug use), SITE (Site A or Site B) and TREAT (short or long treatment) have an effect
20
© Department of Statistics 2012 STATS 330 Lecture 25: Slide 20 Preliminary fits (1) Call: glm(formula = DFREE ~. - IVHX - ID + factor(IVHX), family = binomial, data = drug.df) Deviance Residuals: Min 1Q Median 3Q Max -1.3465 -0.8091 -0.6326 1.1834 2.4231 Don’t include ID!
21
© Department of Statistics 2012 STATS 330 Lecture 25: Slide 21 Preliminary fits (2) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -2.4111283 0.5983427 -4.030 5.59e-05 *** AGE 0.0504143 0.0174057 2.896 0.00377 ** BECK 0.0002759 0.0107982 0.026 0.97961 NDRUGTX -0.0615329 0.0256441 -2.399 0.01642 * RACE 0.2260262 0.2233685 1.012 0.31159 TREAT 0.4424802 0.1992922 2.220 0.02640 * SITE 0.1489209 0.2176062 0.684 0.49375 factor(IVHX)2 -0.6036962 0.2875974 -2.099 0.03581 * factor(IVHX)3 -0.7336591 0.2549893 -2.877 0.00401 ** (Dispersion parameter for binomial family taken to be 1) Null deviance: 653.73 on 574 degrees of freedom Residual deviance: 619.25 on 566 degrees of freedom AIC: 637.25 Number of Fisher Scoring iterations: 4
22
© Department of Statistics 2012 STATS 330 Lecture 25: Slide 22 Preliminary conclusions Important variables seem to be AGE, NDRUGTX, TREAT, IVHX Data are ungrouped, can’t assess goodness of fit with residual deviance No extremely large residuals
23
© Department of Statistics 2012 STATS 330 Lecture 25: Slide 23 Hosmer-Lemeshow test > HLstat(drug.glm) Value of HL statistic = 5.05 P-value = 0.752 No evidence of a bad fit using this test
24
© Department of Statistics 2012 STATS 330 Lecture 25: Slide 24 Variable selection (1) > anova(drug.glm, test="Chisq") Analysis of Deviance Table Model: binomial, link: logit Response: DFREE Terms added sequentially (first to last) Df Deviance Resid. Df Resid. Dev P(>|Chi|) NULL 574 653.73 AGE 1 1.40 573 652.33 0.24 BECK 1 0.57 572 651.76 0.45 NDRUGTX 1 14.07 571 637.69 0.0001760 RACE 1 3.06 570 634.63 0.08 TREAT 1 4.96 569 629.67 0.03 SITE 1 1.07 568 628.60 0.30 factor(IVHX) 2 9.35 566 619.25 0.01
25
© Department of Statistics 2012 STATS 330 Lecture 25: Slide 25 Variable selection (2) Step: AIC= 632.59 DFREE ~ NDRUGTX + IVHX + AGE + TREAT Call: glm(formula = DFREE ~ NDRUGTX + IVHX + AGE + TREAT, family = binomial, data = drug.df) Degrees of Freedom: 574 Total (i.e. Null); 569 Residual Null Deviance: 653.7 Residual Deviance: 620.6 AIC: 632.6
26
© Department of Statistics 2012 STATS 330 Lecture 25: Slide 26 Sub-model > sub.glm<-glm(DFREE ~ NDRUGTX + factor(IVHX) + AGE + TREAT, family=binomial, data=drug.df) > summary(sub.glm) Call: glm(formula = DFREE ~ NDRUGTX + factor(IVHX) + AGE + TREAT, family = binomial, data = drug.df) Deviance Residuals: Min 1Q Median 3Q Max -1.2598 -0.8051 -0.6284 1.1401 2.4574
27
© Department of Statistics 2012 STATS 330 Lecture 25: Slide 27 Sub-model (ii) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -2.33276 0.54838 -4.254 2.1e-05 *** NDRUGTX -0.06376 0.02563 -2.488 0.012844 * factor(IVHX)2 -0.62366 0.28470 -2.191 0.028484 * factor(IVHX)3 -0.80561 0.24453 -3.294 0.000986 *** AGE 0.05259 0.01721 3.056 0.002244 ** TREAT 0.45134 0.19860 2.273 0.023048 * (Dispersion parameter for binomial family taken to be 1) Null deviance: 653.73 on 574 degrees of freedom Residual deviance: 620.59 on 569 degrees of freedom AIC: 632.59 Number of Fisher Scoring iterations: 4 All variables significant, but use caution
28
© Department of Statistics 2012 STATS 330 Lecture 25: Slide 28 Do we need interaction terms? > sub.glm<-glm(DFREE ~ NDRUGTX + IVHX + AGE + TREAT, family=binomial, data=drug.df) > sub2.glm<-glm(DFREE ~ NDRUGTX*IVHX + AGE*IVHX + AGE*TREAT + NDRUGTX*TREAT, family=binomial, data=drug.df) > > anova(sub.glm, sub2.glm, test="Chisq") Analysis of Deviance Table Model 1: DFREE ~ NDRUGTX + IVHX + AGE + TREAT Model 2: DFREE ~ NDRUGTX * IVHX + AGE * IVHX + AGE * TREAT + NDRUGTX * TREAT Resid. Df Resid. Dev Df Deviance P(>|Chi|) 1 569 620.59 2 563 616.16 6 4.42 0.62 Big p-value so interactions not required Model with interactions
29
© Department of Statistics 2012 STATS 330 Lecture 25: Slide 29 Do we need to transform? par(mfrow=c(1,2)) sub.gam<-gam(DFREE ~ s(NDRUGTX) + factor(IVHX) + s(AGE) + TREAT, family=binomial, data=drug.df) plot(sub.gam)
30
© Department of Statistics 2012 STATS 330 Lecture 25: Slide 30 Transforming Suggests a possible quadratic in NDRUGTX: > subq.glm<-glm(DFREE ~ poly(NDRUGTX,2) + IVHX + AGE + TREAT, family=binomial, data=drug.df) > summary(subq.glm) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -2.70763 0.56715 -4.774 1.80e-06 *** poly(NDRUGTX, 2)1 -7.24501 2.93274 -2.470 0.01350 * poly(NDRUGTX, 2)2 4.21319 2.69624 1.563 0.11814 IVHX2 -0.59018 0.28608 -2.063 0.03912 * IVHX3 -0.76015 0.24725 -3.074 0.00211 ** AGE 0.05458 0.01730 3.154 0.00161 ** TREAT 0.44379 0.19904 2.230 0.02577 * But term is not significant, so we stick with no transformation
31
© Department of Statistics 2012 STATS 330 Lecture 25: Slide 31 Diagnostics Pt 85 7, 471
32
© Department of Statistics 2012 STATS 330 Lecture 25: Slide 32 Influence of 7, 471, 85 None 7 85 471 All (Intercept) -2.333 -2.295 -2.222 -2.447 -2.293 NDRUGTX -0.064 -0.084 -0.065 -0.075 -0.100 IVHX2 -0.624 -0.595 -0.635 -0.680 -0.662 IVHX3 -0.806 -0.783 -0.785 -0.795 -0.747 AGE 0.053 0.053 0.049 0.057 0.054 TREAT 0.451 0.434 0.441 0.479 0.450 Effect on coefficients of removing cases: None seem particularly influential! We will not delete them
33
© Department of Statistics 2012 STATS 330 Lecture 25: Slide 33 Over-dispersion qsub.glm<-glm(DFREE ~ NDRUGTX + IVHX + AGE + TREAT, family=quasibinomial, data=drug.df) > summary(qsub.glm) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -2.33276 0.55435 -4.208 2.99e-05 *** NDRUGTX -0.06376 0.02591 -2.461 0.01414 * IVHX2 -0.62366 0.28780 -2.167 0.03065 * IVHX3 -0.80561 0.24720 -3.259 0.00118 ** AGE 0.05259 0.01740 3.023 0.00262 ** TREAT 0.45134 0.20076 2.248 0.02495 * --- (Dispersion parameter for quasibinomial family taken to be 1.021892) Very close to 1 so no overdispersion
34
© Department of Statistics 2012 STATS 330 Lecture 25: Slide 34 Interpretation Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -2.33276 0.54838 -4.254 2.1e-05 *** NDRUGTX -0.06376 0.02563 -2.488 0.012844 * IVHX2 -0.62366 0.28470 -2.191 0.028484 * IVHX3 -0.80561 0.24453 -3.294 0.000986 *** AGE 0.05259 0.01721 3.056 0.002244 ** TREAT 0.45134 0.19860 2.273 0.023048 * As the number of prior treatments goes up, the probability of a drug-free recovery goes down The probability of a drug-free recovery for persons with no IV drug use is more than for persons with previous IV drug use The probability of a drug-free recovery for persons with previous IV drug use is more than for persons with recent IV drug use. The probability of a drug-free recovery goes up with age The probability of a drug-free recovery is higher for the long treatment
35
© Department of Statistics 2012 STATS 330 Lecture 25: Slide 35 Interpreting p-values after model selection We have seen that this is not valid, as model selection changes the distribution of the estimated coefficients. We can use the bootstrap to examine the revised distribution Leave TREAT in the model, use forward selection to select the other variables.
36
© Department of Statistics 2012 STATS 330 Lecture 25: Slide 36 Procedure Draw a bootstrap sample Do forward selection, record value of regression coef for TREAT (forced to be in every model) Repeat 200 times, draw histogram of the results
37
© Department of Statistics 2012 STATS 330 Lecture 25: Slide 37 R code # bootstrap sample n = dim(drug.df)[1] B=200 beta.boot = numeric(B) for(b in 1:B){ ni = rmultinom(1, n,prob=rep(1/n,n)) newdata = drug.df[rep(1:n,ni),] drug.boot.glm = glm(DFREE ~ NDRUGTX + factor(IVHX) + AGE + BECK + TREAT + RACE + SITE, family=binomial, data= newdata) chosen = step(drug.boot.glm, list(lower = DFREE ~ TREAT, upper= formula(drug.boot.glm)), direction = “forward", trace=0) k = match("TREAT",names(coef(chosen))) beta.boot[b] = summary(chosen)$coefficients[k,1] }
38
© Department of Statistics 2012 STATS 330 Lecture 25: Slide 38 Histogram > mean(beta.boot) [1] 0.4540209 > sd(beta.boot) [1] 0.2019222 > z.val = mean(beta.boot)/ sd(beta.boot) > 2*(1-pnorm(z.val)) [1] 0.02454468 Compare Beta = 0.45134 SE = 0.19860 P-value = 0.023048
39
© Department of Statistics 2012 STATS 330 Lecture 25: Slide 39 Prediction Sensitivity: chance the model predicts a successful recovery (drug-free at end of program), when one will actually occur Specificity: chance the model predicts a failure (return to drug use before end of program), when one actually will occur
40
© Department of Statistics 2012 STATS 330 Lecture 25: Slide 40 R code sub.glm<-glm(DFREE ~ NDRUGTX + IVHX + AGE + TREAT, family=binomial, data=drug.df) > pred = predict(sub.glm, type="response") > predcode = ifelse(pred<0.5, 0,1) > table(drug.df$DFREE,predcode) predicted 0 1 Actual 0 426 2 1 144 3 Sensitivity = 3/147 = 0.02040816 Specificity = 426/428 = 0.9953271 Error rate = 146/575 = 0.2539130 Proportion correctly classified = 429/575 = 0.746087
41
© Department of Statistics 2012 STATS 330 Lecture 25: Slide 41 ROC curve ROC.curve(DFREE ~ NDRUGTX + factor(IVHX) + AGE + TREAT, data= drug.df) # in the R330 package
42
© Department of Statistics 2012 STATS 330 Lecture 25: Slide 42 Prediction (2) Use 10-fold cross-validation –Split data into 10 parts –Calculate sensitivity and specificity for each part, using model fitted to the remaining parts –Average results –Repeat for different splits, average repeats E.g. for one part
43
© Department of Statistics 2012 STATS 330 Lecture 25: Slide 43 CV and bootstrap Results > cross.val(DFREE ~ NDRUGTX + factor(IVHX) + AGE + TREAT, drug.df) Mean Specificity = 0.9908424 Mean Sensitivity = 0.02854005 Mean Correctly classified = 0.7446491 > err.boot(DFREE ~ NDRUGTX + factor(IVHX) + AGE + TREAT, data= drug.df) $err [1] 0.2539130 $Err [1] 0.2552974 A poor classifier, but this doesn’t mean that the model fits poorly – there are very few cases with fitted probs over 0.5, and many with fitted probabilities between 0.2 and 0.5. We expect a moderate number of these to be misclassified, as some events (being drug free) with probs 0.2 to 0.5 have occurred. Bootstrap estimate Training set estimate
44
© Department of Statistics 2012 STATS 330 Lecture 25: Slide 44 Overall conclusions Model seems to fit well Strong evidence that longer treatments are better No apparent difference between sites Age and prior IV drug use affect recovery Model predicts poorly for the covariates in the data set – effectively always predicts patients will not be drug free
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.