© Department of Statistics 2012 STATS 330 Lecture 25: Slide 1 Stats 330: Lecture 25.

Slides:



Advertisements
Similar presentations
© Department of Statistics 2012 STATS 330 Lecture 32: Slide 1 Stats 330: Lecture 32.
Advertisements

1 COMM 301: Empirical Research in Communication Lecture 15 – Hypothesis Testing Kwan M Lee.
Forecasting Using the Simple Linear Regression Model and Correlation
Inference for Regression
Lecture 16: Logistic Regression: Goodness of Fit Information Criteria ROC analysis BMTRY 701 Biostatistical Methods II.
Lecture 22: Evaluation April 24, 2010.
Logistic Regression Example: Horseshoe Crab Data
CHAPTER 24: Inference for Regression
Logistic Regression.
Confidence Intervals Underlying model: Unknown parameter We know how to calculate point estimates E.g. regression analysis But different data would change.
Some Terms Y =  o +  1 X Regression of Y on X Regress Y on X X called independent variable or predictor variable or covariate or factor Which factors.
From last time….. Basic Biostats Topics Summary Statistics –mean, median, mode –standard deviation, standard error Confidence Intervals Hypothesis Tests.
Logistic Regression Predicting Dichotomous Data. Predicting a Dichotomy Response variable has only two states: male/female, present/absent, yes/no, etc.
Nemours Biomedical Research Statistics April 2, 2009 Tim Bunnell, Ph.D. & Jobayer Hossain, Ph.D. Nemours Bioinformatics Core Facility.
Nemours Biomedical Research Statistics April 23, 2009 Tim Bunnell, Ph.D. & Jobayer Hossain, Ph.D. Nemours Bioinformatics Core Facility.
Review for Exam 2 Some important themes from Chapters 6-9 Chap. 6. Significance Tests Chap. 7: Comparing Two Groups Chap. 8: Contingency Tables (Categorical.
Validation of predictive regression models Ewout W. Steyerberg, PhD Clinical epidemiologist Frank E. Harrell, PhD Biostatistician.
Model Checking in the Proportional Hazard model
Logistic Regression and Generalized Linear Models:
Today Evaluation Measures Accuracy Significance Testing
Chapter 13: Inference in Regression
SPH 247 Statistical Analysis of Laboratory Data May 19, 2015SPH 247 Statistical Analysis of Laboratory Data1.
© Department of Statistics 2012 STATS 330 Lecture 28: Slide 1 Stats 330: Lecture 28.
New Ways of Looking at Binary Data Fitting in R Yoon G Kim, Colloquium Talk.
9/14/ Lecture 61 STATS 330: Lecture 6. 9/14/ Lecture 62 Inference for the Regression model Aim of today’s lecture: To discuss how we assess.
Lecture 15: Logistic Regression: Inference and link functions BMTRY 701 Biostatistical Methods II.
Lecture 6 Generalized Linear Models Olivier MISSA, Advanced Research Skills.
7.1 - Motivation Motivation Correlation / Simple Linear Regression Correlation / Simple Linear Regression Extensions of Simple.
© Department of Statistics 2012 STATS 330 Lecture 26: Slide 1 Stats 330: Lecture 26.
BPS - 3rd Ed. Chapter 211 Inference for Regression.
Design and Analysis of Clinical Study 11. Analysis of Cohort Study Dr. Tuan V. Nguyen Garvan Institute of Medical Research Sydney, Australia.
Repeated Measures  The term repeated measures refers to data sets with multiple measurements of a response variable on the same experimental unit or subject.
Logistic regression. Analysis of proportion data We know how many times an event occurred, and how many times did not occur. We want to know if these.
CpSc 810: Machine Learning Evaluation of Classifier.
Regression Model Building LPGA Golf Performance
April 4 Logistic Regression –Lee Chapter 9 –Cody and Smith 9:F.
© Department of Statistics 2012 STATS 330 Lecture 20: Slide 1 Stats 330: Lecture 20.
CROSS-VALIDATION AND MODEL SELECTION Many Slides are from: Dr. Thomas Jensen -Expedia.com and Prof. Olga Veksler - CS Learning and Computer Vision.
© Department of Statistics 2012 STATS 330 Lecture 23: Slide 1 Stats 330: Lecture 23.
Generalized linear MIXED models
Evaluating Risk Adjustment Models Andy Bindman MD Department of Medicine, Epidemiology and Biostatistics.
A preliminary exploration into the Binomial Logistic Regression Models in R and their potential application Andrew Trant PPS Arctic - Labrador Highlands.
Applied Statistics Week 4 Exercise 3 Tick bites and suspicion of Borrelia Mihaela Frincu
© Department of Statistics 2012 STATS 330 Lecture 30: Slide 1 Stats 330: Lecture 30.
Applied Quantitative Analysis and Practices LECTURE#25 By Dr. Osman Sadiq Paracha.
Multiple Logistic Regression STAT E-150 Statistical Methods.
© Department of Statistics 2012 STATS 330 Lecture 19: Slide 1 Stats 330: Lecture 19.
Discrepancy between Data and Fit. Introduction What is Deviance? Deviance for Binary Responses and Proportions Deviance as measure of the goodness of.
© Department of Statistics 2012 STATS 330 Lecture 22: Slide 1 Stats 330: Lecture 22.
Université d’Ottawa - Bio Biostatistiques appliquées © Antoine Morin et Scott Findlay :32 1 Logistic regression.
Logistic Regression. Example: Survival of Titanic passengers  We want to know if the probability of survival is higher among children  Outcome (y) =
Heart Disease Example Male residents age Two models examined A) independence 1)logit(╥) = α B) linear logit 1)logit(╥) = α + βx¡
Logistic Regression Analysis Gerrit Rooks
Statistical inference Statistical inference Its application for health science research Bandit Thinkhamrop, Ph.D.(Statistics) Department of Biostatistics.
© Department of Statistics 2012 STATS 330 Lecture 24: Slide 1 Stats 330: Lecture 24.
Introduction to Multiple Regression Lecture 11. The Multiple Regression Model Idea: Examine the linear relationship between 1 dependent (Y) & 2 or more.
Logistic Regression and Odds Ratios Psych DeShon.
R Programming/ Binomial Models Shinichiro Suna. Binomial Models In binomial model, we have one outcome which is binary and a set of explanatory variables.
BPS - 5th Ed. Chapter 231 Inference for Regression.
Chapter 11: Categorical Data n Chi-square goodness of fit test allows us to examine a single distribution of a categorical variable in a population. n.
Logistic Regression Jeff Witmer 30 March Categorical Response Variables Examples: Whether or not a person smokes Success of a medical treatment.
Transforming the data Modified from:
BINARY LOGISTIC REGRESSION
Logistic regression.
A priori violations In the following cases, your data violates the normality and homoskedasticity assumption on a priori grounds: (1) count data  Poisson.
Inference for Least Squares Lines
Logistic Regression APKC – STATS AFAC (2016).
CHAPTER 7 Linear Correlation & Regression Methods
CHAPTER 29: Multiple Regression*
Inferential Statistics
Presentation transcript:

© Department of Statistics 2012 STATS 330 Lecture 25: Slide 1 Stats 330: Lecture 25

© Department of Statistics 2012 STATS 330 Lecture 25: Slide 2 Plan of the day In today’s lecture we discuss prediction and present a logistic regression case study. Topics covered will be Prediction in logistic regression In-sample and out-of-sample error rates Cross-validation and bootstrap estimates of error rates Sensitivity and specificity ROC curves Then, a case study

© Department of Statistics 2012 STATS 330 Lecture 25: Slide 3 Housekeeping Error in slide 34 in lecture 23: Function is now influenceplots Bug in ROC.curve – download replacement from web page

© Department of Statistics 2012 STATS 330 Lecture 25: Slide 4 Prediction Suppose we have fitted a logistic model and we want to use the model to predict new cases. If a new case presents with explanatory variables x, how do we predict the y-value, 0 or 1? Work out the estimated log-odds for the case Work out probability: Prob = exp(log-odds)/(1+exp(log.odds)) Predict –Y=1 if prob >= 0.5 (equivalently log.odds >=0) –Y=0 if prob < 0.5 (equivalently log.odds <0)

© Department of Statistics 2012 STATS 330 Lecture 25: Slide 5 Estimating the prediction error Prediction error is the probability of a wrong classification ( 0’s predicted as 1’s, 1’s predicted as 0’s) As in linear regression, using the training data to estimate these proportions tends to give an optimistic estimate We can use cross-validation or the bootstrap to improve the estimate –see the case study

© Department of Statistics 2012 STATS 330 Lecture 25: Slide 6 Sensitivity and specificity Sensitivity: probability of predicting a 1 when the case is truly a 1: the “true positive rate” Specificity: probability of predicting a 0 when the case is truly a 0: the “true negative rate” (1- specificity is called the “false positive rate”) Ideally, want both to be close to 1 We would like to know what these would be for new data – use cross-validation and the bootstrap as for normal regression

© Department of Statistics 2012 STATS 330 Lecture 25: Slide 7 Calculating sensitivity and specificity Model predicts Failure (0)Success (1) ActualFailure (0) Success ( 1) Specificity = 100/( ) = 33% Sensitivity = 600/( ) = 70% In-sample error rate = ( )/1150

© Department of Statistics 2012 STATS 330 Lecture 25: Slide 8 ROC curves We have predicted a “success” (Y=1) if the log-odds are positive. We can generalize this to predict a success if log-odds >=c for some constant c If c is large and –ve, almost every case will be predicted as a success (1) –Sensitivity close to 1, specificity close to 0 If c is large and +ve, almost every case will be predicted as a failure (0) –Sensitivity close to 0, specificity close to 1 Allows a trade-off between sensitivity and specificity As c varies, the sensitivity and specificity change. ROC curve is a plot of the points (1-specificity, sensitivity) as c changes.

© Department of Statistics 2012 STATS 330 Lecture 25: Slide 9

© Department of Statistics 2012 STATS 330 Lecture 25: Slide 10 ROC curves - cont False positive rate True positive rate False positive rate True positive rate Perfect prediction True positive rate Worst case prediction Predictor no help

© Department of Statistics 2012 STATS 330 Lecture 25: Slide 11 Area under the curve For a perfect predictor, the area under the ROC curve (AUC) is 1. If the predictor is independent of the response, the sensitivity and specificity are both 0.5. AUC curve serves as a measure of how good the model is at predicting.

© Department of Statistics 2012 STATS 330 Lecture 25: Slide 12 Case study The data comes from the University of Massachusetts AIDS Research Unit IMPACT study, a medical study performed in the US in the early 90’s. The study aimed to evaluate two different treatments for drug addiction. Reference: Hosmer and Lemeshow, Applied Logistic Regression (2 nd Ed), p28

© Department of Statistics 2012 STATS 330 Lecture 25: Slide 13 List of variables Variable DescriptionCodes/ValuesName Identification Code1-575ID Age at EnrollmentYearsAGE Beck Depression Score 0-54BECK IV Drug Use History 1 = Never, IVHX at Admission 2 = Previous, 3 = Recent No of prior Treatments 0-40NDRUGTX Subject's Race0 = White, RACE 1 = Other Treatment Duration0=short, 1 = LongTREAT Treatment Site0 = A, 1 = B SITE Remained Drug Free 1 = Yes, 0 = NoDFREE

© Department of Statistics 2012 STATS 330 Lecture 25: Slide 14 The variables The response DFREE is binary: records if subject is drug-free after conclusion of treatment. There is a mix of categorical and continuous explanatory variables Categorical: IVHX, RACE, TREAT, SITE Continuous: AGE, BECK, NDRUGTX.

© Department of Statistics 2012 STATS 330 Lecture 25: Slide 15 Questions Is the longer treatment more effective? Did Site A deliver the program more effectively than site B? What other variables have an effect on successful rehabilitation of addicts? Can we predict who is likely to be drug- free in 12 months?

© Department of Statistics 2012 STATS 330 Lecture 25: Slide 16 Analysis strategy Preliminary plots, tables Variable selection Model fitting Interpretation of coefficients Evaluation as a predictor of recovery from addiction.

© Department of Statistics 2012 STATS 330 Lecture 25: Slide 17 Preliminary plots

© Department of Statistics 2012 STATS 330 Lecture 25: Slide 18 Preliminary plots (2)

© Department of Statistics 2012 STATS 330 Lecture 25: Slide 19 Preliminary plots (3) Seems like number of previous drug treatments have an effect Seems like factors IVHX (Recent IV drug use), SITE (Site A or Site B) and TREAT (short or long treatment) have an effect

© Department of Statistics 2012 STATS 330 Lecture 25: Slide 20 Preliminary fits (1) Call: glm(formula = DFREE ~. - IVHX - ID + factor(IVHX), family = binomial, data = drug.df) Deviance Residuals: Min 1Q Median 3Q Max Don’t include ID!

© Department of Statistics 2012 STATS 330 Lecture 25: Slide 21 Preliminary fits (2) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) e-05 *** AGE ** BECK NDRUGTX * RACE TREAT * SITE factor(IVHX) * factor(IVHX) ** (Dispersion parameter for binomial family taken to be 1) Null deviance: on 574 degrees of freedom Residual deviance: on 566 degrees of freedom AIC: Number of Fisher Scoring iterations: 4

© Department of Statistics 2012 STATS 330 Lecture 25: Slide 22 Preliminary conclusions Important variables seem to be AGE, NDRUGTX, TREAT, IVHX Data are ungrouped, can’t assess goodness of fit with residual deviance No extremely large residuals

© Department of Statistics 2012 STATS 330 Lecture 25: Slide 23 Hosmer-Lemeshow test > HLstat(drug.glm) Value of HL statistic = 5.05 P-value = No evidence of a bad fit using this test

© Department of Statistics 2012 STATS 330 Lecture 25: Slide 24 Variable selection (1) > anova(drug.glm, test="Chisq") Analysis of Deviance Table Model: binomial, link: logit Response: DFREE Terms added sequentially (first to last) Df Deviance Resid. Df Resid. Dev P(>|Chi|) NULL AGE BECK NDRUGTX RACE TREAT SITE factor(IVHX)

© Department of Statistics 2012 STATS 330 Lecture 25: Slide 25 Variable selection (2) Step: AIC= DFREE ~ NDRUGTX + IVHX + AGE + TREAT Call: glm(formula = DFREE ~ NDRUGTX + IVHX + AGE + TREAT, family = binomial, data = drug.df) Degrees of Freedom: 574 Total (i.e. Null); 569 Residual Null Deviance: Residual Deviance: AIC: 632.6

© Department of Statistics 2012 STATS 330 Lecture 25: Slide 26 Sub-model > sub.glm<-glm(DFREE ~ NDRUGTX + factor(IVHX) + AGE + TREAT, family=binomial, data=drug.df) > summary(sub.glm) Call: glm(formula = DFREE ~ NDRUGTX + factor(IVHX) + AGE + TREAT, family = binomial, data = drug.df) Deviance Residuals: Min 1Q Median 3Q Max

© Department of Statistics 2012 STATS 330 Lecture 25: Slide 27 Sub-model (ii) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) e-05 *** NDRUGTX * factor(IVHX) * factor(IVHX) *** AGE ** TREAT * (Dispersion parameter for binomial family taken to be 1) Null deviance: on 574 degrees of freedom Residual deviance: on 569 degrees of freedom AIC: Number of Fisher Scoring iterations: 4 All variables significant, but use caution

© Department of Statistics 2012 STATS 330 Lecture 25: Slide 28 Do we need interaction terms? > sub.glm<-glm(DFREE ~ NDRUGTX + IVHX + AGE + TREAT, family=binomial, data=drug.df) > sub2.glm<-glm(DFREE ~ NDRUGTX*IVHX + AGE*IVHX + AGE*TREAT + NDRUGTX*TREAT, family=binomial, data=drug.df) > > anova(sub.glm, sub2.glm, test="Chisq") Analysis of Deviance Table Model 1: DFREE ~ NDRUGTX + IVHX + AGE + TREAT Model 2: DFREE ~ NDRUGTX * IVHX + AGE * IVHX + AGE * TREAT + NDRUGTX * TREAT Resid. Df Resid. Dev Df Deviance P(>|Chi|) Big p-value so interactions not required Model with interactions

© Department of Statistics 2012 STATS 330 Lecture 25: Slide 29 Do we need to transform? par(mfrow=c(1,2)) sub.gam<-gam(DFREE ~ s(NDRUGTX) + factor(IVHX) + s(AGE) + TREAT, family=binomial, data=drug.df) plot(sub.gam)

© Department of Statistics 2012 STATS 330 Lecture 25: Slide 30 Transforming Suggests a possible quadratic in NDRUGTX: > subq.glm<-glm(DFREE ~ poly(NDRUGTX,2) + IVHX + AGE + TREAT, family=binomial, data=drug.df) > summary(subq.glm) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) e-06 *** poly(NDRUGTX, 2) * poly(NDRUGTX, 2) IVHX * IVHX ** AGE ** TREAT * But term is not significant, so we stick with no transformation

© Department of Statistics 2012 STATS 330 Lecture 25: Slide 31 Diagnostics Pt 85 7, 471

© Department of Statistics 2012 STATS 330 Lecture 25: Slide 32 Influence of 7, 471, 85 None All (Intercept) NDRUGTX IVHX IVHX AGE TREAT Effect on coefficients of removing cases: None seem particularly influential! We will not delete them

© Department of Statistics 2012 STATS 330 Lecture 25: Slide 33 Over-dispersion qsub.glm<-glm(DFREE ~ NDRUGTX + IVHX + AGE + TREAT, family=quasibinomial, data=drug.df) > summary(qsub.glm) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) e-05 *** NDRUGTX * IVHX * IVHX ** AGE ** TREAT * --- (Dispersion parameter for quasibinomial family taken to be ) Very close to 1 so no overdispersion

© Department of Statistics 2012 STATS 330 Lecture 25: Slide 34 Interpretation Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) e-05 *** NDRUGTX * IVHX * IVHX *** AGE ** TREAT * As the number of prior treatments goes up, the probability of a drug-free recovery goes down The probability of a drug-free recovery for persons with no IV drug use is more than for persons with previous IV drug use The probability of a drug-free recovery for persons with previous IV drug use is more than for persons with recent IV drug use. The probability of a drug-free recovery goes up with age The probability of a drug-free recovery is higher for the long treatment

© Department of Statistics 2012 STATS 330 Lecture 25: Slide 35 Interpreting p-values after model selection We have seen that this is not valid, as model selection changes the distribution of the estimated coefficients. We can use the bootstrap to examine the revised distribution Leave TREAT in the model, use forward selection to select the other variables.

© Department of Statistics 2012 STATS 330 Lecture 25: Slide 36 Procedure Draw a bootstrap sample Do forward selection, record value of regression coef for TREAT (forced to be in every model) Repeat 200 times, draw histogram of the results

© Department of Statistics 2012 STATS 330 Lecture 25: Slide 37 R code # bootstrap sample n = dim(drug.df)[1] B=200 beta.boot = numeric(B) for(b in 1:B){ ni = rmultinom(1, n,prob=rep(1/n,n)) newdata = drug.df[rep(1:n,ni),] drug.boot.glm = glm(DFREE ~ NDRUGTX + factor(IVHX) + AGE + BECK + TREAT + RACE + SITE, family=binomial, data= newdata) chosen = step(drug.boot.glm, list(lower = DFREE ~ TREAT, upper= formula(drug.boot.glm)), direction = “forward", trace=0) k = match("TREAT",names(coef(chosen))) beta.boot[b] = summary(chosen)$coefficients[k,1] }

© Department of Statistics 2012 STATS 330 Lecture 25: Slide 38 Histogram > mean(beta.boot) [1] > sd(beta.boot) [1] > z.val = mean(beta.boot)/ sd(beta.boot) > 2*(1-pnorm(z.val)) [1] Compare Beta = SE = P-value =

© Department of Statistics 2012 STATS 330 Lecture 25: Slide 39 Prediction Sensitivity: chance the model predicts a successful recovery (drug-free at end of program), when one will actually occur Specificity: chance the model predicts a failure (return to drug use before end of program), when one actually will occur

© Department of Statistics 2012 STATS 330 Lecture 25: Slide 40 R code sub.glm<-glm(DFREE ~ NDRUGTX + IVHX + AGE + TREAT, family=binomial, data=drug.df) > pred = predict(sub.glm, type="response") > predcode = ifelse(pred<0.5, 0,1) > table(drug.df$DFREE,predcode) predicted 0 1 Actual Sensitivity = 3/147 = Specificity = 426/428 = Error rate = 146/575 = Proportion correctly classified = 429/575 =

© Department of Statistics 2012 STATS 330 Lecture 25: Slide 41 ROC curve ROC.curve(DFREE ~ NDRUGTX + factor(IVHX) + AGE + TREAT, data= drug.df) # in the R330 package

© Department of Statistics 2012 STATS 330 Lecture 25: Slide 42 Prediction (2) Use 10-fold cross-validation –Split data into 10 parts –Calculate sensitivity and specificity for each part, using model fitted to the remaining parts –Average results –Repeat for different splits, average repeats E.g. for one part

© Department of Statistics 2012 STATS 330 Lecture 25: Slide 43 CV and bootstrap Results > cross.val(DFREE ~ NDRUGTX + factor(IVHX) + AGE + TREAT, drug.df) Mean Specificity = Mean Sensitivity = Mean Correctly classified = > err.boot(DFREE ~ NDRUGTX + factor(IVHX) + AGE + TREAT, data= drug.df) $err [1] $Err [1] A poor classifier, but this doesn’t mean that the model fits poorly – there are very few cases with fitted probs over 0.5, and many with fitted probabilities between 0.2 and 0.5. We expect a moderate number of these to be misclassified, as some events (being drug free) with probs 0.2 to 0.5 have occurred. Bootstrap estimate Training set estimate

© Department of Statistics 2012 STATS 330 Lecture 25: Slide 44 Overall conclusions Model seems to fit well Strong evidence that longer treatments are better No apparent difference between sites Age and prior IV drug use affect recovery Model predicts poorly for the covariates in the data set – effectively always predicts patients will not be drug free