© Department of Statistics 2012 STATS 330 Lecture 22: Slide 1 Stats 330: Lecture 22.

Slides:



Advertisements
Similar presentations
Lecture 11 (Chapter 9).
Advertisements

© Department of Statistics 2012 STATS 330 Lecture 32: Slide 1 Stats 330: Lecture 32.
Lecture 16: Logistic Regression: Goodness of Fit Information Criteria ROC analysis BMTRY 701 Biostatistical Methods II.
Logistic Regression Example: Horseshoe Crab Data
Logistic Regression.
Logistic Regression Multivariate Analysis. What is a log and an exponent? Log is the power to which a base of 10 must be raised to produce a given number.
© 2010 Pearson Prentice Hall. All rights reserved Least Squares Regression Models.
Logistic Regression Predicting Dichotomous Data. Predicting a Dichotomy Response variable has only two states: male/female, present/absent, yes/no, etc.
Some more issues of time series analysis Time series regression with modelling of error terms In a time series regression model the error terms are tentatively.
Generalised linear models
Log-linear and logistic models Generalised linear model ANOVA revisited Log-linear model: Poisson distribution logistic model: Binomial distribution Deviances.
Log-linear and logistic models
Nemours Biomedical Research Statistics April 23, 2009 Tim Bunnell, Ph.D. & Jobayer Hossain, Ph.D. Nemours Bioinformatics Core Facility.
Introduction to Logistic Regression Analysis Dr Tuan V. Nguyen Garvan Institute of Medical Research Sydney, Australia.
An Introduction to Logistic Regression
Review for Exam 2 Some important themes from Chapters 6-9 Chap. 6. Significance Tests Chap. 7: Comparing Two Groups Chap. 8: Contingency Tables (Categorical.
Logistic Regression with “Grouped” Data Lobster Survival by Size in a Tethering Experiment Source: E.B. Wilkinson, J.H. Grabowski, G.D. Sherwood, P.O.
Logistic Regression In logistic regression the outcome variable is binary, and the purpose of the analysis is to assess the effects of multiple explanatory.
Unit 4b: Fitting the Logistic Model to Data © Andrew Ho, Harvard Graduate School of EducationUnit 4b – Slide 1
Logistic Regression and Generalized Linear Models:
Psy B07 Chapter 1Slide 1 ANALYSIS OF VARIANCE. Psy B07 Chapter 1Slide 2 t-test refresher  In chapter 7 we talked about analyses that could be conducted.
© Department of Statistics 2012 STATS 330 Lecture 28: Slide 1 Stats 330: Lecture 28.
Copyright © 2013, 2010 and 2007 Pearson Education, Inc. Chapter Inference on the Least-Squares Regression Model and Multiple Regression 14.
Lecture 15: Logistic Regression: Inference and link functions BMTRY 701 Biostatistical Methods II.
Lecture 6 Generalized Linear Models Olivier MISSA, Advanced Research Skills.
7.1 - Motivation Motivation Correlation / Simple Linear Regression Correlation / Simple Linear Regression Extensions of Simple.
© Department of Statistics 2012 STATS 330 Lecture 26: Slide 1 Stats 330: Lecture 26.
© Department of Statistics 2012 STATS 330 Lecture 25: Slide 1 Stats 330: Lecture 25.
Lecture 3: Inference in Simple Linear Regression BMTRY 701 Biostatistical Methods II.
Design and Analysis of Clinical Study 11. Analysis of Cohort Study Dr. Tuan V. Nguyen Garvan Institute of Medical Research Sydney, Australia.
Repeated Measures  The term repeated measures refers to data sets with multiple measurements of a response variable on the same experimental unit or subject.
April 6 Logistic Regression –Estimating probability based on logistic model –Testing differences among multiple groups –Assumptions for model.
Logistic regression. Analysis of proportion data We know how many times an event occurred, and how many times did not occur. We want to know if these.
University of Warwick, Department of Sociology, 2014/15 SO 201: SSAASS (Surveys and Statistics) (Richard Lampard) Week 7 Logistic Regression I.
November 5, 2008 Logistic and Poisson Regression: Modeling Binary and Count Data LISA Short Course Series Mark Seiss, Dept. of Statistics.
AN INTRODUCTION TO LOGISTIC REGRESSION ENI SUMARMININGSIH, SSI, MM PROGRAM STUDI STATISTIKA JURUSAN MATEMATIKA UNIVERSITAS BRAWIJAYA.
1 היחידה לייעוץ סטטיסטי אוניברסיטת חיפה פרופ’ בנימין רייזר פרופ’ דוד פרג’י גב’ אפרת ישכיל.
April 4 Logistic Regression –Lee Chapter 9 –Cody and Smith 9:F.
Lecture Slide #1 Logistic Regression Analysis Estimation and Interpretation Hypothesis Tests Interpretation Reversing Logits: Probabilities –Averages.
Lecture 7 GLMs II Binomial Family Olivier MISSA, Advanced Research Skills.
© Department of Statistics 2012 STATS 330 Lecture 20: Slide 1 Stats 330: Lecture 20.
© Department of Statistics 2012 STATS 330 Lecture 23: Slide 1 Stats 330: Lecture 23.
A preliminary exploration into the Binomial Logistic Regression Models in R and their potential application Andrew Trant PPS Arctic - Labrador Highlands.
Logistic Regression. Linear Regression Purchases vs. Income.
Applied Statistics Week 4 Exercise 3 Tick bites and suspicion of Borrelia Mihaela Frincu
© Department of Statistics 2012 STATS 330 Lecture 30: Slide 1 Stats 330: Lecture 30.
Multiple Logistic Regression STAT E-150 Statistical Methods.
Université d’Ottawa - Bio Biostatistiques appliquées © Antoine Morin et Scott Findlay :32 1 Logistic regression.
Logistic Regression. Example: Survival of Titanic passengers  We want to know if the probability of survival is higher among children  Outcome (y) =
LOGISTIC REGRESSION Binary dependent variable (pass-fail) Odds ratio: p/(1-p) eg. 1/9 means 1 time in 10 pass, 9 times fail Log-odds ratio: y = ln[p/(1-p)]
Logistic Regression Analysis Gerrit Rooks
© Department of Statistics 2012 STATS 330 Lecture 24: Slide 1 Stats 330: Lecture 24.
Remembering way back: Generalized Linear Models Ordinary linear regression What if we want to model a response that is not Gaussian?? We may have experiments.
Logistic Regression and Odds Ratios Psych DeShon.
Nonparametric Statistics
R Programming/ Binomial Models Shinichiro Suna. Binomial Models In binomial model, we have one outcome which is binary and a set of explanatory variables.
Logistic Regression Jeff Witmer 30 March Categorical Response Variables Examples: Whether or not a person smokes Success of a medical treatment.
Stats Methods at IC Lecture 3: Regression.
Transforming the data Modified from:
Chapter 4: Basic Estimation Techniques
Chapter 4 Basic Estimation Techniques
Logistic regression.
A priori violations In the following cases, your data violates the normality and homoskedasticity assumption on a priori grounds: (1) count data  Poisson.
CHAPTER 7 Linear Correlation & Regression Methods
Basic Estimation Techniques
Basic Estimation Techniques
SAME THING?.
PSY 626: Bayesian Statistics for Psychological Science
Logistic Regression with “Grouped” Data
Presentation transcript:

© Department of Statistics 2012 STATS 330 Lecture 22: Slide 1 Stats 330: Lecture 22

© Department of Statistics 2012 STATS 330 Lecture 22: Slide 2 Plan of the day In today’s lecture we continue our discussion of the logistic regression model Topics covered –Multiple logistic regression –Grouped data in multiple linear regression –Deviances –Models and submodels Reference: Coursebook, sections 5.2.3, 5.2.4

© Department of Statistics 2012 STATS 330 Lecture 22: Slide 3 Multiple Logistic Regression The logistic regression is easily extended to handle more than one explanatory variable. For k explanatory variables x 1,…,x k, and binary response Y, the model is

© Department of Statistics 2012 STATS 330 Lecture 22: Slide 4 Odds & log-odds form

© Department of Statistics 2012 STATS 330 Lecture 22: Slide 5 Interpretation of coefficients As before, a unit increase in x j multiplies the odds by exp(  j ) A unit increase in x j adds  j to the log- odds

© Department of Statistics 2012 STATS 330 Lecture 22: Slide 6 Grouped and ungrouped data in multiple LR To group two individuals in multiple LR, the individuals must have the same values for all the covariates Each distinct set of covariates is called a covariate pattern If there are m distinct covariate patterns, we record for each pattern the number of individuals having that pattern (n) and the number of “successes” (r).

© Department of Statistics 2012 STATS 330 Lecture 22: Slide 7 Log -likelihood For grouped data, the log-likelihood is  ’s are chosen to maximise this expression, using IRLS The i th covariate pattern is (x i1,…x ik )

© Department of Statistics 2012 STATS 330 Lecture 22: Slide 8 For ungrouped data: The log-likelihood is Again,  ’s are chosen to maximise this expression. Two forms give equivalent results

© Department of Statistics 2012 STATS 330 Lecture 22: Slide 9 Example: Kyphosis risk factors Kyphosis is a curvature of the spine that may be the result of spinal surgery. In a study to determine risk factors for this condition, a study was conducted. Variables are –Kyphosis: (binary, absent=no kyphosis, present=kyphosis) –Age: continuous, age in months –Start: continuous, vertebrae level of surgery –Number: continuous, no of vertebrae involved.

© Department of Statistics 2012 STATS 330 Lecture 22: Slide 10 Data Kyphosis Age Number Start 1 absent absent present absent absent absent absent absent absent present cases in all

© Department of Statistics 2012 STATS 330 Lecture 22: Slide 11 Caution In this data set Kyphosis is not a binary variable with values 0 and 1 but rather a factor with 2 levels “absent” and “present”: levels(kyphosis.df$Kyphosis) [1] "absent" "present" NB: if we fit a regression with Kyphosis as the response we are modelling the prob that Kyphosis is “present”: In general, R picks up the first level of the factor to mean “failure (ie in this case “absent” or Y=0) and combines all the other levels into “success” (in this case “present” or Y=1).

© Department of Statistics 2012 STATS 330 Lecture 22: Slide 12 plot(kyphosis.df$age,kyphosis.df$Kyphosis)

© Department of Statistics 2012 STATS 330 Lecture 22: Slide 13 plot(gam(Kyphosis~s(Age) + Number + Start, family=binomial, data=kyphosis.df))

© Department of Statistics 2012 STATS 330 Lecture 22: Slide 14 Fitting (i) Seems age is important, fit as a quadratic > kyphosis.glm<-glm(Kyphosis~ Age + I(Age^2) + Start + Number, family=binomial, data=kyphosis.df) > summary(kyphosis.glm)

© Department of Statistics 2012 STATS 330 Lecture 22: Slide 15 Fitting (ii) Call: glm(formula = Kyphosis ~ Age + I(Age^2) + Start + Number, family = binomial, data = kyphosis.df) Deviance Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) * Age * I(Age^2) * Start ** Number (Dispersion parameter for binomial family taken to be 1) Null deviance: on 80 degrees of freedom Residual deviance: on 76 degrees of freedom AIC:

© Department of Statistics 2012 STATS 330 Lecture 22: Slide 16 Points arising Start and Age clearly significant Need age as quadratic What is deviance? How do we judge goodness of fit? Is there an analogue of R 2 ? What is a dispersion parameter? What is Fisher Scoring? To answer these, we first need to explain deviance

© Department of Statistics 2012 STATS 330 Lecture 22: Slide 17 Deviance Recall that our model had 2 parts –The binomial assumption (r is Bin (n,  ) ) –The logistic assumption ( logit of  is linear  If we only assume the first part, we have the most general model possible, since we put no restriction on the probabilities. Our likelihood L is a function of the  ’s, one for each covariate pattern:

© Department of Statistics 2012 STATS 330 Lecture 22: Slide 18 Deviance (cont) The log-likelihood is (ignoring bits not depending on the  ’s) The maximum value of this (log)- likelihood is when  i = r i /n i If r i = 0 or n i then use 0 log 0 =0 Call this maximum value of L L max

© Department of Statistics 2012 STATS 330 Lecture 22: Slide 19 Deviance (cont) L max represents the biggest possible value of the likelihood for the most general model. Now consider the logistic model, where the form of the probabilities is specified by the logistic function. Let L Mod be the maximum value of the likelihood for this model. The deviance for the logistic model is defined as Deviance = 2(log L max - log L Mod )

© Department of Statistics 2012 STATS 330 Lecture 22: Slide 20 Deviance (cont) Intuitively, the better the logistic model, the closer L mod is to L max, and the smaller the deviance should be How small is small? If m is small and the n i ’s are large, then when the logistic model is true, the deviance has approximately a chi-squared distribution with m-k-1 degrees of freedom –m: number of covariate patterns –k: number of covariates

© Department of Statistics 2012 STATS 330 Lecture 22: Slide 21 Deviance (cont) Thus, if the deviance is less than the upper 95% percentage point of the appropriate chi-square distribution, the logistic model fits well In this sense, the deviance is the analogue of R 2 NB Only applies to grouped data, when m is small and the n’s are large. Other names for deviance: model deviance, residual deviance (R)

© Department of Statistics 2012 STATS 330 Lecture 22: Slide 22 Null deviance At the other extreme, the most restrictive model is one where all the probabilities  i are the same (ie don’t depend on the covariates). The deviance for this model is called the null deviance Intuitively, if none of the covariates is related to the binary response, the model deviance won’t be much smaller then the null deviance

© Department of Statistics 2012 STATS 330 Lecture 22: Slide 23 Graphical interpretation 2 log L MAX 2 log L MOD 2 log L NULL Residual deviance Null Deviance

© Department of Statistics 2012 STATS 330 Lecture 22: Slide 24 Example: budworm data Batches of 20 moths subjected to increasing doses of a poison, “success”=death Data is grouped: for each of 6 doses (1.0, 2.0, 4.0, 8.0, 16.0, 32.0 mg) and each of male and female, we have 20 moths. m=12 covariate patterns

© Department of Statistics 2012 STATS 330 Lecture 22: Slide 25 Example: budworm data sex dose r n Sex: 0=male 1=female

© Department of Statistics 2012 STATS 330 Lecture 22: Slide 26 3 models Null model: probabilities  i are constant, equal to  say. Estimate of this common value is total deaths/total moths = sum(r)/sum(n) =111/240 = Logistic model : probabilities estimated using fitted logistic model Maximal model: probabilities estimated by r i /n i

© Department of Statistics 2012 STATS 330 Lecture 22: Slide 27 Probabilities under the 3 models > max.mod.probs<-budworm.df$r/budworm.df$n > budworm.glm<-glm( cbind(r, n-r) ~ sex + dose, family=binomial, data = budworm.df) > logist.mod.probs<-predict(budworm.glm, type="response") > null.mod.probs<-sum(budworm.df$r)/sum(budworm.df$n) > cbind(max.mod.probs,logist.mod.probs,null.mod.probs) max.mod.probs logist.mod.probs null.mod.probs

© Department of Statistics 2012 STATS 330 Lecture 22: Slide 28 Calculating logits > max.logit = log((budworm.df$r+0.5)/(budworm.df$n - budworm.df$r+0.5)) > model.logit = predict(budworm.glm) > > cbind(max.logit,model.logit) If model is true, max.logits max.logit model.logit should be proportional to dose and close to model.logits

© Department of Statistics 2012 STATS 330 Lecture 22: Slide 29 Plotting logits Poor fit! Logit = log(prob/(1-prob)) Maximal model logit is Log((r/n)/(1-r/n)) = Log(r/(n-r))

© Department of Statistics 2012 STATS 330 Lecture 22: Slide 30 Calculating the likelihoods Likelihood is L MAX = x 10 -7, 2 log L MAX = L MOD = x , 2 log L MOD = L NULL = x , 2 log L NULL =

© Department of Statistics 2012 STATS 330 Lecture 22: Slide 31 Calculating the deviances summary(budworm.glm) Call: glm(formula = cbind(r, n - r) ~ sex + dose, family = binomial, data = budworm.df) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) e-06 *** sex ** dose e-12 *** (Dispersion parameter for binomial family taken to be 1) Null deviance: on 11 degrees of freedom Residual deviance: on 9 degrees of freedom AIC: Residual deviance = – ( ) = Null deviance = – ( ) =

© Department of Statistics 2012 STATS 330 Lecture 22: Slide 32 Goodness of fit N’s reasonably large, m small Can interpret residual deviance as a measure of fit > 1-pchisq(27.968,9) [1] Not a good fit!! (as we suspected from the plot) In actual fact log(dose) works better

© Department of Statistics 2012 STATS 330 Lecture 22: Slide 33 > logdose.glm<-glm( cbind(r, n-r) ~ sex + log(dose), family=binomial, data = budworm.df) > summary(logdose.glm) glm(formula = cbind(r, n - r) ~ sex + log(dose), family = binomial, data = budworm.df) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) e-10 *** sex ** log(dose) e-16 *** Null deviance: on 11 degrees of freedom Residual deviance: on 9 degrees of freedom AIC: > 1-pchisq( 6.757,9) [1] > Big reduction in deviance, was P-value now large Improvement!