© Department of Statistics 2012 STATS 330 Lecture 24: Slide 1 Stats 330: Lecture 24.

Slides:



Advertisements
Similar presentations
Continued Psy 524 Ainsworth
Advertisements

Topic 12: Multiple Linear Regression
Lecture 10 F-tests in MLR (continued) Coefficients of Determination BMTRY 701 Biostatistical Methods II.
© Department of Statistics 2012 STATS 330 Lecture 32: Slide 1 Stats 330: Lecture 32.
Brief introduction on Logistic Regression
4/14/ lecture 81 STATS 330: Lecture 8. 4/14/ lecture 82 Collinearity Aims of today’s lecture: Explain the idea of collinearity and its connection.
Lecture 28 Categorical variables: –Review of slides from lecture 27 (reprint of lecture 27 categorical variables slides with typos corrected) –Practice.
Logistic Regression Example: Horseshoe Crab Data
Logistic Regression.
Predicting Success in the National Football League An in-depth look at the factors that differentiate the winning teams from the losing teams. Benjamin.
© 2010 Pearson Prentice Hall. All rights reserved Least Squares Regression Models.
Logistic Regression Predicting Dichotomous Data. Predicting a Dichotomy Response variable has only two states: male/female, present/absent, yes/no, etc.
Lecture 23: Tues., Dec. 2 Today: Thursday:
Lecture 24: Thurs. Dec. 4 Extra sum of squares F-tests (10.3) R-squared statistic (10.4.1) Residual plots (11.2) Influential observations (11.3,
Log-linear and logistic models
Nemours Biomedical Research Statistics April 23, 2009 Tim Bunnell, Ph.D. & Jobayer Hossain, Ph.D. Nemours Bioinformatics Core Facility.
Introduction to Logistic Regression Analysis Dr Tuan V. Nguyen Garvan Institute of Medical Research Sydney, Australia.
7/2/ Lecture 51 STATS 330: Lecture 5. 7/2/ Lecture 52 Tutorials  These will cover computing details  Held in basement floor tutorial lab,
Logistic Regression with “Grouped” Data Lobster Survival by Size in a Tethering Experiment Source: E.B. Wilkinson, J.H. Grabowski, G.D. Sherwood, P.O.
Example of Simple and Multiple Regression
Logistic Regression II Simple 2x2 Table (courtesy Hosmer and Lemeshow) Exposure=1Exposure=0 Disease = 1 Disease = 0.
Logistic Regression and Generalized Linear Models:
Chapter 13: Inference in Regression
© Department of Statistics 2012 STATS 330 Lecture 28: Slide 1 Stats 330: Lecture 28.
© Department of Statistics 2012 STATS 330 Lecture 18 Slide 1 Stats 330: Lecture 18.
New Ways of Looking at Binary Data Fitting in R Yoon G Kim, Colloquium Talk.
9/14/ Lecture 61 STATS 330: Lecture 6. 9/14/ Lecture 62 Inference for the Regression model Aim of today’s lecture: To discuss how we assess.
Lecture 15: Logistic Regression: Inference and link functions BMTRY 701 Biostatistical Methods II.
© 2002 Prentice-Hall, Inc.Chap 14-1 Introduction to Multiple Regression Model.
Lecture 6 Generalized Linear Models Olivier MISSA, Advanced Research Skills.
7.1 - Motivation Motivation Correlation / Simple Linear Regression Correlation / Simple Linear Regression Extensions of Simple.
© Department of Statistics 2012 STATS 330 Lecture 26: Slide 1 Stats 330: Lecture 26.
© Department of Statistics 2012 STATS 330 Lecture 25: Slide 1 Stats 330: Lecture 25.
Logistic Regression Pre-Challenger Relation Between Temperature and Field-Joint O-Ring Failure Dalal, Fowlkes, and Hoadley (1989). “Risk Analysis of the.
Introduction to Generalized Linear Models Prepared by Louise Francis Francis Analytics and Actuarial Data Mining, Inc. October 3, 2004.
Repeated Measures  The term repeated measures refers to data sets with multiple measurements of a response variable on the same experimental unit or subject.
Logistic regression. Analysis of proportion data We know how many times an event occurred, and how many times did not occur. We want to know if these.
University of Warwick, Department of Sociology, 2014/15 SO 201: SSAASS (Surveys and Statistics) (Richard Lampard) Week 7 Logistic Regression I.
November 5, 2008 Logistic and Poisson Regression: Modeling Binary and Count Data LISA Short Course Series Mark Seiss, Dept. of Statistics.
1 היחידה לייעוץ סטטיסטי אוניברסיטת חיפה פרופ’ בנימין רייזר פרופ’ דוד פרג’י גב’ אפרת ישכיל.
Lesson Multiple Regression Models. Objectives Obtain the correlation matrix Use technology to find a multiple regression equation Interpret the.
© Department of Statistics 2012 STATS 330 Lecture 31: Slide 1 Stats 330: Lecture 31.
Lecture 7 GLMs II Binomial Family Olivier MISSA, Advanced Research Skills.
© Department of Statistics 2012 STATS 330 Lecture 20: Slide 1 Stats 330: Lecture 20.
© Department of Statistics 2012 STATS 330 Lecture 23: Slide 1 Stats 330: Lecture 23.
Generalized linear MIXED models
A preliminary exploration into the Binomial Logistic Regression Models in R and their potential application Andrew Trant PPS Arctic - Labrador Highlands.
Applied Statistics Week 4 Exercise 3 Tick bites and suspicion of Borrelia Mihaela Frincu
Chapter 22: Building Multiple Regression Models Generalization of univariate linear regression models. One unit of data with a value of dependent variable.
© Department of Statistics 2012 STATS 330 Lecture 30: Slide 1 Stats 330: Lecture 30.
Count Data. HT Cleopatra VII & Marcus Antony C c Aa.
© Department of Statistics 2012 STATS 330 Lecture 19: Slide 1 Stats 330: Lecture 19.
© Department of Statistics 2012 STATS 330 Lecture 22: Slide 1 Stats 330: Lecture 22.
Logistic Regression. Example: Survival of Titanic passengers  We want to know if the probability of survival is higher among children  Outcome (y) =
Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc.. Chap 14-1 Chapter 14 Introduction to Multiple Regression Basic Business Statistics 10 th Edition.
Statistics 2: generalized linear models. General linear model: Y ~ a + b 1 * x 1 + … + b n * x n + ε There are many cases when general linear models are.
Example x y We wish to check for a non zero correlation.
Introduction to Multiple Regression Lecture 11. The Multiple Regression Model Idea: Examine the linear relationship between 1 dependent (Y) & 2 or more.
2/25/ lecture 121 STATS 330: Lecture 12. 2/25/ lecture 122 Diagnostics 4 Aim of today’s lecture To discuss diagnostics for independence.
Tutorial 5 Thursday February 14 MBP 1010 Kevin Brown.
Logistic Regression and Odds Ratios Psych DeShon.
R Programming/ Binomial Models Shinichiro Suna. Binomial Models In binomial model, we have one outcome which is binary and a set of explanatory variables.
Logistic Regression Jeff Witmer 30 March Categorical Response Variables Examples: Whether or not a person smokes Success of a medical treatment.
Stats Methods at IC Lecture 3: Regression.
Transforming the data Modified from:
Logistic regression.
A priori violations In the following cases, your data violates the normality and homoskedasticity assumption on a priori grounds: (1) count data  Poisson.
CHAPTER 7 Linear Correlation & Regression Methods
SAME THING?.
Logistic Regression with “Grouped” Data
Presentation transcript:

© Department of Statistics 2012 STATS 330 Lecture 24: Slide 1 Stats 330: Lecture 24

© Department of Statistics 2012 STATS 330 Lecture 24: Slide 2 Plan of the day In today’s lecture we continue our discussion of logistic regression Topics –Binary anova –Variable selection –Under/over dispersion Coursebook reference: Section 5.2.6

© Department of Statistics 2012 STATS 330 Lecture 24: Slide 3 Binary Anova Categorical variables are included in logistic regressions in just the same way as in linear regression. Done by means of “dummy variables”. Interpretation is similar, but in terms of log- odds rather than means. A model which fits a separate probability to every possible combination of factor levels is a maximal model, with zero deviance

© Department of Statistics 2012 STATS 330 Lecture 24: Slide 4 Example The plum tree data: see the coursebook, For another example, see Tutorial 8 Data concerns survival of plum tree cuttings. Two categorical explanatory variables, each at 2 levels: planting time (spring, autumn) and cutting length (long, short). For each of these 4 combinations 240 cuttings were planted, and the number surviving recorded.

© Department of Statistics 2012 STATS 330 Lecture 24: Slide 5 Data length time r n 1 long autumn long spring short autumn short spring

© Department of Statistics 2012 STATS 330 Lecture 24: Slide 6 Fitting > plum.glm<-glm(cbind(r,n-r)~length*time, family=binomial, data=plum.df) > summary(plum.glm) Call: glm(formula = cbind(r, n - r) ~ length * time, family = binomial, data = plum.df) Deviance Residuals: [1] Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) e-06 *** lengthshort e-06 *** timespring e-11 *** lengthshort:timespring Null deviance: e+02 on 3 degrees of freedom Residual deviance: e-14 on 0 degrees of freedom AIC: Zero residuals! Zero deviance and df!

© Department of Statistics 2012 STATS 330 Lecture 24: Slide 7 Points to note The model length*time fits a separate probability to each of the 4 covariate patterns Thus, it is fitting the maximal model, which has zero deviance by definition This causes all the deviance residuals to be zero The fitted probabilities are just the ratios r/n

© Department of Statistics 2012 STATS 330 Lecture 24: Slide 8 Fitted logits LogitsLength= long Length= short Time= autumn = Time= spring = = > predict(plum.glm) [1] Coefficients: Estimate (Intercept) lengthshort timespring lengthshort:timespring Logit = log((r/n)/(1-r/n) =log(r/(n-r))

© Department of Statistics 2012 STATS 330 Lecture 24: Slide 9 Interaction plot attach(plum.df) interaction.plot(length,time,log(r/(n-r)) Lines almost parallel, indicating no interaction on the log-odds scale

© Department of Statistics 2012 STATS 330 Lecture 24: Slide 10 Anova > anova(plum.glm,test="Chisq") Analysis of Deviance Table Model: binomial, link: logit Response: cbind(r, n - r) Terms added sequentially (first to last) Df Deviance Resid.Df Resid.Dev P(>|Chi|) NULL length e-11 time e-24 length:time e > 1-pchisq(2.294,1) [1] Interaction not significant

© Department of Statistics 2012 STATS 330 Lecture 24: Slide 11 Final model: interpretation and fitted probabilities > plum2.glm<-glm(cbind(r,n-r)~length + time, family=binomial, data=plum.df) > summary(plum2.glm) Call: glm(formula = cbind(r, n - r) ~ length + time, family = binomial, data = plum.df) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) e-09 *** lengthshort e-12 *** timespring < 2e-16 *** Null deviance: on 3 degrees of freedom Residual deviance: on 1 degrees of freedom AIC: > 1-pchisq(2.2938,1) [1]

© Department of Statistics 2012 STATS 330 Lecture 24: Slide 12 Final model: interpretation and fitted probabilities Estimate Std. Error z value Pr(>|z|) (Intercept) e-09 *** lengthshort e-12 *** timespring < 2e-16 *** Prob of survival less for short cuttings (coeff<0) Prob of survival less for spring planting (coeff<0) Null deviance: on 3 degrees of freedom Residual deviance: on 1 degrees of freedom AIC: Deviance of on 1 df: pvalue is evidence is that no-interaction model fits well.

© Department of Statistics 2012 STATS 330 Lecture 24: Slide 13 Fitted Probabilities Length= long Length= short Time = autumn Time = spring > predict(plum2.glm,type="response") [1]

© Department of Statistics 2012 STATS 330 Lecture 24: Slide 14 Variable selection Variable selection proceeds as in ordinary regression Use anova and stepwise AIC also defined for logistic regression AIC = Deviance + 2  (number of parameters) Pick model with smallest AIC Calculate with AIC function

© Department of Statistics 2012 STATS 330 Lecture 24: Slide 15 Example: lizard data Site preferences of 2 species of lizard, grahami and opalinus Want to investigate the effect of –Perch height –Perch diameter –Time of day on the probability that a lizard caught at a site will be grahami

© Department of Statistics 2012 STATS 330 Lecture 24: Slide 16 Data length height time r n 1 short low early short high early long low early long high early short low mid short high mid long low mid long high mid short low late short high late long low late long high late 5 10

© Department of Statistics 2012 STATS 330 Lecture 24: Slide 17 Eyeball analysis > plot.design(lizard.df, y=log(lizard.df$r /(lizard.df$n-lizard.df$r)), ylab="mean of logits") Proportion of grahami lizards higher when perches are short and high, and in the earlier part of the day

© Department of Statistics 2012 STATS 330 Lecture 24: Slide 18 Model selection Full model is cbind(r,n-r)~time*length*height so fit this first. Then use anova and stepwise to select a simpler model if appropriate

© Department of Statistics 2012 STATS 330 Lecture 24: Slide 19 anova > lizard.glm<-glm(cbind(r,n-r)~time*length*height, + family=binomial,data=lizard.df) > anova(lizard.glm, test="Chisq") Df Deviance Resid. Df Resid. Dev P(>|Chi|) NULL time length e-05 height e-04 time:length time:height length:height time:length:height e Suggests model cbind(r,n-r) ~ time + length + height

© Department of Statistics 2012 STATS 330 Lecture 24: Slide 20 stepwise >null.model<-glm(cbind(r,n-r)~1, family=binomial,data=lizard.df) > step(null.model, formula(lizard.glm), direction="both") Call: glm(formula = cbind(r, n - r) ~ height + time + length, family = binomial, data = lizard.df) Coefficients: (Intercept) heightlow timelate timemid lengthshort Degrees of Freedom: 11 Total (i.e. Null); 7 Residual Null Deviance: Residual Deviance: AIC: 64.09

© Department of Statistics 2012 STATS 330 Lecture 24: Slide 21 Summary > summary(model2) Call: glm(formula = cbind(r, n - r) ~ time + length + height, family = binomial, data = lizard.df) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) e-07 *** timelate *** timemid lengthshort ** heightlow *** --- Signif. codes: 0 `***' `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: on 11 degrees of freedom Residual deviance: on 7 degrees of freedom

© Department of Statistics 2012 STATS 330 Lecture 24: Slide 22 AIC > AIC(glm(cbind(r,n-r)~time*length*height, + family=binomial,data=lizard.df)) [1] > AIC(glm(cbind(r,n-r)~time*length*height - time:length:height, + family=binomial,data=lizard.df)) [1] > AIC(glm(cbind(r,n-r)~time+length+height + time:length+time:height, + family=binomial,data=lizard.df)) [1] > AIC(glm(cbind(r,n-r)~time+length+height, + family=binomial,data=lizard.df)) [1] Suggests cbind(r,n-r)~time+length+height

© Department of Statistics 2012 STATS 330 Lecture 24: Slide 23 Diagnostics > par(mfrow=c(2,2)) > plot(model2, which=1:4) No major problems

© Department of Statistics 2012 STATS 330 Lecture 24: Slide 24 Conclusions Strong suggestion that Grahami relatively more numerous in mornings/midday Strong suggestion Grahami relatively more numerous on short perches Strong suggestion Grahami relatively more numerous on high perches

© Department of Statistics 2012 STATS 330 Lecture 24: Slide 25 Over/under dispersion The variance of the binomial B(n,p) distribution is np(1-p), which is always less than the mean np. Sometimes the individuals having the same covariate pattern in a logistic regression may be correlated. This will result in the variance being greater than np(1-p) (if the correlation is +ve) or less than np(1-p) (if the correlation is - ve)

© Department of Statistics 2012 STATS 330 Lecture 24: Slide 26 Over/under-dispersion If this is the case, we say the data are over- dispersed (if the variance is greater) or under- dispersed (if the variance is less) Consequence: standard errors will be wrong. Quick and dirty remedy: analyse as a binomial, but allow the “scale factor” to be arbitrary: this models the variance as  np(1-p) where  is the “scale factor” (for the binomial, the scale factor is always 1)

© Department of Statistics 2012 STATS 330 Lecture 24: Slide 27 Over-dispersed model > model3<-glm(cbind(r,n-r)~time+length+height, family=quasibinomial,data=lizard.df) > summary(model3) Call: glm(formula = cbind(r, n - r) ~ time + length + height, family = quasibinomial, data = lizard.df) > Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) ** timelate * timemid lengthshort * heightlow * --- (Dispersion parameter for quasibinomial family taken to be ) Null deviance: on 11 degrees of freedom Residual deviance: on 7 degrees of freedom

© Department of Statistics 2012 STATS 330 Lecture 24: Slide 28 Comparison Quasibinomial Estimate Std. Error t value Pr(>|t|) (Intercept) ** timelate * timemid lengthshort * heightlow * Binomial Estimate Std. Error z value Pr(>|z|) (Intercept) e-07 *** timelate *** timemid lengthshort ** heightlow ***