Logistic Regression Jeff Witmer 30 March 2016. Categorical Response Variables Examples: Whether or not a person smokes Success of a medical treatment.

Slides:



Advertisements
Similar presentations
Qualitative predictor variables
Advertisements

© Department of Statistics 2012 STATS 330 Lecture 32: Slide 1 Stats 330: Lecture 32.
Analysis of Categorical Data Nick Jackson University of Southern California Department of Psychology 10/11/
732G21/732G28/732A35 Lecture computer programmers with different experience have performed a test. For each programmer we have recorded whether.
Logistic Regression Example: Horseshoe Crab Data
Logistic Regression.
Some Terms Y =  o +  1 X Regression of Y on X Regress Y on X X called independent variable or predictor variable or covariate or factor Which factors.
Logistic Regression Predicting Dichotomous Data. Predicting a Dichotomy Response variable has only two states: male/female, present/absent, yes/no, etc.
PH6415 Review Questions. 2 Question 1 A journal article reports a 95%CI for the relative risk (RR) of an event (treatment versus control as (0.55, 0.97).
Basic Business Statistics, 11e © 2009 Prentice-Hall, Inc. Chap 14-1 Chapter 14 Introduction to Multiple Regression Basic Business Statistics 11 th Edition.
EPI 809/Spring Multiple Logistic Regression.
Nemours Biomedical Research Statistics April 23, 2009 Tim Bunnell, Ph.D. & Jobayer Hossain, Ph.D. Nemours Bioinformatics Core Facility.
Notes on Logistic Regression STAT 4330/8330. Introduction Previously, you learned about odds ratios (OR’s). We now transition and begin discussion of.
An Introduction to Logistic Regression
Logistic Regression with “Grouped” Data Lobster Survival by Size in a Tethering Experiment Source: E.B. Wilkinson, J.H. Grabowski, G.D. Sherwood, P.O.
Logistic regression for binary response variables.
Statistical hypothesis testing – Inferential statistics II. Testing for associations.
Logistic Regression In logistic regression the outcome variable is binary, and the purpose of the analysis is to assess the effects of multiple explanatory.
Logistic Regression Logistic Regression - Dichotomous Response variable and numeric and/or categorical explanatory variable(s) –Goal: Model the probability.
Logistic Regression and Generalized Linear Models:
AS 737 Categorical Data Analysis For Multivariate
Lecture 15: Logistic Regression: Inference and link functions BMTRY 701 Biostatistical Methods II.
Lecture 6 Generalized Linear Models Olivier MISSA, Advanced Research Skills.
7.1 - Motivation Motivation Correlation / Simple Linear Regression Correlation / Simple Linear Regression Extensions of Simple.
© Department of Statistics 2012 STATS 330 Lecture 26: Slide 1 Stats 330: Lecture 26.
© Department of Statistics 2012 STATS 330 Lecture 25: Slide 1 Stats 330: Lecture 25.
Logistic Regression Pre-Challenger Relation Between Temperature and Field-Joint O-Ring Failure Dalal, Fowlkes, and Hoadley (1989). “Risk Analysis of the.
Design and Analysis of Clinical Study 11. Analysis of Cohort Study Dr. Tuan V. Nguyen Garvan Institute of Medical Research Sydney, Australia.
Repeated Measures  The term repeated measures refers to data sets with multiple measurements of a response variable on the same experimental unit or subject.
April 6 Logistic Regression –Estimating probability based on logistic model –Testing differences among multiple groups –Assumptions for model.
Session 10. Applied Regression -- Prof. Juran2 Outline Binary Logistic Regression Why? –Theoretical and practical difficulties in using regular (continuous)
Logistic regression. Analysis of proportion data We know how many times an event occurred, and how many times did not occur. We want to know if these.
November 5, 2008 Logistic and Poisson Regression: Modeling Binary and Count Data LISA Short Course Series Mark Seiss, Dept. of Statistics.
AN INTRODUCTION TO LOGISTIC REGRESSION ENI SUMARMININGSIH, SSI, MM PROGRAM STUDI STATISTIKA JURUSAN MATEMATIKA UNIVERSITAS BRAWIJAYA.
1 היחידה לייעוץ סטטיסטי אוניברסיטת חיפה פרופ’ בנימין רייזר פרופ’ דוד פרג’י גב’ אפרת ישכיל.
April 4 Logistic Regression –Lee Chapter 9 –Cody and Smith 9:F.
© Department of Statistics 2012 STATS 330 Lecture 20: Slide 1 Stats 330: Lecture 20.
A preliminary exploration into the Binomial Logistic Regression Models in R and their potential application Andrew Trant PPS Arctic - Labrador Highlands.
Applied Statistics Week 4 Exercise 3 Tick bites and suspicion of Borrelia Mihaela Frincu
Count Data. HT Cleopatra VII & Marcus Antony C c Aa.
© Department of Statistics 2012 STATS 330 Lecture 22: Slide 1 Stats 330: Lecture 22.
Logistic Regression. Example: Survival of Titanic passengers  We want to know if the probability of survival is higher among children  Outcome (y) =
LOGISTIC REGRESSION Binary dependent variable (pass-fail) Odds ratio: p/(1-p) eg. 1/9 means 1 time in 10 pass, 9 times fail Log-odds ratio: y = ln[p/(1-p)]
Statistics 2: generalized linear models. General linear model: Y ~ a + b 1 * x 1 + … + b n * x n + ε There are many cases when general linear models are.
© Department of Statistics 2012 STATS 330 Lecture 24: Slide 1 Stats 330: Lecture 24.
Logistic regression (when you have a binary response variable)
Tutorial 5 Thursday February 14 MBP 1010 Kevin Brown.
Logistic Regression and Odds Ratios Psych DeShon.
Nonparametric Statistics
R Programming/ Binomial Models Shinichiro Suna. Binomial Models In binomial model, we have one outcome which is binary and a set of explanatory variables.
Unit 32: The Generalized Linear Model
Lecture #25 Tuesday, November 15, 2016 Textbook: 14.1 and 14.3
Nonparametric Statistics
Transforming the data Modified from:
Logistic regression.
A priori violations In the following cases, your data violates the normality and homoskedasticity assumption on a priori grounds: (1) count data  Poisson.
Logistic Regression APKC – STATS AFAC (2016).
Advanced Quantitative Techniques
CHAPTER 7 Linear Correlation & Regression Methods
Notes on Logistic Regression
Logistic Regression CSC 600: Data Mining Class 14.
Chapter 13 Nonlinear and Multiple Regression
Advanced Quantitative Techniques
Generalized Linear Models
Nonparametric Statistics
SAME THING?.
Association, correlation and regression in biomedical research
Logistic Regression.
Logistic Regression with “Grouped” Data
Presentation transcript:

Logistic Regression Jeff Witmer 30 March 2016

Categorical Response Variables Examples: Whether or not a person smokes Success of a medical treatment Opinion poll responses Binary Response Ordinal Response

Example: Height predicts Gender Y = Gender (0=Male 1=Female) X = Height (inches) Try an ordinary linear regression > regmodel=lm(Gender~Hgt,data=Pulse) > summary(regmodel) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) <2e-16 *** Hgt <2e-16 ***

Ordinary linear regression is used a lot, and is taught in every intro stat class. Logistic regression is rarely taught or even mentioned in intro stats, but mostly because of inertia. We now have the computing power and software to implement logistic regression.

π = Proportion of “Success” In ordinary regression the model predicts the mean Y for any combination of predictors. What’s the “mean” of a 0/1 indicator variable? Goal of logistic regression: Predict the “true” proportion of success, π, at any value of the predictor.

Binary Logistic Regression Model Y = Binary responseX = Quantitative predictor π = proportion of 1’s (yes,success) at any X Equivalent forms of the logistic regression model: What does this look like? Logit form Probability form N.B.: This is natural log (aka “ln”)

Logit Function

Binary Logistic Regression via R > logitmodel=glm(Gender~Hgt,family=binomial, data=Pulse) > summary(logitmodel) Call: glm(formula = Gender ~ Hgt, family = binomial) Deviance Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) e-14 *** Hgt e-14*** ---

proportion of females at that Hgt Call: glm(formula = Gender ~ Hgt, family = binomial, data = Pulse) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) e-14 *** Hgt e-14*** ---

> plot(fitted(logitmodel)~Pulse$Hgt)

> with(Pulse,plot(Hgt,jitter(Gender,amount=0.05))) > curve(exp( *x)/(1+exp( *x)), add=TRUE)

Example: Golf Putts Length34567 Made Missed Total Build a model to predict the proportion of putts made (success) based on length (in feet).

Logistic Regression for Putting Call: glm(formula = Made ~ Length, family = binomial, data = Putts1) Deviance Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) <2e-16 *** Length <2e-16 *** ---

Linear part of logistic fit

Probability Form of Putting Model

Odds Definition: is the odds of Yes.

Fair die ProbOddsEvent roll a 2 1/61/5 [or 1/5:1 or 1:5] even #1/2 1 [or 1:1] X > 22/3 2 [or 2:1]

x increases by 1 π increases by.072 π increases by.231 the odds increase by a factor of 2.718

Odds The logistic model assumes a linear relationship between the predictors and log(odds). ⇒ Logit form of the model:

Odds Ratio A common way to compare two groups is to look at the ratio of their odds Note: Odds ratio (OR) is similar to relative risk (RR). So when p is small, OR ≈ RR.

X is replaced by X + 1: is replaced by So the ratio is

Example: TMS for Migraines Transcranial Magnetic Stimulation vs. Placebo Pain Free?TMSPlacebo YES3922 NO6178 Total100 Odds are 2.27 times higher of getting relief using TMS than placebo

Logistic Regression for TMS data > lmod=glm(cbind(Yes,No)~Group,family=binomial,data=TMS) > summary(lmod) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) e-07 *** GroupTMS ** --- Signif. codes: 0 ‘***’ ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: on 1 degrees of freedom Residual deviance: on 0 degrees of freedom AIC: Note: e = 2.27 = odds ratio

> datatable=rbind(c(39,22),c(61,78)) > datatable [,1] [,2] [1,] [2,] > chisq.test(datatable,correct=FALSE) Pearson's Chi-squared test data: datatable X-squared = , df = 1, p-value = > lmod=glm(cbind(Yes,No)~Group,family=binomial,data=TMS) > summary(lmod) Call: glm(formula = cbind(Yes, No) ~ Group, family = binomial)Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) e-07 *** GroupTMS ** Binary Logistic Regression Chi-Square Test for 2-way table

Response variable: Y = Success/Failure Predictor variable: X = Group #1 / Group #2 Method #1: Binary logistic regression Method #2: Z- test, compare two proportions Method #3: Chi-square test for 2-way table All three “tests” are essentially equivalent, but the logistic regression approach allows us to mix other categorical and quantitative predictors in the model. A Single Binary Predictor for a Binary Response

Putting Data Odds using data from 6 feet = Odds using data from 5 feet =  Odds ratio (6 ft to 5 ft) = 0.953/1.298 = 0.73 The odds of making a putt from 6 feet are 73% of the odds of making from 5 feet.

Golf Putts Data Length34567 Made Missed Total Odds

Golf Putts Data Length34567 Made Missed Total Odds OR

Interpreting “Slope” using Odds Ratio When we increase X by 1, the ratio of the new odds to the old odds is ⇒ i.e. odds are multiplied by

Odds Ratios for Putts 4 to 3 feet5 to 4 feet6 to 5 feet7 to 6 feet From samples at each distance: 4 to 3 feet5 to 4 feet6 to 5 feet7 to 6 feet From fitted logistic: In a logistic model, the odds ratio is constant when changing the predictor by one.

Example: 2012 vs 2014 congressional elections How does %vote won by Obama relate to a Democrat winning a House seat? See the script elections 12, 14.R

Example: 2012 vs 2014 congressional elections How does %vote won by Obama relate to a Democrat winning a House seat? In 2012 a Democrat had a decent chance even if Obama got only 50% of the vote in the district. In 2014 that was less true.

There is an easy way to graph logistic curves in R. > library(TeachingDemos) > with(elect, plot(Obama12,jitter(Dem12,amount=.05))) > logitmod14=glm(Dem14~Obama12,family=binomial,data=elect) > Predict.Plot(logitmod14, pred.var="Obama12”,add=TRUE, plot.args = list(lwd=3,col="black"))

> summary(PuttModel) Call: glm(formula = Made ~ Length, family = binomial) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) <2e-16 *** Length <2e-16 *** --- Null deviance: on 586 degrees of freedom Residual deviance: on 585 degrees of freedom > PuttModel=glm(Made~Length, family=binomial,data=Putts1) > anova(PuttModel) Analysis of Deviance Table Df Deviance Resid. Df Resid. Dev NULL Length R Logistic Output

Two forms of logistic data 1.Response variable Y = Success/Failure or 1/0: “long form” in which each case is a row in a spreadsheet (e.g., Putts1 has 587 cases). This is often called “binary response” or “Bernoulli” logistic regression. 2.Response variable Y = Number of Successes for a group of data with a common X value: “short form” (e.g., Putts2 has 5 cases – putts of 3 ft, 4 ft, … 7 ft). This is often called “Binomial counts” logistic regression.

> str(Putts1) 'data.frame':587 obs. of 2 variables: $ Length: int $ Made : int > longmodel=glm(Made~Length,family=binomial,data=Putts1) > summary(longmodel) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) <2e-16 *** Length <2e-16 *** --- Null deviance: on 586 degrees of freedom Residual deviance: on 585 degrees of freedom

> str(Putts2) 'data.frame':5 obs. of 4 variables: $ Length: int $ Made : int $ Missed: int $ Trials: int > shortmodel=glm(cbind(Made,Missed)~Length,family=binomial,data=Putts2) > summary(shortmodel) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) <2e-16 *** Lengths <2e-16 *** --- Null deviance: on 4 degrees of freedom Residual deviance: on 3 degrees of freedom

Binary Logistic Regression Model Y = Binary response X = Single predictor π = proportion of 1’s (yes, success) at any x Equivalent forms of the logistic regression model: Logit form Probability form

Y = Binary response X = Single predictor Logit form Probability form X 1,X 2,…,X k = Multiple predictors π = proportion of 1’s (yes, success) at any xπ = proportion of 1’s at any x 1, x 2, …, x k Binary Logistic Regression Model Equivalent forms of the logistic regression model:

Interactions in logistic regression Consider Survival in an ICU as a function of SysBP -- BP for short – and Sex > intermodel=glm(Survive~BP*Sex, family=binomial, data=ICU) > summary(intermodel) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) BP ** Sex BP:Sex Null deviance: on 199 degrees of freedom Residual deviance: on 196 degrees of freedom

Rep = red, Dem = blue Lines are very close to parallel; not a significant interaction

Generalized Linear Model (1) What is the link between Y and b 0 + b 1 X? (2) What is the distribution of Y given X? (a) Regular reg: indentity (b) Logistic reg: logit (c) Poisson reg: log (a) Regular reg: Normal (Gaussian) (b) Logistic reg: Binomial (c) Poisson reg: Poisson

C-index, a measure of concordance Med school acceptance: predicted by MCAT and GPA? Med school acceptance: predicted by coin toss??

> library(Stat2Data) > data(MedGPA) > str(MedGPA) > GPA10=MedGPA$GPA*10 > Med.glm3=glm(Acceptance~MCAT+GPA10, family=binomial, data=MedGPA) > summary(Med.glm3) > Accept.hat.5 > with(MedGPA, table(Acceptance,Accept.hat)) Accept.hat Acceptance FALSE TRUE = 41 correct out of 55

Now consider that there were 30 successes and 25 failures. There are 30*25=750 possible pairs. We hope that the predicted Pr(success) is greater for the success than for the failure in a pair! If yes then the pair is “concordant”. > with(MedGPA, table(Acceptance,Accept.hat)) Accept.hat Acceptance FALSE TRUE C-index = % concordant pairs

> #C-index work using the MedGPA data > library(rms) #after installing the rms package > m3=lrm(Acceptance~MCAT+GPA10, data=MedGPA) > m3 lrm(formula = Acceptance~ MCAT + GPA10) Model Likelihood Discrimination Rank Discrim. Ratio Test Indexes Indexes Obs 55 LR chi R C d.f. 2 g Dxy Pr(> chi2) < gr gamma max |deriv| 2e-07 gp tau-a Brier Coef S.E. Wald Z Pr(>|Z|) Intercept MCAT GPA The R package rms has a command, lrm, that does logistic regression and gives the C-index.

> newAccept=sample(MedGPA$Acceptance) #scramble the acceptances > m1new=lrm(newAccept~MCAT+GPA10,data=MedGPA) > m1new lrm(formula = newAccept ~ MCAT + GPA10) Model Likelihood Discrimination Rank Discrim. Ratio Test Indexes Indexes Obs 55 LR chi R C d.f. 2 g Dxy Pr(> chi2) gr gamma max |deriv| 1e-13 gp tau-a Brier Coef S.E. Wald Z Pr(>|Z|) Intercept MCAT GPA Suppose we scramble the cases.. Then the C-index should be ½, like coin tossing