Logistic Regression Hal Whitehead BIOL4062/5062
Categorical data Logistic regression on binary data Odds ratio Logits Probit regression With many categories
Categorical data Categorical data: Categorical vs Continuous Sex, species, morph, physiological state Categorical vs Continuous Continuous => Continuous Linear regression Categorical => Continuous ANOVA Categorical => Categorical Log-linear models Continuous => Categorical Logistic regression {Also: Continuous + Categorical => Categorical}
Logistic Regression on Binary Data two categories proportions want to work out probability of being in a category: P Logistic regression: Z= β0 + β1·X1 + …
Logistic Regression Z= β0 + β1 · X1 + … If Z is large and positive: P ~ 1.0 If Z is large and negative: P ~ 0.0 Fit β0 , β1 using maximum likelihood X’s can be categorical as well as continuous
Logistic Regression: Outputs Estimates of regression coefficients: β0, β1 ,… Significance of regression coefficients and overall logistic regression Quantile probabilities Accuracy of prediction Odds ratios
Logistic Regression Regression coefficients estimated by maximizing log-likelihood iteratively Significance of coefficients indicated by likelihood ratio test (theoretically best) Wald test (normal approximation) Can reduce numbers of independent variables using stepwise elimination Or choose “best” model using AIC
Example: Fruit-fly Death Dose Dead Alive 0.01 1 4 0.1 3 2 1.0 2 3 10.0 4 1 100.0 5 0
Logistic Regression β0 = 0.56 β1 = 0.92 Constant x Log(Dose) P=0.255 Overall P=0.0064 β0 = 0.56 Constant β1 = 0.92 x Log(Dose)
Model selection using AIC Constant only Log(L)=-16.825 AIC=35.650 Const, dose Log(L)=-13.112 AIC=30.224 Const, dose, dose2 Log(L)=-12.869 AIC=31.738
Accuracy of prediction Predicted: Actual: Died Lived Died 10.6 4.4 Lived 4.4 5.6 Correct 0.7 0. 6 Overall correct 0.65
Odds ratio Compares probabilities of something happening at two values of independent variable: ω=[P(A)/(1-P(A))] / [P(B)/(1-P(B))] “Odds of dying in next 5 years are ω times greater for smokers than non-smokers” Log(ω)= β the change in odds of the event happening as the independent variable changes by one is the log of the regression coefficient
Odds ratio Odds ratio for β1 = 2.5 95% c.i. 1.2-5.4 Odds of dying are 2.5 greater when dose is 10-fold stronger
Example: Matriarchs As Repositories of Social Knowledge in African Elephants Playback vocalizations of other elephants to matriarchal groups of elephants Do they “bunch”? McComb et al. Science 2001
Elephant Knowledge Dependent variable: Bunch / not bunch Independent variables: Family [Categorical] Age of matriarch Mean age of other females Number of females in group Number of calves in group Age of youngest calf Presence of adult males Association index between group and playback individual Interactions Age of matriarch X ...
Logistic Regression Elephant Bunching on: β d.f. Variables included in final model Family - 20 P = 0.029 Age of matriarch -0.514 1 P = 0.005 Association index 98.0 1 P = 0.147 Age of matriarch × association index -4.31 1 P = 0.011 Variables excluded from final model Age of other females -0.201 1 P = 0.248 Females in group 0.033 1 P = 0.867 Calves in group 0.015 1 P = 0.946 Age of youngest calf 0.032 1 P = 0.194 Presence of males -0.851 1 P = 0.166 Other interactions with Age of matriarch
Logistic Regression Elephant Bunching on: β d.f. Variables included in final model Family - 20 P = 0.029 Age of matriarch -0.514 1 P = 0.005 Association index 98.0 1 P = 0.147 Age of matriarch × association index -4.31 1 P = 0.011 55 yr-old matriarchs 35 yr-old matriarchs “sensitivity of the bunching response to the association index increased with the age of the matriarch” McComb et al. Science 2001
Logit Logistic regression Logit transformation Z= β0 + β1 · X1 + … Logit transformation is inverse of logistic function Logit differences are logs of odds-ratios Logit regression (almost) equivalent to logistic regression Z= β0 + β1 · X1 + … Logistic regression Logit transformation
Probit Regression Transforms values in range [0 1] using inverse cumulative normal function Useful for proportions (when numbers are not available) Type of generalized linear model Probit(Y) Y
With Many Categories Logistic regression for one category against rest Canonical Variate Analysis