Logistic Regression Predicting Dichotomous Data
Predicting a Dichotomy Response variable has only two states: male/female, present/absent, yes/no, etc Linear regression fails because we cannot keep the prediction within the bounds of 0 – 1 Continuous and non-continuous predictors possible
Logistic Model Explanatory variables used to predict the probability that the response will be present (male, yes, etc) We fit a linear model to the log of the odds that an event will occur If the probability that an event will occur is p, then the odds = p/(1-p)
logits Equations: –logit(p) = log(p/(1-p)) –logit(p) = b 0 + b 1 x1 + b 2 x2... So logistic regression is a linear regression of logits (logs of odds ratios)
Assumptions Dichotomous response (only two states possible) Outcomes statistically independent Model contains all relevant predictors and no irrelevant ones Samples sizes of about 50 cases per predictor
Two Approaches Data consisting of individual cases with a dichotomous variable Grouped data where the number present and number absent are known for each combination of explanatory variables (in practice these will usually be categorical/ ordinal)
Inverting Snodgrass Instead of seeing if houses inside the white wall are larger than those outside, we can use area to predict where the house is located.
# Use Rcmdr to create a dichotomous variable In Snodgrass$In <- with(Snodgrass, ifelse(Inside=="Inside", 1, 0)) # Use Rcmdr to bin Area into 10 bins using numbers to Snodgrass$AreaBin <- bin.var(Snodgrass$Area, bins=10, method='intervals', labels=FALSE) # Use Rcmdr to aggregate compute mean Area and In for each AreaBin AggregatedData <- aggregate(Snodgrass[,c("Area","In"), drop=FALSE], by=list(AreaBin=Snodgrass$AreaBin), FUN=mean) # Plot raw data plot(In~Area, data=Snodgrass, las=1) # Plot means by AreaBin groups points(AggregatedData[,2:3], type="b", pch=16)
Fitting a Simple Model We start with a simple model using Area only Statistics | Fit Models | Generalized Linear Model In is the response, Area is the explanatory variable Family is binomial, Link function is logit
> GLM.1 <- glm(In ~ Area, family=binomial(logit), data=Snodgrass) > summary(GLM.1) Call: glm(formula = In ~ Area, family = binomial(logit), data = Snodgrass) Deviance Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) e-06 *** Area e-06 *** --- Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: on 90 degrees of freedom Residual deviance: on 89 degrees of freedom AIC: Number of Fisher Scoring iterations: 6
Results Slope value for Area is highly significant – Area is a significant predictor of the odds of being inside the white wall The residual deviance is less than the degrees of freedom (an indicator that the binomial model fits)
# Rcmdr command > GLM.1 <- glm(In ~ Area, family=binomial(logit), data=Snodgrass) # Typed commands > x <- seq(20, 470, 5) > y <- predict(GLM.1, data.frame(Area=x), type="response") > plot(In~Area, data=Snodgrass, las=2) > points(AggregatedData[,2:3], type="b", lty=2, pch=16) > lines(x, y, col="red", lwd=2) # Rcmdr command > Snodgrass$Predicted <- with(Snodgrass, + factor(ifelse(fitted.GLM.1 <.5, "Outside", "Inside"))) # Use Rcmdr to produce a crosstabulation of Inside and Predicted >.Table <- xtabs(~Inside+Predicted, data=Snodgrass) >.Table Predicted Inside Inside Outside Inside 29 9 Outside 5 48 >( )/( ) [1] Predictions are correct 84.6% of the time
Expanding the Model Expand the model by adding Total and Types Check the results – neither of the new variables is significant, but this could be the high correlation between the two (+.94) Delete Types and try again
Third Model Without Types, Total is now highly significant ANOVA comparing 2 nd and 3 rd models show no difference so the 3 rd (simpler) model is preferred Also AIC, Akaike’s Information Criterion is lower (which is better) New model is 89% accurate
Akaike Information Criterion AIC measures relative goodness of fit of a statistical model Roughly it describes the tradeoff between accuracy and complexity of the model A method of comparing different statistical models – generally prefer model with lower AIC