Presentation is loading. Please wait.

Presentation is loading. Please wait.

Logistic Regression Predicting Dichotomous Data. Predicting a Dichotomy Response variable has only two states: male/female, present/absent, yes/no, etc.

Similar presentations


Presentation on theme: "Logistic Regression Predicting Dichotomous Data. Predicting a Dichotomy Response variable has only two states: male/female, present/absent, yes/no, etc."— Presentation transcript:

1 Logistic Regression Predicting Dichotomous Data

2 Predicting a Dichotomy Response variable has only two states: male/female, present/absent, yes/no, etc Linear regression fails because we cannot keep the prediction within the bounds of 0 – 1 Continuous and non-continuous predictors possible

3 Logistic Model Explanatory variables used to predict the probability that the response will be present (male, yes, etc) We fit a linear model to the log of the odds that an event will occur If the probability that an event will occur is p, then the odds = p/(1-p)

4 logits Equations: –logit(p) = log(p/(1-p)) –logit(p) = b 0 + b 1 x1 + b 2 x2... So logistic regression is a linear regression of logits (logs of odds ratios)

5 Assumptions Dichotomous response (only two states possible) Outcomes statistically independent Model contains all relevant predictors and no irrelevant ones Samples sizes of about 50 cases per predictor

6 Two Approaches Data consisting of individual cases with a dichotomous variable Grouped data where the number present and number absent are known for each combination of explanatory variables (in practice these will usually be categorical/ ordinal)

7 Inverting Snodgrass Instead of seeing if houses inside the white wall are larger than those outside, we can use area to predict where the house is located.

8

9 # Use Rcmdr to create a dichotomous variable In Snodgrass$In <- with(Snodgrass, ifelse(Inside=="Inside", 1, 0)) # Use Rcmdr to bin Area into 10 bins using numbers to Snodgrass$AreaBin <- bin.var(Snodgrass$Area, bins=10, method='intervals', labels=FALSE) # Use Rcmdr to aggregate compute mean Area and In for each AreaBin AggregatedData <- aggregate(Snodgrass[,c("Area","In"), drop=FALSE], by=list(AreaBin=Snodgrass$AreaBin), FUN=mean) # Plot raw data plot(In~Area, data=Snodgrass, las=1) # Plot means by AreaBin groups points(AggregatedData[,2:3], type="b", pch=16)

10 Fitting a Simple Model We start with a simple model using Area only Statistics | Fit Models | Generalized Linear Model In is the response, Area is the explanatory variable Family is binomial, Link function is logit

11 > GLM.1 <- glm(In ~ Area, family=binomial(logit), data=Snodgrass) > summary(GLM.1) Call: glm(formula = In ~ Area, family = binomial(logit), data = Snodgrass) Deviance Residuals: Min 1Q Median 3Q Max -2.1103 -0.4815 -0.1836 0.2885 2.5706 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -8.663071 1.818444 -4.764 1.90e-06 *** Area 0.034760 0.007515 4.626 3.74e-06 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 123.669 on 90 degrees of freedom Residual deviance: 57.728 on 89 degrees of freedom AIC: 61.728 Number of Fisher Scoring iterations: 6

12 Results Slope value for Area is highly significant – Area is a significant predictor of the odds of being inside the white wall The residual deviance is less than the degrees of freedom (an indicator that the binomial model fits)

13

14 # Rcmdr command > GLM.1 <- glm(In ~ Area, family=binomial(logit), data=Snodgrass) # Typed commands > x <- seq(20, 470, 5) > y <- predict(GLM.1, data.frame(Area=x), type="response") > plot(In~Area, data=Snodgrass, las=2) > points(AggregatedData[,2:3], type="b", lty=2, pch=16) > lines(x, y, col="red", lwd=2) # Rcmdr command > Snodgrass$Predicted <- with(Snodgrass, + factor(ifelse(fitted.GLM.1 <.5, "Outside", "Inside"))) # Use Rcmdr to produce a crosstabulation of Inside and Predicted >.Table <- xtabs(~Inside+Predicted, data=Snodgrass) >.Table Predicted Inside Inside Outside Inside 29 9 Outside 5 48 >(29 + 48)/(29 + 9 + 5 + 48) [1] 0.8461538 Predictions are correct 84.6% of the time

15 Expanding the Model Expand the model by adding Total and Types Check the results – neither of the new variables is significant, but this could be the high correlation between the two (+.94) Delete Types and try again

16 Third Model Without Types, Total is now highly significant ANOVA comparing 2 nd and 3 rd models show no difference so the 3 rd (simpler) model is preferred Also AIC, Akaike’s Information Criterion is lower (which is better) New model is 89% accurate

17 Akaike Information Criterion AIC measures relative goodness of fit of a statistical model Roughly it describes the tradeoff between accuracy and complexity of the model A method of comparing different statistical models – generally prefer model with lower AIC


Download ppt "Logistic Regression Predicting Dichotomous Data. Predicting a Dichotomy Response variable has only two states: male/female, present/absent, yes/no, etc."

Similar presentations


Ads by Google