Modeling with Dichotomous Dependent Variables Logistic Regression Modeling with Dichotomous Dependent Variables
A New Type of Model… Dichotomous Dependent Variable: Why did someone vote for Bush or Kerry? Why did residents own or rent their houses? Why do some people drink alcohol and others don’t? What determined if a household owned a car?
Dependent Variable… Is binary, with a yes or a no answer Can be coded, 1 for yes and 0 for no. There are no other valid responses.
Problem: OLS Regression does not model the relationship well
Solution: Use a Different Functional Form The properties we need: The model should be bounded by 0 and 1 The model should estimate a value for the dependent variable in terms of the probability of being in one category or the other, e.g., a owner or renter; or a Bush voter or Kerry voter
Solution, cont. We want to know the probability, p, that a particular case falls in the 0 or the 1 category. We want to derive a model which gives good estimates of 0 and 1, or put another way, that a particular case is likely to be a 0 or a 1.
Solution: A Logistic Curve
The Logistic Function Probability that a case is a 0 or a 1 is distributed according to the logistic function.
Remember probabilities… Probabilities range from 0 to 1. Probability: frequency of being in one category relative to the total of all categories. Example: The probability that the first card dealt in a card game is a queen of hearts is 1/52 (one in 52). It does us no good to “predict” a value of .5 as in the linear regression model.
But can we manipulate probabilities to estimate the logistic function? Steps: Convert probabilities to odds ratios Convert odds ratios to log odds or logits
Manipulating probabilities to estimate the logistic function LIST V2 V3 V4 V5 /N=13 Case number P 1-P P/1-P ln(P/1-P) 1 0.010 0.990 0.010 -4.595 2 0.050 0.950 0.053 -2.944 3 0.100 0.900 0.111 -2.197 4 0.200 0.800 0.250 -1.386 5 0.300 0.700 0.429 -0.847 6 0.400 0.600 0.667 -0.405 7 0.500 0.500 1.000 0.000 8 0.600 0.400 1.500 0.405 9 0.700 0.300 2.333 0.847 10 0.800 0.200 4.000 1.386 11 0.900 0.100 9.000 2.197 12 0.950 0.050 19.000 2.944 13 0.990 0.010 99.000 4.595
Logistic Function
Logistic Function
Steps…. Log odds = a + bx Odds ratio = Exponentiate (a + bx) Probability is distributed according to the logistic function
An Example Determinants of Homeownership: Age of the householder Age of the householder squared Building Type Year house was built Householder’s Ethnicity Occupational status scale
Calculating the Model Maximum Likelihood Estimation (not OLS) Estimates of the b’s, standard errors, t ratios and p values for coefficients Coefficients are estimates of the impact of the independent variable on the logit of the dependent variable
Logistic Regression Model Parameter Estimate S.E. t-ratio p-value 1 CONSTANT -6.976 1.501 -4.647 0.000 2 AGE 0.250 0.060 4.132 0.000 3 AGESQ -0.002 0.001 -3.400 0.001 4 BLDGTYP2$_cottage 0.036 0.277 0.131 0.895 5 BLDGTYP2$_duplex -1.432 0.328 -4.363 0.000 6 YEAR 0.061 0.022 2.757 0.006 7 GERMAN 0.706 0.264 2.677 0.007 8 POLISH 0.777 0.422 1.841 0.066 9 OCCSCALE 0.190 0.091 2.074 0.038
Logistic Regression model, cont. Parameter Odds Ratio Upper Lower 2 AGE 1.284 1.445 1.140 3 AGESQ 0.998 0.999 0.997 4 BLDGTYP2$_cottage 1.037 1.784 0.603 5 BLDGTYP2$_duplex 0.239 0.454 0.125 6 YEAR 1.063 1.109 1.018 7 GERMAN 2.026 3.398 1.208 8 POLISH 2.175 4.972 0.951 9 OCCSCALE 1.209 1.446 1.011 Log Likelihood of constants only model = LL(0) = -303.864 2*[LL(N)-LL(0)] = 85.180 with 8 df Chi-sq p-value = 0.000 McFadden's Rho-Squared = 0.140
Converting Odds Ratios to Probabilities Odds ratio = P/1-P. For Germans, compared with the omitted category (Americans and other ethnicities) controlling for other variables, 2.026 = P/(1-P) Germans are more likely to own houses than Americans. Can we be more specific?
Calculating Probability of a Case Log odds of homeownership = -6.976 + .250Age - .002Agesquared + .036 cottage – 1.432 duplex + .061Year + .706 German + .777 Polish + .190 occscale Plug in values and solve the equation. Exponentiate the result to create the odds Convert the odds to a probability for the case.
Calculations Log odds of homeownership = -6.976 + .250Age - .002Agesquared + .036 cottage – 1.432 duplex + .061Year + .706 German + .777 Polish + .190 occscale For a 40 year old skilled, American born worker, living in a residence built in 1892: Log odds of homeownership = -6.976 + .250*40 - .002*1600 + .061* 5 + .190*3 Log odds = .699
Calculations, cont. log odds = .699 odds = anti log or exponentiation of.699 = 2.012 odds = P/(1-P) = 2.012 Solve for P. The result is .67.
More calculations…. How about a 40 year old German skilled worker in an 1892 residence? Log odds of homeownership = -6.976 + .250Age - .002Agesquared + .036 cottage – 1.432 duplex + .061Year + .706 German + .777 Polish + .190 occscale Log odds = -6.976 + .250*40 - .002*1600 + .061* 5 + .706 + .190*3 = 1.405 Note as well that .699 + .706 = 1.405. Note as well that .699 * 2.026 (or the odds ratio for the variable “German”) = 1.405
More calculations Convert the log odds to odds, e.g., take the antilog of 1.405 = 4.076. Odds = 4.076 = P/(1-P). Solve for P. P = .803. So the probability of the increase in home ownership between Americans and Germans is from .67 to .803 or about 13%.
More calculations For a 30 year old American worker in a residence built in 1892: Log odds = -6.976 + .250*30 - .002*900 + .061*5 + .190*3 = -0.401 Odds = Antilog of (-.401) = 0.670 Probability of ownership = .670/1.670 = 0.401
Classification Table Model Prediction Success Table Actual Predicted Choice Actual Choice Response Reference Total Response 281.647 85.353 367.000 Reference 85.353 58.647 144.000 Pred. Tot. 367.000 144.000 511.000 Correct 0.767 0.407 Success Ind. 0.049 0.125 Tot. Correct 0.666 Sensitivity: 0.767 Specificity: 0.407 False Reference: 0.233 False Response: 0.593
Extending the Logic… Logistic Regression can be extended to more than 2 categories for the dependent variable, for multi response models Classification Tables can be used to understand misclassified cases Results can be analyzed for patterns across different values of the independent variables.