Logistic Regression PLAN 6930.

Logistic Regression PLAN 6930

Categorical Dependent Variables
OLS regression assumes that the dependent variable is interval. But suppose we want to know the determinants of a teenager smoking? Dep. Var.: smokes, doesn’t smoke The determinants of someone having lung cancer Has lung cancer, doesn’t have lung cancer The determinants of homeownership Owner or renter These are all non-interval dependent variables

STATA EXAMPLES Data: NYCHVS Dependent variable Independent variables
Triennial survey of approximately 18,000 units Roughly 15,000 occupied Dependent variable Homeownership Independent variables Age Race Marital status Income

HOMEOWNER

Categorical Dependent Variables
OLS is inappropriate for categorical dependent variables for the following reasons: The errors are not normally distributed. In OLS regression we assume the errors to be normally distributed. But the errors cannot be normally distributed because Y takes on only 2 values, 0 or 1. The errors are now binomially distributed. This means that we cannot use inference to test if the slope is = 0. So how could we determine how certain we are that the slope is really different from 0? Recall that this may not be a problem if we have a very large sample because of the Central Limit theorem.

The Problem with OLS The variances of the error term are no longer homoscedastic. In the presence of heteroscedasticity, we again cannot use Sb for statistical inference. In this case, however, the central limit theorem does not save us either. We cannot appropriately test the hypothesis that b = 0.

Error Term Kurtosis of 3 is normally distributed
predict homeowner_resid, residual Kurtosis of 3 is normally distributed Skewness of 0 is normally distributed

Error Term histogram homeowner_resid, normal

The Problem with OLS We may get nonsensical values for predicted ŷ
If we predict Y for a categorical dependent variable, we are in effect calculating the conditional probability of Y, for a given value of X. Example: We are predicting the probability of a secretary being fired, for given values of their typing speed. We get: ŷ = x But if we substitute the typing speed of someone in the sample who types 94 words per minute into the regression equation we get: ŷ = (94) = 1.14. But a probability must range between 0 and 1. Setting values above 1 to 1 and below 0 to 0 is arbitrary.

Predicted Probability of being Homeowner
predict homeowner_hat Predicted Probability of being Homeowner

The Problem with OLS R2 is of dubious validity when the dependent variable is categorical. Recall that R2 is a measure of how well the regression line fits the scatterplot. But with a categorical dependent variable all of the data fall on one of two lines representing 0 or 1. Generally no regression line will fit this data very well.

Probability Models When we have a categorical dependent variable we use Probability Models. Logistic Binary Ordinal Multinomial Probit Poisson Count data (e.g. number of times a student raises there hand)

Logistic Regression With a binary(2 outcome) dependent variable we wish to know the probability or likelihood of one outcome occurring Using OLS, however, can lead to probabilities outside the 0-1 range. To circumvent this problem statisticians suggest: Using the odds, which has no restriction on its values or odds of Y Take the natural logarithm of the odds, ln(odds) or logit of Y We use ln(odds) so that we now have a linear relationship between the dependent and independent variable

The Linear Probability Model

Transformation

Properties of the logit
P ranges from 0 to 1 logit from -∞ to + -∞ We now have a linear relationship between x and y Slope measures the change in the logit for a unit change in X

Logistic Regression The logit is preferable for analyzing dichotomous variables The probability, odds, and logit are all a function of one another To go from the logit(y) to the odds, we exponentiate the logit or elogit y To go to a probability we know p(y) = odds(y)/1+odds(y)

Estimating the Log-likelihood function
For a given set of X values we observe y taking on the value of 0 or 1. We use maximum likelihood to estimate the function that best describes the relationship between y and X(s)

Maximum likelihood is an iterative process to estimate the function that best describes the relationship between y and X(s) Start with estimate, repeat until the last iteration does not improve

Example: 32 – 0 40 – 4 60 – 16 70 – 21 We would try various iterations until we arrived at Tc = (5/9)*(Tf-32) The computer uses an algorithm to solve the likelihood function

Interpreting Logistic Regression
Constant Similar to intercept Interpreting Regression coefficients (like slopes in OLS) Statistical significance of coefficients (Is there a relationship between independent and dependent variables) Statistical significance of entire model Akin to F-test in OLS regression Accuracy of model Akin to R2 for OLS

Constant The constant simply tells us the logit of y when the independent variables = 0

Interpreting the Coefficients
The logit is not easily interpretable Three ways of Interpreting Coefficients Odds Ratio Predicted Probabilities Marginal Change We will focus on first two

Interpreting Logistic Coefficients
Dependent Variable: Homeowner, yes = 1, no = 0 Independent variables Age Income (in 10 thousands) Black Hispanic Asian Other Married Immigrant

Logistic Regression

Odds Ratio Income (Continuous Variable) 1.08 An increment of $10,000 in household income is associated with an 8% increase in the odds of being a homeowner, when other variables are held constant Hispanic (Dummy Variable) .32 The odds of a Hispanic being a homeowner are 68% less (or 32% as great) as whites (the reference category), when other variables are held constant

Interpreting Logistic Coefficients: Prediction

Predicted Probability Hispanic (Dummy Variable) Hispanic: 19% White: 38% Whites are predicted to have a much higher homeowner rate, even after holding other variables constant

Predicted Probabilities of homeownership Income (Continuous Variable) $4,200 : 20% $40,000: 26% $166,000: 50% The probability of owning one’s home increases as their income increases even after holding other variables constant

In major league baseball the teams with the best records enter a “playoff” tournament the winners of which play for the championship in the World Series Team makes the playoff = 1, did not make playoff = 0 Binary outcome

Dependent variable: Team makes the playoff Independent variables Team Earned Run Average Ranges Team On Base Percentage Ranges Strikeouts by team’s pitchers Ranges 905-1,387 Homeruns hit by team Ranges

Descriptive Statistics—Used for Predictions

Descriptive Statistics—Used for Prediction

Dependent variable: Team makes the playoff Interpret Odds Ratios, write a sentence for on base percentage (obp2) & era

Dependent variable: Team makes the playoff Predicted Probabilities of making playoffs Write a sentence interpreting result. If you are a baseball team owner which seems more important for making the playoffs– a team with a high on base percentage of a team with a lot of home runs?

On Base Percentage

Home Runs

Statistical Significance of Coefficient
The null hypothesis is that the independent variable has no effect or the odds ratio is 1 The logistic regression coefficient approximates a normal probability distribution For income the probability of obtaining the parameter estimate we did if the true parameter = 0 was .000 For Hispanic the probability of obtaining the parameter estimate we did if the true parameter = 0 was .000 If we use an alpha of .05, do we accept or reject the null hypothesis?

Statistical Significance of Coefficient
Interpret the statistical significance of the coefficients in the following table using our baseball model Obp2 on base percentage multiplied by 100 Hr home runs hit per team So strikeouts thrown by team’s pitchers Era earned run average of the team’s pitchers

Statistical Significance of Entire Model
The log-likelihood ratio Compares how accurate our predictions are without the independent variables to how accurate they are with the independent variables The log-likelihood has a chi-square distribution Df is the number of independent variables If the predictions are much better with the independent variable and/or the sample size is large, you are likely to get a statistically significant result

We return to our model predicting homeownership

Assuming an alpha of .05, what does the probability of observing a likelihood ratio of that is listed below tell us?

Assuming an alpha of .05, what does the probability of observing a likelihood ratio of that is listed below tell us? The probability is .0000, meaning we can reject the null hypothesis that the model performs no better than a model without independent variables

The precise interpretation is “the probability of observing a likelihood ratio of with 8 degrees of freedom is .0000” The probability is not actually 0 but is very small

Accuracy of model How well does the model fit the data? Analogous to R2 No agreed upon measure of goodness of fit for probability models One method compares the likelihood ratio with independent variables to the likelihood ratio without independent variables “Pseudo R-square” Reported by default in Stata

Accuracy of model How well does the model fit the data? Analogous to R2 No agreed upon measure of goodness of fit for probability models One measure compares the predictions for the category of the dependent based on the individual’s characteristics (i.e. a male, 20 year old,etc.) Compares prediction to actual category Looks at the proportion of correct predictions

Accuracy of model Adjusted Count R2
Intuition: With a binary outcome we can try to predict the outcome Example: In our homeownership data 32% are owners If we simply predict that everyone is a renter we will be correct 68% of the time Does our model allow us to improve our predictive accuracy?

Accuracy of Model homeowner naïve predictions 1 accuracy 70%

Accuracy of Model Adjusted count R2
n is the count of the most frequent outcome The Adjusted Count R-squared measures the proportion of correct predictions beyond the naive prediction.

Accuracy of Model Can obtain model accuracy or fit statistics using “fitstat” in stata (you may have to install using “findit fitstat” and then clicking on installation files)

Accuracy of Model

Accuracy of Model Interpret the adjusted count R2 below from our baseball model

Accuracy of model McFadden’s “Pseudo” R2 Mfull = Model with predictors
Mintercept = Model without predictors NOTE: THERE ARE SEVERAL “PSEUDO” R2. THIS IS THE DEFAULT PRODUCED IN STATA

Accuracy of model McFadden’s “Pseudo” R2
Ranges from 0 but cannot reach 1 Generally lower than R2 found in OLS regression Problem: What is an intuitive explanation of the improvement in the ratio of the log likelihood?

Tjur’s D For each of the two categories of the dependent variable, calculate the mean of the predicted probabilities of an event. Then, take the difference between those two means. If a model makes good predictions, the cases with events should have high predicted values and the cases without events should have low predicted values. Coefficient of Discrimination Tjur, T. (2009) “Coefficients of determination in logistic regression models—A new proposal: The coefficient of discrimination.” The American Statistician 63: Known as Coefficient of Discrimination or Tjur’s D

Accuracy of model Suggested Strategy for presenting model accuracy for logistic regression Present Pseudo R2 along with an explanation of which statistic you are presenting Present adjusted count R2 Or Present coefficient of discrimination (Tjur’s D)

Rent Regulation Model

Nodefic : number of deficiencies (e.g. holes in wall)

Rent Regulation Mildner & Salins argue:
Rent regulation does not serve poor Rent regulated buildings are not well maintained Newcomers don’t benefit from rent regulation Consider all the available information in the previous two slides Write a paragraph interpreting the results and drawing conclusions about Mildner and Salins’ hypotheses

Testing the addition of an Independent Variable
The rent regulation model includes a measure of maintenance deficiencies Rent regulated buildings, however, are older Perhaps we should control for building age?

Tests the log likelihood statistic from the first model (nested model) against the log likelihood statistic from the second model. The second model includes all the independent variables as the first The second model also includes additional independent variable(s) we wish to test

Nodefic: number of deficiencies Built: decade building was built. Higher number means older building

Testing the addition of an Independent Variable using Nestreg
Wald chi-square: Df: 1 Probability of observing wald chi-square statistic due to chance: Should we add age of building to our model?

Do Minorities Experience Mortgage Discrimination?
Dependent variable: Mortgage application approved = 1, Mortgage application denied = 0 Mortgage applications in Suffolk County Independent variables: size of loan, year of application, controlling for income, gender, loan amount and neighborhood characteristics Use predicted probabilities to explore relationship between racial composition of neighborhood and probability of loan approval

marginsplot, title ("Predicted Probability of Approving Loan by Race", size(medium)) ///subtitle ("Controlling for income, loan amount, year, sex, tract % minority&Income", size(medium)) ///ytitle ("Probability of Approval") ///xtitle ("Race/Ethnicity", size(small)) ///caption ("Source: HMDA data") ///legend (size(vsmall))

margins, at(MinPopPerc = (0 (5) 100))

marginsplot, title ("Predicted Probability of Approving Loan by Minority % in Tract", size(medium)) ///subtitle ("Controlling for income, loan amount, year, sex, tract % minority&Income", size(medium)) ///ytitle ("Probability of Approval") ///xtitle ("Race/Ethnicity", size(small)) ///caption ("Source: HMDA data") ///legend (size(vsmall))

Logistic Regression PLAN 6930.

Similar presentations

Presentation on theme: "Logistic Regression PLAN 6930."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Logistic Regression PLAN 6930.

Similar presentations

Presentation on theme: "Logistic Regression PLAN 6930."— Presentation transcript:

Similar presentations

About project

Feedback