Logistic Regression PLAN 6930.

Slides:



Advertisements
Similar presentations
Continued Psy 524 Ainsworth
Advertisements

Brief introduction on Logistic Regression
Binary Logistic Regression: One Dichotomous Independent Variable
Logistic Regression.
6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.
Regression Analysis Module 3. Regression Regression is the attempt to explain the variation in a dependent variable using the variation in independent.
Objectives (BPS chapter 24)
Chapter 11 Contingency Table Analysis. Nonparametric Systems Another method of examining the relationship between independent (X) and dependant (Y) variables.
Logistic Regression Multivariate Analysis. What is a log and an exponent? Log is the power to which a base of 10 must be raised to produce a given number.
QUALITATIVE AND LIMITED DEPENDENT VARIABLE MODELS.
Statistical Inference and Regression Analysis: GB Professor William Greene Stern School of Business IOMS Department Department of Economics.
So far, we have considered regression models with dummy variables of independent variables. In this lecture, we will study regression models whose dependent.
Chapter 11 Multiple Regression.
An Introduction to Logistic Regression
Generalized Linear Models
Relationships Among Variables
Example of Simple and Multiple Regression
AM Recitation 2/10/11.
Inference for regression - Simple linear regression
Chapter 13: Inference in Regression
9-1 MGMG 522 : Session #9 Binary Regression (Ch. 13)
April 4 Logistic Regression –Lee Chapter 9 –Cody and Smith 9:F.
MGT-491 QUANTITATIVE ANALYSIS AND RESEARCH FOR MANAGEMENT OSMAN BIN SAIF Session 26.
Multiple Logistic Regression STAT E-150 Statistical Methods.
Logistic Regression Analysis Gerrit Rooks
Dates Presentations Wed / Fri Ex. 4, logistic regression, Monday Dec 7 th Final Tues. Dec 8 th, 3:30.
Review of Statistical Inference Prepared by Vera Tabakova, East Carolina University.
Roger B. Hammer Assistant Professor Department of Sociology Oregon State University Conducting Social Research Logistic Regression Categorical Data Analysis.
Logistic Regression and Odds Ratios Psych DeShon.
Nonparametric Statistics
The Probit Model Alexander Spermann University of Freiburg SS 2008.
Instructor: R. Makoto 1richard makoto UZ Econ313 Lecture notes.
LOGISTIC REGRESSION. Purpose  Logistical regression is regularly used when there are only two categories of the dependent variable and there is a mixture.
Logistic Regression: Regression with a Binary Dependent Variable.
Methods of Presenting and Interpreting Information Class 9.
Stats Methods at IC Lecture 3: Regression.
Outline Sampling Measurement Descriptive Statistics:
The Probit Model Alexander Spermann University of Freiburg SoSe 2009
Nonparametric Statistics
32931 Technology Research Methods Autumn 2017 Quantitative Research Component Topic 4: Bivariate Analysis (Contingency Analysis and Regression Analysis)
BINARY LOGISTIC REGRESSION
Logistic Regression APKC – STATS AFAC (2016).
Advanced Quantitative Techniques
CHAPTER 7 Linear Correlation & Regression Methods
Notes on Logistic Regression
Chapter 13 Nonlinear and Multiple Regression
Advanced Quantitative Techniques
Inference and Tests of Hypotheses
Hypothesis Testing Review
AP Biology Intro to Statistics
Regression with a Binary Dependent Variable.  Linear Probability Model  Probit and Logit Regression Probit Model Logit Regression  Estimation and Inference.
Generalized Linear Models
Inferential statistics,
Multiple logistic regression
Jeffrey E. Korte, PhD BMTRY 747: Foundations of Epidemiology II
Nonparametric Statistics
CHAPTER 26: Inference for Regression
Discrete Event Simulation - 4
Categorical Data Analysis Review for Final
Logistic Regression.
When You See (This), You Think (That)
CHAPTER 12 More About Regression
Modeling with Dichotomous Dependent Variables
15.1 The Role of Statistics in the Research Process
Reasoning in Psychology Using Statistics
Reasoning in Psychology Using Statistics
Chapter 6 Logistic Regression: Regression with a Binary Dependent Variable Copyright © 2010 Pearson Education, Inc., publishing as Prentice-Hall.
Logistic Regression.
MGS 3100 Business Analysis Regression Feb 18, 2016
Presentation transcript:

Logistic Regression PLAN 6930

Categorical Dependent Variables OLS regression assumes that the dependent variable is interval. But suppose we want to know the determinants of a teenager smoking? Dep. Var.: smokes, doesn’t smoke The determinants of someone having lung cancer Has lung cancer, doesn’t have lung cancer The determinants of homeownership Owner or renter These are all non-interval dependent variables

STATA EXAMPLES Data: NYCHVS Dependent variable Independent variables Triennial survey of approximately 18,000 units Roughly 15,000 occupied Dependent variable Homeownership Independent variables Age Race Marital status Income

HOMEOWNER

Categorical Dependent Variables OLS is inappropriate for categorical dependent variables for the following reasons: The errors are not normally distributed. In OLS regression we assume the errors to be normally distributed. But the errors cannot be normally distributed because Y takes on only 2 values, 0 or 1. The errors are now binomially distributed. This means that we cannot use inference to test if the slope is = 0. So how could we determine how certain we are that the slope is really different from 0? Recall that this may not be a problem if we have a very large sample because of the Central Limit theorem.

The Problem with OLS The variances of the error term are no longer homoscedastic. In the presence of heteroscedasticity, we again cannot use Sb for statistical inference. In this case, however, the central limit theorem does not save us either. We cannot appropriately test the hypothesis that b = 0.

Error Term Kurtosis of 3 is normally distributed predict homeowner_resid, residual Kurtosis of 3 is normally distributed Skewness of 0 is normally distributed

Error Term histogram homeowner_resid, normal

The Problem with OLS We may get nonsensical values for predicted ŷ If we predict Y for a categorical dependent variable, we are in effect calculating the conditional probability of Y, for a given value of X. Example: We are predicting the probability of a secretary being fired, for given values of their typing speed. We get: ŷ = -1.19 + .0248x But if we substitute the typing speed of someone in the sample who types 94 words per minute into the regression equation we get: ŷ = -1.19 + .0248(94) = 1.14. But a probability must range between 0 and 1. Setting values above 1 to 1 and below 0 to 0 is arbitrary.

Predicted Probability of being Homeowner predict homeowner_hat Predicted Probability of being Homeowner

The Problem with OLS R2 is of dubious validity when the dependent variable is categorical. Recall that R2 is a measure of how well the regression line fits the scatterplot. But with a categorical dependent variable all of the data fall on one of two lines representing 0 or 1. Generally no regression line will fit this data very well.

Probability Models When we have a categorical dependent variable we use Probability Models. Logistic Binary Ordinal Multinomial Probit Poisson Count data (e.g. number of times a student raises there hand)

Logistic Regression With a binary(2 outcome) dependent variable we wish to know the probability or likelihood of one outcome occurring Using OLS, however, can lead to probabilities outside the 0-1 range. To circumvent this problem statisticians suggest: Using the odds, which has no restriction on its values or odds of Y Take the natural logarithm of the odds, ln(odds) or logit of Y We use ln(odds) so that we now have a linear relationship between the dependent and independent variable

The Linear Probability Model

Transformation

Transformation

Properties of the logit P ranges from 0 to 1 logit from -∞ to + -∞ We now have a linear relationship between x and y Slope measures the change in the logit for a unit change in X

Logistic Regression The logit is preferable for analyzing dichotomous variables The probability, odds, and logit are all a function of one another To go from the logit(y) to the odds, we exponentiate the logit or elogit y To go to a probability we know p(y) = odds(y)/1+odds(y)

Estimating the Log-likelihood function For a given set of X values we observe y taking on the value of 0 or 1. We use maximum likelihood to estimate the function that best describes the relationship between y and X(s)

Estimating the Log-likelihood function Maximum likelihood is an iterative process to estimate the function that best describes the relationship between y and X(s) Start with estimate, repeat until the last iteration does not improve

Estimating the Log-likelihood function Example: 32 – 0 40 – 4 60 – 16 70 – 21 We would try various iterations until we arrived at Tc = (5/9)*(Tf-32) The computer uses an algorithm to solve the likelihood function

Interpreting Logistic Regression Constant Similar to intercept Interpreting Regression coefficients (like slopes in OLS) Statistical significance of coefficients (Is there a relationship between independent and dependent variables) Statistical significance of entire model Akin to F-test in OLS regression Accuracy of model Akin to R2 for OLS

Constant The constant simply tells us the logit of y when the independent variables = 0

Interpreting the Coefficients The logit is not easily interpretable Three ways of Interpreting Coefficients Odds Ratio Predicted Probabilities Marginal Change We will focus on first two

Interpreting Logistic Coefficients Dependent Variable: Homeowner, yes = 1, no = 0 Independent variables Age Income (in 10 thousands) Black Hispanic Asian Other Married Immigrant

Logistic Regression

Interpreting Logistic Coefficients Odds Ratio Income (Continuous Variable) 1.08 An increment of $10,000 in household income is associated with an 8% increase in the odds of being a homeowner, when other variables are held constant Hispanic (Dummy Variable) .32 The odds of a Hispanic being a homeowner are 68% less (or 32% as great) as whites (the reference category), when other variables are held constant

Interpreting Logistic Coefficients: Prediction

Interpreting Logistic Coefficients: Prediction

Interpreting Logistic Coefficients: Prediction

Interpreting Logistic Coefficients: Prediction

Interpreting Logistic Coefficients Predicted Probability Hispanic (Dummy Variable) Hispanic: 19% White: 38% Whites are predicted to have a much higher homeowner rate, even after holding other variables constant

Interpreting Logistic Coefficients Predicted Probabilities of homeownership Income (Continuous Variable) $4,200 : 20% $40,000: 26% $166,000: 50% The probability of owning one’s home increases as their income increases even after holding other variables constant

Interpreting Logistic Coefficients In major league baseball the teams with the best records enter a “playoff” tournament the winners of which play for the championship in the World Series Team makes the playoff = 1, did not make playoff = 0 Binary outcome

Interpreting Logistic Coefficients Dependent variable: Team makes the playoff Independent variables Team Earned Run Average Ranges 3.02-5.22 Team On Base Percentage Ranges .292-.362 Strikeouts by team’s pitchers Ranges 905-1,387 Homeruns hit by team Ranges 91-257

Descriptive Statistics—Used for Predictions

Descriptive Statistics—Used for Prediction

Interpreting Logistic Coefficients Dependent variable: Team makes the playoff Interpret Odds Ratios, write a sentence for on base percentage (obp2) & era

Interpreting Logistic Coefficients Dependent variable: Team makes the playoff Predicted Probabilities of making playoffs Write a sentence interpreting result. If you are a baseball team owner which seems more important for making the playoffs– a team with a high on base percentage of a team with a lot of home runs?

On Base Percentage

Home Runs

Statistical Significance of Coefficient The null hypothesis is that the independent variable has no effect or the odds ratio is 1 The logistic regression coefficient approximates a normal probability distribution For income the probability of obtaining the parameter estimate we did if the true parameter = 0 was .000 For Hispanic the probability of obtaining the parameter estimate we did if the true parameter = 0 was .000 If we use an alpha of .05, do we accept or reject the null hypothesis?

Statistical Significance of Coefficient Interpret the statistical significance of the coefficients in the following table using our baseball model Obp2 on base percentage multiplied by 100 Hr home runs hit per team So strikeouts thrown by team’s pitchers Era earned run average of the team’s pitchers

Statistical Significance of Entire Model The log-likelihood ratio Compares how accurate our predictions are without the independent variables to how accurate they are with the independent variables The log-likelihood has a chi-square distribution Df is the number of independent variables If the predictions are much better with the independent variable and/or the sample size is large, you are likely to get a statistically significant result

Statistical Significance of Entire Model We return to our model predicting homeownership

Statistical Significance of Entire Model Assuming an alpha of .05, what does the probability of observing a likelihood ratio of 2588.27 that is listed below tell us?

Statistical Significance of Entire Model Assuming an alpha of .05, what does the probability of observing a likelihood ratio of 2588.27 that is listed below tell us? The probability is .0000, meaning we can reject the null hypothesis that the model performs no better than a model without independent variables

Statistical Significance of Entire Model The precise interpretation is “the probability of observing a likelihood ratio of 2588.27 with 8 degrees of freedom is .0000” The probability is not actually 0 but is very small

Accuracy of model How well does the model fit the data? Analogous to R2 No agreed upon measure of goodness of fit for probability models One method compares the likelihood ratio with independent variables to the likelihood ratio without independent variables “Pseudo R-square” Reported by default in Stata

Accuracy of model How well does the model fit the data? Analogous to R2 No agreed upon measure of goodness of fit for probability models One measure compares the predictions for the category of the dependent based on the individual’s characteristics (i.e. a male, 20 year old,etc.) Compares prediction to actual category Looks at the proportion of correct predictions

Accuracy of model Adjusted Count R2 Intuition: With a binary outcome we can try to predict the outcome Example: In our homeownership data 32% are owners If we simply predict that everyone is a renter we will be correct 68% of the time Does our model allow us to improve our predictive accuracy?

Accuracy of Model homeowner naïve predictions 1 accuracy 70%

Accuracy of Model Adjusted count R2 n is the count of the most frequent outcome The Adjusted Count R-squared measures the proportion of correct predictions beyond the naive prediction. 

Accuracy of Model Can obtain model accuracy or fit statistics using “fitstat” in stata (you may have to install using “findit fitstat” and then clicking on installation files)

Accuracy of Model

Accuracy of Model Interpret the adjusted count R2 below from our baseball model

Accuracy of model McFadden’s “Pseudo” R2 Mfull = Model with predictors Mintercept = Model without predictors NOTE: THERE ARE SEVERAL “PSEUDO” R2. THIS IS THE DEFAULT PRODUCED IN STATA

Accuracy of model McFadden’s “Pseudo” R2 Ranges from 0 but cannot reach 1 Generally lower than R2 found in OLS regression Problem: What is an intuitive explanation of the improvement in the ratio of the log likelihood?

Tjur’s D For each of the two categories of the dependent variable, calculate the mean of the predicted probabilities of an event. Then, take the difference between those two means. If a model makes good predictions, the cases with events should have high predicted values and the cases without events should have low predicted values. Coefficient of Discrimination Tjur, T. (2009) “Coefficients of determination in logistic regression models—A new proposal: The coefficient of discrimination.” The American Statistician 63: 366-372. Known as Coefficient of Discrimination or Tjur’s D

Accuracy of model Suggested Strategy for presenting model accuracy for logistic regression Present Pseudo R2 along with an explanation of which statistic you are presenting Present adjusted count R2 Or Present coefficient of discrimination (Tjur’s D)

Rent Regulation Model

Nodefic : number of deficiencies (e.g. holes in wall)

Rent Regulation Mildner & Salins argue: Rent regulation does not serve poor Rent regulated buildings are not well maintained Newcomers don’t benefit from rent regulation Consider all the available information in the previous two slides Write a paragraph interpreting the results and drawing conclusions about Mildner and Salins’ hypotheses

Testing the addition of an Independent Variable The rent regulation model includes a measure of maintenance deficiencies Rent regulated buildings, however, are older Perhaps we should control for building age?

Testing the addition of an Independent Variable Tests the log likelihood statistic from the first model (nested model) against the log likelihood statistic from the second model. The second model includes all the independent variables as the first The second model also includes additional independent variable(s) we wish to test

Testing the addition of an Independent Variable Nodefic: number of deficiencies Built: decade building was built. Higher number means older building

Testing the addition of an Independent Variable

Testing the addition of an Independent Variable Nodefic: number of deficiencies Built: decade building was built. Higher number means older building

Testing the addition of an Independent Variable using Nestreg Wald chi-square: 243.89 Df: 1 Probability of observing wald chi-square statistic due to chance: .0000 Should we add age of building to our model?

Do Minorities Experience Mortgage Discrimination? Dependent variable: Mortgage application approved = 1, Mortgage application denied = 0 Mortgage applications in Suffolk County 2010-2014 Independent variables: size of loan, year of application, controlling for income, gender, loan amount and neighborhood characteristics Use predicted probabilities to explore relationship between racial composition of neighborhood and probability of loan approval

marginsplot, title ("Predicted Probability of Approving Loan by Race", size(medium)) ///subtitle ("Controlling for income, loan amount, year, sex, tract % minority&Income", size(medium)) ///ytitle ("Probability of Approval") ///xtitle ("Race/Ethnicity", size(small)) ///caption ("Source: 2010-2014 HMDA data") ///legend (size(vsmall))

margins, at(MinPopPerc = (0 (5) 100))

marginsplot, title ("Predicted Probability of Approving Loan by Minority % in Tract", size(medium)) ///subtitle ("Controlling for income, loan amount, year, sex, tract % minority&Income", size(medium)) ///ytitle ("Probability of Approval") ///xtitle ("Race/Ethnicity", size(small)) ///caption ("Source: 2010-2014 HMDA data") ///legend (size(vsmall))