Download presentation
Presentation is loading. Please wait.
1
Limited Dependent Variables
Nicholas Charron, Associate Professor, -Department of Business & Politics, -QoG Institute, GU Required Readings: Long, Scott ”Regression Models for Categorical and Limited dependent Variables”. Sage Publications
2
Outline Today (30-Jan) Overview of Logit & Probit regression using a dichotomous Dep. Variable, interpretation of results Long, Kapitel 1 & 3 1-Feb – limited and categorical dependent variables: Ordered & multinomial logit. STATA exercise Long, Kapitel 5 & 6 6-Feb –count outcome variables and censured variables: Poisson, negative binomial models, Tobit, Heckman slection models Long, Kapital 7 & 8
3
Learning objectives for this part of the course
To understand the basic purpose and idea behind regression with binary and limited outcome variables & why OLS is inappropriate To understand how to set up a model (logit, ologit, mlogit, Poisson, nbreg, etc.), run esitmates and interpret results with odds ratios, predicted probabilities and marginal effects using STATA To be able to evaluate overall model strength and compare the relative explanitory strength of different model specification Understand some of the potential problems with logistic estimation and be able to diagnose and evaluate such issues. To be able to clearly present your results for readers outside of your research field in several different ways with tables & visuals.
4
Why use Logistic regression?
OLS is great, but often it’s not appropriate for our data.. There are many important research topics for which the dependent variable is "limited." Binary/Dichotomous logistic (or Probit) regression is a type of regression analysis where the dependent variable is a dummy variable: coded 0 (did not vote) or 1(did vote) Or when the Dep. Variable is “limited” (non-continuous). Or takes on only a few values. We cover this next class.. Very important: The choice to use a logistic model is determined by the dependent variable rather than any independent variables in your model!!
5
Some Common examples in Social Sciences
1. Political science/ IP: -Why some individuals vote for a cerain candidate or party? -Why/ under what circomstances two countries go to war with one another? 2. Economics/marketing -why a firm enters a marketplace -models explaining why individuals choose to buy a certain product (e.g. Coke instead of Pepsi) 3. Criminology -models explaining why people commit a crime or not 4. Sociology -models on gymnasiam/ university graduation, -marital status -health issues (having a disease or not.)
6
Why can’t we just use OLS??
Just like in OLS linear regression, logistic regression allows us to make linear predictions about the impact of the IV’s on the DV, BUT: The Problem: In OLS regression: 𝑌= 𝛼+ 𝛽𝑋+ 𝜀 ‘e’ (the error term) is assumed to be uncorrelated with the DV (exogeneity), constant for all levels of X (homoskadasticity), and normally distributed but where Y = (0, 1) This does not necessarily bias the coefficients (+ or -), but will bias (underestimate) the SE’s (lead to type I error ) results in misleading hypothesis testing & produce wrong estimates of the MAGNITUDE of X on Y, especially at large/ small values.. But, in the early stages of quantative work, most people just used… Underestimated SE’s lead to type 1 error Type 1: incorrectly rejecting Ho in favor of Ha Type 2: failure to reject Ho in favor of Ha
7
1. Initial model: the Linear Probability Model (LPM)
Tries to litearly fit a linear estimation (OLS) to a binary outcome 𝑃 𝑖 =E Y=1 𝑋 ) *This just like running OLS with a binary/limited DV*
8
1. Initial model: the Linear Probability Model (LPM)
Assumes that the effect of ’X’ on ’Y’ is constant (linear), thus has the same problems associated with the OLS estimates of the previous slides, plus more issues. Places no restrictions on IV’s (can be dummy ,continuous, etc) just like other binary models (logit, probit) The LPM here, it is not used in contemporary many quantaitive studies (that are published anyway…). For Example….
9
Simple Example: 2016 US Election – ’Trump vote’ and Voter Income
number votetrump income 1 110 2 50 3 70 4 80 5 20 6 30 7 60 8 100 9 90 10 35 11 12 65 13 40 14 15 16 17 75 18 160 19 25 Take the votes and incomes of 20 US voters DV = Vote Trump (1/0) IV = yearly income (in thousands $) How to test this relationship?? Using linear OLS (LPM) ’reverse’ the IV and DV and do a simple t-test of means (no controls..) Using Probit/Logit
10
Effect of Income on Vote for Romney
What do we observe? One average, the Higher the income the more likely to vote for Trump.. Regressing OLS on this DV is what Long calls the ’LPM’, & it gives us Pr(Y=1) Model interpretation Probability of somone with a $5k year income? Someone with a $130k income? (0.011*10) = -0.08, or -8% prob (0.011*110) = 1.02, or 102% prob
11
Calculating probabilities of Xi in our LPM
For 5k, we get: (0.0086*5) = , or -1.5% pr=1 For 130k, we get: (0.0086*130) = 1.07, or 107% pr=1 What do we make of these predictions?!?
12
Let’s plot the residuals (Y-yhat) against PR(Y=1) from OLS ’rvfplot’
Why does the plot of residuals versus fitted values (i.e. yhat versus e) look the way it does? Dots on lower line , people did NOT to vote Trump, line above Trump vote=1
13
’yhat’ = predicted value of Yi from our model
Our error residuals looks like this becasue – when Y (0, 1) then: e= - yhat (e.g. when Yi=0) Or e = 1 – yhat(e.g. when Yi=1) ’yhat’ = predicted value of Yi from our model Our residual plots have a lower line (for Yi=0 cases ) & an upper line (for Yi=1 cases), & as Yhat goes up, absolute error levels go down.. So, if yhat = .2 for example, our ’e’ value can ONLY be -.2 or .8 for example. This seems neither ’normally distributed’, nor independent of X So if Y(0,1), we are left with several problems if using OLS: 1a. ‘e’ in a binomial distribution is thus HETEROSKADASTISTIC – in OLS, under homoskadasticity, when we plot y-yhat (residuals) over yhat, what SHOULD it look like??? our simple example shows this is NOT the case.. (efficiency & bias)
14
Key problems 1b. If 𝑉𝑎𝑟 𝜀 𝑖 = 𝜎 2 ε , this means OLS assumes ALL values of X have the same Var, e.g, it doesn’t matter if Xi values are high or low -result: Errors not NORMALLY DISTRIUBTED. Where Y(0,1), errors can only take on 2 values – clear violation (efficiency) 2. Exogoenity: 𝑬 𝜺 𝑿 =𝟎 For dichotomous DV’s, this is ONLY the case if Pr(Y=1) = 0.5. our regressors are ENDOGENOUS (e.g. X’s correlated with error term) Thus error terms are incorrect & hypothesis testing pointless 3. Linearity of Betas – we want to know the Pr(Y=1), which ranges between 0-1 – OLS cannot constrain values, thus we get unrealistic predictions & specify the wrong FUNCTIONAL FORM of X. Plus, is it fair to say the marginal effect of X is constnat? (bias)
15
Line clearly doesnt fit very well
3. For example, can we really expect that a10k increase in voting for Trump is the same between 60 and 70k as it is for 10 and 20, or 2million and 2,010,000?? If going from 50% Pr(Trump) to 65% (pr(Trump) at 60 to 70k means an increase in 10k leads to a 15% greater probability, than what do we say about the millionare who has a 99% Prob and gets 10k more in income?? 114%?? Nope..
16
How to ’fix’ this - Link Functions
So these ’link’ the actual ’Y’ values to the DV in our statistical models. So we take what is called a ”link function” F(Y), that takes Y and makes it 𝑌 ′ 𝐹 𝑌 = 𝒀 𝒊 =βX+𝜀 These can be logged (as we discussed in OLS) or sq.root DV’s in OLS even (which transforms the ’real Y’, to a ’logged Y’).. But in our case, we want to go from just estimating Y(0,1) to Y as odds or PR
17
Link functions cont. We would need to transform our dichotomous DV (Y) into a (somewhat) continuous DV, for example, the log of the odds: 𝑌 ′ ∈(−∞,∞) For dichotomous DV’s we need to find a function F(Y) that goes from (0,1) that is normally distributed – predicts as Xi increases, Pr(Y=1) increases (or decreases). For starters, statisticians discovered that we could use the probability density fuction (PDF, e.g. normal bell curve) from which we draw hypothesis testing with Z-scores, etc. If you test the significance of any Beta in your OLS model you get a p-value (that corresponds to a Z-score), which ranges from 0 to 1. We simply take the inverse of this (called the ’cumlulative density function – CDF), which is also normally distributed, A link function is the function that links the linear model specified in the design matrix, where columns represent the beta parameters and rows the real parameters
18
Key concept: Cumulative densitity function (CDF)
Just like a normal probability distribution function (PDF), we want to know, given the value of Xi, what is the Yi, in terms of Pr(Y=1). For this, we use the same logic from a standardized Bell curve (z-scores), a value of -1 implies about 16% Pr(Y=1), of 2 implies 97.7% Pr(Y=1), etc.. BUT, the effect is NON-linear (little effect on the DV for low/high values, and strong effect in the middle of the distribution. So the normal PDF is transformed into the CDF to better capture this effect Given x has a signfincat effect in Pr(Y=1) Just like in estimating your DV in OLS, you desire the distribution of Yi to be ’normally distributed’, aka, alla PDF. Becasue the range of the probability that DV=1, the distribution of probit (and more or less logit) is the inverse of the regular CDF, So we must use the distribution of the PDF (normal bell curve) and use the inverse of this distribution to make the CDF.
19
The Probit Model
20
Leads us to an Alternative: The Probit Model
We see from the CDF that the effect of Xi on the Pr(Y=1) is non-linear, but probit wants to estimate it like a linear model – Long Ch. 3 So, we do the following adjust for this with: Pr 𝑌=1 𝑥 =𝐹 𝑥𝛽 =Φ(𝑥𝛽) We thus impose the pdf 𝜱 (standard normal function) on 𝑥𝛽 e.g. our link function) The function gives us the PROBABILITY AREA in the standard normal distribution (like in a bell curve) of Pr(Y=1|Xi) Error term assumed (like OLS) to have a mean of 0 and variance of 1 So, probit produces z-scores that you can look up, just like with standardized variables, or with hypothesis testing in OLS But, to make the model ’linear-like’ and fit the Probit distribution, we start by: Pr 𝑌=1 𝑥 =𝜱 𝛽 𝑘 𝑋 𝑖𝑘 AND THEN: we subtract the inverse of Φ (𝑐𝑑𝑓): 𝜱 −𝟏 (𝑌=1|𝑥)= 𝛽 𝑘 𝑋 𝑖𝑘 Coefficients are NOT probabilites, but scaled as the inverse of the standard normal distribution.. But easy to calculate into probabilities.. Differences in estimates are minimal, choice is basically up to the researcher which model (logit or probit) one wants to use. Our ’link function’ is theta.. Political science mainly uses logit, while economics mainly uses probit.
21
ex. predicting Y=1 at values of Xi
Instead of OLS, we just run: probit votetrump income in STATA
22
ex. predicting Y=1 at values of Xi
Instead of OLS, we just run: ’probit vote_trump income’ in STATA We get Pr(Y=1) = 𝛼(−2.31) + βIncome(0.038) + e. Let’s say we want to know Pr(voteTrump) for someone with 80k income.. Thus Φ −1 (Pr(𝑇𝑟𝑢𝑚𝑝=1))= (0.038*80) = 0.73 Φ0.73 = 0.76, or a 76% likelihood of voting for Trump given someone has an income of 80k That was really cool, Nicholas, how did you do that?? 1. Look it up on a z-score table…. OR, 2. In Stata, type: ’ display normal(.73)’ (back to this later…) And the number that results from plugging in Xi is then the ’transformed’ figure already (e.g. using the inverse of the Prob Dens Function) Φ −1 (𝑝𝑖)= the number on the PDF, which we need totransform to probabilities (0 to 1) 4.96
23
*for income levels of 160k, we get Φ 3.77, or 0.0001 = 99.99%
The p-value gives us how much area in the normal distribution that the Xi score covers – that in turn becomes out PR(Y=1)!! So, probit just takes the area to the left of the Xi. This is distributed normally and the probability of Y=1 is the AREA covered in a PDF. ’ at 65k (Φ 0.16) for ex. = 55%, *for income levels of 160k, we get Φ 3.77, or = 99.99%
24
Or put in terms of the CDF – this is EXACTLY the same area
Or put in terms of the CDF – this is EXACTLY the same area!! (in Stata – ’twoway (connected hat income)’
25
The Logit Model
26
3rd option: Logit Models
another nonlinear regression model that forces the output (predicted values) to be between 0 - 1: Like probit, logit models estimate the probability of your dependent variable to be 1 (Y=1|Xi). This is the probability that some event happens, given a certain level of X. You can always reverse this if you want (Pr(Y=0)).. Pr 𝑌 𝑖 =1 =𝒍𝒐𝒈 𝒑 𝟏−𝒑 =α+𝛽𝑋+𝜖 p is the probability that the event Y occurs, p(Y=1) p/(1-p) is the "odds ratio" ln[p/(1-p)] is the log odds ratio, or "logit" (which is what is different from probit, which uses Φ) Mean (𝜖)=0, 𝑉𝑎𝑟(𝜖)= 𝜋 wheras Probit was 0,1.. Becasue this is common in poly-sci (and has some extra interp advatanages, later) we will focus on logit from here on out, but the substantive differences from probit are indistinguishable. Logit, LP or Probit (LP & Probit can be tougher to interpret, & not covered here – cannot report ‘odds ratios’)
27
Like Probit with the CDF (-Φ), we need a formula for the logistic transformation (our ‘link function’) Pr 𝑌=1 = 𝐥𝐨𝐠 𝜫 𝟏−𝜫 =α+𝛽𝑋+𝜖 This is the odds. As the probability increases (from zero to 1), the odds increase from 0 to infinity. Odds CANNOT be negative So if β is ‘large’ then as X increases the log of the odds will increase steeply. The log of the odds then increases from –infinity to +infinity. The steepness of the curve will therefore increase as β gets larger
28
Odds vs. probability What is the difference??
Really, they express the same thing – the chance that a given outcome will occur Simple difference: Probabilities = # of times event occured/ total number of tries or observations Odds are = the probability an event will occur/ (1–the probability an event will occur). *For example, say we want to know the Pr(graduate) and we observe that out of a 100 students, 80 did and 20 did not. *The probability of a student graduating from our sample is thus 80/100 = .80 or 80% The odds of a student graduating is .80/ .20 = 4/1 a 0 for Probability = 0 for Odds, and 0.5 probability = 1.00 odds (’even money’) all Pr<0.5 range between 0 and 1 for odds All Pr>0.5 range between 1 to ∞
29
Logit Model Cont. In comparison to the linear probability (LPM) estimates, the logistic distribution constrains the estimated probabilities to be between 0 and 1. The estimated probability is defined as (Long p. 49): pr(Yi=1|Xi) = 1/[1 + exp](- - X)] as + X increases, p approaches 1 as + X decreases, p approaches 0 if you let + X =0, then p = .50
30
Logit vs. Normal curve (e.g. probit): see p. 43 in Long (1997)
Standard logistic curve is flatter than normal (probit) distribution since it has a slightly larger variance for the error term, (𝑉𝑎𝑟(𝜖)= 𝜋 ) Logit and probit will ALWAYS have the same signs for βs (given the same model) Coefficients will thus be about1.7x greater for Logit than Probit for the same model Probit will have slightly higher probabilities for Xi around the mean, but logit greater at more extreme values.. Why always the same signs? The meet at the zero mark in exactly the same place on the x-axis.
31
Model fit – OLS vs Logit So that’s what we want to do, but how do we do it? With OLS we tried to minimize the squares of the residuals (which is why its’ called “least squares”..), to get the best fitting line for each IV regressed onto Y. When the DV is binary, there’s only 2 values & the errors won’t be normally distributed. Thus ‘least squares’ technique does not seem really logical.. So instead, for logit and probit, we use something called maximum likelihood to estimate what the β and α are.
32
Fitting Logit models.. What’s going on here?
Maximum likelihood (ML)is an iterative process that estimates the best fitted equation. (see Long pps 25-33) Iterative? This just means that STATA tries lots of models until we get to a situation where alternative ways do not improve the ‘fit’ of the model given our constraints (e.g. IV’s that are in our model) The ML process is pretty complicated, although very intuitive. The basic idea is that we find the coefficient value that makes the observed data most likely. (more on that in a bit..) In either case, the coefficients produced (while direction & sig. are interesting) for both logit and probit are essentially meaningless – e.g. not marginal effects like in OLS ***logit requires a bit larger sample – rule of thumb: 20 obs per IV Without a statistical program like STATA, this process would be absurdly time consuming.. As far as betas being meaningless, we say this with the probit example, that we had to look up the score on a Z-score chart…
33
Logit Regression output: let’s compare & interpret
Logit regression shows impact of Income on Pr(voteRomney) is positive & sig. BUT we cannot interpret the Beta coefficients like in OLS due to our link function, we need to take one of two more ’interpretable’ numers: Predicted Probabilites (logit and probit) Odds Ratios (only logit) Marginal effects (logit and probit) What does the beta produced tell us?? Why cant we really interpret this?? Becasue we’ve impose our link function (log of the odds) on our equation.. the error variance is NOT DISTRIBUTED like the pdf (e.g. with a mean of 0 and s.d. of 1, like in Probit)
34
The Effect of Income on Voting for Romney: Predicted Probabilites
Simple Interpretation: What is the probability that a voter voted for Trump with an income of: 5k? 65k? 130k? Calculate using formula: 𝑒𝑥𝑝 (−∝−𝛽𝑖𝑛𝑐𝑜𝑚𝑒) 1/ (1 + exp(( (65*0.64)) = = 57.1% Or… II. Based on the logit regression estimates, we can produce the Predicted probability for each voter PR(=1) using post-STATA command: ’predict y_hat’ Like probit, the range is between 0 to 1 Also, notice the probability at 80k, it is pretty much identical to the one we calculated in Probit…
35
’twoway (connected y_hat income)’
Or visually… So, you can see by the scatterplot, When income is 65k per year, the Pr(voteTrump) = 57.1% STATA command: ’twoway (connected y_hat income)’
38
The Effect of Income on Voting for Romney: ODDS RATIOS
Remember, Since: ln[Pr Vote Romney/(1- Pr Vote Romney)] = + income + e We interpret the slope of the income as the rate of change in the "log odds" as income changes by one unit… anyone know what that means? ? So, like taking the Predicted Prob, we can also take the “Odds Ratio” remember, Probability = 1/[1 + exp(- - income)] The marginal effect of a change in income on the probability is: Δp/Δincome = f( income)+e Since the exponential function is the inverse of the log, The Odds ratio is just: [p/(1-p)] = exp( + X) In STATA, we can get the Odd ratios just by running the following: Logit DV IVs, or **OBS - the model produces no constant & z- scores What is meant by the ’unit’ of X??
39
cont. Odds ratios above 1 imply a positive effect, & below 1 a negative effect. Also, a value of 2 is of equal strength as same with 4 and 0.25, etc. (Long p 82) Let’s take another example, using a multivariate analysis, adding gender interpretation Basically, if OR < 1, subtract value from 1 (1 – = 21.2) ”holding income constant, the odds of voting for Trump decrease by 21.2% for women compared with men” (p is insig..) If OR>1, subtract ’1’ from OR value (for income, = 0.065) ”odds increase by 6.5% (or 0.065) for each increase in 10k”
40
More Interpretation: marginal effects & interactions (pps61-82)
Ok, let’s use a bigger dataset and multivariate analysis - the example data on explaining cases of diabetes in the US webuse nhanes2f, clear Here we are simply interested in knowing whether certain demographic factors increase likelihood of diabetes (age, gender, race) Let’s first describe our data.. Download this & follow What do we see? (follow with do.file ’logit interp with margins’) Not very oftern you will be just running 1 or 2 variables in your model with 16 cases . Let’s try a few examples with a large x-sec dataset with multiple variables.. Data from the US on diabetes (Y), we have 4 explanitory variables – age, gender, race & region (one of 4) – you can do ’tab region’ to see regional distribution
41
Our baseline model: logit diabetes black female age
Ok, we see that getting older (age), being black and being female (almost) signficantly increases the odds of having diabetes. Can we make these results more ”tangible”?? Yes we can For this we can use our odds ratios, predicted probabilites, etc., & we also want the marginal effects of a certain IV, given a certain level of another IV in the model For ex., the Odds ratio for ’black’ is 2.05, but this is calculated at the ’average’ level of the other variables (e.g for female & 47 for age)
42
So, in other words, We need more!
Here we will continue using the margins command**** Some things you might want to interpret: Predict the probability of the DV with all IV’s at their means Predict the probability of the DV at various levels of an IV, given all others at their mean or at certain levels (specific predicitons) The quadradic effect of an IV (e.g. squared variables) Interaction effects Marginal effects – e.g. Show changes in Y as X changes
43
1. Predict the DV when all IV’s at their means
Very simple, what is the probability of having diabetes of a person who has the ’mean value’ of all IV’s (simultaneously)? We can say that average person has a 3.16% chance of having diabetes, but is this super useful?? Maybe a little… To use ’margins’, we need to specify type of IV: ’i’ is ordinal/nominal, ’c’ is continous
44
2. Predict the DV with 1 IV at different levels, given all others at their mean
How about finding out what the pr(diabetes) is at different ages is, holding gender and race constant? No problem.. Here we see that mean values of female & black are held constant, but ’age’ is taken at 20, 50 and 70. Pr(diabetes) at 20 is only 0.6%, while at 70 it is % - showing a non- linear effect..
45
2b. Predict DV plugging in specific values for all IV’s
What about Pr(Y=1) for a black male at age 60? Or a white female age at 21?
46
3. Predicted effects of a quadratic IV, given all others at their mean
It seems like age is non-linearly related with the DV (e.g., stronger effect at higher levels) We could generate a squared term and run a regression to check for non- linearity But interpreting this is more difficult, since no one (yet) is 70² years old, thus we will over/underestiate the effects in this case.. We don’t want to compute their effects seperately, but together (STATA needs to be told..) But we can get around this in STATA, again using margins:
47
3. Predicted effects with a quadratic IV, given all others at their mean What is the Pr(diabetes=1) for someone at 20? 50? At 70? So the linear age IV gave us higher Pr(Y=1) values at low and high levels but lower at middle values
48
Use the ’#’ between two variables
4. Interaction Effects Now what if we want to know if the effect of age on the DV differs in females (vs. males) and/or race (or in black/white/ females/males)?? Now we are talking about INTERACTION EFFECTS, and we need to calculate the MARGINAL EFFECT of one variable, given a certain level (or presence, if dummy) of another.. 2 ways to do this in STATA: ”gen ageXfemale = age*female” Use the ’#’ between two variables
49
Interaction terms interpretation: quick recap
Why interact two X’s? Because we believe that the impact of βX on Y CHANGES based on different levels of βZ Two continuous IV’s This can be at low levels of βZ, we expect βX to be positive, but at high levels of βZ we expect βX to be negative (or much more positive) When one IV is a dummy (gender, etc.) This can be that we expect the marignal effect of a continuous variable βX to have one effect on Y when βZ=0 compared to when βZ=1
50
cont When interaction term is introduced, interpretation of the constituent terms is thus altered. So if 𝑌=𝛼+𝛽1(𝐴)+𝛽2(𝑍)+𝛽3 𝐴∗𝑍 +𝑒 𝐴 = dummy (0,1) 𝑍= continuous We interpret the individual coefficients (𝛽1 and 𝛽2) only when the other =0 When both > or < 0, effect 𝑜𝑓 𝐴 𝑜𝑛 𝑌=𝛽1 + 𝛽3 effect 𝑜𝑓 𝑍 𝑜𝑛 𝑌=𝛽2 + 𝛽3 Rember, we need all constituent terms in the model and they cannot be interpreted individually unless the other term =0 However, the binary DV’s the interp is a bit tricky becasue we have to transform Y just to understand it, so always best to just let STATA calculate Probabilities or odds ratios..
51
Interaction with age and female
Interaction with age and female? with ’margins’, we don’t even need to generate this.. Yes, interaction term is significant.. What does that mean?
52
Looking at gender effects over meaningful age values (sum) margins, at (age=( ) female=(0 1)) marginsplot
53
4a. Multiple dummy variables
Let’s say you want to have a general idea about how age relates with diabetes, but you don’t want to use exact years (this could also be for GDP p.c. or income for example.) Maybe compare broad age groups: 20-39, and 60+ for ex. You can create ’age group dummy variables’ in STATA (and then combine them)or just create an ordinal variable like this:
54
Cont. Holding gender and race constant, we find that the probabiltiy of someone under 40 having diabetes is 0.8%, while someone over 60 is about 9%
55
5. Marginal Effects “A ME [marginal effect], or partial effect, most often measures the effect on the conditional mean of Y of a change in one of the regressors, say Xk. In the linear regression model, the ME equals the relevant slope coefficient, greatly simplifying analysis. For nonlinear models, this is no longer the case, leading to remarkably many different methods for calculating MEs.” (Cameron and Trivedi 2004: 333) Several ways to show that Xi’s effect on Pr(Y=1) is NOT just its own, but also where other IV’s in our model are, and at various levels of X.. 1. We can see the effect of Xi just assuming that all others are held at means, or absolute means, or something… 2. Another way is to let the other IV’s vary and see Xi’s effect across a range of values for other IV’s.. Until now, we have just reported the PR(Y=1) given certain levels of Xi, now let’s focus on what happens when there is a CHANGE in Xi. Very common in economics and in good quality political science publications More easily interpretable with OLS (continuous DVs) but more ways of calculating in logit/probit
56
Cont. 1. For categorical (dummy) IV’s, the ME shows how Pr(Y=1) changes as Xi changes from 0 to 1 (male to female, white to black, etc.), or: 𝑀𝐸 𝑜𝑓 𝑋 𝑖 =Pr Y=1 X, 𝑋 𝑖 =1 −Pr(Y=1,X| 𝑋 𝑖 =0) 2. For continuous IV’s, the ME shows how how Pr(Y=1) changes as Xi increases by ONE UNIT (age would be years for ex.) – ranges from 0-1 𝑀𝐸 𝑜𝑓 𝑋 𝑖 =limit Pr Y=1 X, 𝑋 𝑖 =∆ − Pr Y=1,X 𝑋 𝑖 /∆]
57
1. Marginal Effects at the Means (MEM) The marginal effect of gender: at mean age
The probaility of having diabetes as a 47.5 year old man is 2.5%, while a woman of the same age is 3.7%. Thus there is a 38.7% difference between females and males for having diabetes at age 47.5
58
2. Average Marginal Effects (AME)
Essentially, AME’s test the effect of ’being black’ and ’being white’ across the whole sample, at all levels of the other IV’s and compare the PR(Y=1) between the two samples. We are not bound by ’mean of age or female’ for ex. In effect, you are comparing two hypothetical populations – one all white, one all black – that have the exact same values on the other independent variables in the model. Since the only difference between these two populations is their race, race must be the cause of the differences in their likelihood of diabetes, about a 4% increase in having diabetes for blacks in comparison with whites. You use all of the data, but maybe a bit ‘unrealistic’ - treating men as though they are women, and women as though they are men, really is a better way of computing marginal effects… This is essentially like running an OLS model and getting a coefficient for Black (the Betas will be pretty close as well)
59
3. Marginal Effects at Representative Values (MERV)
Much better for Non- linear effects of the IV’s & interactions (next) Show a range of values for Xi and show how Pr(Y=1) changes for Zi at these values So, the MEM for black = 2.9%, while the AME=4%, we see that each is very dependent on age Black people at 20 are 0.06% more likely, while at 70, 8.8% The MERV is A bit more time consuming, but I would argue the best to report: most informative In a large model, it may be cumbersome to specify representative values for every variable, but you can do so for those of greatest interest.
60
We can also produce this without c.i.s by typing: marginsplot, noci
To show this however, it’s probably easier to GRAPH the results. IN STATA 12, we just do the following to show discrete changes.. (see do file) Here we are interested in the marignal effects relative to ’0’ to test Ho We can also produce this without c.i.s by typing: marginsplot, noci Rememebr, this is the probability INCREASE of having diabetes in going from male to female or non-black to black at various ages… e.g. ’discrete change’
61
interaction with female*black, but without the discrete change effects..
What if we want to know the interaction between black & women at different ages?? No problem.. Can specify in margins Remember, include both the interaction term & the component terms in the logit regression **As Berry et al (2010) argue, Even if the product term is Insignficant, always check the Marginal effects for binary models Now we are simply reporting the the probability of Y=1 given certain charactoristics..
62
By the way, running Probit gives us basically the exact same marginal effect..
63
use http://www.ats.ucla.edu/stat/stata/dae/binary.dta, clear
We are interested in admission into graduate school (0/1) based on 3 factors: GRE (Graduate Record Exam scores), 2. GPA (grade point average) and 3. prestige of the undergraduate institution (1-4, lower=higher rank) Describe & Sum variables to find out what you have run a baseline logit model with all 3 IV’s ODDS – what is the change in odds for admission with a 1 unit increase of each IV holding constant the others? – e.g. GRE, GPA and ‘rank’? Predict the probability of admission for a student with: All variables held at their means A GRE of 700, a GPA of 3.8 and from a high prestige university (rank=1)? A GRE of 200, a GPA of 1.5 and from a low prestige university (rank=4)? A GPA of 1, 2.5 and 4 at mean values of GRE and rank Marginal effects - The marginal effect of GPA on the Pr(admissions=1) at mean levels of GRE & rank Interaction effects – test an interaction model including GRE*GPA. Show a figure with the marginal effect of GRE over select values of GPA – what do you see? Total effects – which of the 3 IV’s has the greatest total effect on Pr(admissions=1)?
64
Evaluating the Model & Diagnostics
65
Post- Regression: Evalutating the Logit Model
Going past the interpretation of individual β’s and taking a look at the model on whole.. Several things one can do, here are 3 pretty basic tests: Model χ² ”Pseudo-R²” % of ”correct” predicted outcomes in the model.
66
1. Judging Model χ² (pps. 87-98 in Long)
LR and Wald χ² tests Simple formula applies: 𝐿𝑅 𝑖 = −2 𝐿𝑜𝑔𝐿𝑖𝑘𝑒𝑙𝑖ℎ𝑜𝑜𝑑 𝛼 − 𝐿𝑜𝑔𝐿𝑖𝑘𝑒𝑙𝑖ℎ𝑜𝑜𝑑 𝛼,𝛽 -2*( )= Put another way, the model χ² test compares the 1st LL with the last one, account for degrees of freedom (d.o.f.) in the model (in our case = 4) Overal model fit Ho: the collective variables in the model explain 𝜎² in the DV = 0. **Can we reject?? “Model likelihood ratio” (LR). The first iteration (called iteration 0) is the log likelihood of the "null" or "empty" model; that is, a model with no predictors. At the next iteration, the predictor(s) are included in the model. At each iteration, the log likelihood increases because the goal is to maximize the log likelihood. When the difference between successive iterations is very small, the model is said to have "converged", the iterating is stopped and the results are displayed
67
𝑅 2 =1− 𝑖=1 𝑁 𝑦 𝑖 − 𝑦 𝑖 2 𝑖=1 𝑁 𝑦 𝑖 − 𝑦 𝑖 2 2. Pseudo R²
Remember, ’real’ R² in OLS is very intuative and can be use to compare across models with different sample, observations and variables. 𝑅 2 =1− 𝑖=1 𝑁 𝑦 𝑖 − 𝑦 𝑖 𝑖=1 𝑁 𝑦 𝑖 − 𝑦 𝑖 N= # of obs Y = DV Y-bar = mean of Y Y-hat = values predicted by the model Numerator = sum of squared difference : RSS Denomonator = sum of sq. Difference: TSS
68
2. Pseudo R² In OLS,What does the R² tells us?
"the proportion of the total variability of the outcome that is accounted for by the model" Pseudo R² is a little different.. Many types: Mcfadden’s, Cox & Snell, Efron, etc. Important: for Logit/probit, these do NOT tell us the proportion of VAR in a model & cannot be compared across datasets, or models w/ differing smaple sizes Are useful in assessing which model (e.g. constellation of IV’s) fit the DV best with the same dataset & sample McFadden’s for ex. is: 1− −2𝐿𝐿 𝛼,𝛽 /−2𝐿𝐿 𝛼 In all cases, a model with a Larger Pseudo R² is preferred, Thus greater gap between 1st & last LL = good Other Pseudo R² calculations Can be produced with ” fitstat” in STATA This is the pseudo R-squared. Logistic regression does not have an equivalent to the R-squared that is found in OLS regression; however, many people have tried to come up with one. There are a wide variety of pseudo-R-square statistics. Because this statistic does not mean what R-square means in OLS regression (the proportion of variance explained by the predictors). pseudo R-squareds cannot be interpreted independently or compared across datasets, they are valid and useful in evaluating multiple models predicting the same outcome on the same dataset. In other words, a pseudo R-squared statistic without context has little meaning. A pseudo R-squared only has meaning when compared to another pseudo R-squared of the same type, on the same data, predicting the same outcome. In this situation, the higher pseudo R-squared indicates which model better predicts the outcome.
69
Example of post-regression ’fitstat’ – logit reports McFadden’s R2 (based on log likelihood) Efron’s and Count for example compare values-predicted values The way in which R-squared is calculated in OLS regression captures how well the model is doing what it aims to do. Different methods of the pseudo R-squared reflect different interpretations of the aims of the model. In evaluating a model, this is something to keep in mind. For example, Efron's R-squared and the Count R-squared evaluate models according to very different criteria: both examine the residuals--the difference between the outcome values and predicted probabilities--but they treat the residuals very differently. Efron's sums the squared residuals and assesses the model based on this sum. Two observations with small a differences in their residuals (say, 0.49 vs. 0.51) will have small differences in their squared residuals and these predictions are considered similar by Efron's. The Count R-squared, on the other hand, assesses the model based solely on what proportion of the residuals are less than .5. Thus, the two observations with residuals 0.49 and 0.51 are considered very differently: the observation with the residual of 0.49 is considered a "correct" prediction while the observation with the residual of 0.51 is considered an "incorrect" prediction. When comparing two logistic models predicting different outcomes, the intention of the models may not be captured by a single pseudo R-squared, and comparing the models with a single pseudo R-squared may be deceptive.
70
3. % ”correct” probabilites predicted
Remember, in STATA, we can assign ’predicted probabilities’ of the likelihood of our DV occuring after running each model: ”predict yhat” For how many observations did our model ”correctly predict”? For this, we might assume (like ’Count R2) that all cases where Pr(DV=1) ≥ 0.5 equals ’yes’ and < 0.5 equals ’no’. We assign all ’yes’ predicted outcomes ’1’ and all ’no’s a ’0’ and compare them with the actual outcomes (e.g. the DV). BUT, we don’t just want correctly predicted ’1’s (called ’sensitivity’), we also want correctly predicted ’0’s (called ’specificity’). The “percent correctly predicted” is somewhat useful, but can be misleading. Suppose the data consist of 100 observations with 95 “successes” (i.e., ones) and 5 “failures” (i.e., zeroes). If the logit/probit model predicts that all observations will be “successes” (ones), that would translate to a 95% correct prediction rate. This sounds impressive, but masks the fact that the model failed to correctly predict a single “failure!” For this reason, it is best to report separate “percent correctly predicted” values for “successes” and “failures,” respectively. In the vocabulary of binary choice models, the “percent correctly predicted” successes (1’s) is called the “Sensitivity,” and the “percent correctly predicted” failures (0’s) is called the “Specificity”.
71
” estat classification”
In small datasets, we can just compare ’hat’ with the actual DV, but in large ones, we can use STATA After estimation, use the command: ” estat classification” Going back to our 16 US voters, we find that our model predicts Trump voters pretty well, 80% predicted correctly D = ’actual T voters ~D = non-T voters *you might want a different cut- off than 0.5… Model sucessfully predicts ’Trump votes’ 77.78% of time, and non-Trump votes 81.82% of the time. The overall prediction rate is 80% percent. D = actual Trump Votes ~D = non-Trump votes + = predicted Trump Votes - = predicted non-Trump votes You can adjust the Pr(D) threshold by going into STATA and Statistics / Binary outcomes / Postestimation / Goodness-of-fit after logistic / logit / probit ***The lower the % of ‘1’s in the sample, the lower the cut-off you’ll want…***
72
What if ’1’s are more rare?
***The lower (higher) the % of ‘1’s in the sample, the lower (higher) the cut-off you’ll want…*** You can adjust the Pr(D) threshold by going into STATA and: Statistics / Binary outcomes / Postestimation / Goodness-of-fit after logistic / logit / probit
73
Other structural Issues with our model to look out for..
Logit estimation requires similar diagnostic checks Common issues in x-sec data for Logit: Omitted or Irrelevant Variable bias Functional form of IV’s Multicollinearity Structural Breaks Heteroskedasticity Remember the classical assumptions of OLS? 1. Regression is linear in parameters (no omitted IV’s, proper form, error term) 2. Error term has zero population mean (E(εi)=0). 3. Error term is not correlated with X’s, ‘exogeneity’, E(εi|X1i,X2i,…, XNi,)=0 4. No serial correlation 5. No heteroskedasticity (e.g. constant variance of the error term) 6. No perfect multicollinearity and (usually): 7. Error term is normally distributed (efficieny, not bias)
74
1a. Omitted Variable Bias
A serious violation that can lead to Biased estimates How to detect? Reflect on your theory – why does X lead to changes inY? What else (see the literature on your topic) does as well (e.g. ’standard control IV’s) What is happening with this? Where the ‘real’ model is: Pr(Y)= β0+β1X + β2Z + εi But we estiamte instead: Pr(Y)= β0+β1X+ εi
75
Ex. omitted Variable Bias (OVB)
REMEMBER: Both of these conditions must hold for OVB: I. the omitted variable is a determinant of the dependent variable; and II. the omitted variable is correlated with the/ an included IV (e.g., E(εi|X1i,X2i,…, XNi,)≠0 (anything not modeled is in the error term) Where this hold, b1 without including b2 will be biased, because: 𝐸 𝑏 1 ∗ = 𝛽 1 + 𝛽 2 𝜎 12 𝜎² 1 *so check correlations for Y, and all X’s… DV (vote trump) ethnicity IV (income) Can’t solve all issues of OVB with statistics (social science is super complex), but we can remedy some stuff. This also speaks to your theory..
76
Can’t ’solve’ OVB with statistics, but some checks IF we have data on additional variables..
2. After ’common sense’ ideas, you can test omited variable bias with a Likelihood Ratio test (LR). We also speak in terms of one model being ”nested” in another model. For ex. - Pr 𝑣𝑜𝑡𝑒𝑇𝑟𝑢𝑚𝑝 =𝛼+𝛽 𝑖𝑛𝑐𝑜𝑚𝑒 +𝑒 Is nested in: Pr 𝑣𝑜𝑡𝑒𝑅𝑜𝑚𝑛𝑒𝑦 =𝛼+𝛽 𝑖𝑛𝑐𝑜𝑚𝑒 +𝛽 𝑤ℎ𝑖𝑡𝑒 +𝑒 Formula for Likelihood Ratio Test: 𝐿𝑅 (𝐹) = −2𝐿𝐿(𝑚𝑜𝑑𝑒𝑙1, 𝑖=𝑘−𝑞 − −2𝐿𝐿(𝑚𝑜𝑑𝑒𝑙2, 𝑖=𝑘 −2𝐿𝐿(𝑚𝑜𝑑𝑒𝑙1, 𝑖=𝑘−𝑞 − −2𝐿𝐿(𝑚𝑜𝑑𝑒𝑙2, 𝑖=𝑘 ***Where, LR is distributed with ’q’ d.o.f., with q ≥1 omitted IV’s You must compare two models (thus have the data) to test whether the exclusion of ’q’ in model 1 leads to bias in model 2.. One model is considered nested in another if the first model can be generated by imposing restrictions on the parameters of the second. Most often, the restriction is that the parameter is equal to zero. In a regression model restricting a parameters to zero is accomplished by removing the predictor variables from the model. This is basically saying that if model 1 (with C & D but without Bx) predicts Y equally to model 2 (with Bx) then model 1 is NESTED within model 2..
77
Save the results of each 1 at a time using: estimates store a
How do we do this?? You must run 2 models, one with and one without the extra variable.. They must have the same sample size Save the results of each 1 at a time using: estimates store a estimates store b STATA will save the output, then you use: lrtest a b Test produces a χ² value, & p-value from your d.o.f. (=q) Let’s test this with our Trump model… -the LR test involves estimating two models and comparing them. Fixing one or more parameters to zero, by removing the variables associated with that parameter from the model, will almost always make the model fit less well, so a change in the log likelihood does not necessarily mean the model with more variables fits significantly better -The lr test compares the log likelihoods of the two models and tests whether this difference is statistically significant. -If the difference is statistically significant, then the less restrictive model (the one with more variables) is said to fit the data significantly better than the more restrictive model. -test assumes that model a is nested in model b. -d.o.f. = ‘q’, or ‘removed variables’
78
model 1
79
Model 2
80
The LR Test If χ² value is significant (e.g. p<0.05), then model 2 has ”signficantly more explanitory power” than model 1 In our case, the χ² p-value > 0.05, so we can conlcude that in excluding ’race’ in model, we do not have an omitted bias problem Also look at change in β’s Other tests of omitted bias via nested models are: Wald test Lagrange multiplier test
81
1b. Irrelevant Variable Bias
Well, sort of the opposite to OVB However, does not lead to BIAS estiamtes, but can lead to INEFFICIENT estiamtes – what is the difference? Remember in model 2 (less restrictive) we included the added variable ”white” – what happened? Signficance for 𝛽 𝑖𝑛𝑐𝑜𝑚𝑒 was reduced (but estiamtes were virtually identical) = inefficient PARSIMONY: If we can show that a model performs just as well with less, drop the irrelevant variable – in this case, 𝛽 𝑤ℎ𝑖𝑡𝑒 , to maximize efficiency However, always check correlation coefficients for all variables
82
2a. Functional Form of Variables
Similar discussion as in OLS Errors in functional form can result in biased coefficient estimates and poor model fit. How do we know if we have chose the wrong form? Theory – do you really predict a linear relationship from X to Y? Look at the scatterplots & Pearson correlations– what does the relationship look like? Try several different forms, for example: Quadradic (e.g. squared variable), logged variable, interactions Pr 𝑌 =𝛼+𝛽𝑋+𝑒 Pr 𝑌 =𝛼+𝛽𝑋+𝛽 𝑋 2 +𝑒 Like in OLS, you can run an initial regression, and do a ‘linktest’. If the squared residuals are significant, that tells you something is probably mis- specified.
83
For ex., we want to know the relationship between ethnic diversity conflict ex data.dta on GUL
Let’s use Collier & Hoffler’s model (1998): Pr 𝑐𝑖𝑣𝑖𝑙𝑐𝑜𝑛𝑓𝑙𝑖𝑐𝑡 𝑖 =𝛼+ 𝛽 𝐸𝑡ℎ𝑛𝑖𝑐 𝐷𝑖𝑣 𝑖 +𝛽 𝑅𝑒𝑠𝑜𝑢𝑟𝑐𝑒𝑠 𝑖 + 𝛽 𝐸𝑐𝑜𝑛 𝑑𝑒𝑣𝑒𝑙𝑜𝑝𝑚𝑒𝑛𝑡 𝑖 + 𝛽 𝑝𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛 𝑖 + 𝑒 𝑖 We find that: diversity and population lead to more conflict on average Let’s look at ethnic diversity Interp? What might be wrong about this functional form?
84
Right: Pr 𝑐𝑖𝑣𝑖𝑙𝑐𝑜𝑛𝑓𝑙𝑖𝑐𝑡 𝑖 =𝛼 +𝛽 𝑒𝑡ℎ𝑛𝑖𝑐 𝑑𝑖𝑣 𝑖 +𝛽( 𝑒𝑡ℎ𝑛𝑖𝑐 𝑑𝑖𝑣 2 )+𝑒 𝑖
Collier and Hoffler (1998) argue this relationship is actually non-linear! So maybe really homogenous states & really diverse states have less conflict than states with moderate diversity, – how to test this?? Right: Pr 𝑐𝑖𝑣𝑖𝑙𝑐𝑜𝑛𝑓𝑙𝑖𝑐𝑡 𝑖 =𝛼 +𝛽 𝑒𝑡ℎ𝑛𝑖𝑐 𝑑𝑖𝑣 𝑖 +𝛽( 𝑒𝑡ℎ𝑛𝑖𝑐 𝑑𝑖𝑣 2 )+𝑒 𝑖 Aka ”quadratic function” Let’s re-run the model in STATA & compare predicted probabities Which model is a better fit?? Let’s test…
85
LR Test: for ”better functional form
What do we do? Run model 1 (no 𝑒𝑡ℎ𝑛𝑖𝑐 𝑑𝑖𝑣 2 ) estimates store a Run model 2 (with 𝑒𝑡ℎ𝑛𝑖𝑐 𝑑𝑖𝑣 2 ) estimates store b Run the LR test ”lrtest a b” Ho: extra variable does not improve the (more limited) original model What is our conclusion?? Also look at model stats. Since the ChiSq of the LR test is greater than hthe critical value (eg signficant) then we can say that the nested model with greater restrictions is inferior to the more restrictive model (e.g. with dem2) – in other words, the quadratic relationship fits the DV better..
86
We can do similar checks with other variables as well
We can check if the removal of insignificnat variables doesn’t effect the overall model fit.. Also, let’s say we thought the logged oil variable might be a better fit.. Can’t do LR test, but compare Psuedo R2, Beta sig., and cases correctly predicted
87
2b. Testing for outliers in logit regression
Undetected outliers can lead to very misleading results, especially in smaller samples (<50). but ALWAYS good to check OLS has several residual checks, but Logit’s are slightly different: Here’s a nice way to detect them visually (let’s use the conflict data again): Deviance Residual (predict dv, dev) Pregibon’s Leverage (predict p, hat) You can just graph these against observation #’s (gen long obsnr = _n) or Pred. Probs (predict yhat) - Again, observations can be ’residual’, ’leverage’ or ’influence’ outliers.. *let’s look at the visuals.. 1. Deviance residual is another type of residual. It measures the disagreement between the maxima of the observed and the fitted log likelihood functions. Since logistic regression uses the maximal likelihood principle, the goal in logistic regression is to minimize the sum of the deviance residuals. Therefore, this residual is parallel to the raw residual in OLS regression, where the goal is to minimize the sum of squared residuals. It is referred to as Pregibon’s Dbeta in STATA. 2. Another statistic, sometimes called the hat diagonal since technically it is the diagonal of the hat matrix, measures the leverage of an observation, which basically test the impact on the model if we were to remove certain obs., with 0 being what we’d like to see. It is also sometimes called the Pregibon leverage & if it is high for any obs than its removal will probably mean that at least one IV will be sig/non-sig comapare with the original model – this is usually when an observation has a very high value for one of the IV’s Either case – rule of thumb > 2 should warrant furhter checks
88
Pregibon’s Leverage (range of X)
Cont. Here I now show Deviance Residual Pregibon’s Leverage (range of X) What do we see & do now? The further away from the ’0’ line, the more deviance and/or leverage. Rule of thumb: Obs >2 <-2 (large sample 3/-3) for deviance Obs > 2x (3x lrg n) mean leverage -Deviance is CoD & ISR -The most leverage is BGD & CHN What do we see from these plots? We see some observations that are far away from most of the other observations. These are the points that need particular attention. For example, Congo and (somewhat Botswana) have a very high Pearson and they stand out a bit in terms of deviance residual. The observed outcome (conflict) is 1 but the predicted probability is very, very low (meaning that the model predicts the outcome to be 0). This leads to large residuals. But notice that none of the observations (incl. S. Africa) are not that bad in terms of leverage (this means that it has IV’ levels not in any ‘extreme range’ but that its observed outcome is different than the predicted outcome. That is to say, that by not including this particular observation, our logistic regression estimate won't be too much different from the model that includes this observation. Let's list the most outstanding observations based on the graphs. What makes Congo stand out? It has a larger Prigibon value, and we see that it is ranked very high on diversity but has no conflict.
89
Examining some possible outliers: clist for our model variables, then compare with means for our sample..
90
What to do about outliers??
Remember, there are 2 types of broad outliers: Normal values with ’oposite predictions’ (Congo) – impacts ’fit statistics’ Extreme values of the IV or DV – impacts ’beta estimates’ (leverage) Again, no ”right” answer here, just be aware of if they exsist and how much effect they have on the estimates, BUT Check for data error! 2. Create an obs. dummy – for example, in the case of Congo ’gen CoD = 1 if ccode==180 ’replace CoD=0 if CoD==. *Take out the country & re-run model & see if any differences, run ’lfit’ and compare χ² stats.. Report any differences… 3. New functional form (log, standardize) 4. Do nothing, leave them in… What do we want to do with these observations? It really depends. Sometimes, we may be able to go back to correct the data entry error. Sometimes we may have to exclude them. Regression diagnostics can help us to find these problems, but they don't tell us exactly what to do about them.
91
In our case Model looks pretty similar, although the quadradic term is slightly weaker.. We’ve now predicted 85% instead of 84% correct however (no suprise…)
92
3. Multicollinearity among IV’s
Multicollinearity impacts a model’s efficiency, but does not BIAS the estimates. What does this mean? Standard errors of the coefficients are therfore ’inflated’ Overall model explains more, but individual coefficients don’t seem ’signficant’ Other detection methods: Check Pearson coefficients among IV’s – any greater than 0.60? Higher puedoR2, but no ’sig Beta’s” VIF
93
”solutions” to a multicollinearity problem
In previous ex. (with OIL and logOIL), we cannot include both in the model, becasue they predicted eachother ”perfectly” Run a VIF test (Variance inflation factor) as a posttest. In Logit, we do this with a slightly different command than in OLS (e.g. estat vif). We’ll see VIF and ’tolerance’ (basically the inverse of the VIF) findit collin collin Y X Z 2. If two IV’s higly correlated, try dropping the ’least important’ one and re- running the model. 3. If the variables are more or less trying to capture the same underlying latent concept – try to create a single indicator, using a Principle Component Analysis (PCA) or Factor Analysis (FA) VIF - The square root of the variance inflation factor tells you how much larger the standard error is, compared with what it would be if that variable were uncorrelated with the other predictor variables in the model. Example If the variance inflation factor of a predictor variable were 5.27 (√5.27 = 2.3) this means that the standard error for the coefficient of that predictor variable is 2.3 times as large as it would be if that predictor variable were uncorrelated with the other predictor variables.
94
VIF test with logit model
95
4. Structural Breaks Common if you have any type of diverse sample spatially. More on this during multi-level modeling, but basically, you might have 2+ ’sub-samples’ where the IV’s have a different effect on the DV. With just 2 sub-samples (for ex. individual level data from two regions, schools, etc.) you can run a model for each sub-sample and do and CHOW test to see if they are in fact different. If you find differences, you can use dummy variables to control for this. Can lead to heteroskedasticity if not addressed…
96
For ex. Going back to our diabetes data, I might believe that the effects of gender and race might not matter so much on having diabetes in the non- South, as compared with the Southern states (for many, many reasons..) Maybe one region (ex. the South)is ’driving’ the results, and biasing our outlook on the overall relationship. Let’s test this… Ho: coefficients are NOT different across models…
97
5. Heteroskedasticity A different problem then one finds ín OLS
Very common in x-sec data with different sub-groups (clusters of countries, individuals, etc.) You can get consistent Beta’s in OLS even with heteroskedasticity, but in Logit it can cause biased estimates. Thus in OLS, the ’robust’ command will help correct for this & provide more efficient results. But in Logit, you will be ”correcting” the standard errors around inconsistant (e.g. biased) estimates, so pretty pointless.. Some (but not great) techniques to estimate around it. Best to spot it and ”model it”
98
Some more literature on this topic:
1. Williams, R Using heterogenous choice models to compare logit and probit coefficients across groups. Sociological Methods & Research 37: 2. Allison, Paul “Comparing Logit and Probit Coefficients Across Groups.” Sociological Methods and Research 28(2): 3. Hauser, Robert M. and Megan Andrew “Another Look at the Stratification of Educational Transitions: The Logistic Response Model with Partial Proportionality Constraints.” Sociological Methodology 36 (1), 1–26. 4. Long, J. Scott and Jeremy Freese Regression Models for Categorical Dependent Variables Using Stata, Second Edition. College Station, Texas: Stata Press.
99
How to write up your results? My suggestions
Include descriptive statistics (obs, mean, s.d., min & max) in appendix Be very clear that your DV is binary (0/1) and NOT continuous, and thus you will use loigt (or probit) Logit (or probit/ LP) are very standard in the literature. You don’t need to spell out the formula or even defend ’why’ it is better than OLS. For ex. ”the DV is equal to ’1’ for countries in my sample that have had a civil conflict in any year from , and ’0’ if otherwise.” ”the DV equals ’1’ if the individual voted for Romney in 2012 and ’0’ if otherwise – the logistic regression model is used in this analyss to estimate the factors that impact (civil conflict)voting behavior.”
100
Cont. Set up a table (in excel)
Include the DV name & ’logistic regression’ Coefficient estimates (or Odds ratios), t or z-stats, the overall model χ² & number of observations If presenting 2+ models, include the Pseudo R² & the 1st/last log likelihood interation so you can compare performance.
101
logit hiqual c.meals c.yr_rnd
use See do file ’logit exercise day2’ if you need help.. 1. Describe, and then estimate a high quality school (hiqual) on year-round status (yr_rnd) & % free meals (meals). (Sum these first to see what types of variables they are..) logit hiqual c.meals c.yr_rnd 2. Model specification: Do a linktest – what do you see? Is hat2 sig? What could you do if so? 3. Test for omitted variable bias – % teachers with full acredidation (full). Do an LR test between the two models & compare model statistics – what do you see? 4. Test for multicollinearity for the better of the 2 models – what do you find? 5. Test for outliers – after the regression, produce the deviance and Pregibon leverage variables, as well as yhat. Show scatterplots with each over yhat – scatter dv yhat, mlabel( snum) scatter p yhat, mlabel( snum) -List the observations you find to be outliers – what makes them so? What would you do about them?
102
Part II: Models with discrete dependent variables with 3+ outcomes
103
Topic 2: Ordered and Multinomial Logit/ Probit
Ordered Logit – what is this for?? -extension of the logistic regresion model for binary response -when your DV has multiple, ordered categories: For ex. – Bond ratings (AAA, AA, A, etc.), Grades (MVG, VG, G, etc.), opinion surveys (strongly agree, agree, disagree, strongly disagree) Some type of continuous outcome you might want to collapse - spending, ’performance’ (high, medium, low) Employment (fully, partial, unemploymed) We have talked about the analysis of dependent variables that have only two possible values, e.g. lives or dies, wins or loses, gets an A or doesn’t get an A. Of course, many dependent variables of interest will have more than two possible categories. These categories might be unordered (doesn’t move, moves South, moves East) or ordered (high, medium, low; favors more immigration, thinks the level of immigration is about right, favors less immigration). We will briefly discuss techniques for handling each of these. For these we can treat them as independent cateogries (non-ordered) but then we loose efficiency in our models as we need additional parameters (e.g. loose d.f.) to estimate the model
104
Assumptions of Ordered Logit Models
Maximum likelihood estimation – again, no ’sum of squares’ estimation – this uses an iterative process that converges the model’s log likelihood in comparison to an ’empty model’ (Iteration 0) Number of ordered responses <6. After the DV takes on 6+ values, the model can be run using OLS if distance between categories equal (no ’exact’ cut-off, this is a rule of thumb..)
105
Assumptions of the Ordered logit model (’ologit’ in STATA
Proportional odds assumption (aka parallel regression): β’s for one outcome group (low Bond rating countries) are the same as any other group (median, or high Bond rating states) – is an assumption to increase efficiancy in our estimates. -NOTE, we do NOT need to assume the distance between each interval in Y is the same! (as we would if using OLS) -we start with an observed, ordinal variable (Y) -as in most models of estimation, Y is a function of a latent, unobserved variable Y* -the variable Y* has ”threshold points” (’M’)– the value of Y depends on whether an observation has crossed these thresholds. If Y has 3 groups, then 2 cut-offs: 𝑌 𝑖 =1 if 𝑌 𝑖 * is ≤ 𝑀 1 𝑌 𝑖 =2 if 𝑀 1 is ≤ 𝑌 𝑖 * ≤ 𝑀 2 𝑌 𝑖 =3 if 𝑌 𝑖 * is ≥ 𝑀 2 Ex. If we have 3 ordered categories, C=3 M1 and M2 are ’cut-offs’. Maybe we impose these cut-off willingly (levels of economic development – low, medium & high = less than 5000$ pc, between , and greater than 15000$pc) or these are more abstract ’strongly trust, somewhat trust, do not trust’ Basically Y is a ’collapsed version of Y* that can take on any number of catagories depending on the data and choice of researcher..
106
Estimating the model 𝑌 ∗ 𝑖 = 𝑘=1 𝑘 𝛽 𝑘 𝑋 𝑘𝑖 + 𝜀 𝑖 = 𝑍 𝑖 + 𝜀 𝑖
So, as in all statistical models we’ve covered, our latent variable Y* is a function of our right-hand side IV’s plus some level of error: 𝑌 ∗ 𝑖 = 𝑘=1 𝑘 𝛽 𝑘 𝑋 𝑘𝑖 + 𝜀 𝑖 = 𝑍 𝑖 + 𝜀 𝑖 The Ologit model will estimate part of this: 𝑍 𝑖 = 𝑘=1 𝑘 𝛽 𝑘 𝑋 𝑘𝑖 =𝐸 𝑌 ∗ 𝑖 So Z, basically is Y* as a function of some disturbance (not a perfect measure of Y*..). It is of a different scale than Y (e.g. continuous), but our estimates can give us Pr(Y=1, 2,..X) based on the value of Z Like binary Logit, our link function is the log of the odds (logit), giving us odds/probability that an observation falls into a given Y category based on its levels of X’s. Just like the probit and logit models, Z is continuous 0-1 Y* is a function of the sum of all variables for each observation i. K is the number of ’IV’s and B is the slope of the effect of each variable (K), which range according to X The error term has a logistic (non-normal, as in probit) distribution Thus the model needs to estimate the K β’s and the M-1 ks parameteres (e.g. thresholds) The model esimtates Z as a function of the sum of BX.
107
The model cont. In Ologit, there is no ’traditional’ intercept, just ’cut-off points’ (M) (like an intercept) & that they are different for each level of Y, but Beta’s do NOT vary for the levels of Y! The point: we want to estimate the probability that Y (observed variable) will take on a given value (in this case, 1, 2 or 3). Z helps us estimate the probability that a given observation will fall into a given Y category 𝑃 𝑌=1 = 1 1+𝑒𝑥𝑝 𝑍 𝑖 − 𝑀 1 𝑃 𝑌=2 = 1 1+𝑒𝑥𝑝 𝑍 𝑖 − 𝑀 2 − 1 1+𝑒𝑥𝑝 𝑍 𝑖 − 𝑀 1 𝑃 𝑌=3 =1− 1 1+𝑒𝑥𝑝 𝑍 𝑖 − 𝑀 2 So with the estiamte value of Z and the assumed logistic distribution of the error term, we can estimate the probability that an observation will fall into one of the categories of Y
108
This seems complicated, let’s test a few examples (use http://www. ats
This seems complicated, let’s test a few examples (use clear ) Let’s say we want to estimate ’socio-economic stats’ (SES) as a function of test scores and gender 𝑆𝐸𝑆 𝑖 = ∝ 𝑘−1 +𝛽 𝑠𝑐𝑖𝑒𝑛𝑐𝑒 +𝛽 𝑠𝑜𝑐𝑖𝑎𝑙𝑠𝑡𝑢𝑑𝑖𝑒𝑠 +𝛽 𝑓𝑒𝑚𝑎𝑙𝑒 + 𝜖 𝑖 We have 200 obs in our data – let’s see how the summary stats look:
109
-again, coefficients are pretty meaningless..
Ok, we see that higher science & social science scores lead to higher SES & that females, on average, have lower SES ologit Yvar Xvars -again, coefficients are pretty meaningless.. So, let’s calculate the PR(Y=1, 2 and 3) for a female who got average test score on both tests… Getting our ”thresholds” G1 (low SES): < 2.75 >2.75 G2 (med. SES) <5.10 G3 (high SES): >5.10 The threshold parameters of and 5.75 (cut-off 1 and cut-off 2) tell us the following. Since there are three possible values for Y (i.e. M = 3), the values for Y are (see above)
110
**Total should add up to 1**
Calculating ’Zi’ for a female with average test scores (from ’sum’) & our Beta estimates from the last slide: Zi = (0.03*51.85(science) *52.405(soc. Sci) – *1(female) Zi = 3.86 This is really great, but what do we do with it?? *remember the ’cut points’ from the model? & 5.105, we’ll use those… Ah, ok, now we can compute Pr(Y=1, 2 & 3) from our fun formulas! 𝑃 𝑌=1 = 1 1+𝑒𝑥𝑝 𝑍 𝑖 − 𝑀 1 = 1 1+exp(3.86−2.755) = .249 𝑃 𝑌=2 = 1 1+𝑒𝑥𝑝 𝑍 𝑖 − 𝑀 2 − 1 1+𝑒𝑥𝑝 𝑍 𝑖 − 𝑀 1 = 1 1+exp(3.86−5.105) − 1 1+exp(3.86−2.755) = .528 𝑃 𝑌=3 =1− 1 1+𝑒𝑥𝑝 𝑍 𝑖 − 𝑀 2 = 1− 1 1+exp(3.86−5.105) = .223 **Total should add up to 1** So, a female with average test scores has a 24.9%, 52.8% and 22.3% probability of being in the low, medium and high levels of SES respectively!
111
An alternative way.. We can also ask STATA to calculate this for us…
Again, we use the ’margins’ command We get the exact same thing in STATA What about marginal effects of gender at different levels of tests? In ologit, predicted probabilities can be used or odds ratios..
112
Mean & Average Marginal Effects of Gender
we find that a women with average test scores has 24.8% probability of ending up with low SES, (e.g. ’outcome 1’), while a man has a 16.95% probability We can calculate the raw difference (see below), which is Aternatively, we can say that a women with average test scores is 37.8% more likely than a man with average test scores to end up with low SES Formula: % difference = { ( |Value 1 - Value 2| ) / [ (Value 1+ Value 2) / 2 ] } X 100 =( )/(( )/2)*100=37.799
113
Mean & Average Marginal Effects of Gender
Or, we can do the same for high SES (outcome 3) A women (ave test scores) has a probability of 22.4% of ending up with high SES, while a man has 31.8% - an absolute difference of 9.45% Or again, we can say that a women is predicted to have 34.9% LESS of a chance of having high SES than a man, given that test scores are average.. =( )/(( )/2)*100 =
114
use http://www.ats.ucla.edu/stat/stata/notes/hsb2, clear
115
Interactions & marginal effects with ologit
Just like with logit, we can do the marginal effect of, say, gender at different levels of science test scores. Remember to specify Xvar type! BUT, it is a bit more complicated to report, because we must do for each Y-level, or choose 1 we’re most intrested in For now, let’s just see if there’s an interaction – yup! What do you see?
116
Using dydx, we see gap, or marginal effect
Just like with ’margins’ in logit, we can do this many ways (MEM’s, AME’s, etc) Let’s see the gender gap for the Pr(Y=high SES) for a range of science scores In the first ’margins’, we see the Pr(Y=3) for both men (0) and women (1) at 5 levels of science scores, from low to high (left column 1-5) f/e, with low scores (26), Pr(Y=3|female) = 4.7% Pr(Y=3|male) = 35.5% But, with high score (74) Pr(Y=3|female) = 59.8% Pr(Y=3|male) = 32.0% Using dydx, we see gap, or marginal effect MEM’s – marginal effects at the means AME’s – average marginal effects Or marignal effects at representative values (with either MEM’s or AME’s for the 3rd IV, social scienc tests) Using the dydx, we can see the differnece in gender gap for high SES at different levels of science scores..
117
Graphing out the marginal effect with: ’marginsplot, noci’
Ex for interp: At 26 and 42, the negative diff between female and male is sig at the 95%+ level of confidence, while at 52 and 74, the difference is sig at the 90% level of confidence. The difference between men and women for gaining high SES when test scores are 62 is negligable.
118
Or showing the effects of gender over science scores for all 3 outcomes in 1 figure…
STATA commands to do this: margins, dydx(female) at(science=( )) predict (outcome (3)) marginsplot, noci saving(ses3) margins, dydx(female) at(science=( )) predict (outcome (2)) marginsplot, noci saving(ses2) margins, dydx(female) at(science=( )) predict (outcome (1)) marginsplot, noci saving(ses1) gr combine ses1.gph ses2.gph ses3.gph
119
Model diagnostics Just like with logit, ologit has similar tests for ’goodness of fit’ Use the LR χ² statistic (& p-value) to test if all coefficients in the model ≠ 0 You can test nested models (omitted variables) with the LR test Outliers detected same as logit Can use a Chow test to check for structural breaks (sub-groups) but Ologit requires an extra, & very important diagnostic – checking for the model’s key assumption: the parallel regression assumption/ Proportional odds assumption. If this is not the case, we have to fix it or maybe just run seperate models
120
Two (very similar) tests you can do
LR Test -1st, get the command omodel (type findit omodel in STATA) -re-run your model: 2. Brant’s test Run regular model & type ’brant’ afterward, gives for individual IV’s **for both, the Ho is that there is NO difference in the coefficients between models, distributed as a χ² (e.g. we want non-sig.) In our case, we have NOT violates the POA
121
FYI In small samples, (say under 50 or so), you will often violate the Proportional/paralell odds assumption because outlying obesrvations will have a large impact on the model In this case, the estimates will be biased. To remedy this, you can use GENERALIZED LEAST SQUARES estimates with the command ”gologit2” and interpret each coefficient differently for each level of Y.. Jup, pretty intensive..
122
For example, let’s re- examine our model
For example, let’s re- examine our model. Let’s say we now believe SES is driven by gender, race, school type (public/private) and type of school program (academic, Vocational & general) We run our new model, while testing for the Proportional odds assumption – jup probably violated! But, We now need to see which IV’s are violating this, as not all IV’s need adjusting.. White=1 if white or Asian, 0 if black or hispanic School type = 1 if public, 2 if private School program – 3 dummy variables: Academic Vocational General (omitted group) Interp – pretty strait forward, men and white/Asians on average find themselves in higher SES groups relative to women and hispanics/blacks. In addition, compared with general program, students with an academic background tend to occupy higher SES, while vocational and public/private do not play a sig. role – but is this correct?? The reason why we dont just asume that all IV’s are violating this is becasue we will add unneccesary parameters to the model whihc will make it less efficient & in the end, much more difficult to interpret..
123
Now run with: gologit2 We see the model is in violation, but which IV’s drive this? STATA command: ”gologit2 ses female white schtyp academic vocational, autofit lrforce” Model goes through each IV, one by one first When autofit is specified, gologit2 goes through an iterative process. autofit basically employs a backwards stepwise selection procedure, starting with the least parsimonious model and gradually imposing constraint. You can make the test ‘tougher’ by specifying autofit(.01), for 99% confidence for ex. First, it estimates a totally unconstrained model, the same model as the original gologit. It then does a series of Wald tests on each variable individually to see whether its coefficients differ across equations, e.g. whether the variable meets the parallel lines assumption (comparing gr 1 v.s 2 & 3 , gr 2 vs 1&3, and gr 3 vs. 1&2) . If the Wald test is statistically insignificant for one or more variables, the variable with the values of 0.05 or less on the Wald test is constrained to have equal effects across equations (academic, white & female). The model is then re-estimated with constraints (for school type and vocational), and the process is repeated until there are no more variables that meet the parallel lines assumption. A global Wald test is then done of the final model with constraints versus the original unconstrained model; a statistically insignificant test value indicates that the final model does not violate the parallel lines assumption. As the global Wald test shows, 3 constraints have been imposed in the final model, which corresponds to 3 variables being constrained to have their effects meet the parallel lines assumption (academic, female, white – all assumed with consistent Betas across 3 levels of SES)
124
The ’new and improved’ model
Ok, now let’s re-interpret! Low comapares with mid&high, while middle compares with low and high (high omitted b/c it has already been compared twice) positive coefficients indicate that higher values on the explanatory variable make it more likely that the respondent will be in a higher category of Y than the current one, while negative coefficients indicate that higher values on the explanatory variable increase the likelihood of being in the current or a lower category Instead of all Betas being unique (totally unconstrained model, 10 unique IV’s) we now only estimate 7. We now have new insights - see that students who went to private school are much more likely to be in a higher category than low SES (but no difference between middle or high SES. Same with vocational compared with general education.
125
Ok, a little more intepretation: marigns comand
What is the probability that a non-white women with general education in a public school is in low, middle or high SES? Again, with ’margins’ command, we can show this: margins, at(female=(1) white=(0) schtyp=(1) academic=(0) vocational=(0)) predict(outcome(1)) =63.6% For middle SES (just change ‘outcome’ to 2) =28.0% For high SES (just change ‘outcome’ to 3) =8.4% What about increased/decreased PR(Y=1,2 or 3) based on change or one IV? margins, dydx(female) at (white=(0) schtyp=(1) academic=(0) vocational=(0)) predict(outcome(3)) high: -4.8% Med: -7.5% Low: 12.3% Remember – when having dummy variables that are mututally exclusive – academic, vocation, general – you can only specify ONE at a time… **You can be very creative in interpreting this – highlight changines in your key variables holding others at meaningful values and discuss changes an c.i.’s at the different levels of Y.
126
For Ologit exercise see GUL:
ologit grad school data.dta ologit ex.pdf ologit ex answers.pdf
127
2. Multinomial Logit Similar to ordered logit, when our DV takes on 2+ values, but still limited – 3, 4, 5 categories for example. Unlike ordered logit, the categories of the DV are ’not ordered’, but are nominal categories (aka ’categorical’).. We are interested in the relative probability of these outcomes using a common set of parameters (IV’s) For example - given a set of IV’s (education, country/regional origin, parent’s income, rural/urban) we might want to know the following: Choice of a foreign language – English, Spanish, Chinese, Swedish Choice of drink: coffee, Coke, juice, wine Choice of occupation – police, teacher, or health care worker mode of transportation – car, but, tram, train Voting for a party or bloc – R-G, Alliansen or S.D.
128
Assumptions of ’mlogit’ models:
a common set of parameters (IV’s) can linearly predict probabilities of DV categorical outcomes, but do not assume error term is constant across Y outcomes.. Unlike Ologit, these IV’s are CASE SPECIFIC – have independent effects on each category of the DV (e.g. different Betas across categories – no ’parallel odds assumption’..) ”Independence of Irrelevant Alternatives” (IIA, from Arrow’s ’impossibility theorom) – the odds/probability of chosing one case of the DV over another does not depend on another’s presence or absence, ’irrelevant alternatives’ **strong assumption** *Multinomial logit is not appropriate if the assumption is violated. IIA – the choice of A over B does not rely on Z Can be used with multi-level modeling (is more common at individual level)
129
Multinomial Logit Assumptions: IIA
IIA Example 1: Voting for certain parties **For ex., the probabilities of someone S, V, L, M, KD or, S.D. vs. M does not change if MP is added or taken away Is IIA assumption likely met in this election model? Probably not.. If MP were removed, those voters would likely vote for V or S. Removal of MP would increase likleyhood for S or V relative to M IIA Example 2: Consumer Preferences Options: coffee, juice, wine, Coke Might meet IIA assumption Options: coffee, juice, Coke, Pepsi Won’t meet IIA assumption. Coke & Pepsi are very similar – substitutable. Removal of Pepsi will drastically change odds ratios for coke vs. others.
130
Multinomial Logit Assumptions: IIA
What to do about this issue? Long and Freese (2006): “Multinomial and conditional logit models should only be used in cases where the alternatives “can plausibly be assumed to be distinct and weighed independently in the eyes of the decision-maker.” Categories should be “distinct alternatives”, not substitutes. Theory & argument very important Note: There are some formal tests for violation of IIA. But they don’t always work well. Be cautious of them. See Long and Freese (2006: 243)
131
age, gender, unemployment, education, region
Let’s do a simple example: Danish election (from 2014 ESS data), testing for how support for EU impacts one’s party vote, controlling for other IV’s age, gender, unemployment, education, region 1st, We have a sample of ca Dansk respondents and lots of parties- but let’s group them into 3 groups to reduce violation of IIA (’Votebloc3’): 1= Red Bloc (A+B+F+Ø+Å) 2= Blue Bloc (V+I+C+K) 3=Dansk Folkparti (DF) **Mlogit is more complicated than logit or even ologit, because it needs to estimate more parameters (K+1) + (Y-1), K=number of IV’s (plus a constant), Y= number of outcomes in the DV. The interpretation is a bit more complicated & cumbersome… See GUL for data – what do we observe? Basline group should be the ’theoreticaly most interesting’ to compare with – or just the largest if not clear group.. For now, other small parties (e.g. below 5%) & ‘unsure’ respondents are dropped.. It will thus require more ‘iterations’ than the previous 2 models
132
Descriptive Stats (see do.file in GUL for all commands)
133
Simple look (bivariate)
We want to see how voter’s feelings about the EU allign with their national party choices (euftf - ’EU unification gone too far – not gone far enough , 0-10). Let’s look at the summary stats by bloc: It appears DF voters are much more EU skeptic, followed by Blue, then Red – does this hold with control variables?
134
How is the model testing this?
The outcome of an ’mlogit’ model always has Y-1 parts (total outcomes – a basline category. Let’s use Red, the (then) incumbent bloc of parties. **Coefficients are thus all relative to the ’baseline’ or ’common reference’ (like dummy variables as IVs in a model..) They correspond to Y-1 seperate equations (in our case 2 equations): 𝑙𝑛 Pr(𝑉𝑜𝑡𝑒=𝐵𝑙𝑢𝑒) Pr(𝑣𝑜𝑡𝑒=𝑅𝑒𝑑 = 𝛼 10 + 𝛽 11 EUftf + 𝐵 12𝑥 controls +ϵ 𝑙𝑛 Pr(𝑉𝑜𝑡𝑒=𝐷𝐹) Pr(𝑣𝑜𝑡𝑒=𝑅𝑒𝑑 = 𝛼 20 + 𝛽 21 (EUftf)+ 𝐵 22𝑥 controls +ϵ *** euftf is also a limited, ordinal variable, but with 10 categories, let’s try it is a continuous IV (we can always use dummys for each cateogry if we want to as well..) We thus regress the ration of the the log of the probability of the vote for each party (over the ND baseline) to see if our predictors signficnatly explain this ratio in any of our outcomes..
135
Baseline mlogit model What do we see??
Log likelihood & Psuedo R2 allows us to compare models Chi2 tells us model significance Coefficients give us the relative log odds of our X’s for each outcome relative to the baseline (Red)
136
*see word file for clearer pic..**
Notice the ’rrr’ command? This produces the ’relative risk’, or odds ratio.. What do we find now?? EU support holds Control variables? What do you notice? Gender f/e.. regional variable – comparison in DK01 (Copenhagen) Note: We can change the baseline group: baseline (x) The relative risk ratio for a one-unit increase in the euftf is ..92 for voting Blue relative to Red, and 0.66 for DF relative to Red
137
Sig. test of variable(s)
Post-regression command: -with controls, we can test to see if the overall effect of euftf (or any other IV) is significant on our DV with ’test’, (the Ho is that the effect of the variable in question =0) Clearly significant overall
138
An Alterantive: Comparison with regular logit
Notice when we take the DV and reduce it to 1/0, 1 Red and 0 otherwise, we find pretty similar results. Thus we see that a simple logit model is in a sense, embedded in a larger mlogit model So, you COULD if you wanted, jsut run several seperate logit models (notice that you must have the ’0’ be equal to hte basline group in mlogit, or else the results will be a bit different..)
139
Interpretation, Interpretation, Interpretation
we find some interesting results: Most positive toward EU (want more integration) most likely to vote Red, least likley to vote DF Compared with Red, Blue voters are more male, less educated, less urban Unemployed most likley to vote Red No sig. Difference for gender between Red and DF (but women less likely to vote Blue..) Voters from S. Denmark and Midtjylland most likley to vote DF or Blue over Red. Voters in Copenhagen region most likely go for Red. *Can we be more specific? YES WE CAN!! start with the ’margins’ command to do Predicted Probabilities for each party….
140
Commands to do this: margins, at( euftf =( 0(1)10)) predict (outcome (0)) Marginsplot, saving(red) margins, at( euftf =( 0(1)10)) predict (outcome (1)) Marginsplot, saving(blue) margins, at( euftf =( 0(1)10)) predict (outcome (2)) Marginsplot, saving(df) gr combine red.gph blue.gph df.gph, ycommon Notice how across each level of EUftf, the probabilities =1? ***Notice that the baseline you choose is essentailly irrelevant***
141
Diagnositics with MLogit
Again, like logit (and ologit), we test the signficance of the full model with the χ² statistic, and ’improvements’ (or omitted/ irellevant variables) with an LR test using the log likelihood ratios. Again, Psudeo-R2 is meaningless by itself – only compared to other models with the same sample..BUT, the higher, the better Detecting outliers is harder in mlogit than logit. Similar stats apply however
142
Testing for omitted or irrelevant IV’s: an LR (all IV’s) or Wald Chi-Sq test – no factor IV’s however (’i’ & ’c’)… mlogtest, all Must take the ’i’ out of the DemSat IV for example… In our case, we see that Urban, Unemployment and and income are ’irrelevant’ – we can re-test the mosdels with the psudo-R2 or and LR test after we remove them to see if we can get away with runing a less restricted model with fewer parameters..
143
Testing the IIA & Combining Dependent categories of Y
Hausman test Small-Hsiao test (in mlogtest, all) Both test the Ho that the IIA is NOT violated (e.g. independence of categories) 2 similar tests (Wald & LR) **we find that the predictors for the 3 voting blocs are statistically signficant. **If violated, consider collapsing categories, or running simple logit Negative (and positiev) insignficnat value for the Huasman and Hsiao tests cannot reject the Ho (e.g. IIA has not been violated) For lrcolm, we are testing basically if none of the IV’s significnatly affect the odds of outcome ’a’ vs. Outcome ’b’ for all outcomes – if they are indistinguishable with respect to hte IV’s in the model, then we can
144
Example of logit and mlogit regression
Charron, N. and A. Bågenholm Ideology, party systems and corruption voting in European democracies. Electoral Studies 41, 35-49 On GUL
145
Topic 3: Event Count Models
Again, we determine the use of an Event Count model by the structure of our DV So far, we’ve looked at variables that have normal and binary distributions (OLS, and Logit). We’ll now consider a 3rd type, ’Gamma’ distributions In this case, the DV is: a FIXED number of outcomes & NOT binary For ex., can be units of time (days, years, etc), units in fixed time (individual or geographic unit) Ordinal (more later if your DV is continuous) Positive (but can take ’0’)
146
Some examples Number of new political parties entering parliament in a given election year The number of political protests or coup d’Etats in a country-year Number of presidential vetos in a year or mandate period Number of children in a household Number of vaccinations a child gets in a year, or doctor visits an adult makes Number of civic organizations an individual joins or is a member of in a given year.
147
Key characteristics of ’Event Data’
The count of events is non- negative are independent of one another Counts must be integers (e.g. discrete) – cannot be 2.2, 3.7 but 2 or 4. Can have 1-parameter (λ) distribution (mean=VAR) Using a histogram, we see that the distribution of Yi outcomes is usually large in 0 or 1, and diminishes rapidly from the 2nd or 3rd outcome on… The distribution is thus NOT normal (’Gausian’)– it is a ’gamma distribution: for count data we use these models: 1.Poisson 2. negativel binomal Key characteristics of ’Event Data’
148
1. Poisson Models: Assumptions & workings
Like logit, estimates with Maximum Likelihood estimation (MLE), which finds the value of the parameter that fits the model ’best’ (log likelihood) Our ”link function” in this case is Lambda – λ Goals are to: 1) estimate the increase Pr(Y=n) for a unit change in X. In Poisson regression, the model expresses the log outcome rate as a linear function of a set of predictors. (like Logit, β’s need to be transformed for interpretation) 2) predict the expected count-outcome (group) for an observation (like ologit). But because of our DV distribution, the normal/logit curve can’t be used, thus the Gamma distribution fills this gap..
149
Why better than OLS?? -like our Trump example from logit, OLS will produce a linear estimate of the relationship between βX and Y that will be less than 0 and greater than our highest count (unrealistic predictions). -OLS assumes the difference is the same between all counts in Y (0 to 1 is the same as 3 to 4), like Ologit, Poisson does not. -we will almost always have heteroskadasticity (as there will probably be more VAR in Y-outcomes with more observations) -error term not normally distributed
150
Cont. , The Poisson Distribution
The Poisson distribution is the following: Pr 𝑌 𝑖 =k = 𝜆 𝑘 𝑒 −𝜆 𝑘! 𝜆 is calculated as the mean of Yi 𝑒 −𝜆 is equal to the exponent inverse of Lambda K is the number of outcomes in Y K! is the factorial of K (ex. 4! = 4 × 3 × 2 × 1 = 24) 𝝀 is the expected value of Yi (mean of DV) and also its variance: So: 𝜆= E(Y) = Var(Y) Notice when 𝜆 =1 the CDF is highly concentrated between 0 and 10, as Lamda increases, what does the CDF look like? Cont. , The Poisson Distribution
151
Poisson distributions at different levels of Lambda
(λ) is equal to rate of the event (DV) So, if the mean of the distribution (λ) is high enough, than OLS is ok. So we can generate c.i.’, Pr(Y=n|Xi), etc. in a similar way as a normal curve – e.g. Mean approaches 10 BUT the data we will discuss will have a mean closer to about 1 or less 3 examples with K=20 & λ=1, 4 & 10 Note that for Poisson, E(Y) = lambda or the mean of the DV = rate; lambda may be thought of as the mean of the Poisson distribution -Note that as the rate, lambda (or the expected number of times than an event has occurred per unit of time, increases, the distribution shifts to the right, approaching the normal distribution.
152
Important assumptions of a Poisson Model
The observations are assumed to be independent of one another Logarithm of rate changes in the DV are expressed linearly with equal increment increases in the IV’s ”Equidispersion” – e.g., the mean of the DV = the Variance (although this does not happen that very often..). -Breaking this is called ”overdispearsion” –whan VAR in our data is greater than the model assumes. If violated, we can’t use Poisson for hypothesis testing.. **If outcome cases of Y are not independent, then we will mostly likely see ”overdispersion” – which if large enough, will lead us to use a Negative Binomial model (more later…)
153
Overdispersion: Causes & Consequences
Possible causes: 1. a poorly fitted model Omited variables Outliers Wrong functional form of 1+ of our IV’s in the model Un-accounted heteroskadescticity from structural breaks. 2. 𝑉𝐴𝑅( 𝑌 𝑖 ) > 𝜇 𝑖 (variance of our data greater than the mean) -very common with individual level data! Consequences Underestimated SE’s (think opposite effect of multicollinearity) Overstimated p-values & poor prediections
154
Simple Empirical Example – SES school data
H: the number of awards earned by students at one high school is positively related to their score on their final exam in math, and can be predicted by the type of program in which the student was enrolled (e.g., vocational, general or academic). use clear Let’s take a look at the mean of our DV by the 3 academic programs - it looks like ‘academic’ program has a significantly higher level of awards..
155
Our DV: # of Awards
156
Our Poisson model #Awards= academic program+math score+ constant + error Just like Logit models, we can use the Wald χ² to evaluate the model on whole. The pseudo R² & log pseudolikelihood can be used to compare different models We CAN use robust s.e.’s to correct for slight violation of Mean=VAR For ordered/catagorical IV’s Let’s first test for overall significance of ’Programs’ on ’Awards’ (general=baseline)
157
Important extra model test in Poisson
Before going on to interpret the model’s Betas, we need to know whether we’ve ’chosen correctly’ with Poisson – does the Poisson estimation form fit our data?? E.g. is the Gamma distribution appropriate? Otherwise, we might consider ologit A ’goodness of fit’ test (χ²) will let us know if we have a problem from – the Ho is the the model’s form DOES fit our data, a rejection of Ho means that Poisson might be the WRONG estimation…. Other reasons for rejection would be omitted IV’s or incorrect functional forms Can we reject??
158
Ok, now time to interpret!
Like logit, the Betas are basically meaningless, but - Poisson can give us Odds ratio (IRR), or ’incident rate ratio’ = exponentiated Betas (like logit) 𝜆| 𝑋 𝑝𝑟𝑜𝑔𝑟𝑎𝑚=𝑎𝑐𝑎𝑑𝑒𝑚𝑖𝑐 𝜆| 𝑋 𝑝𝑟𝑜𝑔𝑟𝑎𝑚=𝑔𝑒𝑛𝑒𝑟𝑎𝑙 =exp( 𝛽 𝑋 𝑝𝑟𝑜𝑔𝑟𝑎𝑚 ) =exp(1.08) = 2.95 Ex., holding math score constant, a student in an academic program (compared with general) has 2.95 times the incident rate Also, we see that for every increase in one unit in a math score (e.g. ’1’), the percent change in the incident rate increases by 7%, holding program constant We can know the signficance & direction, but that’s about it..
159
No let’s try predicted probabilities…
With math score at its mean (52), we find the predicted number of awards for someone with an academic program education = 0.62, while someone with a general is only 0.21 (or 2.95 times greater, as in our IRR score!!) Someone with a vocational background is predicted to have .31 awards, holding math score at its mean.
160
Testing the impact of Math Scores
We can do marginal effects of Math scores at its mean ”margins, dydx(math) atmeans’ or at certain levels – We find that the predicted count of awards for a student with a 35 (near low) math score is 0.13, but at 75 (high) it is 2.17 – This seems very meritocratic!!
161
Showing the effect of a categorical IV over other IV values..
To see how math scores interact with academic programs, we can show marginal effets of the programs at various levels of math scores.. When math scores are low, program plays no role for awards, BUT, as it increases, it seems students with academic program favored!!
162
Alternatively A similar way to show this is with predicted probabilities over ’actual observations’ This shows only the relevant range (but slightly more commands than just margins) Stata graph commands predict c separate c, by(prog) twoway scatter c1 c2 c3 math, connect(1 1 1) sort
163
Final diagonistic checks of the model
Run ’fitstat’ to get model statistics, especially when running more IV’s and testing for omitted &/or irrelevant IV’s Overdispersion: since it is VERY rare that the mean of the Y = VAR(Y), we should do one last test – it is recomended to run a negative binomial model (with same DV & IV’s) and compare the ’alpha’ and other estimates of the IV’s – LR test ’nbreg’ STATA does this for us – Ho: no difference –since Poisson is more efficient , we choose this!!
164
2. Negative Binomial Models (NBM)
Are also ”count” models for limited DV’s, very similar to Poisson in both assumptions and interpretation Uses a version of Lambda as a link function to estimate Pr(Y) as well Key difference from Poisson is that the Var(Y) is assumed to be larger than the Mean(Y) (e.g. ’overdispersion’). Also, if we cannot assume that the outcomes of Y are independent from one another, than a NBM might be more appropriate A matter of efficiency: we prefer Poisson becasue of greater efficiency, but there is a clear solution when we violate key model assumptions, so we take NBM instead.. Where ’mu’ is the poisson distribution over number of Y outcomes (K) squared
165
Cont. Like Poisson, the NBM assumes constant variance in Y, which is estimated by maximum likelihood as: Var(Y) = λ+ λ 𝟐 /𝛂 𝜶= the ’dispearsion parameter’ (set at ’0’ in Poisson), so instead of one parameter being estimated, there are 2 (which is why less ’efficient’) Uses logged Betas, so like logit (& Poisson) can use Odds ratios So, NBM’s are basically a more general type of Poisson model. Key differences because of the quadradic function in the assumed Var(Y), they are LESS EFFICIENT – Poisson will produce SMALLER s.e.’s for beta estimates, in med- large samples, the estimates are consistant (not-biased) however. Following, NBM’s will result in larger expected probabilities for smaller counts (e.g. # of Yi outcomes) compared with Poisson NBM’s will have slightly larger probabilities for larger counts Alpha is known as a ’dispersion parameter’ which is assumed to be fixed across observations and greater than the mean of Y
166
Example: common Poisson vs. Negative binomial distributions of our data
Notice the variance is larger and the tail of the NB is fatter than the Poisson..
167
Ex. use http://www.ats.ucla.edu/stat/stata/dae/nb_data, clear
Let’s say we want to explain student absences from a given school term Let’s look at our DV: tabstat DV, stats (mean v n) 2. ‘histogram AbsAdj, discrete freq scheme(s1mono) normal’ Ok, it’s pretty clear that the Poisson assumptions (mean=Var) does not hold as the Var of the DV is more that 2*mean Thus we have ’over-dispersed data’, and right for a NBM
168
The estimation Let’s get really creative and estimate the type of school program & student performance (math scores) on the # of absences from a term.. What do we learn from this? -Model fit: chi2 says model is significant -’alpha’ is used to get our estimation of Var(Y) in ML (0.45) (this is like ’modeling overdispearsion’ The LR test is Ho: our model = Poisson, which we can reject.
169
Cont. -better Math scores are associated with a decrease in absences
-compared with general program, academic & vocational student have less absences – we can test the overall sig. of ’program’ with the ’test’ after regression. Odds ratios (again ’irr’) can help.. We see that for every increase in math scores, the rate of absences decreases by ca. 1% Compared with general, the ’incident rate’ of an absences is .57 and .38 for academic and vocation students respectively
170
Interpretation – individual predictions
What is the number of absences predicted by the model for a vocational student with a 79 math score? (79*-.0045) + (1* ) = Like logit, this is the log, so we take the exponent of this: 𝑒 = absences To check this, we can ’predict c’ to get predicted probabilities, and go in an look at any observation..
171
Or just check for marginal effects - MEM
At mean levels of math scores, predicted number of absences by program: General = 5.07 Academic = 2.91 Vocational = 1.95 ** Vocational/General = 1.952/5.076 = 0.384 =our Incident rate!
172
Marginal Effects of Math scores
Math scores range from 1- 99, so we take predicted number of absences at scores at intervals of 20 points Predicted # of absences for a student with lowest score = 3.4, while only 2.16 for highest But how do our 2 variables interact??
173
margins, at(math=(0(20)100) prog=(1 2 3)) marginsplot, noci
175
Interactions – MERV over only ’relevant ranges’
Here we can graph out the effects of math scores on Absences in the 3 programs of study for only ’relevant’ obs -Predict c -twoway (line c math if prog==1) (line c math if prog==2) (line c math if prog==3) Math scores have a pretty constant effect on the DV, regardless of program
176
NBM vs. Poisson for our example
See how close the Betas are? This shows that Poisson is still a consistent estimator, dispite overdispersion However, what is the difference here? Yes, s.e.’s considerably larger in NBM, leads to higher Z-scores in Poisson and maybe greater type-1 error What ’would happen’ if we had used Poisson?
177
Brief note: Modelling Gamma distributions for continuous variables
Sometimes, you will not have ’count data’, but contiuous (with non-intergers..) that are non-normally distributed (look at histogram…) If there is a high concentration of the observations at lower levels and the values are all positive, then you need to take into account the Gamma distribution Similarly, OLS will not be appropriate In this case, you will need to use a ’generlized linear model’ (GLM) with one of 2 link functions: Logged Inverse (reciprical) **Results less intuative, but report st. dev. changes and show maringsplots as we’ve done Source: Dobson, A. J., & Barnett, A. (2008). An introduction to generalized linear models. CRC press.
178
Brief note on ’rare events’
What if you have a binary variable where the outcome (1 or 0) is really unbalenced? Logit is known for ’small sample bias’ – most common if one of th outcomes has< 50 obs for ex. or the % of ’1’s in the DV< 5% It is NOT NECESSARILY about the % of ’1’s per se, it is really about the N. If you have 1% of 1’s, and: N = 1, (e.g. 10 ’1’s) not so great, logit probably biased N = 10,000 – (e.g. 100 ’1’s) probably ok, dependendin on # of IV’s N = 100,000 – (e.g ’1’s) fine In the case where this might be a problem, Allison (2013) recommends using the ’firth method’ - findit firthlogit Run models both with logit and then with firthlogit – if you see noticable differneces, the latter is probably more appropriate.
179
Some other logit models in Brief
With 3+ outcomes When error terms are correlated between 2+ outcomes ’nested’ Logit - when decisions are made in sequential ’tree-type’ stages -e.g. voter chooses between voting or not voting, then between party A or party B nlogit in STATA Conditional Logit Useful for “alternative specific” data – data needs to be ‘stacked’ Ex: Data on charactoristics of resturants and individuals clogit in STATA
180
Summary review Sometimes, our DV’s will have a limited distribution: 0/1, 0-4, 1-5, categorical responses, etc. This results in many problems for OLS, such as heterogeneity of the error term, which gives biased and and unrealistic estimation for our betas. Like in OLS, we want to make predictions about Pr(Y) given values of Xi, etc., but we need to transform our Y’s to probabilities, odds, etc. using LINK FUNCTIONS. For binary variables, our link functions can be logit or probit. Same for ordinal or categorical data. For count data, we take advantage of gamma distributions, and use Lamba as our link function (for Poission and NBM) Remember, none of the betas produced make intuative sense, and thus they need to be transformed (odds, pr, etc.) margins in STATA is great with this Also, the choice of any of these models is based on your Dep. Variable!!
181
Modelling Limited Outcome Variables
Sometimes we will we have either ’censured’ or ’truncated’ data for our DV Censured = we have all observations in the dataset, but we don't know the "true" values of some of them (our measure ‘artificially’ limits high/low levels of ‘true’ Y) *ex. Test score might be VG, but some probably did better than others even within this grade.. ‘truncated’ = some observations are missing that are related to values of Y *ex. – GDP per capita data that is systematically missing lots of poorer countries.
182
Example from Long (p. 188)
183
1. Truncation - Why does this matter??
Sample not representative and Distribution of observed y will be non-normal and skewed relative to ’real’ Y 𝑓(𝑦|𝑦>𝑎)= 𝑓(𝑦) 𝑃(𝑦>𝑎 In other words, the distribution of Y is conditional on the probability that it is above ’a’ (some level of truncation in our data) When the sign is ’>’ we are dealing with ’left truncation’, ocnversely we could have ’right trunction’, and the sign would be ’<’ We can calculate HOW much truncation affects a sample with the INVERSE MILLS RATIO, which is the ratio of the PDF and the CDF It tells us the number of s.d.’s that our mean is above/below the truncation point. The less the truncation, the closer our mean is to the true Y mean
184
What to do? What to do when we have such data for the DV
Just use OLS for available observations -estimates would be less efficient (less observations & d.o.f.), and most often biased (now X variable is related with the error term, e.g. E(error term ≠ 0)) 2. Run truncated regression model (less bias, more efficiency) – for this, we will use a Heckman Selection model (Heckman 1979) **in STATA, you can model either left or right truncation or both! Let’s take an example: effect of corruption on gender equality Our gender equality index is from 0-1, clearly has some obs missing..
185
Tobit model – the basics
Tobin (1958) wanted to explain the relationship between household income and household luxury expenditures BUT, he noticed that thre were alot of ’0’s in Y (people buying no luxury goods) He hypothesized that the X’s might have a ’duel effect’ – they first explain IF someone spends on luxury goods (e.g. Y = 0 or 1), and then, HOW MUCH (e.g. E(Y) as continuous) He came up with a model: 𝑌 ∗ = 𝑥 ′ 𝛽+ 𝜀 𝑖 Where 𝑌 ∗ is a latent variable and equals (for left-censured data!) 𝑦 𝑖 = 𝑦∗ 𝑖 𝑖𝑓 𝑦∗ 𝑖 >𝐶 0 𝑖𝑓 𝑦∗ 𝑖 ≤𝐶 ’C’ is the ’censuring point’ in the data (commonly ’0’ for example) Thus we have an observed Y that equals Y* if the value of Y* is greater than C, but equals C if the value of the unobserved Y* is less than or equal to C.
186
Example: is there a relationship between bribery and recent economic growth?
DV: % of reported bribes paid in EU regions IV: total PPP growth What do we observe? Yup, several regions have ’0’ in The DV
187
What to do? What to do when we have such data for the DV
Just use OLS for available observations -estimates would be less efficient (less observations & d.o.f.), and most often biased (now X variable is related with the error term, e.g. E(error term ≠ 0)) 2. probit/logit model (lose much variation if DV is continuous) 3. Run Tobit model (for censured data), Heckman Selection model or a truncated regression model (less bias, more efficiency)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.