Analyzing dichotomous dummy variables Quantitative Methods Analyzing dichotomous dummy variables
Logistic Regression Analysis Like ordinary regression and ANOVA, logistic regression is part of a category of models called generalized linear models. Generalized linear models were developed to unify various statistical models (linear regression, logistic regression, poisson regression). We can think of maximum likelihood as a general algorithm to estimate all these models.
Logistic Regression—when? Logistic regression models are appropriate for dependent variables coded 0/1. We only observe “0” and “1” for the dependent variable—but we think of the dependent variable conceptually as a probability that “1” will occur.
Logistic Regression--examples Some examples Vote for Obama (yes, no) Turned out to vote (yes, no) Sought medical assistance in last year (yes, no)
Logistic Regression—why not OLS? Why can’t we use OLS? After all, linear regression is so straightforward, and (unlike other models) actually has a “closed form solution” for the estimates.
Logistic Regression—why not OLS? Three problems with using OLS. First, what is our dependent variable, conceptually? It is the probability of y=1. But we only observe y=0 and y=1. If we use OLS, we’ll get predicted values that fall between 0 and 1—which is what we want— but we’ll also get predicted values that are greater than 1, or less than 0. That makes no sense.
Logistic Regression—Why not OLS? Three problems using OLS. Second problem—there is heteroskedasticity in the model. Think about the meaning of “residual”. The residual is the difference between the observed and the predicted Y. By definition, what will that residual look like at the center of the distribution? By definition, what will that residual look like at the tails of the distribution?
Logistic Regression—why not OLS? Three problems using OLS. The third problem is substantive. The reality is that many choice functions can be modeled by an S- shaped curve. Therefore (much as when we discussed linear transformations of the X variable), it makes sense to model a non-linear relationship.
Logistic Regression—but similar to OLS.... So. We actually could correct for the heteroskedasticity, and we could transform the equation so that it captured the “non-linear” relationship, and then use linear regression. But what we usually do....
Logistic Regression—but similar to OLS... ...is use logistic regression to predict the probability of the occurrence of an event.
Logistic Regression—s shaped curve
Logistic Regression— S shaped curve and Bernoulli variables Note that the observed dependent variable is a Bernoulli (or binary) variable. But what we are really interested in is predicting the probability that an event occurs (i.e., the probability that y=1).
Logistic Regression--advantage Logistic regression is particularly handy because (unlike, say, discriminant analysis) it makes no assumptions about how the independent variables are distributed. They don’t have to be continuous versus categorical, normally distributed—they can take any form.
Logistic Regression— exponential values and natural logs Note—”exp” is the exponential function. Ln is the natural log. These are opposites. When we take the exponential function of any number, we take 2.72 raised to the power of that number. So, exp(3)=2.72 * 2.72 * 2.72=20.09. If we take ln (20.09), we get the number 3.
Logistic Regression--transformation Note that you can think of logistic regression in terms of transforming the dependent variable so that it fits an s-shaped curve. Note that the odds ratio is the probability that a case will be a 1 divided by the probability that it will not be a 1. The natural log of the odds ratio is the “logit” and it is a linear function of the x’s (that is, of the right hand side of the model).
Logistic Regression--transformation Note that you can equivalently talk about modelling the probability that y=1 (theta, below), as below (these are the same mathematical expressions):
Logistic Regression Note that the independent variables are not related to the probability that y=1. However, the independent variables are linearly related to the logit of the dependent variables.
Logistic Regression--recap Logistic regression analysis, in other words, is very similar to OLS regression, just with a transformation of the regression formula. We also use binomial theory to conduct the tests.
Logistic Regression--interpretation Most commonly, with all other variables held constant, there is a constant increase of b1 in the logit (p) for every 1-unit increase in x1. But remember that even though the right hand side of the model is linearly related to the logit (that is, to the natural log of the odds-ratio), what does it mean for the actual probability that y=1?
Logistic Regression It’s fairly straightforward—it’s multiplicative. If b1 takes the value of 2.3 (and we know that exp(2.3)=10), then if x1 increases by 1, the odds that the dependent variable takes the value of 1 increase tenfold.
Od pravdepodobnosti šancí k logaritmom šancí Všetko začína s pojmom pravdepodobnosti. Pravdepodobnosť úspechu nejakej udalosti, je 0,8. Potom pravdepodobnosť poruchy je 1- 0,8 = 0,2. Šance na úspech sú definované ako pomer pravdepodobnosti úspechu cez pravdepodobnosti poruchy. V našom príklade, šance na úspech sú 0,8 / 0,2 = 4. To znamená, že šance na úspech sú 4 ku 1. V prípade, že pravdepodobnosť úspechu je 0,5, teda 50 až 50 percent šanca, potom šanca na úspech je 1 až 1.
Od pravdepodobnosti šancí k logaritmom šancí Transformácia z pravdepodobnosťou šanca je monotónna transformácie, čo znamená, že pravdepodobnosť zvyšovať so zvyšujúcou pravdepodobnosť alebo naopak. Pravdepodobnosť sa pohybuje od 0 a 1. kurzy v rozmedzí od 0 do kladného nekonečna.
Logistická regresia bez prediktorov Inými slovami, lokujúca konštanta modelu bez prediktora je odhadom logaritmu šance byť v triede vyznamenaných z celkovej skúmanej vzorky. Môžeme taktiež transformovať logaritmus šance späť na pravdepodobnosť.
Logistická regresia s jednou dichotomickou premennou V našom datasete. Aké sú šance mužov byť v triede vyznamenaných? Aké sú šance žien byť v triede vyznamenaných? Môžeme vypočítať ručne tieto šance od stola: u mužov, šance sú v triede vyznamenaných sú (17/91) / (74/91) = 17/74 = .23; a pre ženy, šance sú v triede vyznamenaných sú (32/109) / (77/109) = 32/77 = .42. Pomer šancí pre ženy ku šancí pre mužov je (32/77) / (17/74) = (32 * 74) / (77 * 17) = 1,809. Takže šanca pre mužov sú 17-74, šanca pre ženy je 32 až 77, a šance pre ženy sú o 81% vyššie ako šance pre mužov
Zamerajme sa na šance pre mužov a ženy a výstupu z logistickej regresie. Intercept z -1.471 záznamu je šanca pre mužov, pretože muž je referenčná skupina (female = 0). Použitie šance sme vypočítali vyššie pre mužov, môžeme potvrdiť toto: log (.23) = -1,47. Koeficient pre ženy je log pomer šancí medzi ženskými skupinami a mužskými skupinami: log (1.809) = .593. Takže môžeme dostať pomer šancí tým, že vypočítame exponenciálny koeficient pre ženy. Väčšina balíčkov pre štatistické zobrazenie oboch surové regresné koeficienty a umocňuje koeficienty pre logistické regresné modely.
zdroje http://www.ats.ucla.edu/stat/mult_pkg/faq/genera l/odds_ratio.htm