Class 6 Qualitative Dependent Variable Models SKEMA Ph.D programme Lionel Nesta Observatoire Français des Conjonctures Economiques
Structure of the class 1.The linear probability model 2.Maximum likelihood estimations 3.Binary logit models and some other models 4.Multinomial models 5.Ordered multinomial models 6.Count data models
The Linear Probability Model
The linear probability model When the dependent variable is binary (0/1, for example, Y=1 if the firm innovates, 0 otherwise), OLS is called the linear probability model. How should one interpret β j ? Provided that OLS4 – E(u|X)=0 – holds true, then:
Y follows a Bernoulli distribution with expected value P. This model is called the linear probability model because its expected value, conditional on X, and written E(Y|X), can be interpreted as the conditional probability of the occurrence of Y given values of X. β measures the variation of the probability of success for a one-unit variation of X (ΔX=1) The linear probability model
Non normality of errors OLS6 : The error term is independent of all RHS and follows a normal distribution with zero mean and variance σ² Since the errors are the complement to unity of the conditional probability, they follow a Bernoulli distribution, not a normal distribution. Limits of the linear probability model (1)
Non normality of errors Limits of the linear probability model (1)
Heteroskedastic errors OLS5 : The variance of the error term, u, conditional on RHS, is the same for all values of RHS The error term is itself distributed Bernoulli, and its variance depends on X. Hence it is heteroskedastic Limits of the linear probability model (2)
Heteroskedastic errors Limits of the linear probability model (2)
Fallacious predictions By definition, a probability is always in the unit interval [0;1] But OLS does not guarantee this condition Predictions may lie outside the bound [0;1] The marginal effect is constant, since P = E(Y|X) grows linearly with X. This is not very realistic (ex: the probability to give birth conditional on the number of children already born) Limits of the linear probability model (3)
Fallacious predictions Limits of the linear probability model (3)
A downward bias in the coefficient of determination R² Observed values are 1 or 0, whereas predictions should lie between 0 and 1: [0;1]. Comparing predicted with observed variables, the goodness of fit as assessed by the R² is systematically low. Limits of the linear probability model (4)
Fallacious predictions Fallacious predictions which lower the R 2 Limits of the linear probability model (4)
1. Non normality of errors 2. Heteroskedastic errors 3. Fallacious predictions 4. A downward bias in the R² Limits of the linear probability model (4)
Overcoming the limits of the LPM 1. Non normality of errors Increase sample size 2. Heteroskedastic errors Use robust estimators 3. Fallacious prediction Perform non linear or constrained regressions 4. A downward bias in the R² Do not use it as a measure of goodness of fit
Persistent use of LPM Although it has limits, the LPM is still used 1.In the process of data exploration (early stages of the research) 2.It is a good indicator of the marginal effect of the representative observation (at the mean) 3.When dealing with very large samples, least squares can overcome the complications imposed by maximum likelihood techniques. Time of computation Endogeneity and panel data problems
The LOGIT Model
Probability, odds and logit We need to explain the occurrence of an event: the LHS variable takes two values : y={0;1}. In fact, we need to explain the probability of occurrence of the event, conditional on X: P(Y=y | X) ∈ [0 ; 1]. OLS estimations are not adequate, because predictions can lie outside the interval [0 ; 1]. We need to transform a real number, say z to ∈ ]-∞;+∞[ into P(Y=y | X) ∈ [0 ; 1]. The logistic transformation links a real number z ∈ ]-∞;+∞[ to P(Y=y | X) ∈ [0 ; 1].It is also called the link function
The logit link function Let us make sure that the transformation of z lies between 0 and1
The logit model Hence the probability of any event to occur is : But what is z?
The odds ratio is defined as the ratio of the probability and its complement. Taking the log yields z. Hence z is the log transform of the odds ratio. This has two important characteristics : 1.Z ∈ ]-∞;+∞[ and P(Y=1) ∈ [0 ; 1] 2.The probability is not linear in z (The plot linking z with straight line) The odds ratio
Probability, odds and logit P(Y=1)Odds p(y=1) 1-p(y=1) Ln (odds) 0.011/990,01-4, /970,03-3, /950,05-2, /800,25-1, /700,43-0, /600,67-0, /501,000, /401,500, /302,330, /204,001, /519,02, /332,33, /199,04,60
The logit transformation The preceding table matches levels of probability with the odds ratio. The probability varies between 0 and 1, The odds varies between 0 and + ∞. The log of the odds varies between – ∞ and + ∞. Notice that the distribution of the log of the odds is symetrical.
Logistic probability density distribution
“The probability is not linear in z”
The logit link function The whole trick that can overcome the OLS problem is then to posit: But how can we estimate the above equation knowing that we do not observe z?
Maximum likelihood estimations OLS can be of much help. We will use Maximum Likelihood Estimation (MLE) instead. MLE is an alternative to OLS. It consists of finding the parameters values which is the most consistent with the data we have. In Statistics, the likelihood is defined as the joint probability to observe a given sample, given the parameters involved in the generating function. One way to distinguish between OLS and MLE is as follows: OLS adapts the model to the data you have : you only have one model derived from your data. MLE instead supposes there is an infinity of models, and chooses the model most likely to explain your data.
Let us assume that you have a sample of n random observations. Let f(y i ) be the probability that y i = 1 or y i = 0. The joint probability to observe jointly n values of y i is given by the likelihood function: We need to specify function f(.). It comes from the empirical descrite distribution of an event that can have only two outcome : a success (y i = 1) or a failure (y i = 0). This is the binomial distribution. Hence: Likelihood functions
Knowing p (as the logit), having defined f(.), we come up with the likelihood function:
The log transform of the likelihood function (the log likelihood) is much easier to manipulate, and is written: Log likelihood (LL) functions
The LL function can yield an infinity of values for the parameters β. Given the functional form of f(.) and the n observations at hand, which values of parameters β maximize the likelihood of my sample? In other words, what are the most likely values of my unknown parameters β given the sample I have? Maximum likelihood estimations
However, there is not analytical solutions to this non linear problem. Instead, we rely on a optimization algorithm (Newton-Raphson) The LL is globally concave and has a maximum. The gradient is used to compute the parameters of interest, and the hessian is used to compute the variance-covariance matrix. Maximum likelihood estimations You need to imagine that the computer is going to generate all possible values of β, and is going to compute a likelihood value for each (vector of ) values to then choose (the vector of) β such that the likelihood is highest.
Example: Binary Dependent Variable We want to explore the factors affecting the probability of being successful innovator (inno = 1): Why? 352 (81.7%) innovate and 79 (18.3%) do not. The odds of carrying out a successful innovation is about 4 against 1 (as 352/79=4.45). The log of the odds is (z = 1.494) For the sample (and the population?) of firms the probability of being innovative is four times higher than the probability of NOT being innovative
Instruction Stata : logit logit y x 1 x 2 x 3 … x k [if] [weight] [, options] Options noconstant : estimates the model without the constant robust : estimates robust variances, also in case of heteroscedasticity if : it allows to select the observations we want to include in the analysis weight : it allows to weight different observations Logistic Regression with STATA
Let’s start and run a constant only model logit inno Goodness of fit Parameter estimates, Standard errors and z values Logistic Regression with STATA
What does this simple model tell us ? Remember that we need to use the logit formula to transform the logit into a probability : Interpretation of Coefficients
The constant must be interpreted as the log of the odds ratio. Using the logit link function, the average probability to innovate is dis exp(_b[_cons])/(1+exp(_b[_cons])) We find exactly the empirical sample value: 81,7% Interpretation of Coefficients
A positive coefficient indicates that the probability of innovation success increases with the corresponding explanatory variable. A negative coefficient implies that the probability to innovate decreases with the corresponding explanatory variable. Warning! One of the problems encountered in interpreting probabilities is their non-linearity: the probabilities do not vary in the same way according to the level of regressors This is the reason why it is normal in practice to calculate the probability of (the event occurring) at the average point of the sample Interpretation of Coefficients
Let’s run the more complete model logit inno lrdi lassets spe biotech
Using the sample mean values of rdi, lassets, spe and biotech, we compute the conditional probability : Interpretation of Coefficients
It is often useful to know the marginal effect of a regressor on the probability that the event occur (innovation) As the probability is a non-linear function of explanatory variables, the change in probability due to a change in one of the explanatory variables is not identical if the other variables are at the average, median or first quartile, etc. level. prvalue provides the predicted probabilities of a logit model (or any other) prvalue prvalue, x(lassets=10) rest(mean) prvalue, x(lassets=11) rest(mean) prvalue, x(lassets=12) rest(mean) prvalue, x(lassets=10) rest(median) prvalue, x(lassets=11) rest(median) prvalue, x(lassets=12) rest(median) Marginal Effects
prchange provides the marginal effect of each of the explanatory variables for the majority of the variations of the desired values prchange [varlist] [if] [in range],x(variables_and_values) rest(stat) fromto prchange prchange, fromto prchange, fromto x(size=10.5) rest(mean) Marginal Effects
Goodness of Fit Measures In ML estimations, there is no such measure as the R 2 But the log likelihood measure can be used to assess the goodness of fit. But note the following : The higher the number of observations, the lower the joint probability, the more the LL measures goes towards -∞ Given the number of observations, the better the fit, the higher the LL measures (since it is always negative, the closer to zero it is) The philosophy is to compare two models looking at their LL values. One is meant to be the constrained model, the other one is the unconstrained model.
Goodness of Fit Measures A model is said to be constrained when the observed set the parameters associated with some variable to zero. A model is said to be unconstrained when the observer release this assumption and allows the parameters associated with some variable to be different from zero. For example, we can compare two models, one with no explanatory variables, one with all our explanatory variables. The one with no explanatory variables implicitly assume that all parameters are equal to zero. Hence it is the constrained model because we (implicitly) constrain the parameters to be nil.
The likelihood ratio test (LR test) The most used measure of goodness of fit in ML estimations is the likelihood ratio. The likelihood ratio is the difference between the unconstrained model and the constrained model. This difference is distributed 2. If the difference in the LL values is (no) important, it is because the set of explanatory variables brings in (un)significant information. The null hypothesis H 0 is that the model brings no significant information as follows: High LR values will lead the observer to reject hypothesis H 0 and accept the alternative hypothesis H a that the set of explanatory variables does significantly explain the outcome.
The McFadden Pseudo R 2 We also use the McFadden Pseudo R 2 (1973). Its interpretation is analogous to the OLS R 2. However its is biased doward and remain generally low. Le pseudo-R 2 also compares The likelihood ratio is the difference between the unconstrained model and the constrained model and is comprised between 0 and 1.
Goodness of Fit Measures Constrained model Unconstrained model
Other usage of the LR test The LR test can also be generalized to compare any two models, the unconstrained one being nested in the constrained one. Any variable which is added to a model can be tested for its explanatory power as follows : logit [modèle contraint] est store [nom1] logit [modèle non contraint] est store [nom2] lrtest nom2 nom1
Goodness of Fit Measures LR test on the added variable (biotech)
Quality of predictions Lastly, one can compare the quality of the prediction with the observed outcome variable (dummy variable). One must assume that when the probability is higher than 0.5, then the prediction is that the vent will occur (most likely And then one can compare how good the prediction is as compared with the actual outcome variable. STATA does this for us: estat class
Quality of predictions
The Logit model is only one way of modeling binary choice models The Probit model is another way of modeling binary choice models. It is actually more used than logit models and assume a normal distribution (not a logistic one) for the z values. The complementary log-log models is used where the occurrence of the event is very rare, with the distribution of z being asymetric. Other Binary Choice models
Probit model Complementary log-log model
Likelihood functions and Stata commands Example logit inno rdi lassets spe pharma probit inno rdi lassets spe pharma cloglog inno rdi lassets spe pharma
Probability Density Functions
Cumulative Distribution Functions
Comparison of models OLSLogitProbitC log-log Ln(R&D intensity) [3.90]***[3.57]***[3.46]***[3.13]*** ln(Assets) [8.58]***[7.29]***[7.53]***[7.19]*** Spe [1.11][1.01][0.98][0.76] BiotechDummy [7.49]***[6.58]***[6.77]***[6.51]*** Constant [3.91]**[6.01]***[6.12]***[6.08]*** Observations431 Absolute t value in brackets (OLS) z value for other models. * 10%, ** 5%, *** 1%
Comparison of marginal effects OLSLogitProbitC log-log Ln(R&D intensity) ln(Assets) Specialisation Biotech Dummy For all models logit, probit and cloglog, marginal effects have been computed for a one-unit variation (around the mean) of the variable at stake, holding all other variables at the sample mean values.
Multinomial LOGIT Models
Multinomial models Let us now focus on the case where the dependent variable has several outcomes (or is multinomial). For example, innovative firms may need to collaborate with other organizations. One can code this type of interactions as follows Collaborate with university (modality 1) Collaborate with large incumbent firms (modality 2) Collaborate with SMEs (modality 3) Do it alone (modality 4) Or, studying firm survival Survival (modality 1) Liquidation (modality 2) Mergers & acquisition (modality 3)
One could first perform three logistic regressions as follows : Where 1 = survival, 2 = liquidation, 3 = M&A. 1. Open the file mlogit.dta 2. Estimate for each type of outcome the conditional probability of the event for the representative firm - time(log_time) - size(log labour) - firm age(entry_age) - Spin out(spin_out) - Cohort(cohort_*) Multinomial models
The need for multinomial models
First, the sum of all conditional probabilities should add up to unity Second, for k outcomes, we need to estimate (k-1) modality. Hence Multinomial models
Third, the multinomial model is a simultaneous (as opposed to sequential) estimation model comparing the odds of each modality with respect to all others. With three outcomes, we have: Multinomial logit models
Note that there is redundancy, since : Fourth, the multinomial logit model estimates (k – 1) outcomes with following constrained: Multinomial logit models
With k outcomes, the probability of occurrence of event j reads: By convention, outcome 0 is the base outcome Multinomial logit models
Note that Multinomial logit models
Binomial logit as multinomial logit Let us rewrite the probability of event that Y=1 The binomial logit binomial is a special case of the multinomial where only two outcomes are being analyzed.
Let us assume that you have a sample of n random observations. Let f(y j ) be the probability that y i = j. The joint probability to observe jointly n values of y j is given by the likelihood function: We need to specify function f(.). It comes from the empirical discrete distribution of an event that can have several outcomes. This is the multinomial distribution. Hence: Likelihood functions
The maximum likelihood function The maximum likelihood function reads:
The maximum likelihood function The log transform of the likelihood yields
Multinomial logit models Stata Instruction : mlogit mlogit y x 1 x 2 x 3 … x k [if] [weight] [, options] Options : noconstant : omits the constant robust : controls for heteroskedasticity if : select observations weight : weights observations
use mlogit.dta, clear mlogit type_exit log_time log_labour entry_age entry_spin cohort_* Base outcome, chosen by STATA, with the highest empirical frequency Goodness of fit Parameter estimates, Standard errors and z values Multinomial logit models
Interpretation of coefficients The interpretation of coefficients always refer to the base category Does the probability of being bought- out decrease overtime ? No! Relative to survival the probability of being bought-out decrease overtime
Interpretation of coefficients The interpretation of coefficients always refer to the base category Is the probability of being bought-out lower for spinoff? No! Relative to survival the probability of being bought-out is lower for spinoff
Interpretation of coefficients Relative to liquidation, the probability of being bought-out is higher for spinoff lincom [boughtout]entry_spin – [death]entry_spin
Changing base outcome mcross provides other estimates by changing the base ouctome Mind the new base outcome!! Being bought-out relative to liquidation Relative to liquidation, the probability of being bought-out is higher for spinoff
And we observe the same results as before mcross provides other estimates by changing the base ouctome Changing base outcome
Independence of irrelevant alternatives - IAA The model assumes that each pair of outcome is independent from all other alternatives. In other words, alternatives are irrelevant. From a statistical viewpoint, this is tantamount to assuming independence of the error terms across pairs of alternatives A simple way to test the IIA property is to estimate the model taking off one modality (called the restrained model), and to compare the parameters with those of the complete model If IIA holds, the parameters should not change significantly If IIA does not hold, the parameters should change significantly
H 0 : The IIA property is valid H 1 : The IIA property is not valid The H statistics (H stands for Hausman) follows a χ² distribution with M degree of freedom (M being the number of parameters) Independence of irrelevant alternatives - IAA
STATA application: the IIA test H 0 : The IIA property is valid H 1 : The IIA property is not valid mlogtest, hausman Omitted variable
Application de IIA mlogtest, hausman We compare the parameters of the model “liquidation relative bought-out” estimated simultaneously with “survival relative to bought-out” avec the parameters of the model “liquidation relative bought-out” estimated without “survival relative to bought-out” H 0 : The IIA property is valid H 1 : The IIA property is not valid
Application de IIA mlogtest, hausman The conclusion is that outcome survival significantly alters the choice between liquidation and bought-out. In fact for a company, being bought-out must be seen as a way to remain active with a cost of losing control on economic decision, notably investment. H 0 : The IIA property is valid H 1 : The IIA property is not valid
Ordered Multinomial LOGIT Models
Ordered multinomial models Let us now concentrate on the case where the dependent variable is a discrete integer which indicates an intensity. Opinion surveys make an extensive use on such so-called Likert Scale: Obstacles à l’innovation (échelle de 1 à 5) Intensité de collaboration (échelle de 1 à 5) Enquête de marketing (N’apprécie pas (1) – Apprécie (7)) Note d’étudiants Test d’opinion Etc.
Ordered multinomial models Such variables depict a vertical scale – quantitative, so that one can think of them as describing the interval in which an unobserved latent variable y* lies: where α j are unknown bounds to be estimated.
Ordered multinomial models We assume that the latent variable y* is a linear combination of the set of all explanatory variables where u i follows a cumulative distribution function F(.). The probabilities with each occurrence y (y ≠ y*) are then following the cdf F(.). Let us look at the probability that y = 1 :
Ordered multinomial models The probability that y = 2 is: Altogether we have:
Probability in a ordered model y=3y=2y=1y=k uiui
The likelihood function The likelihood function is
If u i follows a logistic distribution, the (log) likelihood function reads : The likelihood function
Ordered multinomial logit models Stata Instruction : ologit ologit y x 1 x 2 x 3 … x k [if] [weight] [, options] Options : noconstant : omits the constant robust : controls for heteroskedasticity if : select observations weight : weights observations
Ordered multinomial models use est_var_qual.dta, clear ologit innovativeness size rdi spe biotech Goodness of fit Estimated parameters Cutoff points
Interpretation of coefficients A positive (negative) sign indicates a positive relationship between the independent variable and the order (or rank) How does one interpret the cutoff values? The model is: What is then the probability that Y = 1 : P(Y = 1) ? What is the probability that the score be inferior to the first cutoff point?
What is the probability that Y = 2 : P(Y = 2) ? Interpretation of coefficients
STATA computation of pred. prob. prvalue computes the predicted probabilities
Count Data Models Part 1 The Poisson Model
Count data models Let us now focus on outcome counting the number of occurrences a given event. Analyzing the number of innovations, the number patents, of invention. Again OLS fails to meet the constrain that the prediction must be nil or positive. To explain count variables, we assume that the dependent variable follows a Poisson distribution.
Poisson models Let Y be a random count variabl. The probability that Y be equal to integer y i is given by the Poisson probability density distribution: To introduce the set of explanatory variables in the model, we condition λ i and impose the following log linear form:
Poisson distributions
The (log) likelihood function reads: The likehood function
Poisson models Stata Instruction : poisson Poisson y x 1 x 2 x 3 … x k [if] [weight] [, options] Options : noconstant : omits the constant robust : controls for heteroskedasticity if : select observations weight : weights observations
Poisson models use est_var_qual.dta, clear Poisson patent lrdi lassets spe biotech Estimated parameters Goodness of fit
Interpretation of coefficients If variables are entered in log, one can interpret the coefficients as elasticities A one % increase in firm size is associated with a.47% increase in the expected number of patents
Interpretation of coefficients A one % increase R&D investment is associated with a.69% increase in the expected number of patents If variables are entered in log, one can interpret the coefficients as elasticities
Interpretation of coefficients A one-point rise in the degree of specialisation is associated with a 113% increase in the expected number of patents If variables are not entered in log, the interpretation changes 100 × (e β – 1)
Interpretation of coefficients For dummy variables, the interpretation changes slightly Biotechnology firms have an expected number of patents which is 191% higher than pharmaceutical companies.
All variables are very significant … but … Interpretation of coefficients
Count Data Models Part 2 Negative Binomial Models
Negative binomial models Generally, the Poisson model is not valid, due to the presence of overdispersion in the data. This violates the asumption of equality between the mean and variance if the dependent variable implied by the Poisson model. The negative binomial model treats this problem by adding to the log linear form a unobserved heterogeneity term u i :
Negative binomial models The density of y i is obtained by taking the density of u i : Assuming that u i is distributed Gamma with mean 1, the density of y i reads:
Likelihood Functions where α is the overdispersion parameter
Negative binomial models Stata Instruction: nbreg nbreg y x 1 x 2 x 3 … x k [if] [weight] [, options] Options : noconstant : omits the constant robust : controls for heteroskedasticity if : select observations weight : weights observations
Negative binomial models use est_var_qual.dta, clear nbreg poisson PAT rdi size spe biotech Goodness of fit Estimated parameters Overdispersion parameter Overdispersion test
Interpretation of coefficients If variables are entered in log, one can still interpret the coefficients as elasticities A one % increase in firm size is associated with a.61% increase in the expected number of patents
Interpretation of coefficients If variables are entered in log, one can still interpret the coefficients as elasticities A one % increase in R&D investment is associated with a.78% increase in the expected number of patents
Interpretation of coefficients If variables are not entered in log, the interpretation changes 100 × (e β – 1) A one-point rise in the degree of specialisation is associated with a 129% increase in the expected number of patents
Interpretation of coefficients For dummy variables, the interpretation follows the same transformation 100 × (e β – 1) Biotechnology firms have an expected number of patents which is 352% higher than pharmaceutical companies.
Overdispersion test We use the LR test to compare the negative binomial model with the Poisson model The results indicate the probability to reject H0 wrongly is almost nil (H0: Alpha=0). Hence there is overdispersion in the data and as a consequence one shopuld use the negative binomial model
Larger standard errors and lower z values
Extensions
ML estimators All models can be extended to a panel context to take full account of unobserved heterogeneity Fixed effect Random effects Heckman models Selection bias Two equations, one on the probability to be observed Survival models Discrete time (complementary log-log, logit) Continuous time (Cox model)