Download presentation
Presentation is loading. Please wait.
Published byLillian Barbara Miller Modified over 8 years ago
1
Quantitative research methods in business administration Lecture 3 Multivariate analysis OLS, ENDOGENEITY BIAS, 2SLS Panel Data Exemplified by SPSS and Stata Taylan Mavruk
2
1. Multivariate analysis – cross tabulation Examining the relation between three qualitative variables 1.In this case we want to investigate the “strength” in our already tested bivariate dependencies. This is done by imposing a third variable (test- or control variable) 2.The test variable (z) must be on the same data level as our independent variable (X) (nominal or ordinal) 3.The purpose of this is to be able to answer two questions: a.What other variables (z1, z2, z3,.. ) may offer an explanation to the variance in Y? b.What will happen with our original dependence (XY) if we introduce one or more test variables?
3
Adding a test variable to case 1 Definition: the combination of frequencies of tree (qualitative) variables Example: F7b_a * Stratum * gender dominants
4
Test of independence in cross-tabulation Case 1 continued Is the identified dependency between the variables statistically significant or is it due to chance? Using hypothesis to test for dependency, (Chi 2 -test) H 0 : The size dependency – sick leave is independent of gender H 1 : The size dependency – sick leave is dependent of gender The Chi 2 -test tells us that we can reject H 0 since the likelihood that the difference is due to chance is less than 5 %. This tells us that our original hypothesis that the dependency between company size and sick leave holds for both male and female dominated organizations
5
Simple linear regression Simple linear regression used one independent variable to explain the dependent variable Some relationships are too complex to be described using a single independent variable Multiple regression models use two or more independent variables to describe the dependent variable This allows multiple regression models to handle more complex situations There is no limit to the number of independent variables a model can use (k<n should hold) Multiple regression has only one dependent variable
6
What are we studying? (Conditional expectations) The aim is to study the effect of a variable w on the expected value of y holding fixed a vector of controls c: the partial effect of changing w on (structural equation) holding c constant. If w is continuous, the partial effect is given by: If w is a dummy variable the partial effect is Estimating partial effects such as these in practice is difficult, primarily because of the unobservability problem: typically, not all elements of the vector c is observed, and perfectly measured, in your data. Thus, we might have problems such as omitted variables, and other sources of endogeneity bias. Ref: Wooldridge (2002)
7
What can we study besides partial effect? Elasticity of the conditional expected value y with respect to say x1 Semi-elasticity; The elasticity shows how much changes in percentage terms in response to a 1% increase in x1. The semi-elasticity shows how much changes, in percentage terms, in response to a one unit increase in x1Ref: Wooldridge (2002)
8
A population model that is linear in parameters: where are observable variables, u is the unobservable random disturbance term, and are parameters that we want to estimate. Properties of the error term allow us to decide whether OLS is the appropriate estimator. For OLS to consistently estimate (asymptotic properties) the coefficients (also called the parameters ) i) and ii) should hold. The Multiple Regression Model OLS Estimation
10
Mean of Zero Assumption The mean of the error terms is equal to 0 Constant Variance Assumption The variance of the error terms 2 is, the same for every combination values of x 1, x 2,…, x k Normality Assumption The error terms follow a normal distribution for every combination values of x 1, x 2,…, x k Independence Assumption The values of the error terms are statistically independent of each other Exogeneity Cov(x,e)=0 The Regression Model Assumptions Continued
11
R 2 and Adjusted R 2 1.Total variation is given by the formula 2.Explained variation is given by the formula 3.Unexplained variation is given by the formula 4.Total variation is the sum of explained and unexplained variation 5.R 2 is the ratio of explained variation to total variation
12
The Adjusted R 2 Adding an independent variable to multiple regression will raise R 2 (or it will be constant) R 2 will rise slightly even if the new variable has no relationship to y The adjusted R 2 corrects this tendency in R 2 As a result, it gives a better estimate of the importance of the independent variables
13
The multiple correlation coefficient R is just the square root of R 2 With simple linear regression, r would take on the sign of b 1 There are multiple b i ’s with multiple regression For this reason, R is always positive To interpret the direction of the relationship between the x’s and y, you must look to the sign of the appropriate b i coefficient Multiple Correlation Coefficient R
14
Testing the model - The Overall F Test To test H 0 : 1 = 2 = …= k = 0 versus H a : At least one of 1, 2,…, k ≠ 0 The test statistic is Reject H 0 in favor of H a if: F(model) > F or p-value < * F is based on k numerator and n-(k+1) denominator degrees of freedom
15
Testing the Significance of an Independent Variable A variable in a multiple regression model is not likely to be useful unless there is a significant relationship between it and y To test significance, we use the null hypothesis: H 0 : j = 0 Versus the alternative hypothesis: H a : j ≠ 0
16
Testing the Significance of an Independent Variable continued AlternativeReject H 0 Ifp-Value H a : β j > 0t > t α Area under t distribution right of t H a : β j < 0t < –t α Area under t distribution left of t H a : β j ≠ 0|t| > t α/2 * Twice area under t distribution right of |t| * That is t > t α/2 or t < –t α/2
17
Testing the Significance of an Independent Variable continued If we can reject H0: bj = 0 at the 0.05 level of significance, we have strong evidence that the independent variable xj is significantly related to y If we can reject H0: bj = 0 at the 0.01 level of significance, we have very strong evidence that the independent variable xj is significantly related to y The smaller the significance level alpha at which H0 can be rejected, the stronger is the evidence that xj is significantly related to y
18
The assumption of is innocuous, as the intercept would pick up a non-zero mean in u. The crucial assumption is zero covariance. If this assumption does not hold, say because x1 is correlated with u, we say that x1 is endogenous. Why is endogeneity a problem? Remember our aim; we try to estimate the partial effect consistently. This implies that in the probability limit the estimated beta from a sample should mirror (equal) the beta estimated from a population. This is not the case in the presence of endogeneity. Thus we cannot use OLS. OBS! Consistency vs. efficiency Endogeneity bias
19
In principle, the problem of endogeneity may arise whenever economists make use of non-experimental data, because in that setting you can never be totally certain what is driving what. Endogeneity typically arises in one of three ways: Omitted variables: Economic theory says we should control for one or more additional variables in our model, but, typically because we do not have the data, we cannot. OLS will be a consistent estimator if the second term is zero; if it is not then OLS estimate will probably be upward biased. Sources of Endogeneity
20
Measurement Error: When we collect survey data we try to make sure the information we get from the respondents conforms as closely as possible to the variables we have in mind for our analysis – yet it is inevitable that measurement errors creep into the data. Measurement errors may well result in (econometric) endogeneity bias. Consider the following model; Now, consider we observe a noisy measure of x1; our model now looks like; Observe that the measurement error v is correlated with and enters the residual The OLS estimate of will be inconsistent. Measurement error leads to three interesting results. Sources of Endogeneity
21
1- Attenuation bias (iron law of econometrics): If then will always be closer to zero than. 2-Signal-to-noise ratio: the severity of the attenuation bias depends on the ratio if this ratio is large then attenuation bias will be small vice versa. 3- The sign of will always be the same as that of the structural parameter.. What do we learn from measurement error is that the sign on the coefficient will not change (asymptotically). Sources of Endogeneity
22
Simultaneity: Simultaneity arises when at least one of the explanatory variables is determined simultaneously along with the dependent variable. Consider the following model; The problem here is that both determines, and depends on. More specifically, because affects, which in turn affects through the first equation, it follows that will be correlated with So, what can we do to overcome the endogeneity bias? Sources of Endogeneity
23
Consider the model; here we have an unobservable variable, we want to pull out from the error term and include it in the model as an independent variable. We do not observe but we find a proxy variable for (highly correlated with ). Thus, including in the model might reduce or eliminate the bias. There are two formal requirements for a proxy variable: 1.The proxy variable must be redundant in the structural equation. 2.The proxy variable should lead that the correlation between the omitted variable and should go to zero, once we condition on. We define as we require that The Proxy variable (OLS) solution to the omitted variables problem
24
Consider a linear population model, where but might be endogenous for any of the reasons discussed earlier. The error term includes an observable that correlated with, so and thus OLS estimation of the model generally results in inconsistent estimates of all the coefficients in the model. The method of IV provides a general solution to the endogeneity bias by using an instrument z for the endogenous variable. The instrument has to satisfy two conditions: a)The instrument is exogenous, valid. This often referred to as an exclusion restriction; is excluded from the structural equation. b)The instrument is informative or relevant. This means that the instrument must be correlated with X, conditional on all the exogenous variables in the model. Instrumental variables (IV) estimation
25
Assume that there is a linear relationship between and where is mean zero error term and uncorrelated with all the variables on the RHS. For z1 to be a relevant or informative Instrument, we require that 0. The reduced form of equation for the dependent variable can be written as; Given the assumptions we made, we now know that the residual is uncorrelated with all the explanatory variables on the RHS. This means that the reduced form equation can be estimated consistently using OLS. Instrumental variables (IV) estimation
26
An example of the proxy variable approach
27
Assume we wish to estimate the following equation: An example of the IV approach (2SLS)
28
Assume we wish to estimate the following equation: First-stage regression of the IV approach
29
Assume we wish to estimate the following equation: Second stage regression of the IV approach
30
Linear Panel Data Analysis Repeated cross-sectional data Panel data: Combined time series and cross section data (N, T) Individual-specific, time-invariant, unobserved heterogeneity Balanced vs. Unbalanced panel data. if the panel is unbalanced for reasons that are not random then we may use a sample selection model.
31
Linear Panel Data Analysis Larger sample size than single cross-section: precise estimates (i.e. lower standard errors). Panel data enable you to solve an omitted variables problem (i.e. endogeneity problem). Panel data also enable you to estimate dynamic equations (lagged dependent variables on the RHS).
32
Fixed effects (within) estimator (FE) (time-demeaned data) First differenced estimator (FD) Least squares dummy variables (LSDV) Assumptions of FE and LSDV : Unobsevred term alpha is correlated with X. Strict exogeneity Assumptions of FD: Unobsevred term alpha is correlated with X. Exogeneity up to first lag of the error term FE or FD? If T=2 then FE=FD If T>2 then these models will differ only because of sampling error. If FD error term is serially uncorrelated then use FD. If the FE error term is I.I.D then use FE.
33
Pooled OLS (POLS) Random Effects Estimator (RE) Assumptions of POLS: Unobsevred term alpha is uncorrelated with X. Allows correlation with some lagged X ( X is predetermined). Homoskedasticity and no serial correlation (use clustered stantard errors) Assumptions of RE (combination of strongest assumptions): Unobsevred term alpha is uncorrelated with X. Strict exogeneity RE is a GLS estimator: standard errors are transformed to fulfill the assumption of classical linear regression model.
34
Model selection FE or RE? Test for non-zero correlation between unobserved effect and X (Hausman test). FE is consistent regardless of whether aplha is correlated with X or not. RE requires that the correlation is zero so: The null hypothesis in Hausman test is that both models are consistent. The alternative hypothesis would favor the FE model. RE or POLS? The Breusch-Pagan test: The null hpothis that the variance of the unobserved term is 0 then use POLS. Alternative is RE. This can also be seen as a test of I.I.D error term in POLS.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.