Quantitative research methods in business administration Lecture 3 Multivariate analysis OLS, ENDOGENEITY BIAS, 2SLS Panel Data Exemplified by SPSS and.

Slides:



Advertisements
Similar presentations
Introduction Describe what panel data is and the reasons for using it in this format Assess the importance of fixed and random effects Examine the Hausman.
Advertisements

Panel Data Models Prepared by Vera Tabakova, East Carolina University.
Economics 20 - Prof. Anderson
Multiple Regression and Model Building
Multiple Regression Analysis
There are at least three generally recognized sources of endogeneity. (1) Model misspecification or Omitted Variables. (2) Measurement Error.
Economics 20 - Prof. Anderson1 Multiple Regression Analysis y =  0 +  1 x 1 +  2 x  k x k + u 7. Specification and Data Problems.
Hypothesis Testing Steps in Hypothesis Testing:
3.3 Omitted Variable Bias -When a valid variable is excluded, we UNDERSPECIFY THE MODEL and OLS estimates are biased -Consider the true population model:
3.2 OLS Fitted Values and Residuals -after obtaining OLS estimates, we can then obtain fitted or predicted values for y: -given our actual and predicted.
6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.
Specification Error II
Multiple Regression Fenster Today we start on the last part of the course: multivariate analysis. Up to now we have been concerned with testing the significance.
Instrumental Variables Estimation and Two Stage Least Square
4.3 Confidence Intervals -Using our CLM assumptions, we can construct CONFIDENCE INTERVALS or CONFIDENCE INTERVAL ESTIMATES of the form: -Given a significance.
1 Lecture 2: ANOVA, Prediction, Assumptions and Properties Graduate School Social Science Statistics II Gwilym Pryce
Chapter 13 Multiple Regression
1 Lecture 2: ANOVA, Prediction, Assumptions and Properties Graduate School Social Science Statistics II Gwilym Pryce
The Simple Linear Regression Model: Specification and Estimation
Economics Prof. Buckles1 Time Series Data y t =  0 +  1 x t  k x tk + u t 1. Basic Analysis.
1Prof. Dr. Rainer Stachuletz Multiple Regression Analysis y =  0 +  1 x 1 +  2 x  k x k + u 7. Specification and Data Problems.
Prof. Dr. Rainer Stachuletz
Econ Prof. Buckles1 Multiple Regression Analysis y =  0 +  1 x 1 +  2 x  k x k + u 1. Estimation.
Chapter 12 Multiple Regression
Econ 140 Lecture 131 Multiple Regression Models Lecture 13.
Chapter 4 Multiple Regression.
Multiple Regression Models
CHAPTER 4 ECONOMETRICS x x x x x Multiple Regression = more than one explanatory variable Independent variables are X 2 and X 3. Y i = B 1 + B 2 X 2i +
Chapter 11 Multiple Regression.
Economics 20 - Prof. Anderson
Topic 3: Regression.
Multiple Regression and Correlation Analysis
Introduction to Regression Analysis, Chapter 13,
Simple Linear Regression Analysis
12 Autocorrelation Serial Correlation exists when errors are correlated across periods -One source of serial correlation is misspecification of the model.
McGraw-Hill/IrwinCopyright © 2009 by The McGraw-Hill Companies, Inc. All Rights Reserved. Simple Linear Regression Analysis Chapter 13.
Chapter 13: Inference in Regression
Hypothesis Testing in Linear Regression Analysis
Regression Method.
Regression Analysis (2)
Chapter 4-5: Analytical Solutions to OLS
2-1 MGMG 522 : Session #2 Learning to Use Regression Analysis & The Classical Model (Ch. 3 & 4)
1 1 Slide Multiple Regression n Multiple Regression Model n Least Squares Method n Multiple Coefficient of Determination n Model Assumptions n Testing.
Instrumental Variables: Problems Methods of Economic Investigation Lecture 16.
Random Regressors and Moment Based Estimation Prepared by Vera Tabakova, East Carolina University.
Examining Relationships in Quantitative Research
Multiple Regression and Model Building Chapter 15 Copyright © 2014 by The McGraw-Hill Companies, Inc. All rights reserved.McGraw-Hill/Irwin.
1 Multiple Regression Analysis y =  0 +  1 x 1 +  2 x  k x k + u.
Multiple Regression Petter Mostad Review: Simple linear regression We define a model where are independent (normally distributed) with equal.
Panel Data Models ECON 6002 Econometrics I Memorial University of Newfoundland Adapted from Vera Tabakova’s notes.
3.4 The Components of the OLS Variances: Multicollinearity We see in (3.51) that the variance of B j hat depends on three factors: σ 2, SST j and R j 2.
© Copyright McGraw-Hill Correlation and Regression CHAPTER 10.
Chapter 16 Data Analysis: Testing for Associations.
Discussion of time series and panel models
Lecture 10: Correlation and Regression Model.
Chapter 4 The Classical Model Copyright © 2011 Pearson Addison-Wesley. All rights reserved. Slides by Niels-Hugo Blunch Washington and Lee University.
Correlation & Regression Analysis
1 Prof. Dr. Rainer Stachuletz Multiple Regression Analysis y =  0 +  1 x 1 +  2 x  k x k + u 1. Estimation.
Copyright © 2011 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill/Irwin Simple Linear Regression Analysis Chapter 13.
McGraw-Hill/IrwinCopyright © 2009 by The McGraw-Hill Companies, Inc. All Rights Reserved. Simple Linear Regression Analysis Chapter 13.
Copyright © 2011 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill/Irwin Multiple Regression Chapter 14.
1/25 Introduction to Econometrics. 2/25 Econometrics Econometrics – „economic measurement“ „May be defined as the quantitative analysis of actual economic.
Lecture 2 Survey Data Analysis Principal Component Analysis Factor Analysis Exemplified by SPSS Taylan Mavruk.
Multiple Regression and Model Building
STOCHASTIC REGRESSORS AND THE METHOD OF INSTRUMENTAL VARIABLES
Correlation and Simple Linear Regression
Simple Linear Regression and Correlation
Chapter 7: The Normality Assumption and Inference with OLS
Product moment correlation
Instrumental Variables Estimation and Two Stage Least Squares
Presentation transcript:

Quantitative research methods in business administration Lecture 3 Multivariate analysis OLS, ENDOGENEITY BIAS, 2SLS Panel Data Exemplified by SPSS and Stata Taylan Mavruk

1. Multivariate analysis – cross tabulation Examining the relation between three qualitative variables 1.In this case we want to investigate the “strength” in our already tested bivariate dependencies. This is done by imposing a third variable (test- or control variable) 2.The test variable (z) must be on the same data level as our independent variable (X) (nominal or ordinal) 3.The purpose of this is to be able to answer two questions: a.What other variables (z1, z2, z3,.. ) may offer an explanation to the variance in Y? b.What will happen with our original dependence (XY) if we introduce one or more test variables?

Adding a test variable to case 1 Definition: the combination of frequencies of tree (qualitative) variables Example: F7b_a * Stratum * gender dominants

Test of independence in cross-tabulation Case 1 continued Is the identified dependency between the variables statistically significant or is it due to chance? Using hypothesis to test for dependency, (Chi 2 -test) H 0 : The size dependency – sick leave is independent of gender H 1 : The size dependency – sick leave is dependent of gender The Chi 2 -test tells us that we can reject H 0 since the likelihood that the difference is due to chance is less than 5 %. This tells us that our original hypothesis that the dependency between company size and sick leave holds for both male and female dominated organizations

Simple linear regression Simple linear regression used one independent variable to explain the dependent variable Some relationships are too complex to be described using a single independent variable Multiple regression models use two or more independent variables to describe the dependent variable This allows multiple regression models to handle more complex situations There is no limit to the number of independent variables a model can use (k<n should hold) Multiple regression has only one dependent variable

What are we studying? (Conditional expectations) The aim is to study the effect of a variable w on the expected value of y holding fixed a vector of controls c: the partial effect of changing w on (structural equation) holding c constant. If w is continuous, the partial effect is given by: If w is a dummy variable the partial effect is Estimating partial effects such as these in practice is difficult, primarily because of the unobservability problem: typically, not all elements of the vector c is observed, and perfectly measured, in your data. Thus, we might have problems such as omitted variables, and other sources of endogeneity bias. Ref: Wooldridge (2002)

What can we study besides partial effect? Elasticity of the conditional expected value y with respect to say x1 Semi-elasticity; The elasticity shows how much changes in percentage terms in response to a 1% increase in x1. The semi-elasticity shows how much changes, in percentage terms, in response to a one unit increase in x1Ref: Wooldridge (2002)

A population model that is linear in parameters: where are observable variables, u is the unobservable random disturbance term, and are parameters that we want to estimate. Properties of the error term allow us to decide whether OLS is the appropriate estimator. For OLS to consistently estimate (asymptotic properties) the coefficients (also called the parameters ) i) and ii) should hold. The Multiple Regression Model OLS Estimation

Mean of Zero Assumption The mean of the error terms is equal to 0 Constant Variance Assumption The variance of the error terms  2 is, the same for every combination values of x 1, x 2,…, x k Normality Assumption The error terms follow a normal distribution for every combination values of x 1, x 2,…, x k Independence Assumption The values of the error terms are statistically independent of each other Exogeneity Cov(x,e)=0 The Regression Model Assumptions Continued

R 2 and Adjusted R 2 1.Total variation is given by the formula 2.Explained variation is given by the formula 3.Unexplained variation is given by the formula 4.Total variation is the sum of explained and unexplained variation 5.R 2 is the ratio of explained variation to total variation

The Adjusted R 2 Adding an independent variable to multiple regression will raise R 2 (or it will be constant) R 2 will rise slightly even if the new variable has no relationship to y The adjusted R 2 corrects this tendency in R 2 As a result, it gives a better estimate of the importance of the independent variables

The multiple correlation coefficient R is just the square root of R 2 With simple linear regression, r would take on the sign of b 1 There are multiple b i ’s with multiple regression For this reason, R is always positive To interpret the direction of the relationship between the x’s and y, you must look to the sign of the appropriate b i coefficient Multiple Correlation Coefficient R

Testing the model - The Overall F Test To test H 0 :  1 =  2 = …=  k = 0 versus H a : At least one of  1,  2,…,  k ≠ 0 The test statistic is Reject H 0 in favor of H a if: F(model) > F   or p-value <  * F  is based on k numerator and n-(k+1) denominator degrees of freedom

Testing the Significance of an Independent Variable A variable in a multiple regression model is not likely to be useful unless there is a significant relationship between it and y To test significance, we use the null hypothesis: H 0 :  j = 0 Versus the alternative hypothesis: H a :  j ≠ 0

Testing the Significance of an Independent Variable continued AlternativeReject H 0 Ifp-Value H a : β j > 0t > t α Area under t distribution right of t H a : β j < 0t < –t α Area under t distribution left of t H a : β j ≠ 0|t| > t α/2 * Twice area under t distribution right of |t| * That is t > t α/2 or t < –t α/2

Testing the Significance of an Independent Variable continued If we can reject H0: bj = 0 at the 0.05 level of significance, we have strong evidence that the independent variable xj is significantly related to y If we can reject H0: bj = 0 at the 0.01 level of significance, we have very strong evidence that the independent variable xj is significantly related to y The smaller the significance level alpha at which H0 can be rejected, the stronger is the evidence that xj is significantly related to y

The assumption of is innocuous, as the intercept would pick up a non-zero mean in u. The crucial assumption is zero covariance. If this assumption does not hold, say because x1 is correlated with u, we say that x1 is endogenous. Why is endogeneity a problem? Remember our aim; we try to estimate the partial effect consistently. This implies that in the probability limit the estimated beta from a sample should mirror (equal) the beta estimated from a population. This is not the case in the presence of endogeneity. Thus we cannot use OLS. OBS! Consistency vs. efficiency Endogeneity bias

In principle, the problem of endogeneity may arise whenever economists make use of non-experimental data, because in that setting you can never be totally certain what is driving what. Endogeneity typically arises in one of three ways: Omitted variables: Economic theory says we should control for one or more additional variables in our model, but, typically because we do not have the data, we cannot. OLS will be a consistent estimator if the second term is zero; if it is not then OLS estimate will probably be upward biased. Sources of Endogeneity

Measurement Error: When we collect survey data we try to make sure the information we get from the respondents conforms as closely as possible to the variables we have in mind for our analysis – yet it is inevitable that measurement errors creep into the data. Measurement errors may well result in (econometric) endogeneity bias. Consider the following model; Now, consider we observe a noisy measure of x1; our model now looks like; Observe that the measurement error v is correlated with and enters the residual The OLS estimate of will be inconsistent. Measurement error leads to three interesting results. Sources of Endogeneity

1- Attenuation bias (iron law of econometrics): If then will always be closer to zero than. 2-Signal-to-noise ratio: the severity of the attenuation bias depends on the ratio if this ratio is large then attenuation bias will be small vice versa. 3- The sign of will always be the same as that of the structural parameter.. What do we learn from measurement error is that the sign on the coefficient will not change (asymptotically). Sources of Endogeneity

Simultaneity: Simultaneity arises when at least one of the explanatory variables is determined simultaneously along with the dependent variable. Consider the following model; The problem here is that both determines, and depends on. More specifically, because affects, which in turn affects through the first equation, it follows that will be correlated with So, what can we do to overcome the endogeneity bias? Sources of Endogeneity

Consider the model; here we have an unobservable variable, we want to pull out from the error term and include it in the model as an independent variable. We do not observe but we find a proxy variable for (highly correlated with ). Thus, including in the model might reduce or eliminate the bias. There are two formal requirements for a proxy variable: 1.The proxy variable must be redundant in the structural equation. 2.The proxy variable should lead that the correlation between the omitted variable and should go to zero, once we condition on. We define as we require that The Proxy variable (OLS) solution to the omitted variables problem

Consider a linear population model, where but might be endogenous for any of the reasons discussed earlier. The error term includes an observable that correlated with, so and thus OLS estimation of the model generally results in inconsistent estimates of all the coefficients in the model. The method of IV provides a general solution to the endogeneity bias by using an instrument z for the endogenous variable. The instrument has to satisfy two conditions: a)The instrument is exogenous, valid. This often referred to as an exclusion restriction; is excluded from the structural equation. b)The instrument is informative or relevant. This means that the instrument must be correlated with X, conditional on all the exogenous variables in the model. Instrumental variables (IV) estimation

Assume that there is a linear relationship between and where is mean zero error term and uncorrelated with all the variables on the RHS. For z1 to be a relevant or informative Instrument, we require that 0. The reduced form of equation for the dependent variable can be written as; Given the assumptions we made, we now know that the residual is uncorrelated with all the explanatory variables on the RHS. This means that the reduced form equation can be estimated consistently using OLS. Instrumental variables (IV) estimation

An example of the proxy variable approach

Assume we wish to estimate the following equation: An example of the IV approach (2SLS)

Assume we wish to estimate the following equation: First-stage regression of the IV approach

Assume we wish to estimate the following equation: Second stage regression of the IV approach

Linear Panel Data Analysis Repeated cross-sectional data Panel data: Combined time series and cross section data (N, T) Individual-specific, time-invariant, unobserved heterogeneity Balanced vs. Unbalanced panel data. if the panel is unbalanced for reasons that are not random then we may use a sample selection model.

Linear Panel Data Analysis Larger sample size than single cross-section: precise estimates (i.e. lower standard errors). Panel data enable you to solve an omitted variables problem (i.e. endogeneity problem). Panel data also enable you to estimate dynamic equations (lagged dependent variables on the RHS).

Fixed effects (within) estimator (FE) (time-demeaned data) First differenced estimator (FD) Least squares dummy variables (LSDV) Assumptions of FE and LSDV : Unobsevred term alpha is correlated with X. Strict exogeneity Assumptions of FD: Unobsevred term alpha is correlated with X. Exogeneity up to first lag of the error term FE or FD? If T=2 then FE=FD If T>2 then these models will differ only because of sampling error. If FD error term is serially uncorrelated then use FD. If the FE error term is I.I.D then use FE.

Pooled OLS (POLS) Random Effects Estimator (RE) Assumptions of POLS: Unobsevred term alpha is uncorrelated with X. Allows correlation with some lagged X ( X is predetermined). Homoskedasticity and no serial correlation (use clustered stantard errors) Assumptions of RE (combination of strongest assumptions): Unobsevred term alpha is uncorrelated with X. Strict exogeneity RE is a GLS estimator: standard errors are transformed to fulfill the assumption of classical linear regression model.

Model selection FE or RE? Test for non-zero correlation between unobserved effect and X (Hausman test). FE is consistent regardless of whether aplha is correlated with X or not. RE requires that the correlation is zero so: The null hypothesis in Hausman test is that both models are consistent. The alternative hypothesis would favor the FE model. RE or POLS? The Breusch-Pagan test: The null hpothis that the variance of the unobserved term is 0 then use POLS. Alternative is RE. This can also be seen as a test of I.I.D error term in POLS.