Class 5 Multiple Regression Lionel Nesta Observatoire Français des Conjonctures Economiques SKEMA Ph.D programme 2010-2011.

Class 5 Multiple Regression Lionel Nesta Observatoire Français des Conjonctures Economiques Lionel.nesta@ofce.sciences-po.fr SKEMA Ph.D programme 2010-2011

Introduction to Regression  Typically, the social scientist is dealing with multiple and complex webs of interactions between variables. An immediate and appealing extension to simple linear regression is to extend the set of explanatory variable to other variables.  Multiple regressions include several explanatory variables in the empirical model

To minimize the sum of squared errors

Multivariate Least Square Estimator Usually, the multivariate is described by matrix notation: With the following least square solution:

Assumption OLS 1 It is possible to operate non linear transformation of the variables (e.g. log of x ) but not of the parameters like the following : Linearity The model is linear in its parameters OLS can not estimate this

Assumption OLS 2 There is no selection bias in the sample. The results pertain to the whole population All observations are independent from one another (no serial nor cross-sectional correlation) Random Sampling The n observations are a random sample of the whole population

Assumption OLS 3 No independent variable is constant. Each variable has variance which can be used with the variance of the dependent variable to compute the parameters. No exact linear relationships amongst independent variables No perfect Collinearity There is no collinearity between independent variables

Assumption OLS 4 Given any values of the independent variables (IV), the error term must have an expected value of zero. In this case, all independent variables are exogenous. Otherwise, at least one IV suffers from an endogeneity problem. Zero Conditional Mean The error term u has an expected value of zero

Sources of endogeneity Wrong specification of the model Omitted variable correlated with one RHS. Measurement errors of RHS Mutual causation between LHS and RHS Simultaneity

Assumption OLS 5 Homoskedasticity The variance of the error term, u, conditional on RHS, is the same for all values of RHS. Otherwise we speak of heteroskedasticity.

Assumption OLS 6 Normality of error term The error term is independent of all RHS and follows a normal distribution with zero mean and variance σ²

Assumptions OLS OLS1 Linearity OLS2 Random Sampling OLS3 No perfect Collinearity OLS4 Zero Conditional Mean OLS5 Homoskedasticity OLS6 Normality of error term

Theorem 1  OLS1 - OLS4 : Unbiasedness of OLS. The set of estimated parameters is equal to the true unknown values of

Theorem 2  OLS1 – OLS5 : Variance of OLS estimate. The variance of the OLS estimator is … where R² j is the R-squared from regressing x j on all other independent variables. But how can we measure ?

Theorem 3  OLS1 – OLS5 : The standard error of the regression is defined as This is also called the standard error of the estimate or the root mean squared errors (RMSE)

Standard Error of Each Parameter  Combining theorems 2 and 3 yields:

Theorem 4  Under assumptions OLS1 – OLS5, estimators are the Best Linear Unbiased Estimators (BLUE) of Assumptions OLS1 – OLS5 are known as the Gauss- Markov Theorem, which stipulates that under OLS1-5, the OLS are the best estimation method  The estimates are unbiased (OLS1-4)  The estimates have the smallest variance (OLS5)

Theorem 5  Under assumptions OLS1 – OLS6, the OLS estimates follows a t distribution:

Extension of theorem 5: Inference  We can define de confidence interval of β, at 95% : If the 95% CI does not include 0, then β is significantly different than 0.

Student t Test for H 0 : β j =0  We are also in the position to infer on β j  H 0 : β j = 0  H 1 : β j ≠ 0 Rule of decision Accept H 0 is | t | < t α/2 Reject H 0 is | t | ≥ t α/2

Summary OLS1 Linearity OLS2 Random Sampling OLS3 No perfect Collinearity OLS4 Zero Conditional Mean OLS5 Homoskedasticity OLS6 Normality of error term T1 Unbiasedness T2-T4 BLUE T5 β ~ t

The knowledge production function Application 1: seminal model

Application 1: modèle de base

Application 2: Changing specification The knowledge production function

Application 2: Changing specification

The knowledge production function Application 3: Adding variables

Qualitative variables used as independent variables

Qualitative variables as indep. variables Qualitative variables Dummy variables Generating dummy variables using STATA Interpretation of coefficients in OLS Interaction effects between continuous and dummy var.

Qualitatives variables Qualitative variables provide information on discrete characteristics The number of categories taken by qualitative variables is general small. These can be numerical values but each number denotes a attribute – a characteristics. A qualitative variable may have several categories  Two categories: male – female  Three categories: nationality (French, German, Turkish)  More than three categories: sectors (car, chemical, steel, electronic equip., etc.)

Qualitative variables There are several ways to code qualitative variables with n categories  Using one categorical variables  Producing n - 1 dummy variables A dummy variable is a variable which takes values 0 or 1.  We also call them binary variables  We also call dichotomous variables

Coding using one categorical variable  Two categories: we generate a categorical variable called “gender” set to 1 if the observation is a female, 2 if the observation is a male.  Three categories: we generate a categorical variable called “country” set to 1 if the observation is French, 2 if the observation is German, three if the observation if Turkish.  More than three categories : we generate a categorical variable called “sector” set to 1 if the observation is in the car industry, 2 for the chemical industry, three for the steel ifnustry, four for the electronic equip industry, etc.. This requires the use of label in order to know to which category a given number pertains Qualitative variables

Labelling variables  Labelling is tedious, boring and uninteresting.  But there are clear consequences when one must interpret the results label variable. Decribe a variable, qualitative or quantitative label variable asset "real capital" label define. Defines a label (meaning of numbers) label define firm_type 1 "biotech" 0 "Pharma" label values Applies the label to a given variable label values type firm_type

Exemple de labellisation ************************************************************************************* ******* CREATION DES LABELS INDUSTRIES ********* ************************************************************************************* egen industrie = group(isic_oecd) #delimit ; label define induscode 1 "Text. Habill. & Cuir" 2 "Bois" 3 "Pap. Cart. & Imprim." 4 "Coke Raffin. Nucl." 5 "Chimie" 6 "Caoutc. Plast." 7 "Aut. Prod. min." 8 "Métaux de base" 9 "Travail des métaux" 10 "Mach. & Equip." 11 "Bureau & Inform." 12 "Mach. & Mat. Elec." 13 "Radio TV Telecom." 14 "Instrum. optique" 15 "Automobile" 16 "Aut. transp." 17 "Autres"; #delimit cr label values industrie induscode

Exercise 1.Open SKEMA_BIO.dta 2.Create variable firm_type from type 3.Label variable firm_type 4.Define a label for firm_type and apply it

Dummy variables Coding categorical variables using dummy variables only Two categories.  We generate one dummy variable “female” set to 1if the obs. is a female, 0 otherwise.  We generate one dummy variable “male” set to 1if the obs. is a male, 0 otherwise.  But one of the dummy variable is simply redundant. When female = 0, then necessarily male = 1 (and vice versa).  Hence with two categories, we only need one dummy variable.

Dummy variables Coding categorical variables using dummy variables only Three categories.  We generate one dummy variable “France” set to 1if the obs. is a French, 0 otherwise.  We generate one dummy variable “Germany” set to 1if the obs. is a German, 0 otherwise.  We generate one dummy variable “Turkish” set to 1if the obs. is a Turkish, 0 otherwise.  But one of the dummy variable is simply redundant. When France=0 and German=0, then Turkish=1. For a variable with n categories, we must create n - 1 dummy variables, each representing one particular category.

Generation of dummies with STATA Using the if condition.  generate DEU = 0  replace DEU = 1 if country==“GERMANY”  generate LDF= 1 if size > 100  replace LDF =0 if size < 101 Avoiding the use of the if condition.  generate FRA = country==“FRANCE”  generate LDF = size > 100

With n categories and n being large, generating dummty variables can become really tedious Function tabulate has a very convenient extension, since it will generate n dummy variables at once.  tabulate varcat, gen(v_)  tabulate country, gen(c_)  Will create n dummy variables with n being the number of country in the dataset, and c_1 being the first country, c_2 being second, c_3 the third, etc. Generation of dummies with STATA

Reading coefficients of dummy variables Remember! A coefficient tells us the increase in y associated with a one-unit increase in x, other things held constant (ceteris paribus). If the knowledge production function goes with « y » being the number of patent and “biotech” being a dummy variable set to 1 for biotech fimrs, 0 otherwise.

If the firm is biotech company, then the dummy variable “biotech” is equal to unity. Hence: If the firm is pharma company, then the dummy variable “biotech” is equal to zero. Hence: Reading coefficients of dummy variables

The coefficient reads as the variation in the dependent variable when the dummy variable is set to 1 relative to the situation where the dummy variable is set to 0.  With two categories, I must introduce one dummy variable.  With three categories, I must introduce two dummy variables.  With n categories, I must introduce (n-1) dummy variables. Reading coefficients of dummy variables

Exercise 1.Regress the following model: 2.Predict the number of patents for both biotech and pharma companies 3.Produce descriptive statistics of PAT for each type of company using the command table 4.What do you observe?

For semi logarithmic forms (log Y), coefficient β must be read as an approximation of the percent change in Y associated with a variation of 1 unit of the explanatory variable. This approximation is acceptable for small β (β < 0.1). When β is large (β ≥ 0.1), the exact percent change in Y is: 100 × (e β – 1) Reading coefficients of dummy variables

Application 4: dummy variable The knowledge production function

Application 4: dummy variable

Patent ln(PAT) size Application 4: dummy variable

Application 5: Interacting variables The knowledge production function

Application 5: Interacting variables

Patent ln(PAT) Size Application 5: Interacting variables

Specification Tests

The knowledge production function Specification Tests for Multiple OLS

Critical probability α such that : Pr(H a |H 0 )= α Student t test: concerning the significance of one parameter Fisher F test: concerning the significance of several parameters simultaneously (Wald test) Non linear restriction test: Testing for non-linear relationship between parameters

Concerning one parameter only H 0 : lassets = 0.30 test size = 0.30 Test on several parameters H 0 : size = 0.30 and rdi = 0.70 test (size = 0.3) (rdi=0.7) H 0 : rdi = 2 * size test lrdi = 2 * lassets H 0 : lrdi + lassets = 1 test lrdi + lassets = 1 lincom _b[lrdi] + _b[lassets] - 1 Specification Tests for Multiple OLS Testing linear combination of parameters

Test on several parameters H 0 : size * rdi = 0.2 testnl _b[lrdi] * _b[lassets] = 0.2 nlcom_b[lrdi] * _b[lassets] = 0.2 Specification Tests for Multiple OLS Testing non linear combination of parameters

Review of Assumptions OLS assumptionConsistency when violated Efficiency when violated Test OLS1 Linearity --- OLS2 Random Sampling Biased βNone None. Redo sampling & estimation OLS3 No perfect Collinearity --- OLS4 Zero Conditional Mean Biased β Poorly estimated variance of β Link test Omitted Variable test OLS5 Homoskedasticity None Underestimated variance of β Breusch-Pagan test OLS6 Normality of error term None Lack of reliability of the t test for β Shapiro Wilk test

Rule of thumb using graphs Stata Instruction rvfplot White Test Stata Instruction estat imtest Breusch-Pagan Test Stata Instruction estat hettest Specification Tests for Multiple OLS Specification tests on the validity of assumptions Hypothesis OLS5 : Homoskedasticity of residuals

Specification Tests for Multiple OLS Specification tests on the validity of assumptions Hypothesis OLS5 : Homoskedasticity of residuals: rvfplot

Specification Tests for Multiple OLS Specification tests on the validity of assumptions Hypothesis OLS5 : Homoskedasticity of residuals: estat imtest

Specification Tests for Multiple OLS Specification tests on the validity of assumptions Hypothesis OLS5 : Homoskedasticity of residuals: estat hettest

Specification tests on the validity of assumptions Hypothesis OLS6 : Normality of residuals Rule of thumb using graphs Stata Instruction predict res, r esidual kdensity res, nor mal Formally using the Shapiro-Wilk Test Stata Instruction predict res, r esidual swilk res, nor mal Specification Tests for Multiple OLS

Specification tests on the validity of assumptions Hypothesis OLS6 : Normality of residuals: kdensity Specification Tests for Multiple OLS

Specification tests on the validity of assumptions Hypothesis OLS6 : Normality of residuals Specification Tests for Multiple OLS

Specification tests on the validity of assumptions There is no omitted variables (OLS4 on endogeneity) Link test : Stata Instruction linktest Regress the DV over the prediction and its squared value Variable _hat must be significant, but not _hatsq Ramsey RESET Test : Stata Instruction ovtest Regress the DV over powers (4) of LHS variables Regress the DV over powers (4) of RHS variables Specification Tests for Multiple OLS

Specification tests on the validity of assumptions There is no omitted variables (OLS4 on endogeneity): linktest Specification Tests for Multiple OLS

Specification tests on the validity of assumptions There is no omitted variables (OLS4 on endogeneity): ovtest Specification Tests for Multiple OLS

Exercise 1.Regress the following model 2.Assuming OLS1-3 to be correct, test OLS4-6 and conclude 1.OL4 on specification test using linktest and ovetst 2.OLS5 on homoskedasticity using imtest and hettest 3.OLS6 on normality of errors using kdensity and swilk test

Class 5 Multiple Regression Lionel Nesta Observatoire Français des Conjonctures Economiques SKEMA Ph.D programme 2010-2011.

Similar presentations

Presentation on theme: "Class 5 Multiple Regression Lionel Nesta Observatoire Français des Conjonctures Economiques SKEMA Ph.D programme 2010-2011."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Class 5 Multiple Regression Lionel Nesta Observatoire Français des Conjonctures Economiques SKEMA Ph.D programme 2010-2011.

Similar presentations

Presentation on theme: "Class 5 Multiple Regression Lionel Nesta Observatoire Français des Conjonctures Economiques SKEMA Ph.D programme 2010-2011."— Presentation transcript:

Similar presentations

About project

Feedback