Chapter 6 Multiple Linear Regression Analysis
Learning Objectives Understand the goals of multiple linear regression analysis Understand the “holding all other variables constant” condition in multiple linear regression analysis Understand the multiple linear regression assumption required for OLS to be BLUE Interpret multiple linear regression output in excel Assess the goodness-of-fit of the estimated sample regression function
Learning Objectives Perform hypothesis tests for the overall significance of the estimated sample regression function Perform hypothesis tests for the individual significance of an estimated slope coefficient Perform hypothesis tests for the joint significance of a subset of estimated slope coefficients Perform the chow test for structural differences between two subsets of data
The Multiple Regression Model
Idea: Examine the linear relationship between one dependent variable, y, and two or more independent variables, x1, x2,…xk Population model: Y-intercept Population slopes Random Error Estimated multiple regression model: Estimated (or predicted) value of y Estimated intercept Estimated slope coefficients
A Visual Depiction of the Estimated Sample Multiple Linear Regression Function
A Visual Depiction of the Predicted Value of and the Calculated Residual for a Given Observation
A Visual Depiction of the Predicted Values of and the Calculated Residuals for Multiple Observations
How are the Multiple Linear Regression Estimates Obtained?
Minimize the sum of squared residuals Unlike simple linear regression, there is no formula in summation notation for the intercept and slope coefficient estimates.
Understand the “Holding All Other Independent Variables Constant” Condition
The idea behind holding all other factors constant (or ceteris paribus) is that we want to isolate the effects of a specific x on the dependent variable without any other factors changing the independent variable or the dependent variable.
A Venn Diagram of the Estimated Linear Relationship between y and x1 Assuming No Factors in the Error Term Affect x1
A Venn Diagram of the Estimated Multiple Linear Relationship between y, x1, and x2
A Comparison of the Estimated Marginal Effect of y and x1 in the Given Venn Diagrams
This is omitted variable bias. It is the bias in x1 that results from x2 not being included in the model and x2 being related to both x1 and y.
When is Omitted Variable Bias Not Present?
If x2 is not related to y – then x2 is not in the error term and does not have to be held constant when x1 changes. If x2 is not related to x1 – then x2 will not change when x1 changes.
Understand the Multiple Linear Regression Assumptions Required for OLS to be the Best Linear Unbiased Estimator Assumptions Required for OLS to be Unbiased Assumption M1: The model is linear in the parameters Assumption M2: The data are collected through independent, random sampling Assumption M3: The data are not perfectly multicollinear. Assumption M4: The error term has zero mean Assumption M5: The error term is uncorrelated with each independent variable and all functions of each independent variable. Additional Assumption Required for OLS to be BLUE Assumption M6: The error term has constant variance. Note that these assumptions are theoretical and typically can’t be proven or disproven.
Assumption S1: Linear in the Parameters
This assumption states that for OLS to be unbiased, the population model must be correctly specified as linear in the parameters.
When is Assumption S1 Violated?
If the population regression model is non-linear in the parameters, i.e. If the true population model is not specified correctly, i.e. if the true model is but the model on the previous slide is the one that is estimated.
Assumption S2: The Data are Collected through Simple Random Sampling
This assumption states that for OLS to be unbiased, the data must be obtained through simple random sampling. This assumption ensures that the observations are statistically independent of each other across the units of observations.
When is Assumption S2 Violated?
If the data are time series data such as GDP and interest rates for the US collected over time. In this circumstance observations from this time period are likely related to observations in previous time periods. If there is some type of selection bias in the sampling. For example, if individuals opt to be in a job training program, go to college, or the response rate for a survey is low.
Assumption S3: The Data are Not Perfectly Multicollinear
This assumption states that for OLS to be unbiased, each independent variable cannot be all the same value or for j = 1, …., k This assumption also states that one of the independent variables is not a linear combination of another independent variable. This assumption ensures that slope estimator is defined. This assumption is only violated if the model falls into the dummy variable trap.
Assumption S4: The Error Term has Zero Mean
This assumption states that for OLS to be unbiased, the average value of the population error term is zero or This assumption will hold as long as an intercept is included in the model. This is because if the average value of the error term equals a value other than zero then the intercept will change accordingly.
Assumption S5: The Error Term is Not Correlated with each Independent Variable or Any Function of each Independent Variable This assumption states that for OLS to be unbiased, the error term is uncorrelated with the independent variable and all functions of the independent variable This is read as the expected value of ε given xij is equal to 0.
How to Determine if Assumption S5 Violated?
Think of all the factors that affect the dependent variable that are not specified in the model. For the salary vs. education example variables that are in the error term include experience, ability, job type, gender, and many other factors. If any of these factors, say ability, are related to any of the independent variables, say education, then violation S5 is violated. Note that the error term is never observed so determining whether S5 is violated is only a thought experiment.
The Importance of S1 through S5
If assumptions S1 through S5 hold, then the OLS estimates are unbiased. This assumption is less likely to be violated in multiple linear regression analysis than simple linear regression analysis but for non-experimental data (i.e. the type of data economists use) that these assumptions almost always fail and therefore the OLS estimates are typically biased.
Assumption S6: The Error Term has Constant Variance
This assumption states that the error term is has a constant variance or in equation form This is called homoskedasticity. If this assumption fails then the error term is heteroskedastic or the error term has a non-constant variance.
How to Determine if Assumption S6 Violated?
Create a scatter plot of y against each x and decide whether the points are scattered in a constant manner around the line. Heteroskedasticity does not have to look like the graph on the right on the next slide, there just has to be a non-constant distribution of the data points along the line. Chapter 9 gives a more in depth coverage of this topic.
Visual Depiction of Homoskedasticity versus Heteroskedasticity
The Importance of S1 through S6
If assumptions S1 through S6 hold, then the OLS estimates are BLUE or the Best Linear Unbiased Estimators. In this instance Best means minimum variance. This means that among all linear unbiased estimators of the population slope and population intercept, the OLS estimates have the lowest variance. As before, in simple linear regression analysis in economics these assumption rarely hold.
Interpret Multiple Linear Regression in Excel: Data Set
Scatter Diagrams From these scatter diagrams it is evident that both square feet and bedrooms have a positive linear association with the price of a house
Interpret Multiple Linear Regression in Excel: Regression output
Estimated Sample Regression Function :
Interpret Multiple Linear Regression in Excel: Interpreting the Output
Estimated Sample Regression Function : : On average, if square feet and bedrooms are 0, then the predicted house price is $89, : On average, holding bedrooms constant, if square footage increases by one foot then the price of the house increases by $56.11. : On average, holding square footage constant, if the number of bedrooms increases by one then the price of the house increases by $30,
Estimated Sample Regression Function :
Interpret Multiple Linear Regression in Excel: Obtaining a Predicted Value Estimated Sample Regression Function : Suppose we wish to predict the price of a house with 2,000 square feet and 3 bedrooms. The predicted price of a house is $293,
Assess the Goodness of Fit of the Sample Multiple Linear Regression Function: R2
The R2 means that 67.48% of the variation in housing price can be explained by square feet and bedrooms.
Assess the Goodness of Fit of the Sample Multiple Linear Regression Function: Adjusted R2
The adjusted R2 imposes a penalty for adding in additional explanatory variables. The penalty is that in the numerator as k goes up the adjusted R2 goes down (if USS is held constant).
Assess the Goodness of Fit of the Sample Multiple Linear Regression Function: Standard Error of the Regression Model The standard error of the regression can also be calculated by taking the square root of the MSUnexplained.
Perform Hypothesis Tests for the Overall Significance of the Sample Regression Function
F-Test for Overall Significance of the Model Shows if there is a linear relationship between any of the independent variables considered together and the dependent variable, y Use F test statistic Hypotheses: H0: β1 = β2 = … = βk = 0 (no linear relationship) H1: at least one βi ≠ 0 (at least one independent variable affects y)
F-Statistic for Overall Significance
Test statistic: where F has (numerator) D1 = k and (denominator) D2 = (n – k - 1) degrees of freedom
Rejection Rules for the F-Test for the Overall Significance of the Regression Model
Critical Value: Reject H0 if F-Stat > Fα, k, n-k-1 P-value: Reject H0 if p-value < α (the p-value for this test is found under Significance F in the ANOVA table in Excel)
F-Test for Overall Significance
With 2 and 7 degrees of freedom P-value for the F-Test
F-Test for Overall Significance
Test Statistic: Rejection Rule: Reject H0 if F-stat > 4.737 or Reject H0 if p-value < .05 Conclusion: H0: β1 = β2 = 0 H1: β1 and β2 not both zero = .05 df1= df2 = 7 Critical Value: F = 4.737 = .05 Because (or alternatively because < .05), we reject H0 and conclude that at least one of square footage or bedrooms affects the price of a house. F Do not reject H0 Reject H0 F.05 = 4.737
Are Individual Independent Variables Significant?
Use t-tests of individual variable slopes Shows if there is a linear relationship between the variable xi and y Hypotheses: H0: βi = 0 (no linear relationship) H1: βi ≠ 0 (linear relationship does exist between xi and y)
Are Individual Independent Variables Significant?
H0: βi = 0 (no linear relationship) H1: βi ≠ 0 (linear relationship does exist between xi and y) Three ways to test this hypothesis Confidence Interval (2) Critical Value (3) p-value
Using a Confidence Interval to test Individual Statistical Significance
H0: βi = 0 (no linear relationship between xi and y) H1: βi ≠ 0 (linear relationship exists between xi and y) Reject H0 if 0 is not within the confidence interval. The α is 1 – the confidence level. The confidence level is usually 95% so α = .05
Confidence Interval Estimate for the Slope
Confidence interval for the population slope β1 (the effect of changes of square feet on house prices): Decision: This confidence interval includes 0 so we fail to reject H0 and conclude that square feet does not have a statistically significant effect on the price of a house at the 5% level. The interval is different from the Excel output due to rounding.
Using Critical Values to test Individual Statistical Significance
H0: βi = 0 (no linear relationship) H1: βi ≠ 0 (linear relationship exists between xi and y) Test Statistic: Rejection Rule: Reject H0 if |t-statistic| > t α, n-k-1
Using Critical Values to Test for Individual Significance of Square Feet (x1)
Rejection Rule: Reject H0 if |t-statistic| > 2.36 Decision: Because < 2.36, we fail to reject H0 and conclude that square feet does not have a statistically significant effect on the price of a house at the 5% level.
Using p-values to Test Individual Statistical Significance
H0: βi = 0 (no linear relationship) H1: βi ≠ 0 (linear relationship exists between xi and y) Test Statistic: (Usually the p-value is found on the Excel Output) Rejection Rule: Reject H0 if p-value < α
Using p-value to Test for Individual Significance of Square Feet (x1)
Rejection Rule: Reject H0 if p-value < .05 Decision: Because > .05, we fail to reject H0 and conclude that square feet does not have a statistically significant effect on the price of a house at the 5% level.
Things to note about the different methods for tests of individual significance
All three methods yield the same conclusions. To test for individual significance of bedrooms instead of square footage follow the same process but use the row below square footage Using any of the three methods we see that bedrooms is also statistically insignificant at the 5% level
What is multicollinearity?*
Multicollinearity is when two of the independent variables are highly linearly related. Note that multicollinearity is not perfect multicollinearity. Perfect multicollinearity implies that the correlation coefficient is 1 in absolute value. Multicollinearity means that the correlation coefficient is high but not perfect between two independent random variables. *Note: This material is not covered in the textbook
Venn Diagram Explanation of Multicollinearity
What are the Implications of Multicollinearity?
Unlike perfect multicollinearity OLS estimates can still be obtained. OLS estimates are still unbiased. Standard errors are large because there is very little information that goes into the estimation of each of the slopes.
Perform Hypothesis Tests for the Joint Significance of a Subset of Slope Coefficients
The original regression model is After testing for individual significance, x2 and x3 are individually statistically significant at the 5% level. The researcher would like to know if x2 and x3 are jointly statistically significant.
Perform Hypothesis Tests for a Subset of Explanatory Variables
This is an F-Test for joint statistical significance Hypothesis: H0: β2 = β3 = 0 (no linear relationship) H1: at least one of β2 or β3 explains y Unrestricted model (the original model) Restricted model (the model with the null hypothesis imposed, in this case β3 = β4 = 0)
F-Statistic for Overall Significance
Test statistic: where q is the number of restrictions (the number of equal signs in the null hypothesis, in this case 2) or
Rejection Rules for the F-Test for the Overall Significance of the Regression Model
Critical Value: Reject H0 if F-Stat > Fα, q, n-k-1 For this test, it is necessary to run two regressions Unrestricted Regression Restricted Regression
For the Housing Price Example
The original model is UnexplainedSSunrestricted Using the p-values, lot size is individually statistically significant at the 5% level but square feet and bedrooms are statistically insignificant at the 5% level.
Testing if Square Feet and Bedrooms are Jointly Equal to 0
Hypothesis: H0: β2 = β3 = 0 (no joint linear relationship) H1: at least one of β2 or β3 explains y Restricted model (the model with the null hypothesis imposed, in this case β3 = β4 = 0) UnexplainedSSrestricted
F-Statistic for Joint Significance
Test statistic: Reject H0 if F-Stat > F.05, 2, 6 Reject H0 if F-Stat > 5.143 Decision: Because is not greater than we reject H0 and conclude that square feet and bedrooms do not jointly affect house price.
F-Statistic for Joint Significance Using R2
Test statistic: Reject H0 if F-Stat > F.05, 2, 6 Reject H0 if F-Stat > 5.143 Decision: Because is not greater than we reject H0 and conclude that square feet and bedrooms do not jointly affect house price. Notice that we obtained the same F-Statistic using SSUnexplained as we did using R2.
Chow Test Use to test if there are statistical differences between two groups such as men and women, those who have graduated from college and those who haven’t, ect. For the Chow test run three regressions The entire data set all together and the USS is the UnexplainedSSrestricted One subset of the data (i.e. only the men) and the USS is the UnexplainedSS1 The other subset of the data (i.e. only the women) and the USS is UnexplainedSS2
The Hypothesis, Test Statistic, and Rejection Rule
H0: There are no differences between the two groups H1: There is at least one difference between the two groups Rejection Rule: Reject H0 if F-Stat > Fα, k+1,2( n-k-1) If the null hypothesis is rejected then we conclude that a difference exists between the two groups either in the intercepts, slopes or both.
Creating a Confidence Interval Around a Prediction in Multiple Linear Regression *
The formula for the confidence interval is where is the predicted value, is the critical value from the t-table, and is the standard error of the prediction. The only component that we don’t know how to obtain is the standard error of the prediction. *Note: This material is not covered in the textbook
Finding the Standard Error of the Prediction
There is not a straightforward formula for the standard error of the prediction like there is in simple linear regression To find this standard error we need to create new variables and run an additional regression. The new variables that need to be created are for each observation and for each independent variable subtract off the value you are interested in predicting for.
Original Regression Results
The Housing Price Example from Before:
Estimated Sample Regression Function : Suppose we wish to predict the price of a house with 2,000 square feet and 3 bedrooms. Say we want to put a confidence interval about this prediction.
An Example of How to Find the Standard Error of the Prediction
Create two new variables in Excel by subtracting 2,000 from each square feet observation and 3 from each bedroom observation and then run a regression with price as the dependent variable and the two new variables that were just created as the independent variables.
Example of Making New Independent Variables:
Same Dependent variable
Excel Regression Results to Find a 95% Prediction Interval for a Mean Value
Predicted Value $293,309.71 Note this is the same value we found earlier Standard Error of the Mean Prediction $22,932.24 Confidence interval around the prediction 95% Confidence interval for the mean is ($239,083.58, $347,535.84)
Excel Regression Results to find a 95% prediction interval for an individual value
Predicted Value $293,309.71 Critical Value = 2.36 Standard Error of an Individual Prediction $22, , = $61,850.01 Confidence interval around a prediction 95% Prediction interval for an individual is 293, (2.36)(61,850.01) 293, ,966.02 (147,343.69, 439,275.73) Notice how much bigger the interval is than before
How to Test if Two Coefficient Estimates are Equal*
Say the original regression model is and you want to test if is equal to This is a t-test and is difficult to obtain in Excel. H0: β1 = β2 or β1 - β2 = 0 H1: β1 ≠ β2 or β1 - β2 ≠ 0 *Note: This material is not covered in the textbook
How to Obtain in Excel Set β1 - β2 = θ and solve for β1 or β1 = β2 + θ Substitute β2 + θ for β1 in the regression model and isolate the parameters Create a new variable (x1,i + x2,i) and regress y on x1,i , (x1,i + x2,i) and x3,i. The t-test and the p-value for the t-test is in the row with x1,i. β2 + θ θ
Original Regression Say we want to test if the coefficients bedrooms and bathrooms are equal Point Estimate of 32, – (-4,257.54)= $36,673.93
Create a New Variable (Bedrooms + Bathrooms)
Dependent variable Independent Variables
We fail to reject H0 and conclude β1 = β2
Excel Regression Results to Find a 95% Prediction Interval for a Mean Value Point Estimate = $36,673.93 Standard Error of = $67,466.01 t-stat for this test is p – value = We fail to reject H0 and conclude β1 = β2
