Presentation is loading. Please wait.

Presentation is loading. Please wait.

Lecture 14: Omitted Explanatory Variables, Multicollinearity, and Irrelevant Explanatory Variables Omitted Explanatory Variables: “Too Few” Explanatory.

Similar presentations


Presentation on theme: "Lecture 14: Omitted Explanatory Variables, Multicollinearity, and Irrelevant Explanatory Variables Omitted Explanatory Variables: “Too Few” Explanatory."— Presentation transcript:

1 Lecture 14: Omitted Explanatory Variables, Multicollinearity, and Irrelevant Explanatory Variables Omitted Explanatory Variables: “Too Few” Explanatory Variables Correlated and Independent Variables Multicollinearity: Highly Correlated Explanatory Variables Review Irrelevant Explanatory Variables: “Too Many” Explanatory Variables Unbiased Estimation Procedures Mean of an Estimate’s Probability Distribution Variance of an Estimate’s Probability Distribution Omitted Explanatory Variables, Proxy Effect, and Bias Perfectly Correlated Explanatory Variables Highly Correlated Explanatory Variables Earmarks of Multicollinearity Correlation Coefficient Goal of Multiple Regression Analysis Resolving the Baseball Attendance Puzzle A Puzzle: Baseball Attendance

2 Why is the Mean of the Estimate’s Probability Distribution Important? A mean describes the center of its probability distribution. Mean[Estimate] = Actual Value Conceptually, an estimation procedure is unbiased whenever it does not systematically underestimate or overestimate the actual value. If the probability distribution is symmetric we have even more intuition. the chances that the estimate is too low the chances that the estimate is too high equal  Average of the estimate’s numerical values after many, many repetitions Unbiased Estimation Procedure Formally, an estimation procedure is unbiased whenever the mean of the estimate’s probability distribution equals the actual value. Intuition: Relative Frequency Interpretation of Probability Mean[Estimate] Probability Distribution of Estimate Actual Value Estimate In one repetition, Average of the estimate’s numerical values after many, many repetitions = Actual Value =

3 Variance largeVariance small  Small probability that the numerical value of the estimate from one repetition of the experiment will be close to actual value  Large probability that the numerical value of the estimate from one repetition of the experiment will be close to actual value  Estimate is unreliable  Estimate is reliable When the Estimation Procedure Is Unbiased Why Is the Variance (Spread) of the Estimate’s Probability Distribution Important? Reliability of the Estimate: When the estimation procedure is unbiased the variance (spread) of the estimate’s probability distribution indicates: Variance largeVariance small Probability Distributions of Estimates Actual Value How reliable we can trust the estimate to be? Estimate That is, how likely the numerical value of the estimate is “close to” the actual value?

4 Correlation Coefficient: Ranges from  1 to +1 indicating how correlated two variables are. = 0:Independent (uncorrelated);Knowing the value of one variable does not help us predict the value of the other. > 0:Positive correlation;Typically, when the value of one variable is high, the value of the other variable will be high. < 0:Negative correlation;Typically, when the value of one variable is high, the value of the other variable will be low. Review: Independent and Correlated Variables A Puzzle: AL Baseball Attendance in the Summer of 1996 (588 MLB Games) Attendance t Attendance HomeGamesBehind t Games behind of the home team HomeNetWins t Net wins (wins less losses) of the home team before game PriceTicket t Average ticket price DateDay t Day of game DateMonth t Month of game DH t Dummy variable; 1 if designated hitter allowed, 0 otherwise HomeSalary t Player salaries of the home team VisitSalary t Player salaries of the visiting team Two variables are: independent (uncorrelated) whenever the value of one variable does not help us predict the value of the other correlated whenever the value of one variable does help us predict the value of the other. Omitted Explanatory Variables: “Too Few” Explanatory Variables

5 Dependent Variable: Attendance Explanatory Variable(s):EstimateSEt-StatisticProb PriceTicket1896.611142.723813.288680.0000 Const3688.9111839.1172.0058050.0453 Number of Observations585 Dependent Variable: Attendance Explanatory Variable(s):EstimateSEt-StatisticProb PriceTicket  590.7836 184.7231-3.1982110.0015 HomeSalary783.039445.2395517.308740.0000 Const9246.4291529.6586.0447670.0000 Number of Observations585 Type 1 Model: Attendance depends on ticket price only. Model: Attendance t =  Const +  Price PriceTicket t + e t Explanatory Variable Theory PriceTicket  Price < 0 HomeSalary  HomeSalary > 0 Type 2 Model: Attendance depends on ticket price and salary of home team. Model: Attendance t =  Const +  Price PriceTicket t +  HomeSalary HomeSalary t + e t Explanatory Variable Theory PriceTicket  Price < 0 Critical Result: Higher ticket prices lead to higher attendance. Critical Results: Higher home team salaries (star player attraction) increase attendance. Puzzle:How can we explain the dramatic change in the ticket price coefficient estimate and why is it important? Higher ticket prices decrease attendance. Back to the drawing board.  EViews

6 Doubts Grow Over Flu Vaccine in Elderly by Brenda Goodman NYT 9-1-2008 The influenza vaccine, which has been strongly recommended for people over 65 for more than four decades, is losing its reputation as an effective way to ward off the virus in the elderly. A growing number of immunologists and epidemiologists say the vaccine probably does not work very well for people over 70 … The study found that people who were healthy and conscientious about staying well were the most likely to get an annual flu shot. … [others] are less likely to get to their doctor’s office or a clinic to receive the vaccine. The latest blow was a study in The Lancet last month that called into question much of the statistical evidence for the vaccine’s effectiveness. Dr. David K. Shay of the Centers for Disease Control and Prevention, a co-author of a commentary that accompanied Dr. Jackson’s study, agreed that these measures of health … “were not incorporated into early estimations of the vaccine’s effectiveness” and could well have skewed the findings. Are those who are conscientious about staying well more or less likely to get the flu?Less. Are those who are conscientious about staying well more or less likely to get a flu shot?More. If being conscientious about staying well is NOT considered in judging the effectiveness of flu shots, will the effectiveness of flu shots be overestimated or underestimated? Overestimated. Question: Should government programs to subsidy flu shots be continued? Are those who get a flu shot more or less likely to get the flu?Less.

7 Preview: Omitting an explanatory variable from a regression will bias the OLS estimation procedure whenever two conditions are met. Bias results when the omitted variable: influences the dependent variable; is correlated with an included variable. When these two conditions are met, the coefficient estimate of the included explanatory variable fails to separate out its individual influence; instead the coefficient estimate captures two effects: Direct Effect: The influence that the included explanatory variable itself actually has on the dependent variable; this is the effect we want the coefficient estimate to capture. Proxy Effect: The influence that the omitted explanatory variable has on the dependent variable because the included variable is acting as a proxy for the omitted variable. Claim: The proxy effect leads to bias. Multiple Regression Analysis: Attempts to separate out, to sort out, to isolate, the influence of each individual explanatory variable. Goal of Multiple Regression Analysis

8 Econometrics Lab Model: y t =  Const +  x1 x1 t +  x2 x2 t + e t Act Coef 1 Act Coef 2 012012 505505 Coef1 Value Est Mean CorrX1&X2 .30.00.30 Unbiased? Both X’s Actual Actual Corr Mean of Below Above Coef1 Coef2 Parameter Coef1 Est Actual Actual 2 5.00 2 0.30 Yes No Unbiased No Yes Unbiased Yes Yes Biased Does the omitted Is the omitted variable Estimation variable influence the correlated with the procedure for dependent variable included variable included variable Question: What if the second explanatory variable were omitted from the regression? 2 5.30 x1 Up  y Up  Direct Effect  x1 > 0  Uncorrelated  Typically, x2 Unaffected  y Unaffected  No Proxy Effect  x2 > 0  Positive Correlation  Typically, x2 Up  y Unaffected  No Proxy Effect  x2 = 0  Positive Correlation  Typically, x2 Up  y Up  Proxy Effect  x2 > 0 Conclusion: Bias results from the proxy effect. The direct effect is what we want the coefficient estimate to capture, the influence that x1 has on y.  Lab 14.1 What bias does mean and does not mean?  2.0  50%  50%  3.5  28%  72% Only X1 What bias DOES mean: The estimation procedure systematically overestimates or underestimates the actual value. What bias DOES NOT mean: The value of the estimate in a single repetition will be less than or greater than the actual value with certainty.

9 Correlation Matrix PriceTicketHomeSalary PriceTicket 1.000000 0.777728 HomeSalary 0.777728 1.000000 Dependent Variable: Attendance Explanatory Variable(s):EstimateSEt-StatisticProb PriceTicket1896.611142.723813.288680.0000 Const3688.9111839.1172.0058050.0453 Number of Observations585 Second Lesson: Is the estimation procedure biased when both explanatory variables are included in the regression? Is there good reason to believe that ticket price and home team salary are correlated? Yes. PriceTicket Up  Direct Effect  Attendance Down  Positive Correlation  Typically, HomeSalary Up  Proxy Effect  Attendance Up As a consequence of the proxy effect, when HomeSalary is omitted from a regression, the procedure to estimate the coefficient PriceTicket is biased up; typically, the procedure pushes the coefficient up, in the positive direction. Actual Actual Correlation Mean of Coef1 Coef2 Parameter Coef1 Estimates This could explain the positive coefficient estimate in the regression. Answer: When all explanatory variables that affect the dependent variable are included in the regression, the procedure is unbiased. 2 5.30  2.0 How might we explain the positive estimate?  Price < 0  HomeSalary > 0 Resolving the Puzzle Focus on this regression and consider home team salary. Is there good reason to believe that home team salary affects attendance?  EViews Yes.  Lab 14.2 When only PriceTicket is included:

10 Dependent Variable: Attendance Explanatory Variable(s):EstimateSEt-StatisticProb PriceTicket  590.7836 184.7231-3.1982110.0015 HomeSalary783.039445.2395517.308740.0000 Const9246.4291529.6586.0447670.0000 Number of Observations585 Question: What happened when we add home team salaries? Answer: The estimate for the ticket price coefficient is now negative, as the theory suggests. Summary: Omitted Variable Phenomenon. Omitting an explanatory variable from a regression will bias the OLS estimation procedure whenever two conditions are met. Bias results when the omitted explanatory variable: influences the dependent variable; is correlated with an included explanatory variable. When these two conditions are met, the coefficient estimate of the included explanatory variable captures two effects: Direct Effect: The effect that the included explanatory variable actually has on the dependent variable; Proxy Effect: The effect that the omitted explanatory variable has on the dependent variable because the included explanatory variable acts as a proxy for the omitted explanatory variable. The proxy effect causes the bias. Question: So, what should we do?Answer: Recall the second lesson our simulation taught us. To eliminate the possibility of bias, include all relevant explanatory variables. Preview: But now, we shall see that if two explanatory variables are highly correlated a different problem can arise: Multicollinearity.  EViews

11 Correlation Matrix PriceTicketPCents PriceTicket 1.000000 PCents 1.000000 Perfectly Correlated Explanatory Variables Generate a new variable, the price of tickets expressed in terms of cents: PCents = 100  PriceTicket Error message: “Near singular matrix.” Multiple Regression Analysis: Attempts to separate out, to sort out, to isolate, the influence of each individual explanatory variable.  Explanatory variables perfectly correlated  Both variables contain precisely the same information Estimate the following model: Attendance t =  Const +  Price PriceTicket t +  PCents PCents t + e t Question: Why couldn’t you estimate this model? The explanatory variables PriceTicket and PCents contain precisely the same information. Knowing the value of one variable allows us to determine the value of the other variable exactly  Impossible to separate out their individual effects Multicollinearity: Highly Correlated Explanatory Variables CorrCoef = 1.0 We are asking the software to do the impossible.  EViews

12 Question: What happens if the explanatory variables while not perfectly correlated are highly correlated? Model: Attendance t =  Const +  Price PriceTicket t +  HomeSalary HomeSalary t +  HomeNW HomeNetWins t +  HomeGB HomeGamesBehind t + e t Team Quality Theory:  HomeNW > 0 H 0 :  HomeNW = 0 H 1 :  HomeNW > 0 Division Race Theory:  HomeGB < 0 H 0 :  HomeGB = 0 H 1 :  HomeGB < 0 Multicollinearity – Highly Correlated Explanatory Variables TeamWinsLossesNet WinsGames Behind New York Yankees10359440 Boston Red Sox9567288 Tampa Bay Rays8478619 Toronto Blue Jays7587  1228 Baltimore Orioles6498  3439 2009 Final Standings – AL East HomeNetWinsHomeGamesBehind HomeNetWins 1.000000-0.962037 HomeGamesBehind-0.962037 1.000000  EViews

13 Dependent Variable: Attendance Explanatory Variable(s):EstimateSEt-StatisticProb PriceTicket  437.1603 190.4236-2.2957250.0220 HomeSalary667.579657.8992211.530030.0000 HomeNetWins60.5336485.219180.7103290.4778 HomeGamesBehind  84.38767 167.1067-0.5049930.6138 Const11868.582220.4255.3451840.0000 Number of Observations585 Team Quality Theory:  HomeNW > 0 H 0 :  HomeNW = 0 H 1 :  HomeNW > 0 Division Race Theory:  HomeGB < 0 H 0 :  HomeGB = 0 H 1 :  HomeGB < 0 Focus on HomeNetWins:The coefficient estimate is positive, it has the expected sign. Prob[Results IF H 0 True]: What is the probability that the estimate from one regression would equal 60.5 or more, if the H 0 were true (that is, if  HomeNW actually equaled 0, if HomeNetWins actually have no effect on Attendance)? At the traditional significance levels we cannot reject the null hypothesis. Similarly, for HomeGamesBehind, the coefficient estimate is negative, the expected sign. We cannot reject the null hypothesis that HomeNetWins has no effect on Attendance. We cannot reject the null hypothesis that HomeGamesBehind has no effect on Attendance. Prob[Results IF H 0 True] .24 Prob[Results IF H 0 True] .31  EViews Model: Attendance t =  Const +  Price PriceTicket t +  HomeSalary HomeSalary t +  HomeNW HomeNetWins t +  HomeGB HomeGamesBehind t + e t

14 Wald Test Degrees of Freedom ValueNumDemProb F-statistic5.04677925800.0067 Next, perform a Wald test: H 0 :  HomeNW = 0 and  HomeGB = 0 Both HomeNetWins and HomeGamesBehind HAVE NO effect on Attendance H 1 :  HomeNW  0 and/or  HomeGB  0 HomeNetWins and/or HomeGamesBehind HAVE an effect on Attendance Prob[Results IF H 0 True]: What is the probability that the F-statistic would be 5.047 or more, if H 0 were true (that is, if neither HomeNetWins nor HomeGamesBehind have an effect on Attendance)? Prob[Results IF H 0 True] =.0067. It is unlikely that the null hypothesis is true; it is unlikely that both HomeNetWins and HomeGamesBehind HAVE NO effect on Attendance Paradox: When both HomeNetWins and HomeGamesBehind are included Student-t tests Wald test  Can reject the null hypothesis that both HomeNetWins and HomeGamesBehind have no effect.  HomeNetWins and/or HomeGamesBehind do appear to affect Attendance Individually, neither HomeNetWins nor HomeGamesBehind appear to affect Attendance Cannot reject the null hypothesis that HomeNetWins has no effect Cannot reject the null hypothesis that HomeGamesBehind has no effect.  EViews Can we reject H 0 ?Yes – at the 1 percent significance level.

15 Dependent Variable: Attendance Explanatory Variable(s):EstimateSEt-StatisticProb PriceTicket  449.2097 188.8016-2.3792680.0177 HomeSalary672.296757.1041311.773170.0000 HomeNetWins100.416631.993483.1386580.0018 Const11107.661629.8636.8150870.0000 Dependent Variable: Attendance Explanatory Variable(s):EstimateSEt-StatisticProb PriceTicket  433.4971 190.2726-2.2782950.0231 HomeSalary670.851857.6910611.628350.0000 HomeGamesBehind  194.3941 62.74967-3.0979310.0020 Const12702.161884.1786.7414860.0000 HomeNetWinsHomeGamesBehind HomeNetWins 1.000000-0.962037 HomeGamesBehind-0.962037 1.000000 Regression Including Only One of the Highly Correlated Explanatory Variables “Earmarks” of Multicollinearity – Highly Correlated Explanatory Variables Regression with both explanatory variables Student-t tests do not allow us to reject the null hypothesis that the coefficient of each individual variable equals 0; when considering each explanatory variable individually, we cannot reject the hypothesis that each individually has no influence. A Wald test allows us to reject the null hypothesis that the coefficient of both explanatory variables equal 0; when considering both explanatory variables together, we can reject the hypothesis that they have no influence. Regressions with only one explanatory variable appear to produce “good” results Explanatory variables are highly correlated.  EViews When only one of the highly correlated explanatory variables is included we get “good” results. Correlation of Explanatory Variables Include only HomeNetWins Include only HomeGamesBehind

16 Econometrics Lab Model: y t =  Const +  x1 x1 t +  x2 x2 t + e t Act Coef 1 Act Coef 2 012012 505505 Coef1 Value Est Mean CorrX1&X2 .30.00.30 Unbiased? Actual Correlation Mean of Var of Coef1 Coefficient Coef1 Estimates Coef1 Estimates 2.00  2.0  6.5 2.30  2.0  7.2 Question: What if the both explanatory variables were included in the regression? 2.60  2.0  10.1 Var Question: Does multicollinearity cause bias? Question: What problem does multicollinearity produce? Answer: No. Answer: Variance of the coefficient estimates is high. This means that on any one repetition of the experiment we cannot be confident that the coefficient estimate is close to the actual value. The estimates are not reliable.  Lab 14.3 2.90  2.0  34.2 Both X’s Only X1

17 Summary: Understanding Multicollinearity This difficulty evidences itself by large variances of the estimates’ probability distributions. Explanatory variables perfectly correlated  Both variables contain the same information Explanatory variables highly correlated  Difficult to separate out their individual effects If the explanatory variables are highly correlated, it is difficult to separate out the individual effects even though each of the explanatory variables actually affects the dependent variable. Multiple Regression Analysis: Attempts the separate out, to sort out, to isolate, the influence of each individual explanatory variable.  Impossible to separate out their individual effects  Large variances Multicollinearity Phenomenon

18 Irrelevant Explanatory Variables: “Too Many” Explanatory Variables Only x1 Included Both x1 and x2 Included Actual Correlation Mean of Coef1 Var of Coef1 Mean of Coef1 Var of Coef1 Coef1 Coefficient Estimates Estimates Estimates Estimates Model: y t =  Const +  x1 x1 t +  x2 x2 t + e t Relevant Explanatory Variable: x1.  x1 = 2 Irrelevant Explanatory Variable: x2.  x2 = 0 2.0 2.0 2.0 2.0.00.30.60.90  2.0  2.0  2.0  2.0  6.4  6.4  6.4  6.4  2.0  2.0  2.0  2.0  6.5  7.2  10.1  34.2  Lab 14.4 Consider the coefficient estimate of the relevant explanatory variable. Is the ordinary least squares (OLS) estimation procedure still unbiased? What happens to the variance of the probability distribution? What happens to the variance of the coefficient estimate’s probability distribution even if the irrelevant variable is uncorrelated with a relevant variable? Why? Yes. The variance increases. When an irrelevant explanatory variable is included:


Download ppt "Lecture 14: Omitted Explanatory Variables, Multicollinearity, and Irrelevant Explanatory Variables Omitted Explanatory Variables: “Too Few” Explanatory."

Similar presentations


Ads by Google