Lecture 15 Preview: Other Regression Statistics and Pitfalls Two-Tailed Confidence Intervals Confidence Interval Approach: Which Theories Are Consistent with the Data? A Confidence Interval Example: Television Growth Rates Calculating Confidence Intervals with Statistical Software Coefficient of Determination (Goodness of Fit), R-Squared (R2) Pitfalls Explanatory Variable Has the Same Value for All Observations One Explanatory Variable Is a Linear Combination of Other Explanatory Variables Dependent Variable Is a Linear Combination of Explanatory Variables Outlier Observations Dummy Variable Trap
Two-Tailed Confidence Intervals Two Approaches to Theory and Data Analysis Our approach thus far has gone from the theory to the data: First, we develop the theory. Second, we analyze the data to determine whether the data are consistent with the theory. The confidence interval approach reverses this process by going from data to theories:. First, we analyze the data. Second, we determine which theories are consistent with the data. Two-Tailed Confidence Intervals and Significance Levels: The “size” of the confidence interval plus its significance level sum to 100 percent. Our Example: 95 percent confidence interval. Significance Level = 5 Percent Confidence Interval Approach: The Conceptual Steps Step 1: Use the ordinary least squares estimation procedure to estimate the model’s parameters. Step 2: Consider a specific theory. Is the theory consistent with the data? Does the theory lie within the confidence interval? Step 2a: Based on the theory, construct the null and alternative hypotheses. Step 2b: Compute Prob[Results IF H0 True]. The null hypothesis reflects the theory. Step 2c: Do we reject the null hypothesis? Yes: Reject the theory. The data are not consistent with the theory. The theory does not lie within the two-tailed confidence interval. No: The data are consistent with the theory. The theory does lie within the two-tailed confidence interval.
EViews Television Use Growth Rate Confidence Intervals Step 1: Analyze the data. Use the ordinary least squares (OLS) estimation procedure to estimate the model’s parameters. Model: Dependent variable: LogUsersTV Explanatory variables: Year, CapitalHuman, CapitalPhysical, GdpPC, and Auth Dependent Variable: LogUsersTV Explanatory Variable(s): Estimate SE t-Statistic Prob Year 0.022989 0.015903 1.445595 0.1487 CapitalHuman 0.036302 0.001915 18.95567 0.0000 CapitalPhysical 0.001931 0.000510 3.789394 0.0002 GdpPC 0.058877 0.012338 4.772051 Auth 0.063345 0.012825 4.939278 Const 44.95755 31.77155 -1.415025 0.1575 Number of Observations 742 EViews Step 2: 0.0 Percent Growth Rate Theory. Is the 0.0 percent television growth rate theory consistent with the data? Does 0.0 lie within the confidence interval? Step 2a: Based on the theory, construct the null and alternative hypotheses. 0.0 Percent Growth Rate Theory: After accounting for all other explanatory variables, time has no effect on television use; that is, after accounting for all other explanatory variables, the annual growth rate of television use equals 0.0 percent.
Step 2b: Compute Prob[Results IF H0 True]. OLS estimation procedure unbiased If H0 were true Standard Error Number of observations Number of parameters = .000 = .0159 DF = 742 6 = 736 Prob[Results IF H0 True] = .1487 t-distribution Step 2c: Do we reject the null hypothesis? Mean = .000 95% confidence interval SE = .0159 DF = 736 5% significance level .1487/2 .1487/2 Prob[Results IF H0 True] > .05 .023 .023 Do not reject H0 Dependent Variable: LogUsersTV Explanatory Variable(s): Estimate SE t-Statistic Prob Year 0.022989 0.015903 1.445595 0.1487 CapitalHuman 0.036302 0.001915 18.95567 0.0000 CapitalPhysical 0.001931 0.000510 3.789394 0.0002 GdpPC 0.058877 0.012338 4.772051 Auth 0.063345 0.012825 4.939278 Const 44.95755 31.77155 -1.415025 0.1575 Number of Observations 742 .000 .023 Question: Can we use the Prob column? Data are consistent with the theory .000 does lie within the 95 percent confidence interval. Question: Yes.
Television Use Confidence Interval Approach Continued – Apply Step 2 Again to a Different Theory Dependent Variable: LogUsersTV Explanatory Variable(s): Estimate SE t-Statistic Prob Year 0.022989 0.015903 1.445595 0.1487 CapitalHuman 0.036302 0.001915 18.95567 0.0000 CapitalPhysical 0.001931 0.000510 3.789394 0.0002 GdpPC 0.058877 0.012338 4.772051 Auth 0.063345 0.012825 4.939278 Const 44.95755 31.77155 -1.415025 0.1575 Number of Observations 742 Step 2: 1.0 Percent Growth Rate Theory. Is the 1.0 percent television growth rate theory consistent with the data? Does 1.0 lie within the confidence interval? Step 2a: Based on the theory, construct the null and alternative hypotheses. 1.0 Percent Growth Rate Theory: After account for all other factors, the annual growth rate of television users is 1.0 percent; that is, Year equals .010.
Lab 15.2a Step 2b: Compute Prob[Results IF H0 True]. OLS estimation procedure unbiased If H0 were true Standard Error Number of observations Number of parameters = .010 = .0159 DF = 742 6 = 736 Right tail probability = .019 Lab 15.2a t-distribution Left tail probability = .019 Mean = .010 SE = .0159 Prob[Results IF H0 True] = .019 + .019 .038 DF = 736 .0191 .0191 Step 2c: Do we reject the null hypothesis? Prob[Results IF H0 True] < .05 .033 .033 Do reject H0 Dependent Variable: LogUsersTV Explanatory Variable(s): Estimate SE t-Statistic Prob Year 0.022989 0.015903 1.445595 0.1487 CapitalHuman 0.036302 0.001915 18.95567 0.0000 CapitalPhysical 0.001931 0.000510 3.789394 0.0002 GdpPC 0.058877 0.012338 4.772051 Auth 0.063345 0.012825 4.939278 Const 44.95755 31.77155 -1.415025 0.1575 Number of Observations 742 .010 .023 Question: Can we use the Prob column? Data are not consistent with the theory .010 does not lie within the 95 percent confidence interval. Question: No.
95 percent confidence interval Significance Level = 5% = .05 Observations and Two Questions The 0% growth rate theory lies within the 95 percent confidence interval, but 1% theory does not. Question 1: Based on a 5 percent significance level, .05, what is the lowest growth rate theory that is consistent with the data? That is, what is the lower bound of the two-tailed 95 percent confidence interval? The 4% growth rate theory lies within the 95 percent confidence interval, but 6% theory does not. Question 2: Based on a 5 percent significance level, .05, what is the highest growth rate theory that that is consistent with the data? That is, what is the upper bound of the two-tailed 95 percent confidence interval?
95 Percent Interval Estimates Calculating Confidence Intervals with Statistical Software Getting started in EViews: Run the appropriate regression: In the Equation window: Click View, Coefficient Diagnostics, and Confidence Intervals. In the Confidence Intervals window: Enter the confidence levels you wish to compute. (By default the values of .90, .95, and.99 are entered.) Click OK. 95 Percent Interval Estimates Dependent Variable: LogUsersTV Explanatory Variable(s): Estimate Lower Upper Year 0.022989 -0.008231 0.054209 CapitalHuman 0.036302 0.034656 0.083099 CapitalPhysical 0.001931 0.000931 0.002932 GdpPC 0.058877 0.032542 0.040061 Auth 0.063345 0.038167 0.088522 Const 44.95755 107.3312 17.41612 Number of Observations 742 EViews = .0082 = .0542 At a 95 percent confidence interval, the data are consistent with all the growth rate theories that lie between .82 and 5.42 percent.
Coefficient of Determination (Goodness of Fit) , R-Squared (R2) Theory: Additional studying increases quiz scores. Theory: x > 0 Model: yt = Const + xxt + et xt = Minutes studied yt = Quiz score Hypotheses: H0: x = 0 H1: x > 0 R-squared represents the portion of y’s squared deviations from its mean that is explained: Explained Squared Deviations from the Mean 288 = = .84 R2 = = 342 Actual Squared Deviations from the Mean 66 + 87 + 90 Mean of y = = = 81 Claim: R-squared does not help us assess the theory. 3 Actual y Actual Explained y Explained Deviation Squared Esty Deviation Squared from Mean Deviation Equals from Mean Deviation Student xt yt 1 5 66 66 81 = 15 225 63 +1.25 = 69 69 81 = 12 144 2 15 87 87 81 = 6 36 63 +1.215 = 81 81 81 = 0 3 25 90 90 81 = 9 81 63 +1.225 = 93 93 81 = 12 144 = 342 = 288 Confidence in Theory: Prob[Results IF H0 True] = .2601/2 .13 Dependent Variable: y Explanatory Variable(s): Estimate SE t-Statistic Prob x 1.200000 0.519615 2.309401 0.2601 Const 63.00000 8.874120 7.099296 0.0891 Number of Observations 3 R-squared .842105 EViews 84% of y’s squared deviations are explained.
Actual y Actual Explained y Explained Deviation Squared Esty Deviation Squared Quiz/ from Mean Deviation Equals from Mean Deviation Student xt yt 1/1 5 66 66 81 = 15 225 63 +1.25 = 69 69 81 = 12 144 1/2 15 87 87 81 = 6 36 63 +1.215 = 81 81 81 = 0 1/3 25 90 90 81 = 9 81 63 +1.225 = 93 93 81 = 12 144 2/1 5 66 66 81 = 15 Intuition: Should this increase or decrease your confidence in the theory? 225 63 +1.25 = 69 69 81 = 12 144 2/2 15 87 87 81 = 6 36 63 +1.215 = 81 81 81 = 0 2/3 25 90 90 81 = 9 Increase. 81 63 +1.225 = 93 93 81 = 12 144 = 342 684 = 288 576 What about R2? 66 + 87 + 90 66 + 87 + 90 + 66 + 87 + 90 EViews Mean of y = = = 81 3 6 Explained Squared Deviations from the Mean 576 288 R2 = = = = .84 342 684 Actual Squared Deviations from the Mean Confidence in Theory: Prob[Results IF H0 True] = .0099/2 .005 Dependent Variable: y Explanatory Variable(s): Estimate SE t-Statistic Prob x 1.200000 0.259808 4.618802 0.0099 Const 63.00000 4.437060 14.19859 0.0001 Number of Observations 6 R-squared .842105 Does R2 help us assess theories? No. 84% of y’s squared deviations are explained.
Pitfalls Multiple Regression Analysis: Attempts to separate out, to sort out, to isolate the individual influence that each explanatory variable has on the dependent variable. The coefficient estimate of an explanatory variable allows us to estimate by how much the dependent variable changes when that explanatory variable changes while all other explanatory variables remain constant. Dependent variable: Attendance Explanatory variables: PriceTicket and HomeSalary Dependent Variable: Attendance Explanatory Variable(s): Estimate SE t-Statistic Prob PriceTicket 590.7836 184.7231 -3.198211 0.0015 HomeSalary 783.0394 45.23955 17.30874 0.0000 Const 9246.429 1529.658 6.044767 Number of Observations 585 EViews Estimated Equation: EstAttendance = 9,246 – 591PriceTicket + 783HomeSalary
Suggests a relationship Pitfall: Explanatory Variable Has the Same Value for All Observations Consider the variable DH: DH Dummy variable; 1 if designated hitter permitted; 0 otherwise Our workfile includes only American League games in 1996. Since interleague play did not begin until 1997 and all American League games allow designated hitters, the variable DH will equal 1 for all observations: DH = 1 for all observations EViews This is the software’s way of saying that it cannot perform the calculations that we requested. That is, we as asking the software to do the impossible. Dependent variable: Attendance Explanatory variables: PriceTicket, HomeSalary, and DH The following a warning message appears: Error message: “Near singular matrix.” Intuition: At the most basic level, to determine how an explanatory variable affects the dependent variable, the explanatory variable’s values must vary. Example: Reaction Time Depends on Caffeine Caffeine up Reaction time faster Suggests a relationship Caffeine down Reaction time slower If there is no variation in caffeine consumption, then we cannot observe the effect that the caffeine has on reaction time. More generally, if the is no variation in the explanatory variable, then we cannot observe the effect that the explanatory variable has on the dependent variable. In this case, there is no variation in the DH . Consequently, we cannot observe the effect that DH has on Attendance. In this case, we are asking the software to do the impossible.
Pitfall: One Explanatory Variable Is a Linear Combination of Other Explanatory Variables Review: Include both the ticket price in terms of dollars and the ticket price in terms of cents as explanatory variables: PCents = 100PriceTicket NB: The ticket price in terms of cents was a linear combination of the ticket price in terms of dollars. Dependent variable: Attendance Explanatory variables: PriceTicket, PCents, and HomeSalary EViews This is the software’s way of saying that it cannot perform the calculations that we requested. That is, we are asking the software to do the impossible. The following a warning message appears: Error message: “Near singular matrix.” Review: Multiple regression analysis attempts to separate out, to sort out, to isolate the individual influence that each explanatory variable has on the dependent variable. Intuition: The information contained in PCents is redundant. PCents and PriceTicket contain the same information. The software cannot separate out the individual influence of the two explanatory variables, PriceTicket and PCents, because they contain redundant information.
In fact, any linear combination of explanatory variables produces this problem. EViews To illustrate this consider the following regression: Dependent variable: Attendance Explanatory variables: PriceTicket, HomeSalary, and VisitSalary Dependent Variable: Attendance Explanatory Variable(s): Estimate SE t-Statistic Prob PriceTicket 586.5197 179.5938 -3.265813 0.0012 HomeSalary 791.1983 44.00477 17.97983 0.0000 VisitSalary 163.4448 27.73455 5.893181 Const 3528.987 1775.648 1.987437 0.0473 Number of Observations 585 Now, generate a new variable, TotalSalary: TotalSalary = HomeSalary + VisitSalary NB: TotalSalary is a linear combination of HomeSalary and VisitSalary. Dependent variable: Attendance Explanatory variables: PriceTicket, HomeSalary, VisitSalary, and TotalSalary This is the software’s way of saying that it cannot perform the calculations that we requested. That is, we are asking the software to do the impossible. The following a warning message appears: Error message: “Near singular matrix.” Intuition: The information contained in TotalSalary is redundant. The information contained in TotalSalary is already included in HomeSalary and VisitSalary. The software cannot separate out the individual influence of the three “Salary” explanatory variables because they contain redundant information.
Pitfall: Dependent Variable Is a Linear Combination of Explanatory Variables Dependent variable: TotalSalary Explanatory variables: HomeSalary and VisitSalary EViews Dependent Variable: TotalSalary Explanatory Variable(s): Estimate SE t-Statistic Prob HomeSalary 1.000000 8.58E-17 1.17E+16 0.0000 VisitSalary 8.61E-17 1.16E+16 Const 0.000000 4.24E-15 1.0000 Number of Observations 585 The estimates of the coefficients and constant reveal the definition of TotalSalary (the estimate for the constant is effectively 0): TotalSalary = HomeSalary + VisitSalary Furthermore, the standard errors are very small, approximately 0. In fact, they are precisely equal to 0, but they are not reported as 0’s as a consequence of how digital computers process numbers. The regression printout suggests that we are dealing with an “identity,” something that is true by definition.
An observation that is uniquely different from the others. Pitfall: Outlier Observations EViews Dependent variable: Attendance Explanatory variables: PriceTicket and HomeSalary Home Visiting Home Team Observation Month Day Team Team Salary 1 6 1 Milwaukee Cleveland 20.23200 2 6 1 Oakland New York 19.40450 Dependent Variable: Attendance Explanatory Variable(s): Estimate SE t-Statistic Prob PriceTicket 590.7836 184.7231 -3.198211 0.0015 HomeSalary 783.0394 45.23955 17.30874 0.0000 Const 9246.429 1529.658 6.044767 Number of Observations 585 What is an outlier? An observation that is uniquely different from the others. EViews What if the home team salary were entered incorrectly in the first observation. Home Visiting Home Team Observation Month Day Team Team Salary 1 6 1 Milwaukee Cleveland 20232.00 2 6 1 Oakland New York 19.40450 Even though we changed a single data entry for one nearly six hundred observations, the coefficient estimates of changes dramatically Dependent Variable: Attendance Explanatory Variable(s): Estimate SE t-Statistic Prob PriceTicket 1896.379 142.8479 13.27552 0.0000 HomeSalary 0.088467 0.484536 -0.182580 0.8552 Const 3697.786 1841.286 2.008263 0.0451 Number of Observations 585 This illustrates how sensitive OLS estimates are to outliers.
EViews Dummy Variable Trap Model 1: Salaryt = Const + SexF1SexF1t + EExperiencet + et SexF1t = 1 if female 0 if male Dependent Variable: Salary Explanatory Variable(s): Estimate SE t-Statistic Prob SexF1 2240.053 3051.835 -0.734002 0.4638 Experience 2447.104 163.3812 14.97787 0.0000 Const 42237.61 3594.297 11.75129 Number of Observations 200 EstSalary = 42,238 2,240SexF1 + 2,447Experience For men: SexF1 = 0 EstSalaryMen = 42,237 0 + 2,447Experience EstSalaryMen = 42,238 + 2,447Experience For women: SexF1 = 1 EstSalaryWomen = 42,237 2,240 + 2,447Experience EstSalaryWomen = 39,998 + 2,447Experience InterceptMen = 42,238 EstSalaryMen = 42,238 + 2,447Experience Salary InterceptWomen = 39,998 Slope = 2,447 Question: How many parameter estimates did we use to estimate the value of the 2 intercepts? 2 bConst and bSexF1 42,238 2,240 Dummy Variable Trap: A model in which there are more parameters representing the intercepts than there are intercepts. EstSalaryWomen = 39,998 + 2,447Experience 39,998 Experience
EViews Model 2: Salaryt = Const + SexM1SexM1t + EExperiencet + et SexM1t = 1 if male 0 if female EstSalary = bConst + bSexM1SexM1 + bEExperience Question: Can we determine the values of bConst and bSexM1 in this model from the intercepts we calculated from Model 1? InterceptMen = 42,238 InterceptWomen = 39,998 For men For women SexM1 = 1 SexM1 = 0 EstSalaryMen = bConst + bSexM1 + bEExperience EstSalaryWomen = bConst + bEExperience InterceptMen = bConst + bSexM1 InterceptWomen = bConst 42,238 = bConst + bSexM1 39,998 = bConst Unknowns = 2 Equations = 2 Can we solve for the two unknowns? Yes bConst = 39,998 bSexM1 = 42,238 bConst = 42,238 39,998 = 2,240 EViews Dependent Variable: Salary Explanatory Variable(s): Estimate SE t-Statistic Prob SexM1 2240.053 3051.835 0.734002 0.4638 Experience 2447.104 163.3812 14.97787 0.0000 Const 39997.56 2575.318 15.53112 Number of Observations 200
Model 3: Salaryt = SexM1SexM1t + SexF1SexF1t + EExperiencet + et EstSalary = bSexM1SexM1 + bSexF1SexF1 + bEExperience Question: Can we determine the values of bSexM1 and bSexF1 in this model from the intercepts we calculated from Model 1? InterceptMen = 42,238 InterceptWomen = 39,998 For men For women SexM1=1 and SexF1 = 0 SexM1 = 0 and SexF1 = 1 EstSalaryMen = bSexM1 + bEExperience EstSalaryWomen = bSexF1 + bEExperience InterceptMen = bSexM1 InterceptWomen = bSexF1 42,238 = bSexM1 39,998 = bSexF1 Unknowns = 2 Equations = 2 Can we solve for the unknowns? Yes bSexM1 = 42,238 bSexF1 = 39,998 EViews Dependent Variable: Salary Explanatory Variable(s): Estimate SE t-Statistic Prob SexF1 39997.56 2575.318 15.53112 0.0000 SexM1 42237.61 3594.297 11.75129 Experience 2447.104 163.3812 14.97787 Number of Observations 200
Model 4: Salaryt = Const + SexM1SexM1t + SexF1SexF1t + EExperiencet + et EstSalary = bConst + bSexM1SexM1 + bSexF1SexF1 + bEExperience Question: Can we determine the values of bConst, bSexM1, and bSexF1 in this model from the intercepts we calculated from Model 1? InterceptMen = 42,238 InterceptWomen = 39,998 For men For women SexM1=1 and SexF1 = 0 SexM1 = 0 and SexF1 = 1 EstSalaryMen = bConst + bSexM1 + bEExperience EstSalaryWomen = bConst + bSexF1 + bEExperience InterceptMen = bConst + bSexM1 InterceptWomen = bConst + bSexF1 42,238 = bConst + bSexM1 39,998 = bConst + bSexF1 Unknowns = 3 Equations = 2 Can we solve for the three unknowns? No bConst When we try to run this regression we are asking the software to do the impossible. bSexM1 EViews bSexF1 That is why we get the “Near singular matrix” error message. Dummy Variable Trap: A model in which there are more parameters representing the intercepts than there are intercepts.