Part 2: Model and Inference 2-1/49 Regression Models Professor William Greene Stern School of Business IOMS Department Department of Economics
Part 2: Model and Inference 2-2/49 Regression and Forecasting Models Part 2 – Inference About the Regression
Part 2: Model and Inference 2-3/49 The Linear Regression Model 1. The linear regression model 2. Sample statistics and population quantities 3. Testing the hypothesis of no relationship
Part 2: Model and Inference 2-4/49 A Linear Regression Predictor: Box Office = Buzz
Part 2: Model and Inference 2-5/49 Data and Relationship We suggested the relationship between box office and internet buzz is Box Office = Buzz Note the obvious inconsistency in the figure. This is not the relationship. The observed points do not lie on a line. How do we reconcile the equation with the data?
Part 2: Model and Inference 2-6/49 Modeling the Underlying Process A model that explains the process that produces the data that we observe: Observed outcome = the sum of two parts (1) Explained: The regression line (2) Unexplained (noise): The remainder Regression model The “model” is the statement that part (1) is the same process from one observation to the next. Part (2) is the randomness that is part of real world observation.
Part 2: Model and Inference 2-7/49 The Population Regression THE model: A specific statement about the parts of the model (1) Explained: Explained Box Office = β 0 + β 1 Buzz (2) Unexplained: The rest is “noise, ε.” Random ε has certain characteristics Model statement Box Office = β 0 + β 1 Buzz + ε
Part 2: Model and Inference 2-8/49 The Data Include the Noise
Part 2: Model and Inference 2-9/49 The Data Include the Noise 0 + 1 Buzz Box = 41, 0 + 1 Buzz = 10, = 31
Part 2: Model and Inference 2-10/49 Model Assumptions y i = β 0 + β 1 x i + ε i β 0 + β 1 x i is the ‘regression function’ Contains the ‘information’ about y i in x i Unobserved because β 0 and β 1 are not known for certain ε i is the ‘disturbance.’ It is the unobserved random component Observed y i is the sum of the two unobserved parts.
Part 2: Model and Inference 2-11/49 Regression Model Assumptions About ε i Random Variable (1) The regression is the mean of y i for a particular x i. ε i is the deviation of y i from the regression line. (2) ε i has mean zero. (3) ε i has variance σ 2. ‘Random’ Noise (4) ε i is unrelated to any values of x i (no covariance) – it’s “random noise” (5) ε i is unrelated to any other observations on ε j (not “autocorrelated”) (6) Normal distribution - ε i is the sum of many small influences
Part 2: Model and Inference 2-12/49 Regression Model
Part 2: Model and Inference 2-13/49 Conditional Normal Distribution of
Part 2: Model and Inference 2-14/49 A Violation of Point (4) c = 0 + 1 q + ? Electricity Cost Data
Part 2: Model and Inference 2-15/49 A Violation of Point (5) - Autocorrelation Time Trend of U.S. Gasoline Consumption
Part 2: Model and Inference 2-16/49 No Obvious Violations of Assumptions Auction Prices for Monet Paintings vs. Area
Part 2: Model and Inference 2-17/49 Samples and Populations Population (Theory) y i = β 0 + β 1 x i + ε i Parameters β 0, β 1 Regression β 0 + β 1 x i Mean of y i | x i Disturbance, ε i Expected value = 0 Standard deviation σ No correlation with x i Sample (Observed) y i = b 0 + b 1 x i + e i Estimates, b 0, b 1 Fitted regression b 0 + b 1 x i Predicted y i |x i Residuals, e i Sample mean 0, Sample std. dev. s e Sample Cov[x,e] = 0
Part 2: Model and Inference 2-18/49 Disturbances vs. Residuals =y- 0 - 1 Buzz e=y-b 0 –b 1 Buzz
Part 2: Model and Inference 2-19/49 Standard Deviation of Residuals Standard deviation of ε i = y i - β 0 – β 1 x i is σ σ = √E[ε i 2 ] (Mean of ε i is zero) Sample b 0 and b 1 estimate β 0 and β 1 Residual e i = y i – b 0 – b 1 x i estimates ε i Use √(1/N)Σe i 2 to estimate σ? Close, not quite. Why N-2? Relates to the fact that two parameters (β 0,β 1 ) were estimated. Same reason N-1 was used to compute a sample variance.
Part 2: Model and Inference 2-20/49
Part 2: Model and Inference 2-21/49 Linear Regression Sample Regression Line
Part 2: Model and Inference 2-22/49 Residuals
Part 2: Model and Inference 2-23/49 Regression Computations
Part 2: Model and Inference 2-24/49
Part 2: Model and Inference 2-25/49
Part 2: Model and Inference 2-26/49 Results to Report
Part 2: Model and Inference 2-27/49 The Reported Results
Part 2: Model and Inference 2-28/49 Estimated equation
Part 2: Model and Inference 2-29/49 Estimated coefficients b 0 and b 1
Part 2: Model and Inference 2-30/49 Sum of squared residuals, Σ i e i 2
Part 2: Model and Inference 2-31/49 S = s e = estimated std. deviation of ε
Part 2: Model and Inference 2-32/49 Interpreting (Estimated by s e ) Remember the empirical rule, 95% of observations will lie within mean ± 2 standard deviations? We show (b 0 +b 1 x) ± 2s e below.) This point is 2.2 standard deviations from the regression. Only 3.2% of the 62 observations lie outside the bounds. (We will refine this later.)
Part 2: Model and Inference 2-33/49 No Relationship: 1 = 0Relationship: 1 0 How to Distinguish These Cases Statistically? y i = β 0 + β 1 x i + ε i
Part 2: Model and Inference 2-34/49 Assumptions (Regression) The equation linking “Box Office” and “Buzz” is stable E[Box Office | Buzz] = α + β Buzz Another sample of movies, say 2012, would obey the same fundamental relationship.
Part 2: Model and Inference 2-35/49 Sampling Variability Samples 0 and 1 are a random split of the 62 observations. Sample 1: Box Office = Buzz Sample 0: Box Office = Buzz
Part 2: Model and Inference 2-36/49 Sampling Distributions
Part 2: Model and Inference 2-37/49 n = N-2 Small sample Large sample
Part 2: Model and Inference 2-38/49 Standard Error of Regression Slope Estimator
Part 2: Model and Inference 2-39/49 Internet Buzz Regression Regression Analysis: BoxOffice versus Buzz The regression equation is BoxOffice = Buzz Predictor Coef SE Coef T P Constant Buzz S = R-Sq = 42.4% R-Sq(adj) = 41.4% Analysis of Variance Source DF SS MS F P Regression Residual Error Total Range of Uncertainty for b is (10.94) to (10.94) = [51.27 to 94.17] If you use 2.00 from the t table, the limits would be [50.1 to 94.6]
Part 2: Model and Inference 2-40/49 Some computer programs report confidence intervals automatically; Minitab does not.
Part 2: Model and Inference 2-41/49 Uncertainty About the Regression Slope Hypothetical Regression Fuel Bill vs. Number of Rooms The regression equation is Fuel Bill = Number of Rooms Predictor Coef SE Coef T P Constant Rooms S = R-Sq = 72.2% R-Sq(adj) = 72.0% This is b 1, the estimate of β 1 This “Standard Error,” (SE) is the measure of uncertainty about the true value. The “range of uncertainty” is b ± 2 SE(b). (Actually 1.96, but people use 2)
Part 2: Model and Inference 2-42/49 Sampling Distributions and Test Statistics
Part 2: Model and Inference 2-43/49 t Statistic for Hypothesis Test
Part 2: Model and Inference 2-44/49 Alternative Approach: The P value Hypothesis: 1 = 0 The ‘P value’ is the probability that you would have observed the evidence on this hypothesis that you did observe if the null hypothesis were true. P = Prob(|t| would be this large | 1 = 0) If the P value is less than the Type I error probability (usually 0.05) you have chosen, you will reject the hypothesis. Interpret: It the hypothesis were true, it is ‘unlikely’ that I would have observed this evidence.
Part 2: Model and Inference 2-45/49 P value for hypothesis test
Part 2: Model and Inference 2-46/49 Intuitive approach: Does the confidence interval contain zero? Hypothesis: 1 = 0 The confidence interval contains the set of plausible values of 1 based on the data and the test. If the confidence interval does not contain 0, reject H 0 : 1 = 0.
Part 2: Model and Inference 2-47/49 More General Test
Part 2: Model and Inference 2-48/49
Part 2: Model and Inference 2-49/49 Summary: Regression Analysis Investigate: Is the coefficient in a regression model really nonzero? Testing procedure: Model: y = β 0 + β 1 x + ε Hypothesis: H 0 : β 1 = B. Rejection region: Least squares coefficient is far from zero. Test: α level for the test = 0.05 as usual Compute t = (b 1 – B)/StandardError Reject H 0 if t is above the critical value 1.96 if large sample Value from t table if small sample. Reject H 0 if reported P value is less than α level Degrees of Freedom for the t statistic is N-2