Copyright © 2011 Pearson Education, Inc. The Simple Regression Model Chapter 21
21.1 The Simple Regression Model How can we test the CAPM (Capital Asset Pricing Model) for Berkshire Hathaway stock? Formulate the simple regression with percentage change in Berkshire Hathaway stock as y and the percentage change in value of the whole stock market as x Use inference related to regression: standard errors, confidence intervals and hypothesis tests Copyright © 2011 Pearson Education, Inc. 3 of 50
21.1 The Simple Regression Model Simple Regression Model (SRM): model for the association in the population between an explanatory variable x and response y. Consider the data to be a sample from a population. Copyright © 2011 Pearson Education, Inc. 4 of 50
21.1 The Simple Regression Model Linear on Average The equation of the SRM describes how the conditional mean of Y depends on X. The SRM shows that these means lie on a line with intercept β 0 and slope β 1 : Copyright © 2011 Pearson Education, Inc. 5 of 50
21.1 The Simple Regression Model Deviations from the Mean The deviations of responses around are called errors. Error, is denoted by, and E( ) = 0. Copyright © 2011 Pearson Education, Inc. 6 of 50
21.1 The Simple Regression Model Deviations from the Mean The SRM makes three assumptions about : 1. Independent. Errors are independent of each other. 2. Equal variance. All errors have the same variance, Var( ) =. 3. Normal. The errors are normally distributed. Copyright © 2011 Pearson Education, Inc. 7 of 50
21.1 The Simple Regression Model Data Generating Process Let Y denote monthly sales of a company and let X denote its spending on advertising (both in thousands of dollars). Assume the following population model: Copyright © 2011 Pearson Education, Inc. 8 of 50
21.1 The Simple Regression Model Data Generating Process The SRM assumes a normal distribution at each x. Copyright © 2011 Pearson Education, Inc. 9 of 50
21.1 The Simple Regression Model Data Generating Process Eventually the data shown below are observed. Copyright © 2011 Pearson Education, Inc. 10 of 50
21.1 The Simple Regression Model Data Generating Process The true regression line is a characteristic of the population, not the observed data. The SRM is a model and offers a simplified view of reality. Copyright © 2011 Pearson Education, Inc. 11 of 50
21.1 The Simple Regression Model Simple Regression Model (SRM) Observed values of the response Y are linearly related to the values of the explanatory variable X by the equation:, ~ N(0, ). The observations are independent of one another, have equal variance around the regression line, and are normally distributed around the regression line. Copyright © 2011 Pearson Education, Inc. 12 of 50
21.2 Conditions for the SRM Conditions for the SRM – Checklist Is the association between y and x linear? Have lurking variables been ruled out? Are the errors evidently independent? Are the variances of the residuals similar? Are the residuals nearly normal? Copyright © 2011 Pearson Education, Inc. 13 of 50
21.2 Conditions for the SRM Conditions for the SRM – CAPM Example Linearity condition is satisfied; no pattern in the residuals. Data are shifted to the right because of two outliers (well-known declines in the market). Copyright © 2011 Pearson Education, Inc. 14 of 50
21.2 Conditions for the SRM Conditions for the SRM – CAPM Example No obvious lurking variable (according to CAPM theory). Similar variances condition is satisfied. Check the plot of residuals versus x for any fan shaped pattern (none visible). Copyright © 2011 Pearson Education, Inc. 15 of 50
21.2 Conditions for the SRM Conditions for the SRM – CAPM Example Evidently independent. No dependence apparent in the timeplot of the residuals. Copyright © 2011 Pearson Education, Inc. 16 of 50
21.2 Conditions for the SRM Conditions for the SRM – CAPM Example The residuals are not normally distributed. Check sample size condition (satisfied) to use CLT. Copyright © 2011 Pearson Education, Inc. 17 of 50
21.2 Conditions for the SRM Modeling Process Before looking at plots, ask two questions: 1. Does a linear relationship make sense? 2. Is the relationship free of lurking variables? Then begin working with data. Copyright © 2011 Pearson Education, Inc. 18 of 50
21.2 Conditions for the SRM Modeling Process Plot y versus x and verify a linear association. Fit the least squares line and obtain residuals. Plot the residuals versus x. If time series data, construct a timeplot of residuals. Inspect the histogram and quantile plot of the residuals. Copyright © 2011 Pearson Education, Inc. 19 of 50
21.3 Inference in Regression Parameters and Estimates for SRM Copyright © 2011 Pearson Education, Inc. 20 of 50
21.3 Inference in Regression Standard Errors Describe the sample-to-sample variability of b 0 and b 1 The estimated standard error of b 1 is Copyright © 2011 Pearson Education, Inc. 21 of 50
21.3 Inference in Regression Estimated Standard Error of b 1 Influenced by: Standard deviation of the residuals. As it increases, the standard error increases. Sample size. As it increases, the standard error decreases. Standard deviation of x. As it increases, the standard error increases. Copyright © 2011 Pearson Education, Inc. 22 of 50
21.3 Inference in Regression Software Results for CAPM Example Copyright © 2011 Pearson Education, Inc. 23 of 50
21.3 Inference in Regression Confidence Intervals The 95% confidence interval for β 1 is The 95% confidence interval for β 0 is Copyright © 2011 Pearson Education, Inc. 24 of 50
21.3 Inference in Regression Confidence Intervals – CAPM Example The 95% confidence interval for β 1 is The 95% confidence interval for β 0 is Copyright © 2011 Pearson Education, Inc. 25 of 50
21.3 Inference in Regression Hypothesis Tests To test H 0 : β 1 = 0 use To test H 0 : β 0 = 0 use Copyright © 2011 Pearson Education, Inc. 26 of 50
21.3 Inference in Regression Hypothesis Tests – CAPM Example The t-statistic of 9.29 with p-value of < indicates that the slope is significantly different from zero. The t-statistic of 4.11 with p-value of < indicates that the intercept is significantly different from zero. Copyright © 2011 Pearson Education, Inc. 27 of 50
4M Example 21.1: LOCATING A FRANCHISE OUTLET Motivation Does traffic volume affect gasoline sales? How much more gasoline can be expected to be sold at a franchise location with an average of 40,000 drive-bys compared to one with an average of 32,000 drive-bys? Copyright © 2011 Pearson Education, Inc. 28 of 50
4M Example 21.1: LOCATING A FRANCHISE OUTLET Method Use sales data from a recent month obtained from 80 franchise outlets. The 95% confidence interval for 8,000 times the estimated slope will indicate how much more gas is expected to sell at the busier location. Copyright © 2011 Pearson Education, Inc. 29 of 50
4M Example 21.1: LOCATING A FRANCHISE OUTLET Method Association is linear; no obvious lurking variable. Copyright © 2011 Pearson Education, Inc. 30 of 50
4M Example 21.1: LOCATING A FRANCHISE OUTLET Mechanics Copyright © 2011 Pearson Education, Inc. 31 of 50
4M Example 21.1: LOCATING A FRANCHISE OUTLET Mechanics Residual plot confirms similar variances. Copyright © 2011 Pearson Education, Inc. 32 of 50
4M Example 21.1: LOCATING A FRANCHISE OUTLET Mechanics Residuals appear normally distributed. Copyright © 2011 Pearson Education, Inc. 33 of 50
4M Example 21.1: LOCATING A FRANCHISE OUTLET Mechanics The 95% confidence interval for β 1 is approximately to gallons/car. Hence, a difference of 8,000 cars in daily traffic volume implies a difference in average daily sales of approximately 1,507 to 2,281 more gallons per day. Copyright © 2011 Pearson Education, Inc. 34 of 50
4M Example 21.1: LOCATING A FRANCHISE OUTLET Message Based on a sample of 80 gas stations, we expect that a station located at a site with 40,000 drive bys will sell on average from 1,507 to 2,281 more gallons of gas daily than a location with 32,000 drive bys. Copyright © 2011 Pearson Education, Inc. 35 of 50
21.4 Prediction Intervals Leveraging the SRM Prediction interval: an interval designed to hold a fraction (usually 95%) of the values of the response for a given value of x. A prediction interval differs from a confidence interval because it makes a statement about the location of a new observation rather than a parameter of a population. Copyright © 2011 Pearson Education, Inc. 36 of 50
21.4 Prediction Intervals Leveraging the SRM The 95% prediction interval for y new is where and Copyright © 2011 Pearson Education, Inc. 37 of 50
21.4 Prediction Intervals Leveraging the SRM A simple approximation for a 95% prediction interval is. Prediction intervals are reliable within the range of observed data. They are also sensitive to the assumptions of constant variance and normality. Copyright © 2011 Pearson Education, Inc. 38 of 50
4M Example 21.2: MANAGING NATURAL RESOURCES Motivation In managing commercial fishing fleets, the level of effort (number of boat-days) is assumed to influence the size of the catch. What is the predicted crab catch in a season with 7,500 days of effort? Copyright © 2011 Pearson Education, Inc. 39 of 50
4M Example 21.2: MANAGING NATURAL RESOURCES Method Use regression with Y equal to the catch near Vancouver Island from 1980 – 2007 measured in thousands of pounds of Dungeness crabs with X equal to the level of effort (total number of days by boats catching Dungeness crabs). Copyright © 2011 Pearson Education, Inc. 40 of 50
4M Example 21.2: MANAGING NATURAL RESOURCES Method Linear association is evident. Copyright © 2011 Pearson Education, Inc. 41 of 50
4M Example 21.2: MANAGING NATURAL RESOURCES Mechanics Copyright © 2011 Pearson Education, Inc. 42 of 50
4M Example 21.2: MANAGING NATURAL RESOURCES Mechanics Evidently independent. Copyright © 2011 Pearson Education, Inc. 43 of 50
4M Example 21.2: MANAGING NATURAL RESOURCES Mechanics Similar variances confirmed. Copyright © 2011 Pearson Education, Inc. 44 of 50
4M Example 21.2: MANAGING NATURAL RESOURCES Mechanics Nearly normal condition could be satisfied. Copyright © 2011 Pearson Education, Inc. 45 of 50
4M Example 21.2: MANAGING NATURAL RESOURCES Mechanics The t-statistic (and p-value) indicate that the slope is significantly different from zero. The predicted catch in a year with x = 7500 days of effort is 1, thousand pounds. The 95% prediction interval is from to 1, thousand pounds. Copyright © 2011 Pearson Education, Inc. 46 of 50
4M Example 21.2: MANAGING NATURAL RESOURCES Message There is a statistically significant linear association between days of effort and total catch. On average, each additional day of effort (per boat) increases the harvest by about 160 pounds. In a season with 7,500 days of effort, there is an expected total harvest of 1,173,240 pounds. There is a 95% probability that the catch will be between 908,440 and 1,438,110 pounds. Copyright © 2011 Pearson Education, Inc. 47 of 50
Best Practices Verify that your model makes sense, both visually and substantively. Consider other possible explanatory variables. Check the conditions, in the listed order. Copyright © 2011 Pearson Education, Inc. 48 of 50
Best Practices (Continued) Use confidence intervals to express what you know about the slope and intercept. Check the assumptions of the SRM carefully before using prediction intervals. Be careful when extrapolating. Copyright © 2011 Pearson Education, Inc. 49 of 50
Pitfalls Don’t overreact to residual plots. Do not mistake varying amounts of data for unequal variances. Do not confuse confidence intervals with prediction intervals. Do not expect that r 2 and s e must improve with a larger sample. Copyright © 2011 Pearson Education, Inc. 50 of 50