Essentials of Business Statistics: Communicating with Numbers By Sanjiv Jaggia and Alison Kelly Copyright © 2014 by McGraw-Hill Higher Education. All rights reserved.
12-2 Chapter 12 Learning Objectives LO 12.1Estimate the simple linear regression model and interpret the coefficients. LO 12.2Estimate the multiple linear regression model and interpret the coefficients. LO 12.3Calculate and interpret the standard error of the estimate. LO 12.4Calculate and interpret the coefficient of determination R 2. LO 12.5Differentiate between R 2 and adjusted R 2. LO 12.6Conduct tests of individual significance LO 12.7Conduct a test of joint significance Regression Analysis
The Simple Linear Regression Model With regression analysis, we explicitly assume that one variable, called the response variable, is influenced by other variables, called the explanatory variables. Using regression analysis, we may predict the response variable given values for our explanatory variables. Regression Analysis LO 12.1 Estimate the simple linear regression model and interpret the coefficients.
12-4 If the value of the response variable is uniquely determined by the values of the explanatory variables, we say that the relationship is deterministic. But if, as we find in most fields of research, that the relationship is inexact due to omission of relevant factors, we say that the relationship is inexact. In regression analysis, we include a random error term, that acknowledges that the actual relationship between the response and explanatory variables is not deterministic. LO 12.1 Regression Analysis 12.1 The Simple Linear Regression Model
12-5 Regression Analysis 12.1 The Simple Linear Regression Model LO 12.1 The simple linear regression model is defined as: y = 0 + 1 x + where y and x are, respectively, the response and explanatory variables and is the random error term The coefficients 0 and 1 are the unknown parameters to be estimated.
12-6 Regression Analysis 12.1 The Simple Linear Regression Model LO 12.1 By fitting our data to the model, we obtain the equation: ŷ = b 0 + b 1 x where ŷ is the estimated response variable, b 0 is the estimate of 0, and b 1 is the estimate of 1. Since the predictions cannot be totally accurate, the difference between the predicted and actual value is called the residual e = y – ŷ.
12-7 This is a scatterplot of debt payments against income with a superimposed sample regression equation. Debt payments rise with income. Vertical distance between ŷ and y represents the residual, e. Regression Analysis 12.1 The Simple Linear Regression Model LO 12.1
12-8 Regression Analysis 12.1 The Simple Linear Regression Model LO 12.1 The two parameters 0 and 1 are estimated by minimizing the summer of squared residuals. The slope coefficient is first estimated as: Then the intercept is computed as:
The Multiple Regression Model If there is more than one explanatory variable available, we can use multiple regression. For example, we analyzed how debt payments are influenced by income, but ignored the possible effect of unemployment. Multiple regression allows us to explore how several variables influence the response variable. Regression Analysis LO 12.2 Estimate the multiple linear regression model and interpret the coefficients.
12-10 Regression Analysis LO The Multiple Regression Model Suppose there are k explanatory variables. The multiple linear regression model is defined as: y = 0 + 1 x 1 + 2 x 2 + … + k x k + , Where x 1, x 2, …, x k are the explanatory variables and the j values are the unknown parameters that we will estimate from the data. As before, is the random error term.
12-11 Regression Analysis LO The Multiple Regression Model The sample multiple regression equation is: ŷ = b 0 + b 1 x 1 + b 2 x 2 + … + b k x k. In multiple regression, there is a slight modification in the interpretation of the slopes b 1 through b k, as they show “partial” influences. For example, if there are k = 3 explanatory variables, the value b 1 estimates how a change in x 1 will influence y assuming x 2 and x 3 are held constant.
Goodness-of-Fit Measures Regression Analysis LO 12.3 Calculate and interpret the standard error of the estimate. We will introduce three measures to judge how well the sample regression fits the data The Standard Error of the Estimate The Coefficient of Determination The Adjusted R 2
12-13 To compute the standard error of the estimate, we first compute the mean squared error. We first compute the sum of squares due to error: Dividing SSE by the appropriate degrees of freedom, n – k – 1, yields the mean squared error, MSE: LO 12.3 Regression Analysis 12.3 Goodness-of-Fit Measures
12-14 The square root of the MSE is the standard error of the estimate, s e. In general, the less dispersion around the regression line, the smaller the s e, which implies a better fit to the model. Regression Analysis LO Goodness-of-Fit Measures
12-15 Regression Analysis LO 12.4 Calculate and interpret the coefficient of determination R Goodness-of-Fit Measures The coefficient of determination, commonly referred to as the R 2, is another goodness-of-fit measure that is easier to interpret than the standard error. The R 2 quantifies the fraction of variation in the response variable that is explained by changes in the explanatory variables.
12-16 LO 12.4 Regression Analysis 12.3 Goodness-of-Fit Measures The coefficient of determination is computed as where and The SST, called the total sum of squares, denotes the total variation in the response variable. The SST can be broken down into two components: The variation explained by the regression equation (the sum of squares due to regression, or SSR) and the unexplained variation (the sum of squares due to error, or SSE).
12-17 More explanatory variables always result in a higher R 2. Some of these variables may be unimportant and should not be in the model. The Adjusted R 2 tries to balance the raw explanatory power against the desire to include only important predictors. Regression Analysis LO 12.5 Differentiate between R 2 and adjusted R Goodness-of-Fit Measures
12-18 LO 12.5 Regression Analysis 12.3 Goodness-of-Fit Measures The Adjusted R 2 is computed as: Adjusted R 2 = The Adjusted R 2 penalizes the R 2 for adding additional explanatory variables. As with our other goodness-of-fit measures, we typically allow the computer to compute the Adjusted R 2. It’s shown directly below the R 2 in the Excel regression output.
Tests of Significance Inference with Regression Models LO 12.6 Conduct tests of individual significance. With two explanatory variables (Income and Unemployment) to choose from, we can formulate three possible linear models Model 1: Debt = 0 + 1 Income + Model 2: Debt = 0 + 1 Unemployment+ Model 3: Debt = 0 + 1 Income + 2 Unemployment +
12-20 Inference with Regression Models LO Tests of Significance Consider our standard multiple regression model: y = 0 + 1 x 1 + 2 x 2 + … + k x k + , In general, we can test whether j is equal to, greater than, or less than some hypothesized value j0. This test could have one of three forms:
12-21 Inference with Regression Models LO Tests of Significance The test statistic will follow a t-distribution with degrees of freedom df = n – k – 1. It is calculated as: se bj is the standard error of the estimator b j
12-22 Inference with Regression Models LO Tests of Significance By far the most common hypothesis test for an individual coefficient is to test whether its value differs from zero. To see why, consider our model: y = 0 + 1 x 1 + 2 x 2 + … + k x k + , If a coefficient is equal to zero, then it implies that the explanatory variable is not a significant predictor of the response variable.
12-23 Inference with Regression Models LO 12.7 Conduct a test of joint significance Tests of Significance In addition to conducting tests of individual significance, we also may want to test the joint significance of all k variables at once. The competing hypotheses for a test of joint significance are: H 0 : 1 = 2 = … = k = 0 H A : at least one j ≠ 0
12-24 Inference with Regression Models LO Tests of Significance The test statistic for a test of joint significance is where MSR and MSE are, respectively, the mean square regression and the mean square error. The numerator degrees of freedom, df 1, equal k, while the denominator degrees of freedom, df 2, are n – k – 1. Fortunately, statistical software will generally report the value of F(df 1, df 2 ) and its p-value as standard output, making computation by hand rarely necessary.