©The McGraw-Hill Companies, Inc. 2008McGraw-Hill/Irwin Multiple Linear Regression and Correlation Analysis Chapter 14.

©The McGraw-Hill Companies, Inc. 2008McGraw-Hill/Irwin Multiple Linear Regression and Correlation Analysis Chapter 14

2 GOALS Describe the relationship between several independent variables and a dependent variable using multiple regression analysis. Set up, interpret, and apply an ANOVA table Compute and interpret the multiple standard error of estimate, the coefficient of multiple determination, and the adjusted coefficient of multiple determination. Conduct a test of hypothesis to determine whether regression coefficients differ from zero. Conduct a test of hypothesis on each of the regression coefficients. Use residual analysis to evaluate the assumptions of multiple regression analysis. Evaluate the effects of correlated independent variables. Use and understand qualitative independent variables. Understand and interpret the stepwise regression method (skip) Understand and interpret possible interaction among independent variables (skip).

3 1. Multiple Regression Analysis The general multiple regression with k independent variables is given by: We use more than one independent variables to explain or predict the dependent variables, Y. Again, the least squares criterion is used to develop this equation. Because determining b 1, b 2, etc. is very tedious, a software package such as Excel or MINITAB is recommended.

4 Multiple Regression Analysis For two independent variables, the general form of the multiple regression equation is: X 1 and X 2 are the independent variables. a is the Y-intercept b 1 is the net change in Y for each unit change in X 1 holding X 2 constant. It is called a (partial, net) regression coefficient. Graphically, the relationship is portrayed as a plane (as shown in Chart 14-1)

5 Regression Plane for a 2-Independent Variable Linear Regression Equation

6 One of the questions most frequently asked by prospective home buyers is: If we purchase this home, how much do we expect to pay to heat it during the winter? A real estate agency has been asked to develop a guideline regarding heating costs for single-family homes. Three variables are thought to relate to the heating costs: (1) the mean daily outside temperature, (2) the number of inches of insulation in the attic, and (3) the age in years of the furnace (boiler). To investigate, the research department selected a random sample of 20 recently sold homes. It collected the data for heating cost to each home as well as outside temperature, insulation inches, and the age of the furnace. Multiple Linear Regression - Example

8 Multiple Linear Regression – Excel Example

9 The Multiple Regression Equation – Interpreting the Regression Coefficients The regression coefficient for outside temperature is 4.583. The coefficient is negative, showing an inverse relationship between heating cost and temperature. As the outside temperature increases, the heating cost decreases. The numeric value of the regression coefficient provides more information. If temperature is increased by 1 degree, holding the other two independent variables constant, we estimate a decrease of $4.583 in heating cost. So if the mean temperature is 25 degrees and 35 degrees in Boston ad Philadelphia respectively; insulation and age of furnace being the same, we expect the heating cost would be $45.83 less in Philadelphia. The insulation variable also shows an inverse relationship: the more insulation in the attic, the less is the heating cost. So the negative sign for this coefficient is logical. For each additional inch of insulation, we expect the heating cost to decline $14.83 per month, regardless of the outside temperature or the age of the furnace. The age of the furnace variable shows a direct relationship. With an older furnace, the cost to heat the home increases. Specifically, for each additional year older the furnace is, we expect the heating cost to increase $6.10 per month.

10 Applying the Model for Estimation What is the estimated heating cost for a home if the mean outside temperature is 30 degrees, there are 5 inches of insulation in the attic, and the furnace is 10 years old?

11 2. How well does the regression equation fit the data? Several measures (methods) are used to describe how effectively the independent variables explain the variation of the dependent variable. – (1) Multiple Standard Error of Estimate – (2) ANOVA Table – (3) Coefficient of (Multiple) Determination – (4) Adjusted Coefficient of (Multiple) Determination

12 (1) Multiple Standard Error of Estimate The multiple standard error of estimate is a measure of the effectiveness of the regression equation. It is derived based on the sum of squared ‘deviation from the regression line (residual)’, Σ(Y–Y hat ) 2. It is measured in the same units as the dependent variable. (if Y is measured in dollar, standard error is also in dollar). The formula is:

14 (2) The ANOVA Table The ANOVA table reports the (total) variation in the dependent variable (SST), which is divided into two components. – The Explained (or regression) Variation (SSR) that is accounted for by the set of independent variable. – The Unexplained (or residual or error) Variation (SSE) that is not accounted for by the independent variables. The degree of freedom in the explained (regression) variation is the number of independent variables, k. – The degrees of freedom in the error variation is, (n-k-1). – The mean square value is obtained by dividing the SS (Sum of square) by the matching df. ANOVA table can be used to evaluate the regression result. – (Eg.) S y.123 = Sq root (MSE), or MSR/MSE (discussed later).

15 Minitab – the ANOVA Table

16 (3) Coefficient of Multiple Determination ( R 2 ) Characteristics of the coefficient of multiple determination: 1. It is symbolized by a capital R squared. In other words, it is written as because it behaves like the square of a correlation coefficient. 2. It ranges from 0 to 1. A value near 0 means little association between the set of independent variables and the dependent variable. A value near 1 means a strong association. - It is easy to interpret (compare, and understand). 3. It cannot assume negative values. it is a squared value.

17 Minitab – the ANOVA Table

18 (4) Adjusted Coefficient of Determination The number of independent variables in a multiple regression equation affects the size of coefficient of determination (R 2 ). – Each new independent variable makes SSE smaller and SSR larger, thereby R 2 larger. – R 2 increases only because the number of independent variables increase, not because the added variable is a good predictor of Y. – If the number of independent variables, k, and the sample size, n, are equal, the coefficient of determination is 1.0. To balance the effect the number of independent variables has on the value of R 2, we use an adjusted coefficient of determination. – Statistical software packages provide this R 2 adj.

20 3. Inferences in Multiple Linear Regression Now understand the multiple regression as inferential statistics. <cf: descriptive statistics – Data: a random sample taken from a population. Model the (unknown, stochastic) linear relationship in a population by the following equation – E(Y) = α + β 1 X 1 + β 2 X 2 + - - - + β k X k – Population parameters denoted by Greek letters, α, β i, are estimated by sample statistics, a, b i, computed by least squares method (point estimate). – Under a certain set of assumptions, the point estimate follow the normal distribution (or t distribution), with the corresponding population parameter being its mean. Inferences about population parameters becomes possible, based on the properties of the sampling distribution.

Assumptions for standard multiple regression (discussed later) 1. There is a linear relationship. That is, there is a linear relationship between the dependent variable and a set of independent variables. 2. The variation in the errors is the same for both large and small values of Y. That is, the error is independent whether the estimated Y is large or small. 3. The errors follow the normal probability distribution, with its mean being 0. 4. The independent variables should not be correlated. That is, we would like to select a set of independent variables that are not themselves correlated. 5. The errors are independent. This means that successive observations of the dependent variable are not correlated. This assumption is often violated when time is involved with the sampled observations. 21

22 (1) Global Test: Testing the Multiple Regression Model We test the ability of the independent variables, X 1, X 2, - - X k, to explain the change in dependent variable Y. The ‘global test’ is used to investigate whether any of the independent variables have significant coefficients. The hypotheses are:

23 Global Test continued The test statistic (from the ANOVA table) follows the F distribution – with k (number of independent variables) and n-k-1 degrees of freedom, where n is the sample size. – The example (of heating cost) follows F distribution with degrees of freedom of 3 and 16 (with n=20). Decision Rule: Reject H 0 if F > F ,k,n-k-1

24 Finding the Critical F

25 Finding the Computed F

26 Interpretation The computed value of F is 21.90, which is in the rejection region. The null hypothesis that all the regression coefficients are zero is rejected. Interpretation: some of the independent variables (temperature, insulation, furnace) do have the ability to explain the variations in the dependent variable (heating cost). Logical question– which ones?

27 (2) Evaluating Individual Regression Coefficients (Test whether β i = 0) This test is used to determine which independent variables have non-zero regression coefficients. The variables that have zero regression coefficients are usually dropped from the analysis. The test statistic is the t distribution with n-(k+1) degrees of freedom. The hypothesis test is as follows: H 0 : β i = 0 H 1 : β i ≠ 0 Reject H 0 if t > t  /2,n-k-1 or t < -t  /2,n-k-1

28 Critical t-stat for the Slopes -2.120 2.120

29 Computed t-stat for the Slopes

30 Conclusion on Significance of Slopes

31 New Regression without Variable “Age” (delete insignificant independent variable)

32 New Regression Model without Variable “Age” – Minitab

33 Testing the New Model for Significance

34 Critical t-stat for the New Slopes -2.110 2.110

35 Conclusion on Significance of New Slopes

36 Procedure of adjusting regression equation Conduct global test – Check whether the whole regression equation has some explanatory power. Conduct test for the individual coefficient. – Check the significance of individual explanatory variables. Rerun the regression, after deleting one insignificant independent variable – Delete one having the lowest absolute t value (largest p-value), when two (or more) explanatory variables are insignificant. – Start a new round of tests.

37 4. Evaluating the Assumptions of Multiple Regression * The validity of the previous tests rely on several assumptions. 1. There is a linear relationship. 2. The variation in the errors is the same for both large and small values of Y. 3. The errors follow the normal probability distribution, with its mean being 0. 4. The independent variables should not be correlated. 5. The errors are independent. * Most of these assumptions are related with the error terms.

38 Analysis of Residuals A residual is the difference between the actual value of Y and the predicted value of Y. Residuals are used to check about the assumptions on the error term. Residual plot: A plot of the residuals and their corresponding Y’ values is used for showing whether there are no trends or patterns in the residuals.

39 (1) Linear Relatiohship: Scatter Diagram

40 Linear Relationship: Residual Plot

(2) Same variation in error for large and small value of Y Homoscedasticity: The variation around the regression equation is the same for all of the values of the independent variables. Example of violation of this assumption: – Salary is regressed on the age of worker. The residual plots (in previous slide) used as a preliminary check on this assumption. 41

42 (3) Distribution of Residuals: normal distribution? Histograms (and stem-and-leaf charts) are useful in checking this assumption. Both MINITAB and Excel offer graph that helps to evaluate the assumption of normally distributed residuals. It is a called a normal probability plot and is shown to the right of the histogram.

43 (4) Multicollinearity Multicollinearity exists when independent variables (X’s) are correlated. Correlated independent variables make it difficult to make inferences about the individual regression coefficients and the individual effects on the dependent variable Y. – It may cause unexpected sign or make important independent variable having insignificant coefficient. – Need to select independent variables carefully. – In reality, it is very difficult to get rid of multicollinearity problem fully.

44 How to check multicollinearity: Variance Inflation Factor A general rule is if the correlation between two independent variables is between -0.7 and 0.7 there likely is not a problem using the two independent variables. A more precise test is to use the variance inflation factor (VIF). The term R 2 j refers to the coefficient of determination, where the selected independent variable is used as a dependent variable and the remaining independent variables are used as independent variables. A VIF greater than 10 is considered unsatisfactory, indicating that the independent variable should be removed from the analysis.

45 Multicollinearity – Example Refer to the data in the table, which relates the heating cost to the independent variables outside temperature, amount of insulation, and age of furnace. Develop a correlation matrix for all the independent variables. Does it appear there is a problem of multicollinearity?

46 Correlation Matrix A correlation matrix is used to show all possible simple correlation coefficients among the variables. – The matrix is useful for locating correlated independent variables. – It also shows how strongly each independent variable is correlated with the dependent variable.

47 Correlation Matrix - Minitab

48 VIF – Minitab Example The VIF value of 1.32 is less than the upper limit of 10. This indicates that the independent variable temperature is not strongly correlated with the other independent variables. Coefficient of Determination

49 (5) Independence Assumption The fifth assumption about regression analysis is that successive residuals should be independent. – There is not a pattern to the residuals. When successive residuals are correlated we refer to this condition as autocorrelation. – Autocorrelation frequently occurs when the data are collected over a period of time (time-series data). – A test for autocorrelation, called Durbin-Watson test introduced in Ch16.

50 Residual Plot versus Fitted Values The graph below shows the residuals plotted on the vertical axis and the fitted values on the horizontal axis. Note the run of residuals above the mean of the residuals, followed by a run below the mean. A scatter plot such as this would indicate possible autocorrelation.

51 Qualitative Independent Variables Frequently we need to use nominal-scale variable our analysis, called qualitative variables. – Such as gender, whether the home has a swimming pool, or whether the sports team was the home or the visiting team, etc. To use a qualitative variable in regression analysis, we use a scheme of dummy variables – The variable has one of the two possible values, either 0 or 1.

52 Qualitative Variable - Example Suppose in the Salsberry Realty (heating cost) example that the independent variable “garage” is added. For those homes without an attached garage, 0 is used; for homes with an attached garage, a 1 is used. We will refer to the “garage” variable, as the data from Table 14–2 are used.

53 Qualitative Variable - Minitab

54 Using the Model for Estimation What is the effect of the garage variable? Suppose we have two houses exactly alike next to each other in Boston; one has an attached garage, and the other does not. Both homes have 3 inches of insulation, and the mean January temperature in Boston is 20 degrees. For the house without an attached garage, a 0 is substituted for in the regression equation. The estimated heating cost is $280.90, found by: For the house with an attached garage, a 1 is substituted for in the regression equation. The estimated heating cost is $358.30, found by: Without garage With garage

55 Testing the Model for Significance We have shown the difference between the two types of homes to be $77.40, but is the difference significant? We conduct the following test of hypothesis for the dummy variable, as before. H 0 : β i = 0 H 1 : β i ≠ 0 Reject H 0 if t > t  /2,n-k-1 or t < -t  /2,n-k-1

56 Evaluating Individual Regression Coefficients (β i = 0) This test is to determine whether any independent variables have nonzero regression coefficients. – In this case, consider the coefficient for the garage variable. – The test statistic follows the t distribution with n-(k+1) or n-k-1degrees of freedom. The hypothesis test is as follows: H 0 : β i = 0 H 1 : β i ≠ 0 Reject H 0 if t > t  /2,n-k-1 or t < -t  /2,n-k-1

57 Conclusion: The regression coefficient is not zero. The independent variable garage should be included in the analysis.

58 Stepwise Regression The advantages to the stepwise method are: 1. Only independent variables with significant regression coefficients are entered into the equation. 2. The steps involved in building the regression equation are clear. 3. It is efficient in finding the regression equation with only significant regression coefficients. 4. The changes in the multiple standard error of estimate and the coefficient of determination are shown.

59 The stepwise MINITAB output for the heating cost problem follows. Temperature is selected first. This variable explains more of the variation in heating cost than any of the other three proposed independent variables. Garage is selected next, followed by Insulation. Stepwise Regression – Minitab Example

60 Regression Models with Interaction In Chapter 12 (ANOVA) interaction among independent variables was discussed. Suppose we are studying weight loss and assume that diet and exercise are related. So the dependent variable is amount of change in weight and the independent variables are: diet (yes or no) and exercise (none, moderate, significant). We are interested in whether there is interaction among the independent variables. That is, if they maintain their diet and exercise significantly, will that increase the weight loss? Is total weight loss more than the sum of the loss due to the diet effect and the loss due to the exercise effect? In regression analysis, interaction can be examined as a separate independent variable. An interaction prediction variable can be developed by multiplying data values in one independent variable by the value in another independent variable, creating a new independent variable. A two-variable model that includes an interaction term is:

61 Refer to the heating cost example. Is there an interaction between the outside temperature and the amount of insulation? If both variables are increased, is the effect on heating cost greater than the sum of savings from warmer temperature and the savings from increased insulation separately? Regression Models with Interaction - Example

62 Creating the Interaction Variable Using the information from the table in previous slide, an interaction variable is created by multiplying the temperature by the insulation. For the first sampled home the value temperature is 35 degrees and insulation is 3 inches so the value of the interaction variable is 35 X 3 = 105. The values of the other interaction products are found in a similar fashion. Regression Models with Interaction - Example

63 Regression Models with Interaction - Example

64 The regression equation is: Is the interaction variable significant at 0.05 significance level? Regression Models with Interaction - Example

65 There are other situations that can occur when studying interaction among independent variables. 1. It is possible to have a three-way interaction among the independent variables. In the heating example, we might have considered the three-way interaction between temperature, insulation, and age of the furnace. 2. It is possible to have an interaction where one of the independent variables is a qualitative variable. In our heating cost example, we could have studied the interaction between temperature and garage.

66 End of Chapter 14

©The McGraw-Hill Companies, Inc. 2008McGraw-Hill/Irwin Multiple Linear Regression and Correlation Analysis Chapter 14.

Similar presentations

Presentation on theme: "©The McGraw-Hill Companies, Inc. 2008McGraw-Hill/Irwin Multiple Linear Regression and Correlation Analysis Chapter 14."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

©The McGraw-Hill Companies, Inc. 2008McGraw-Hill/Irwin Multiple Linear Regression and Correlation Analysis Chapter 14.

Similar presentations

Presentation on theme: "©The McGraw-Hill Companies, Inc. 2008McGraw-Hill/Irwin Multiple Linear Regression and Correlation Analysis Chapter 14."— Presentation transcript:

Similar presentations

About project

Feedback