1 1 Slide © 2014 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. or duplicated, or posted to a publicly accessible website, in whole or in part. Chapter 16 Regression Analysis: Model Building n Multiple Regression Approach to Experimental Design Experimental Design n General Linear Model n Determining When to Add or Delete Variables n Variable Selection Procedures n Autocorrelation and the Durbin-Watson Test
2 2 Slide © 2014 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. or duplicated, or posted to a publicly accessible website, in whole or in part. Models in which the parameters ( 0, 1,..., p ) all Models in which the parameters ( 0, 1,..., p ) all have exponents of one are called linear models. General Linear Model n A general linear model involving p independent variables is n Each of the independent variables z is a function of x 1, x 2,..., x k (the variables for which data have been collected).
3 3 Slide © 2014 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. or duplicated, or posted to a publicly accessible website, in whole or in part. General Linear Model n The simplest case is when we have collected data for just one variable x 1 and want to estimate y by using a straight-line relationship. In this case z 1 = x 1. n This model is called a simple first-order model with one predictor variable.
4 4 Slide © 2014 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. or duplicated, or posted to a publicly accessible website, in whole or in part. Modeling Curvilinear Relationships n This model is called a second-order model with one predictor variable. n To account for a curvilinear relationship, we might set z 1 = x 1 and z 2 =.
5 5 Slide © 2014 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. or duplicated, or posted to a publicly accessible website, in whole or in part. Interaction n This type of effect is called interaction. n In this model, the variable z 5 = x 1 x 2 is added to account for the potential effects of the two variables acting together. n If the original data set consists of observations for y and two independent variables x 1 and x 2 we might develop a second-order model with two predictor variables.
6 6 Slide © 2014 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. or duplicated, or posted to a publicly accessible website, in whole or in part. Transformations Involving the Dependent Variable n Another approach, called a reciprocal transformation, is to use 1/ y as the dependent variable instead of y. n Often the problem of nonconstant variance can be corrected by transforming the dependent variable to a different scale. n Most statistical packages provide the ability to apply logarithmic transformations using either the base-10 (common log) or the base e = (natural log).
7 7 Slide © 2014 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. or duplicated, or posted to a publicly accessible website, in whole or in part. n We can transform this nonlinear model to a linear model by taking the logarithm of both sides. Nonlinear Models That Are Intrinsically Linear Models in which the parameters ( 0, 1,..., p ) have exponents other than one are called nonlinear models. Models in which the parameters ( 0, 1,..., p ) have exponents other than one are called nonlinear models. n In some cases we can perform a transformation of variables that will enable us to use regression analysis with the general linear model. n The exponential model involves the regression equation:
8 8 Slide © 2014 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. or duplicated, or posted to a publicly accessible website, in whole or in part. Variable Selection Procedures n Stepwise Regression n Forward Selection n Backward Elimination Iterative; one independent variable at a time is added or deleted based on the F statistic Different subsets of the independent variables are evaluated n Best-Subsets Regression The first 3 procedures are heuristics and therefore offer no guarantee that the best model will be found. The first 3 procedures are heuristics and therefore offer no guarantee that the best model will be found.
9 9 Slide © 2014 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. or duplicated, or posted to a publicly accessible website, in whole or in part. Variable Selection: Stepwise Regression n If no variable can be removed and no variable can be added, the procedure stops. n At each iteration, the first consideration is to see whether the least significant variable currently in the model can be removed because its F value is less than the user-specified or default Alpha to remove. n If no variable can be removed, the procedure checks to see whether the most significant variable not in the model can be added because its F value is greater than the user-specified or default Alpha to enter.
10 Slide © 2014 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. or duplicated, or posted to a publicly accessible website, in whole or in part. Variable Selection: Stepwise Regression Compute F stat. and p -value for each indep. variable not in model Compute F stat. and p -value for each indep. variable not in model Start with no indep. variables in model variables in model Start with no indep. variables in model variables in model Any p -value > alpha to remove ?Any p -value > alpha to remove ? StopStop Indep. variable with largest p -value is removed from model Indep. variable with largest p -value is removed from model Compute F stat. and p -value for each indep. variable in model Compute F stat. and p -value for each indep. variable in model Any p -value < alpha to enter ?Any p -value < alpha to enter ? Indep. variable with smallest p -value is entered into model Indep. variable with smallest p -value is entered into model No No Yes Yesnextiteration
11 Slide © 2014 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. or duplicated, or posted to a publicly accessible website, in whole or in part. Variable Selection: Forward Selection n This procedure is similar to stepwise regression, but does not permit a variable to be deleted. n This forward-selection procedure starts with no independent variables. n It adds variables one at a time as long as a significant reduction in the error sum of squares (SSE) can be achieved.
12 Slide © 2014 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. or duplicated, or posted to a publicly accessible website, in whole or in part. Start with no indep. variables in model Start with no indep. variables in model StopStop Compute F stat. and p -value for each indep. variable not in model Compute F stat. and p -value for each indep. variable not in model Any p -value < alpha to enter ?Any p -value < alpha to enter ? Indep. variable with smallest p -value is entered into model Indep. variable with smallest p -value is entered into model No Yes Variable Selection: Forward Selection
13 Slide © 2014 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. or duplicated, or posted to a publicly accessible website, in whole or in part. Variable Selection: Backward Elimination n This procedure begins with a model that includes all the independent variables the modeler wants considered. n It then attempts to delete one variable at a time by determining whether the least significant variable currently in the model can be removed because its p -value is less than the user-specified or default value. n Once a variable has been removed from the model it cannot reenter at a subsequent step.
14 Slide © 2014 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. or duplicated, or posted to a publicly accessible website, in whole or in part. Variable Selection: Backward Elimination StopStop Compute F stat. and p -value for each indep. variable in model Compute F stat. and p -value for each indep. variable in model Any p -value > alpha to remove ?Any p -value > alpha to remove ? Indep. variable with largest p -value is removed from model Indep. variable with largest p -value is removed from model No Yes Start with all indep. variables in model Start with all indep. variables in model
15 Slide © 2014 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. or duplicated, or posted to a publicly accessible website, in whole or in part. Tony Zamora, a real estate investor, has just Tony Zamora, a real estate investor, has just moved to Clarksville and wants to learn about the city’s residential real estate market. Tony has randomly selected 25 house-for-sale listings from the Sunday newspaper and collected the data partially listed on the next slide. Variable Selection: Backward Elimination n Example: Clarksville Homes Develop, using the backward elimination Develop, using the backward elimination procedure, a multiple regression model to predict the selling price of a house in Clarksville.
16 Slide © 2014 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. or duplicated, or posted to a publicly accessible website, in whole or in part. Variable Selection: Backward Elimination n n Partial Data Segment of City Selling Price ($000) House Size (00 sq. ft.) Number of Bedrms. Number of Bathrms. Garage Size (cars) Northwest South Northeast Northwest West South West West
17 Slide © 2014 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. or duplicated, or posted to a publicly accessible website, in whole or in part. Variable Selection: Backward Elimination n n Regression Output CoefSE CoefT p Intercept House Size Bedrooms Bathrooms Cars Predictor Greatest p -value >.05 Greatest p -value >.05 Variable to be removedVariable removed
18 Slide © 2014 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. or duplicated, or posted to a publicly accessible website, in whole or in part. Variable Selection: Backward Elimination n Cars (garage size) is the independent variable with the highest p -value (.697) >.05. n Cars variable is removed from the model. n Multiple regression is performed again on the remaining independent variables.
19 Slide © 2014 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. or duplicated, or posted to a publicly accessible website, in whole or in part. Variable Selection: Backward Elimination n n Regression Output CoefSE CoefT p Intercept House Size Bedrooms Bathrooms Predictor Greatest p -value >.05 Greatest p -value >.05 Variable to be removedVariable removed
20 Slide © 2014 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. or duplicated, or posted to a publicly accessible website, in whole or in part. Variable Selection: Backward Elimination n Bedrooms is the independent variable with the highest p -value (.281) >.05. n Bedrooms variable is removed from the model. n Multiple regression is performed again on the remaining independent variables.
21 Slide © 2014 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. or duplicated, or posted to a publicly accessible website, in whole or in part. CoefSE CoefTp Intercept House Size Bathrooms Predictor Variable Selection: Backward Elimination n n Regression Output Greatest p -value >.05 Greatest p -value >.05 Variable to be removedVariable removed
22 Slide © 2014 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. or duplicated, or posted to a publicly accessible website, in whole or in part. Variable Selection: Backward Elimination n Bathrooms is the independent variable with the highest p -value (.110) >.05. n Bathrooms variable is removed from the model. n Multiple regression is performed again on the remaining independent variable.
23 Slide © 2014 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. or duplicated, or posted to a publicly accessible website, in whole or in part. Variable Selection: Backward Elimination n n Regression Output CoefSE CoefT p Intercept House Size E-09 Predictor Greatest p -value is <.05 Greatest p -value is <.05
24 Slide © 2014 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. or duplicated, or posted to a publicly accessible website, in whole or in part. Variable Selection: Backward Elimination n House size is the only independent variable remaining in the model. n The estimated regression equation is:
25 Slide © 2014 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. or duplicated, or posted to a publicly accessible website, in whole or in part. n Minitab output identifies the two best one-variable estimated regression equations, the two best two- variable equation, and so on. Variable Selection: Best-Subsets Regression n The three preceding procedures are one-variable-at- a-time methods offering no guarantee that the best model for a given number of variables will be found. n Some software packages include best-subsets regression that enables the user to find, given a specified number of independent variables, the best regression model.
26 Slide © 2014 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. or duplicated, or posted to a publicly accessible website, in whole or in part. The Professional Golfers Association keeps a The Professional Golfers Association keeps a variety of statistics regarding performance measures. Data include the average driving distance, percentage of drives that land in the fairway, percentage of greens hit in regulation, average number of putts, percentage of sand saves, and average score. n Example: PGA Tour Data Variable Selection: Best-Subsets Regression
27 Slide © 2014 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. or duplicated, or posted to a publicly accessible website, in whole or in part. n Variable Names and Definitions Variable-Selection Procedures Score : average score for an 18-hole round Sand : percentage of sand saves (landing in a sand trap and still scoring par or better) Putt : average number of putts for greens that have been hit in regulation been hit in regulation Green : percentage of greens hit in regulation (a par-3 green is “hit in regulation” if the player’s first shot lands on the green) shot lands on the green) Fair : percentage of drives that land in the fairway Drive : average length of a drive in yards
28 Slide © 2014 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. or duplicated, or posted to a publicly accessible website, in whole or in part Variable-Selection Procedures Drive Fair Green Putt Sand Score Drive Fair Green Putt Sand Score n Sample Data (Part 1)
29 Slide © 2014 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. or duplicated, or posted to a publicly accessible website, in whole or in part Variable-Selection Procedures Drive Fair Green Putt Sand Score Drive Fair Green Putt Sand Score n Sample Data (Part 2)
30 Slide © 2014 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. or duplicated, or posted to a publicly accessible website, in whole or in part Variable-Selection Procedures Drive Fair Green Putt Sand Score Drive Fair Green Putt Sand Score n Sample Data (Part 3)
31 Slide © 2014 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. or duplicated, or posted to a publicly accessible website, in whole or in part. n Sample Correlation Coefficients Variable-Selection Procedures Sand Putt Green Fair Drive Score Drive Fair Green Putt