1 1 Slide © 2003 South-Western/Thomson Learning™ Slides Prepared by JOHN S. LOUCKS St. Edward’s University.

2 2 Slide © 2003 South-Western/Thomson Learning™ Chapter 16 Regression Analysis: Model Building n General Linear Model n Determining When to Add or Delete Variables n Analysis of a Larger Problem n Variable-Selection Procedures n Residual Analysis n Multiple Regression Approach to Analysis of Variance and to Analysis of Variance and Experimental Design Experimental Design

3 3 Slide © 2003 South-Western/Thomson Learning™ General Linear Model Models in which the parameters (  0,  1,...,  p ) all have exponents of one are called linear models. n First-Order Model with One Predictor Variable n Second-Order Model with One Predictor Variable n Second-Order Model with Two Predictor Variables with Interaction with Interaction

4 4 Slide © 2003 South-Western/Thomson Learning™ General Linear Model Often the problem of nonconstant variance can be corrected by transforming the dependent variable to a different scale. n Logarithmic Transformations Most statistical packages provide the ability to apply logarithmic transformations using either the base-10 (common log) or the base e = 2.71828... (natural log). n Reciprocal Transformation Use 1/ y as the dependent variable instead of y.

5 5 Slide © 2003 South-Western/Thomson Learning™ Models in which the parameters (  0,  1,...,  p ) have exponents other than one are called nonlinear models. In some cases we can perform a transformation of variables that will enable us to use regression analysis with the general linear model. n Exponential Model The exponential model involves the regression equation: We can transform this nonlinear model to a linear model by taking the logarithm of both sides. General Linear Model

6 6 Slide © 2003 South-Western/Thomson Learning™ Variable Selection Procedures n Stepwise Regression n Forward Selection n Backward Elimination n Best-Subsets Regression Iterative; one independent variable at a time is added or deleted based on the F statistic Different subsets of the independent variables are evaluated

7 7 Slide © 2003 South-Western/Thomson Learning™ Variable Selection Procedures n F Test To test whether the addition of x 2 to a model involving x 1 (or the deletion of x 2 from a model involving x 1 and x 2 ) is statistically significant The p -value corresponding to the F statistic is the criterion used to determine if a variable should be added or deleted

8 8 Slide © 2003 South-Western/Thomson Learning™ Stepwise Regression Compute F stat. and p -value for each indep. variable in model Start Any p -value > alpha to remove ? Stop Indep. variable with largest p -value is removed from model Compute F stat. and p -value for each indep. variable not in model Any p -value < alpha to enter ? Indep. variable with smallest p -value is entered into model No No Yes Yes

9 9 Slide © 2003 South-Western/Thomson Learning™ n This procedure is similar to stepwise-regression, but does not permit a variable to be deleted. n This forward-selection procedure starts with no independent variables. n It adds variables one at a time as long as a significant reduction in the error sum of squares (SSE) can be achieved. Forward Selection

10 Slide © 2003 South-Western/Thomson Learning™ Forward Selection Start with no indep. variables in model Stop Compute F stat. and p -value for each indep. variable not in model Any p -value < alpha to enter ? Indep. variable with smallest p -value is entered into model No Yes

11 Slide © 2003 South-Western/Thomson Learning™ n This procedure begins with a model that includes all the independent variables the modeler wants considered. n It then attempts to delete one variable at a time by determining whether the least significant variable currently in the model can be removed because its p - value is less than the user-specified or default value. n Once a variable has been removed from the model it cannot reenter at a subsequent step. Backward Elimination

12 Slide © 2003 South-Western/Thomson Learning™ Backward Elimination Stop Compute F stat. and p -value for each indep. variable in model Any p -value > alpha to remove ? Indep. variable with largest p -value is removed from model No Yes Start with all indep. variables in model

13 Slide © 2003 South-Western/Thomson Learning™ Example: Clarksville Homes Tony Zamora, a real estate investor, has just moved to Clarksville and wants to learn about the city’s residential real estate market. Tony has randomly selected 25 house-for-sale listings from the Sunday newspaper and collected the data listed on the next three slides. Develop, using the backward elimination procedure, a multiple regression model to predict the selling price of a house in Clarksville.

20 Slide © 2003 South-Western/Thomson Learning™ Using Excel to Perform the Backward Elimination Procedure n Cars (garage size) is the independent variable with the highest p -value (.697) >.05 n Cars is removed from the model n Multiple regression is performed again on the remaining independent variables

24 Slide © 2003 South-Western/Thomson Learning™ Using Excel to Perform the Backward Elimination Procedure n Bedrooms is the independent variable with the highest p -value (.281) >.05 n Bedrooms is removed from the model n Multiple regression is performed again on the remaining independent variables

28 Slide © 2003 South-Western/Thomson Learning™ Using Excel to Perform the Backward Elimination Procedure n Bathrooms is the independent variable with the highest p -value (.110) >.05 n Bathrooms is removed from the model n Regression is performed again on the remaining independent variable

32 Slide © 2003 South-Western/Thomson Learning™ Using Excel to Perform the Backward Elimination Procedure n House size is the only independent variable remaining in the model n The estimated regression equation is: n The Adjusted R Square value is.760

33 Slide © 2003 South-Western/Thomson Learning™ n Best-Subsets Regression The three preceding procedures are one-variable- at-a-time methods offering no guarantee that the best model for a given number of variables will be found. The three preceding procedures are one-variable- at-a-time methods offering no guarantee that the best model for a given number of variables will be found. Some statistical software packages include best- subsets regression that enables the user to find, given a specified number of independent variables, the best regression model. Some statistical software packages include best- subsets regression that enables the user to find, given a specified number of independent variables, the best regression model. Typical output identifies the two best one-variable estimated regression equations, the two best two- variable regression equations, and so on. Typical output identifies the two best one-variable estimated regression equations, the two best two- variable regression equations, and so on. Variable-Selection Procedures

34 Slide © 2003 South-Western/Thomson Learning™ Example: PGA Tour Data The Professional Golfers Association keeps a variety of statistics regarding performance measures. Data include the average driving distance, percentage of drives that land in the fairway, percentage of greens hit in regulation, average number of putts, percentage of sand saves, and average score. The variable names and definitions are shown on the next slide.

35 Slide © 2003 South-Western/Thomson Learning™ n Variable Names and Definitions Drive: average length of a drive in yards Fair: percentage of drives that land in the fairway Green: percentage of greens hit in regulation (a par-3 green is “hit in regulation” if the player’s first shot lands on the green) Putt: average number of putts for greens that have been hit in regulation Sand: percentage of sand saves (landing in a sand trap and still scoring par or better) Score: average score for an 18-hole round Example: PGA Tour Data

36 Slide © 2003 South-Western/Thomson Learning™ n Sample Data Drive Fair Green Putt Sand Score Drive Fair Green Putt Sand Score 277.6.681.6671.768.55069.10 259.6.691.6651.810.53671.09 269.1.657.6491.747.47270.12 267.0.689.6731.763.67269.88 267.3.581.6371.781.52170.71 255.6.778.6741.791.45569.76 272.9.615.6671.780.47670.19 265.4.718.6991.790.55169.73 Example: PGA Tour Data

37 Slide © 2003 South-Western/Thomson Learning™ n Sample Data (continued) Drive Fair Green Putt Sand Score Drive Fair Green Putt Sand Score 272.6.660.6721.803.43169.97 263.9.668.6691.774.49370.33 267.0.686.6871.809.49270.32 266.0.681.6701.765.59970.09 258.1.695.6411.784.50070.46 255.6.792.6721.752.60369.49 261.3.740.7021.813.52969.88 262.2.721.6621.754.57670.27 Example: PGA Tour Data

38 Slide © 2003 South-Western/Thomson Learning™ n Sample Data (continued) Drive Fair Green Putt Sand Score Drive Fair Green Putt Sand Score 260.5.703.6231.782.56770.72 271.3.671.6661.783.49270.30 263.3.714.6871.796.46869.91 276.6.634.6431.776.54170.69 252.1.726.6391.788.49370.59 263.0.687.6751.786.48670.20 263.0.639.6471.760.37470.81 253.5.732.6931.797.51870.26 266.2.681.6571.812.47270.96 Example: PGA Tour Data

39 Slide © 2003 South-Western/Thomson Learning™ n Sample Correlation Coefficients Score Drive Fair Green Putt Score Drive Fair Green Putt Drive -.154 Fair -.427-.679 Green -.556-.045.421 Putt.258-.139.101.354 Sand -.278-.024.265.083 -.296 Example: PGA Tour Data

40 Slide © 2003 South-Western/Thomson Learning™ n Best Subsets Regression of SCORE Vars R-sq R-sq(a) C-p s D F G P S 130.927.926.9.39685X 130.927.926.9.39685X 118.214.635.7.43183X 254.750.512.4.32872XX 254.650.512.5.32891XX 360.755.110.2.31318XXX 359.153.311.4.31957XXX 472.266.84.2.26913XXXX 460.953.112.1.32011XXXX 572.665.46.0.27499XXXXX Example: PGA Tour Data

41 Slide © 2003 South-Western/Thomson Learning™ The regression equation Score = 74.678 -.0398(Drive) - 6.686(Fair) - 10.342(Green) + 9.858(Putt) - 10.342(Green) + 9.858(Putt) Predictor Coef Stdev t-ratio p Constant74.6786.95210.74.000 Drive-.0398.01235-3.22.004 Fair-6.6861.939-3.45.003 Green-10.3423.561-2.90.009 Putt9.8583.1803.10.006 s =.2691 R-sq = 72.4% R-sq(adj) = 66.8% Example: PGA Tour Data

43 Slide © 2003 South-Western/Thomson Learning™ Residual Analysis: Autocorrelation n Durbin-Watson Test for Autocorrelation Statistic Statistic The statistic ranges in value from zero to four. The statistic ranges in value from zero to four. If successive values of the residuals are close together (positive autocorrelation), the statistic will be small. If successive values of the residuals are close together (positive autocorrelation), the statistic will be small. If successive values are far apart (negative auto- If successive values are far apart (negative autocorrelation), the statistic will be large. A value of two indicates no autocorrelation. A value of two indicates no autocorrelation.

1 1 Slide © 2003 South-Western/Thomson Learning™ Slides Prepared by JOHN S. LOUCKS St. Edward’s University.

Similar presentations

Presentation on theme: "1 1 Slide © 2003 South-Western/Thomson Learning™ Slides Prepared by JOHN S. LOUCKS St. Edward’s University."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 1 Slide © 2003 South-Western/Thomson Learning™ Slides Prepared by JOHN S. LOUCKS St. Edward’s University.

Similar presentations

Presentation on theme: "1 1 Slide © 2003 South-Western/Thomson Learning™ Slides Prepared by JOHN S. LOUCKS St. Edward’s University."— Presentation transcript:

Similar presentations

About project

Feedback