Presentation is loading. Please wait.

Presentation is loading. Please wait.

Regression Model Building LPGA Golf Performance - 2008.

Similar presentations


Presentation on theme: "Regression Model Building LPGA Golf Performance - 2008."— Presentation transcript:

1 Regression Model Building LPGA Golf Performance - 2008

2 Data Description Response: log(Prize Winnings/Round) – Skewed data Potential Predictors:  Average Drive Distance  Percentage of Drives Reaching Fairway  Percentage of Greens Reached in Regulation  Average Putts per Hole  Average Number of Sand Traps Hit per Round (Sandshot)  Percentage of Sand Saves Samples:  Training Sample – 100 Randomly Sampled Golfers  Validation Sample – 57 Remaining Golfers used to assess fit

3 Modeling Strategies Select Training Sample Select “best” subset of predictors based on Backward Elimination, Forward Selection, Stepwise Regression and/or All Possible Regressions based on Minimizing: Identify any Influential Observations (based on Outliers, Leverage Values, DFFITS, DFBETAS, Cook’s D) Test Model Assumptions: Normality (Shapiro-Wilk), Constant Variance (Brown-Forsyth and Breusch-Pagan) Determine Validity of model by obtaining prediction errors for validation sample

4 Top of Entire Sample (First 20 Golfers)

5 Backward Elimination (RSS = SSE) Step 1: Start: AIC=-200.22 logprz ~ drive + fairway + green + putts + sandshot + sandsave Df Sum of Sq RSS AIC - fairway 1 0.010 11.750 -202.132 11.740 -200.216 - drive 1 0.397 12.138 -198.887 - sandsave 1 0.405 12.145 -198.827 - sandshot 1 1.030 12.770 -193.806 - green 1 24.960 36.700 -88.238 - putts 1 35.360 47.100 -63.289 Step 2: AIC=-202.13 logprz ~ drive + green + putts + sandshot + sandsave Df Sum of Sq RSS AIC 11.750 -202.132 - sandsave 1 0.400 12.150 -200.784 - drive 1 0.537 12.287 -199.665 - sandshot 1 1.034 12.784 -195.698 - green 1 32.091 43.841 -72.461 - putts 1 35.688 47.438 -64.575 At Step 1, Fairway is eliminated, AIC Is minimized (-202.132 < -200.216) At Step 2, no other variables are removed (no AIC < -202.132)

6 Forward Selection (RSS = SSE) Step 1: Start: AIC=-6.61 logprz ~ 1 Df Sum of Sq RSS AIC + green 1 38.599 53.150 -59.206 + putts 1 33.043 58.706 -49.263 + drive 1 11.622 80.126 -18.156 + sandshot 1 8.951 82.798 -14.876 + sandsave 1 3.118 88.631 -8.069 91.749 -6.611 + fairway 1 0.409 91.340 -5.058 Step 2: AIC=-59.21 logprz ~ green Df Sum of Sq RSS AIC + putts 1 39.514 13.636 -193.246 + sandsave 1 4.859 48.291 -66.793 53.150 -59.206 + fairway 1 0.635 52.514 -58.408 + drive 1 0.361 52.788 -57.888 + sandshot 1 0.004 53.146 -57.214 Step 3: AIC=-193.25 logprz ~ green + putts Df Sum of Sq RSS AIC + sandshot 1 0.73688 12.899 -196.80 + sandsave 1 0.66486 12.971 -196.25 + drive 1 0.31495 13.321 -193.58 13.636 -193.25 + fairway 1 0.09401 13.542 -191.94 Step 4: AIC=-196.8 logprz ~ green + putts + sandshot Df Sum of Sq RSS AIC + drive 1 0.74905 12.150 -200.78 + sandsave 1 0.61234 12.287 -199.66 12.899 -196.80 + fairway 1 0.25056 12.649 -196.76 Step 5: AIC=-200.78 logprz ~ green + putts + sandshot + drive Df Sum of Sq RSS AIC + sandsave 1 0.40005 11.750 -202.13 12.150 -200.78 + fairway 1 0.00524 12.145 -198.83 Step 6: AIC=-202.13 logprz ~ green + putts + sandshot + drive + sandsave Df Sum of Sq RSS AIC 11.75 -202.13 + fairway 1 0.0099086 11.74 -200.22

7 Model – green, putts, sandshot, sandsave, drive Call: lm(formula = logprz ~ green + putts + sandshot + sandsave + drive, data = lpga.cv.in) Residuals: Min 1Q Median 3Q Max -0.72852 -0.20634 0.01067 0.22439 0.72316 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 14.272879 1.580975 9.028 2.14e-14 *** green 0.210379 0.013130 16.023 < 2e-16 *** putts -0.625367 0.037011 -16.897 < 2e-16 *** sandshot 0.790771 0.274937 2.876 0.00498 ** sandsave 0.008334 0.004658 1.789 0.07684. drive -0.009563 0.004615 -2.072 0.04098 * --- Residual standard error: 0.3536 on 94 degrees of freedom Multiple R-squared: 0.8719, Adjusted R-squared: 0.8651 F-statistic: 128 on 5 and 94 DF, p-value: < 2.2e-16

8 Influence Measures (n=100, p’=6)

9 Summary of Influence Measures - I Studentized Residuals (Exceed 3.607 in absolute value)  Extreme values (in absolute value): -2.172 and +2.112 Leverage Values (Exceed 0.12)  Golfers 111 (h=0.1543), 127 (0.1263), 113 (0.1213) (No big problem) DFFITS (Exceed 0.49 in absolute value)  Three Golfers between -0.61 and -0.49 (Golfers 142, 91, and 117)  One Golfer between 0.49 and 0.59 (Golfer 59) Cook’s D (Exceed 1, sometimes suggested to exceed 0.5)  Max value is.0626. None come close to 1 (or the sometimes suggested ½)

10 Summary of Influence Measures DFBETAS (Exceed 0.20 in absolute value)  Intercept: Golfer 117 (-0.54), 28 (0.24), 45 (0.29), 59 (0.34), 142 (0.45)  Greens: Golfer 132 (-0.25), 91 (0.24), 110 (0.25), 142 (0.33)  Putts: Golfer 142 (-0.41), 25 (0.24), 117 (0.43)  Sandshots: Golfer 132 (-0.25), 111 (0.23), 39 (0.23), 110 (0.24)  Sandsaves: Golfers 59 (-0.43), 22 (-0.31), 91 (-0.30), 102 (-0.25), 115 (0.23), 47 (0.43)  Drive: Golfers 142 (-0.49), 59 (-0.24), 56 (0.28), 117 (0.29), 48 (0.30) Note that while some of these exceed the “threshold” none seem to be way too excessive. However, golfers 142 and 117 appear regularly, they should be checked out

11 Residuals appear to be (reasonably) approximately normal. Shapiro-Wilk test does not reject the hypothesis of normal errors > shapiro.test(residuals(lpga.mod1)) Shapiro-Wilk normality test data: residuals(lpga.mod1) W = 0.9833, p-value = 0.2390

12 No Evidence of non-constant error variance (Data had been transformed prior to fitting model)

13 Equal (Homogeneous) Variance - I No evidence to reject the null hypothesis of equal variance among errors

14 Equal (Homogeneous) Variance There is no evidence of unequal variance, based on either Brown-Forsyth or Breusch- Pagan tests Breusch-Pagan test data: logprz ~ green + putts + sandshot + sandsave + drive BP = 1.9306, df = 5, p-value = 0.8587


Download ppt "Regression Model Building LPGA Golf Performance - 2008."

Similar presentations


Ads by Google