Regression Model Building LPGA Golf Performance
Data Description Response: log(Prize Winnings/Round) – Skewed data Potential Predictors: Average Drive Distance Percentage of Drives Reaching Fairway Percentage of Greens Reached in Regulation Average Putts per Hole Average Number of Sand Traps Hit per Round (Sandshot) Percentage of Sand Saves Samples: Training Sample – 100 Randomly Sampled Golfers Validation Sample – 57 Remaining Golfers used to assess fit
Modeling Strategies Select Training Sample Select “best” subset of predictors based on Backward Elimination, Forward Selection, Stepwise Regression and/or All Possible Regressions based on Minimizing: Identify any Influential Observations (based on Outliers, Leverage Values, DFFITS, DFBETAS, Cook’s D) Test Model Assumptions: Normality (Shapiro-Wilk), Constant Variance (Brown-Forsyth and Breusch-Pagan) Determine Validity of model by obtaining prediction errors for validation sample
Top of Entire Sample (First 20 Golfers)
Backward Elimination (RSS = SSE) Step 1: Start: AIC= logprz ~ drive + fairway + green + putts + sandshot + sandsave Df Sum of Sq RSS AIC - fairway drive sandsave sandshot green putts Step 2: AIC= logprz ~ drive + green + putts + sandshot + sandsave Df Sum of Sq RSS AIC sandsave drive sandshot green putts At Step 1, Fairway is eliminated, AIC Is minimized ( < ) At Step 2, no other variables are removed (no AIC < )
Forward Selection (RSS = SSE) Step 1: Start: AIC=-6.61 logprz ~ 1 Df Sum of Sq RSS AIC + green putts drive sandshot sandsave fairway Step 2: AIC= logprz ~ green Df Sum of Sq RSS AIC + putts sandsave fairway drive sandshot Step 3: AIC= logprz ~ green + putts Df Sum of Sq RSS AIC + sandshot sandsave drive fairway Step 4: AIC= logprz ~ green + putts + sandshot Df Sum of Sq RSS AIC + drive sandsave fairway Step 5: AIC= logprz ~ green + putts + sandshot + drive Df Sum of Sq RSS AIC + sandsave fairway Step 6: AIC= logprz ~ green + putts + sandshot + drive + sandsave Df Sum of Sq RSS AIC fairway
Model – green, putts, sandshot, sandsave, drive Call: lm(formula = logprz ~ green + putts + sandshot + sandsave + drive, data = lpga.cv.in) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) e-14 *** green < 2e-16 *** putts < 2e-16 *** sandshot ** sandsave drive * --- Residual standard error: on 94 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: 128 on 5 and 94 DF, p-value: < 2.2e-16
Influence Measures (n=100, p’=6)
Summary of Influence Measures - I Studentized Residuals (Exceed in absolute value) Extreme values (in absolute value): and Leverage Values (Exceed 0.12) Golfers 111 (h=0.1543), 127 (0.1263), 113 (0.1213) (No big problem) DFFITS (Exceed 0.49 in absolute value) Three Golfers between and (Golfers 142, 91, and 117) One Golfer between 0.49 and 0.59 (Golfer 59) Cook’s D (Exceed 1, sometimes suggested to exceed 0.5) Max value is None come close to 1 (or the sometimes suggested ½)
Summary of Influence Measures DFBETAS (Exceed 0.20 in absolute value) Intercept: Golfer 117 (-0.54), 28 (0.24), 45 (0.29), 59 (0.34), 142 (0.45) Greens: Golfer 132 (-0.25), 91 (0.24), 110 (0.25), 142 (0.33) Putts: Golfer 142 (-0.41), 25 (0.24), 117 (0.43) Sandshots: Golfer 132 (-0.25), 111 (0.23), 39 (0.23), 110 (0.24) Sandsaves: Golfers 59 (-0.43), 22 (-0.31), 91 (-0.30), 102 (-0.25), 115 (0.23), 47 (0.43) Drive: Golfers 142 (-0.49), 59 (-0.24), 56 (0.28), 117 (0.29), 48 (0.30) Note that while some of these exceed the “threshold” none seem to be way too excessive. However, golfers 142 and 117 appear regularly, they should be checked out
Residuals appear to be (reasonably) approximately normal. Shapiro-Wilk test does not reject the hypothesis of normal errors > shapiro.test(residuals(lpga.mod1)) Shapiro-Wilk normality test data: residuals(lpga.mod1) W = , p-value =
No Evidence of non-constant error variance (Data had been transformed prior to fitting model)
Equal (Homogeneous) Variance - I No evidence to reject the null hypothesis of equal variance among errors
Equal (Homogeneous) Variance There is no evidence of unequal variance, based on either Brown-Forsyth or Breusch- Pagan tests Breusch-Pagan test data: logprz ~ green + putts + sandshot + sandsave + drive BP = , df = 5, p-value =