Presentation is loading. Please wait.

Presentation is loading. Please wait.

Xuhua Xia Stepwise Regression Y may depend on many independent variables How to find a subset of X’s that best predict Y? There are several criteria (e.g.,

Similar presentations


Presentation on theme: "Xuhua Xia Stepwise Regression Y may depend on many independent variables How to find a subset of X’s that best predict Y? There are several criteria (e.g.,"— Presentation transcript:

1 Xuhua Xia Stepwise Regression Y may depend on many independent variables How to find a subset of X’s that best predict Y? There are several criteria (e.g., adjusted R 2, AIC, BIC, likelihood ratio test, etc.) for model selection and many algorithms for including or excluding X’s in the model: forward selection, backward elimination, stepwise regression, etc. With the availability of statistical packages, stepwise regression is now most commonly used. X2X2 X3X3 X4X4 Y X1X1 X5X5 X6X6

2 Xuhua Xia A Data Set for Multiple Regression Measurements on men involved in a physical fitness course at N. C. State University. Fitness is typically measured by oxygen intake rate (oxy) which is difficult (at least cumbersome when one is exercising oneself) to measure. The study goal is to develop an equation to predict oxy based on exercise tests rather than on oxygen consumption measurements. The dataset has 31 observations. The variables in the data set are: age (in years) weight (in kg) oxy (oxygen intake rate, ml per kg body weight per minute) runtime (time to run 1.5 miles, in minutes) rstpulse (heart rate while resting) runpulse (heart rate while running, at the same time when oxygen rate was measured) maxpulse (maximum heart rate recorded while running).

3 Xuhua Xia oxyageweightruntimerstpulserunpulsemaxpulse 44.6094489.4711.3762178182 59.5714268.158.1740166172 45.6814075.9811.9570176180 60.0553881.878.6348170186 44.7544566.4511.1251176 49.1564981.428.9544180185 46.7744891.6310.2548162164 46.085479.3811.1762156165 45.1185167.2511.0848172 50.5455759.089.9349148155 47.4675282.7810.553170172 45.3134075.0710.0762185 49.8743889.029.2255178180 49.0914381.1910.8564162170 50.5414473.0310.1345168 47.2734779.1510.647162164 40.8365169.6310.9557168172 50.3884973.3710.0867168 45.4415276.329.6348164166 39.2035491.6312.8844168172 48.6734976.329.456186188 54.2974485.848.6545156168 44.8114777.4511.6358176 39.4424481.4213.0863174176 37.3884587.6614.0356186192 51.8555483.1210.3350166170 46.6725177.911048162168 39.4075773.3712.6358174176 54.6255070.878.9248146155 45.795173.7110.4759186188 47.924861.2411.552170176

4 Xuhua Xia Correlation matrix age weight oxy runtime rstpulse runpulse maxpulse age 1.00000 -0.23354 -0.30459 0.18875 -0.16410 -0.33787 -0.43292 0.2061 0.0957 0.3092 0.3777 0.0630 0.0150 weight -0.23354 1.00000 -0.16275 0.14351 0.04397 0.18152 0.24938 0.2061 0.3817 0.4412 0.8143 0.3284 0.1761 oxy -0.30459 -0.16275 1.00000 -0.86219 -0.39936 -0.39797 -0.23674 0.0957 0.3817 <.0001 0.0260 0.0266 0.1997 runtime 0.18875 0.14351 -0.86219 1.00000 0.45038 0.31365 0.22610 0.3092 0.4412 <.0001 0.0110 0.0858 0.2213 rstpulse -0.16410 0.04397 -0.39936 0.45038 1.00000 0.35246 0.30512 0.3777 0.8143 0.0260 0.0110 0.0518 0.0951 runpulse -0.33787 0.18152 -0.39797 0.31365 0.35246 1.00000 0.92975 0.0630 0.3284 0.0266 0.0858 0.0518 <.0001 maxpulse -0.43292 0.24938 -0.23674 0.22610 0.30512 0.92975 1.00000 0.0150 0.1761 0.1997 0.2213 0.0951 <.0001

5 Xuhua Xia Scatterplot matrix

6 rcorr in Hmisc oxy age weight runtime rstpulse runpulse maxpulse oxy 1.00 -0.30 -0.16 -0.86 -0.40 -0.40 -0.24 age -0.30 1.00 -0.23 0.19 -0.16 -0.34 -0.43 weight -0.16 -0.23 1.00 0.14 0.04 0.18 0.25 runtime -0.86 0.19 0.14 1.00 0.45 0.31 0.23 rstpulse -0.40 -0.16 0.04 0.45 1.00 0.35 0.31 runpulse -0.40 -0.34 0.18 0.31 0.35 1.00 0.93 maxpulse -0.24 -0.43 0.25 0.23 0.31 0.93 1.00 P oxy age weight runtime rstpulse runpulse maxpulse oxy 0.0957 0.3817 0.0000 0.0260 0.0266 0.1997 age 0.0957 0.2061 0.3092 0.3777 0.0630 0.0150 weight 0.3817 0.2061 0.4412 0.8143 0.3284 0.1761 runtime 0.0000 0.3092 0.4412 0.0110 0.0858 0.2213 rstpulse 0.0260 0.3777 0.8143 0.0110 0.0518 0.0951 runpulse 0.0266 0.0630 0.3284 0.0858 0.0518 0.0000 maxpulse 0.1997 0.0150 0.1761 0.2213 0.0951 0.0000 > print(rmat) oxy age weight runtime rstpulse runpulse maxpulse oxy 1.00 -0.30 -0.16 -0.86 -0.40 -0.40 -0.24 age -0.30 1.00 -0.23 0.19 -0.16 -0.34 -0.43 weight -0.16 -0.23 1.00 0.14 0.04 0.18 0.25 runtime -0.86 0.19 0.14 1.00 0.45 0.31 0.23 rstpulse -0.40 -0.16 0.04 0.45 1.00 0.35 0.31 runpulse -0.40 -0.34 0.18 0.31 0.35 1.00 0.93 maxpulse -0.24 -0.43 0.25 0.23 0.31 0.93 1.00

7 Backward elimination Xuhua Xia Start: AIC=58.16 oxy ~ age + weight + runtime + rstpulse + runpulse + maxpulse Df Sum of Sq RSS AIC - rstpulse 1 0.571 129.41 56.299 128.84 58.162 - weight 1 9.911 138.75 58.459 - maxpulse 1 26.491 155.33 61.958 - age 1 27.746 156.58 62.208 - runpulse 1 51.058 179.90 66.510 - runtime 1 250.822 379.66 89.664 Step: AIC=56.3 oxy ~ age + weight + runtime + runpulse + maxpulse Df Sum of Sq RSS AIC 129.41 56.299 - weight 1 9.52 138.93 56.499 - maxpulse 1 26.83 156.23 60.139 - age 1 27.37 156.78 60.247 - runpulse 1 52.60 182.00 64.871 - runtime 1 320.36 449.77 92.917 the current model, i.e., without eliminating rstpulse

8 Forward addition Xuhua Xia Start: AIC=104.7 oxy ~ 1 Df Sum of Sq RSS AIC + runtime 1 632.90 218.48 64.534 + rstpulse 1 135.78 715.60 101.313 + runpulse 1 134.84 716.54 101.354 + age 1 78.99 772.39 103.681 851.38 104.699 + maxpulse 1 47.72 803.67 104.911 + weight 1 22.55 828.83 105.867 Step: AIC=64.53 oxy ~ runtime Df Sum of Sq RSS AIC + age 1 17.7656 200.72 63.905 + runpulse 1 15.3621 203.12 64.274 218.48 64.534 + maxpulse 1 1.5674 216.91 66.311 + weight 1 1.3236 217.16 66.346 + rstpulse 1 0.1301 218.35 66.516 Step: AIC=63.9 oxy ~ runtime + age Df Sum of Sq RSS AIC + runpulse 1 39.885 160.83 59.037 + maxpulse 1 14.885 185.83 63.516 200.72 63.905 + weight 1 5.605 195.11 65.027 + rstpulse 1 2.641 198.07 65.494 Step: AIC=59.04 oxy ~ runtime + age + runpulse Df Sum of Sq RSS AIC + maxpulse 1 21.9007 138.93 56.499 160.83 59.037 + weight 1 4.5958 156.24 60.139 + rstpulse 1 0.4901 160.34 60.943 Step: AIC=56.5 oxy ~ runtime + age + runpulse + maxpulse IVs whose addition will improve fit IVs whose addition will make it worse

9 R Functions Xuhua Xia library(Hmisc) cor(myD,method="pearson|spearman") pairs(~age+weight+runtime+rstpulse+runpulse+maxpulse+oxy) rmat<-rcorr(as.matrix(myD), type="pearson|spearman") rmat print(rmat[1],digits=5) fit<-lm(oxy~age+weight+runtime+rstpulse+runpulse+maxpulse) anova(fit) summary(fit) full.model<-lm(oxy~age+weight+runtime+rstpulse+runpulse+maxpulse) best.model<-step(full.model,direction="backward") min.model<-lm(oxy~1) best.model<-step(min.model,direction="forward", scope="~age+weight+runtime+rstpulse+runpulse+maxpulse") new<-data.frame(specify values here) predict(fit,new,interval="confidence") predict(fit,new,interval="prediction") Two R functions for computing Pearson correlation: cor in basic package does not provide associated p, and rcorr in the Hmisc package includes p.

10 Package leaps Xuhua Xia x<-as.matrix(myD) DV<-x[,1] IV<-x[,2:7] library(leaps) solR2a<-leaps(IV, DV, names=names(myD)[2:7], method="adjr2") solCp<-leaps(IV, DV, names=names(myD)[2:7], method="Cp") Package leaps includes a function leaps that offers two more criteria for model selection: 1)adjusted r 2 2)Mallow's Cp (which is used less frequently now) The input to leaps is not a data frame but a vector for DV and a matrix for IVs:

11 leaps evaluates all linear models Xuhua Xia $which age weight runtime rstpulse runpulse maxpulse 1 FALSE FALSE TRUE FALSE FALSE FALSE 1 FALSE FALSE FALSE TRUE FALSE FALSE 1 FALSE FALSE FALSE FALSE TRUE FALSE 1 TRUE FALSE FALSE FALSE FALSE FALSE 1 FALSE FALSE FALSE FALSE FALSE TRUE 1 FALSE TRUE FALSE FALSE FALSE FALSE 2 TRUE FALSE TRUE FALSE FALSE FALSE 2 FALSE FALSE TRUE FALSE TRUE FALSE 2 FALSE FALSE TRUE FALSE FALSE TRUE 2 FALSE TRUE TRUE FALSE FALSE FALSE 2 FALSE FALSE TRUE TRUE FALSE FALSE 2 TRUE FALSE FALSE FALSE TRUE FALSE 2 TRUE FALSE FALSE TRUE FALSE FALSE 2 FALSE FALSE FALSE FALSE TRUE TRUE 2 TRUE FALSE FALSE FALSE FALSE TRUE 2 FALSE FALSE FALSE TRUE TRUE FALSE 3 TRUE FALSE TRUE FALSE TRUE FALSE 3 FALSE FALSE TRUE FALSE TRUE TRUE 3 TRUE FALSE TRUE FALSE FALSE TRUE 3 TRUE TRUE TRUE FALSE FALSE FALSE 3 TRUE FALSE TRUE TRUE FALSE FALSE 3 FALSE FALSE TRUE TRUE TRUE FALSE 3 FALSE TRUE TRUE FALSE TRUE FALSE 3 FALSE TRUE TRUE FALSE FALSE TRUE 3 FALSE FALSE TRUE TRUE FALSE TRUE 3 FALSE TRUE TRUE TRUE FALSE FALSE …… The best model is one with the greatest adjusted r 2 or a Cp closest to the total number of IVs (6 in our case) The next two slides show results of evaluation

12 Model evaluation: adjusted r 2 Xuhua Xia $size [1] 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 6 6 [39] 6 6 6 6 7 $adjr2 [1] 0.734531140 0.130502041 0.129362176 0.061492963 0.023495780 [6] -0.007080873 0.747407426 0.744382656 0.727022568 0.726715841 [11] 0.725213882 0.331423675 0.250289567 0.238663728 0.207123289 [16] 0.180390053 0.790104958 0.788876048 0.757477967 0.745367336 [21] 0.741499364 0.735442756 0.735365598 0.717949838 0.716918697 [26] 0.716790424 0.811713247 0.788260638 0.787518105 0.782696332 [31] 0.781231246 0.753335728 0.750114011 0.740422515 0.725675821 [36] 0.707129090 0.817602176 0.804437584 0.781067331 0.779299361 [41] 0.746441303 0.464879114 0.810839895 The number of coefficients in each model Given the set of adjusted r 2 values for the 43 alternative models, which one is the maximum? maxAdjR2<-max(solR2a$adjr2); bestModelInd<-match(maxAdjR2,solR2a$adjr2) solR2a$which[bestModelInd,] bestModelInd<-which.max(solR2a$adjr2) Find the max adjusted r 2 Find the index of the max adjusted r 2 Show the best model Find the index of the max adjusted r 2

13 leaps output for Cp: 2 $size [1] 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 6 6 [39] 6 6 6 6 7 $Cp [1] 13.698840 106.302108 106.476860 116.881841 122.707162 127.394846 [7] 12.389449 12.837184 15.406872 15.452274 15.674598 73.964510 [13] 85.974204 87.695093 92.363796 96.320923 6.959627 7.135037 [19] 11.616680 13.345306 13.897406 14.761903 14.772916 17.258776 [25] 17.405958 17.424267 4.879958 8.103512 8.205573 8.868324 [31] 9.069700 12.903931 13.346755 14.678848 16.705777 19.255019 [37] 5.106275 6.846150 9.934837 10.168497 14.511122 51.723275 [43] 7.000000 The number of coefficients in each model Given the set of Cp values for the 43 alternative models, which one is closest to 6? solCp bestModelInd<-which(abs(solCp$Cp-6)==min(abs(solCp$Cp-6))) solCp$which[bestModelInd,] This leads to a model that is suboptimal based on AIC or adjusted r 2

14 Xuhua Xia Criteria used in model selection R a 2 Cp SBC (BIC) AIC Significance level Burnham, K. P. and D. R. Anderson. 2002 Model selection and multimodel inference: a practical information-theoretic approach. 2nd ed. Springer. (Best book on model selection)


Download ppt "Xuhua Xia Stepwise Regression Y may depend on many independent variables How to find a subset of X’s that best predict Y? There are several criteria (e.g.,"

Similar presentations


Ads by Google