10/1/2015330 lecture 151 STATS 330: Lecture 15. 10/1/2015330 lecture 152 Variable selection Aim of today’s lecture  To describe some further techniques.

Slides:



Advertisements
Similar presentations
All Possible Regressions and Statistics for Comparing Models
Advertisements

Computational Statistics. Basic ideas  Predict values that are hard to measure irl, by using co-variables (other properties from the same measurement.
4/14/ lecture 81 STATS 330: Lecture 8. 4/14/ lecture 82 Collinearity Aims of today’s lecture: Explain the idea of collinearity and its connection.
Penalized Regression, Part 2
Review of Univariate Linear Regression BMTRY 726 3/4/14.
Model generalization Test error Bias, variance and complexity
Topic 15: General Linear Tests and Extra Sum of Squares.
5/11/ lecture 71 STATS 330: Lecture 7. 5/11/ lecture 72 Prediction Aims of today’s lecture  Describe how to use the regression model to.
5/16/ lecture 141 STATS 330: Lecture 14. 5/16/ lecture 142 Variable selection Aim of today’s lecture  To describe some techniques for selecting.
Model Assessment and Selection
Model Assessment, Selection and Averaging
Model assessment and cross-validation - overview
Lecture 9 Today: Ch. 3: Multiple Regression Analysis Example with two independent variables Frisch-Waugh-Lovell theorem.
Lecture 4 This week’s reading: Ch. 1 Today:
The Multiple Regression Model Prepared by Vera Tabakova, East Carolina University.
Linear Methods for Regression Dept. Computer Science & Engineering, Shanghai Jiao Tong University.
Resampling techniques Why resampling? Jacknife Cross-validation Bootstrap Examples of application of bootstrap.
SPH 247 Statistical Analysis of Laboratory Data 1April 23, 2010SPH 247 Statistical Analysis of Laboratory Data.
Econ 140 Lecture 131 Multiple Regression Models Lecture 13.
Econ Prof. Buckles1 Multiple Regression Analysis y =  0 +  1 x 1 +  2 x  k x k + u 4. Further Issues.
Regression Hal Varian 10 April What is regression? History Curve fitting v statistics Correlation and causation Statistical models Gauss-Markov.
Multiple Regression Models
Lecture 6: Multiple Regression
1 Chapter 9 Variable Selection and Model building Ray-Bing Chen Institute of Statistics National University of Kaohsiung.
Lecture 11 Multivariate Regression A Case Study. Other topics: Multicollinearity  Assuming that all the regression assumptions hold how good are our.
7/2/ Lecture 51 STATS 330: Lecture 5. 7/2/ Lecture 52 Tutorials  These will cover computing details  Held in basement floor tutorial lab,
Interpreting Bi-variate OLS Regression
Chapter 15: Model Building
Ensemble Learning (2), Tree and Forest
1 Chapter 10 Correlation and Regression We deal with two variables, x and y. Main goal: Investigate how x and y are related, or correlated; how much they.
12 Autocorrelation Serial Correlation exists when errors are correlated across periods -One source of serial correlation is misspecification of the model.
EVALUATION David Kauchak CS 451 – Fall Admin Assignment 3 - change constructor to take zero parameters - instead, in the train method, call getFeatureIndices()
Chapter 13: Inference in Regression
Returning to Consumption
9/14/ Lecture 61 STATS 330: Lecture 6. 9/14/ Lecture 62 Inference for the Regression model Aim of today’s lecture: To discuss how we assess.
Variable selection and model building Part II. Statement of situation A common situation is that there is a large set of candidate predictor variables.
Inferences in Regression and Correlation Analysis Ayona Chatterjee Spring 2008 Math 4803/5803.
23-1 Analysis of Covariance (Chapter 16) A procedure for comparing treatment means that incorporates information on a quantitative explanatory variable,
Multiple Regression Selecting the Best Equation. Techniques for Selecting the "Best" Regression Equation The best Regression equation is not necessarily.
CHAPTER 14 MULTIPLE REGRESSION
Model Selection1. 1. Regress Y on each k potential X variables. 2. Determine the best single variable model. 3. Regress Y on the best variable and each.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.
Summarizing Bivariate Data
Regression Model Building LPGA Golf Performance
MGS3100_04.ppt/Sep 29, 2015/Page 1 Georgia State University - Confidential MGS 3100 Business Analysis Regression Sep 29 and 30, 2015.
Sullivan – Fundamentals of Statistics – 2 nd Edition – Chapter 3 Section 2 – Slide 1 of 27 Chapter 3 Section 2 Measures of Dispersion.
Manu Chandran. Outline Background and motivation Over view of techniques Cross validation Bootstrap method Setting up the problem Comparing AIC,BIC,Crossvalidation,Bootstrap.
Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Chapter 13 Multiple Regression Section 13.3 Using Multiple Regression to Make Inferences.
© Department of Statistics 2012 STATS 330 Lecture 20: Slide 1 Stats 330: Lecture 20.
CROSS-VALIDATION AND MODEL SELECTION Many Slides are from: Dr. Thomas Jensen -Expedia.com and Prof. Olga Veksler - CS Learning and Computer Vision.
Multiple Regression Selecting the Best Equation. Techniques for Selecting the "Best" Regression Equation The best Regression equation is not necessarily.
Multiple Regression. Simple Regression in detail Y i = β o + β 1 x i + ε i Where Y => Dependent variable X => Independent variable β o => Model parameter.
1 Experimental Statistics - week 12 Chapter 12: Multiple Regression Chapter 13: Variable Selection Model Checking.
© Department of Statistics 2012 STATS 330 Lecture 19: Slide 1 Stats 330: Lecture 19.
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Variable Selection 1 Chapter 8 Variable Selection Terry Dielman Applied Regression Analysis:
1 Building the Regression Model –I Selection and Validation KNN Ch. 9 (pp )
Statistical Data Analysis 2011/2012 M. de Gunst Lecture 3.
Statistics: Unlocking the Power of Data Lock 5 STAT 250 Dr. Kari Lock Morgan Multiple Regression SECTIONS 10.1, 10.3 Multiple explanatory variables (10.1,
Variable selection and model building Part I. Statement of situation A common situation is that there is a large set of candidate predictor variables.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
By Subhasis Dasgupta Asst Professor Praxis Business School, Kolkata Classification Modeling Decision Tree (Part 2)
Model selection and model building. Model selection Selection of predictor variables.
Stats Methods at IC Lecture 3: Regression.
Estimating standard error using bootstrap
STATS 330: Lecture 16 Case Study 7/17/ lecture 16
Lecture Slides Elementary Statistics Thirteenth Edition
CHAPTER 29: Multiple Regression*
Chapter 6: Multiple Linear Regression
Linear Model Selection and regularization
Statistical inference for the slope and intercept in SLR
Presentation transcript:

10/1/ lecture 151 STATS 330: Lecture 15

10/1/ lecture 152 Variable selection Aim of today’s lecture  To describe some further techniques for selecting the explanatory variables for a regression  To compare the techniques and apply them to several examples

10/1/ lecture 153 Variable selection: Stepwise methods  In the previous lecture, we mentioned a second class of methods for variable selection: stepwise methods  The idea here is to perform a sequence of steps to eliminate variables from the regression, or add variables to the regression (or perhaps both).  Three variations: Backward Elimination (BE), Forward Selection (FS) and Stepwise Regression (a combination of BE and FS)  Invented when computing power was weak

10/1/ lecture 154 Backward elimination 1.Start with the full model with k variables 2.Remove variables one at a time, record AIC 3.Retain best (k-1)-variable model (smallest AIC) 4.Repeat 2 and 3 until no improvement in AIC

R code  Use R function step  Need to define an initial model (the full model in this case, as produced by the R function lm) and a scope (a formula defining the full model) 10/1/ lecture 155 ffa.lm = lm(ffa~., data=ffa.df) step(ffa.lm, scope=formula(ffa.lm), direction=“backward”)

10/1/ lecture 156 > step(ffa.lm, scope=formula(ffa.lm),direction="backward") Start: AIC=-56.6 ffa ~ age + weight + skinfold Df Sum of Sq RSS AIC - skinfold age weight Step: AIC= ffa ~ age + weight Df Sum of Sq RSS AIC age weight Call: lm(formula = ffa ~ age + weight, data = ffa.df) Coefficients: (Intercept) age weight Smallest AIC Smallest AIC

10/1/ lecture 157 Forward selection  Start with a null model  Fit all one-variable models in turn. Pick the model with the best AIC  Then, fit all two variable models that contain the variable selected in 2. Pick the one for which the added variable gives the best AIC  Continue in this way until adding further variables does not improve the AIC

10/1/ lecture 158 Forward selection (cont)  Use R function step  As before, we need to define an initial model (the null model in this case and a scope (a formula defining the full model) # R code: first make null model: ffa.lm = lm(ffa~., data=ffa.df) null.lm = lm(ffa~1, data=ffa.df)# then do FS step(null.lm, scope=formula(ffa.lm), direction=“forward”)

10/1/ lecture 159 Step: output (1) > step(null.lm, scope=formula(ffa.lm), direction="forward") Start: AIC= ffa ~ 1 Df Sum of Sq RSS AIC + weight age skinfold Results of all possible 1 (& 0) variable models. Pick weight (smallest AIC) Starts with constant term only

10/1/ lecture 1510 Final model Call: lm(formula = ffa ~ weight + age, data = reg.obj$model) Coefficients: (Intercept) weight age

10/1/ lecture 1511 Stepwise Regression  Combination of BE and FS  Start with null model  Repeat: one step of FS one step of BE  Stop when no improvement in AIC is possible

10/1/ lecture 1512 Code for Stepwise Regression # first define null model null.lm<-lm(ffa~1, data=ffa.df) # then do stepwise regression, using the R function “step” step(null.model, scope=formula(ffa.lm), direction=“both”) Note difference from FS (use “both” instead of “forward”)

10/1/ lecture 1513 Example: Evaporation data Recall from Lecture 14: variables are evap: the amount of moisture evaporating from the soil in the 24 hour period (response) maxst: maximum soil temperature over the 24 hour period minst: minimum soil temperature over the 24 hour period avst: average soil temperature over the 24 hour period maxat: maximum air temperature over the 24 hour period minat: minimum air temperature over the 24 hour period avat: average air temperature over the 24 hour period maxh: maximum air temperature over the 24 hour period minh: minimum air temperature over the 24 hour period avh: average air temperature over the 24 hour period wind: average wind speed over the 24 hour period.

10/1/ lecture 1514 Stepwise evap.lm = lm(evap~., data=evap.df) null.model<-lm(evap ~ 1, data = evap.df) step(null.model, formula(evap.lm), direction=“both”) Final output: Call: lm(formula = evap ~ maxh + maxat + wind + maxst + avst, data = evap.df) Coefficients: (Intercept) maxh maxat wind maxst avst

10/1/ lecture 1515 Conclusion  APR suggested model with variables maxat, maxh, wind (CV criterion) avst, maxst, maxat, maxh (BIC) avst, maxst, maxat, minh, maxh (AIC)  Stepwise gave model with variables maxat, avst, maxst, maxh, wind

10/1/ lecture 1516 Caveats  There is no guarantee that the stepwise algorithm will find the best predicting model  The selected model usually has an inflated R 2, and standard errors and p-values that are too low.  Collinearity can make the model selected quite arbitrary – collinearity is a data property, not a model property.  For both methods of variable selection, do not trust p- values from the final fitted model – resist the temptation to delete variables that are not significant.

A Cautionary Example  Body fat data: see Assignment 3, 2010  Response: percent body fat ( PercentB )  14 Explanatory variables: Age, Weight, Height, Adi, Neck, Chest, Abdomen, Hip, Thigh, Knee, Ankle, Bicep, Forearm, Wrist  Assume true model: PercentB ~ Age + Adi + Neck + Chest + Abdomen + Hip + Thigh + Forearm + Wrist Coefficients: (Intercept) Age Adi Neck Chest Abdomen Hip Thigh Forearm Wrist Sigma = /1/ lecture 1517

Example (cont)  Using R, generate 200 sets of data from this model, using the same X’s but new random errors each time.  For each set, choose a model by BE, record coefficients. If a variable is not in chosen model, record as 0.  Results summarized on next slide 10/1/ lecture 1518

Results (1) true coef % selected (out of 200) Age Weight Height Adi Neck Chest Abdomen Hip Thigh Knee Ankle Bicep Forearm Wrist /1/ lecture 1519 True model selected only 6 out of 200 times!

Distribution of estimates 10/1/ lecture 1520

Bias in coefficient of Abdomen  Suppose we want to estimate the coefficient of Abdomen. Various strategies: 1.Pick model using BE, use coef of abdomen in chosen model. 2.Pick model using BIC, use coef of abdomen in chosen model. 3.Pick model using AIC, use coef of abdomen in chosen model. 4.Pick model using Adj R2, use coef of abdomen in chosen model. 5.Use coef of abdomen in full model  Which is best? Can generate 200 data sets, and compare 10/1/ lecture 1521

Bias results Table gives MSE i.e. average of squared differences (estimate-true value) 2 x 10 4 averaged over all 200 replications Thus, full model is best! 10/1/ lecture 1522 FullBEBICAICS2S

Estimating the “optimism” in R 2  We noted (caveats slide 16) that the R 2 for the selected model is usually higher than the R 2 for the model fitted to new data.  How can we adjust for this?  If we have plenty of data, we can split the data into a “training set” and a “test set”, select the model using the training set, then calculate the R 2 for the test set. 10/1/ lecture 1523

Example: the Georgia voting data  In the 2000 US presidential election, some voters had their ballots declared invalid for different reasons.  In this data set, the response is the “undercount” (the difference between the votes cast and the votes declared valid).  Each observation is a Georgia county, of which there were 159. We removed 4 outliers, leaving 155 counties.  We will consider a model with 5 explanatory variables: undercount ~ perAA+rural+gore+bush+other  Data is in the faraway package 10/1/ lecture 1524

Calculating the optimism  We split the data into 2 parts at random, a training set of 80 and a test set of 75.  Using the training set, we selected a model using stepwise regression and calculated the R 2.  We then took the chosen model and recalculated the R 2 using the test set. The difference is the “optimism”  We repeated for 50 random splits of 80/75, getting 50 training set R 2 ’s and 50 test set R 2 ’s.  Boxplots of these are shown next 10/1/ lecture 1525

10/1/ lecture 1526 Note that the training R 2 ’s tend to be bigger

Optimism  We can also calculate the optimism for the 50 splits: Opt = training R 2 – test R 2. > stem(Opt) The decimal point is 1 digit(s) to the left of the | -1 | | | | Optimism tends to be 0 | positive. 1 | | 57 2 | 34 10/1/ lecture 1527

What if there is no test set?  If the data are too few to split into training and test sets, and we chose the model using all the data and compute the R 2, it will most likely be too big.  Perhaps we can estimate the optimism and subtract it off the R 2 for the chosen model, thus “correcting” the R 2.  We need to estimate the optimism averaged over all possible data sets.  But we have only one! How to proceed? 10/1/ lecture 1528

Estimating the optimism  The optimism is R 2 (SWR,data) – “True R 2 ”  Depends on the true unknown distribution of the data  Don’t know this but it is approximated by the “empirical distribution” which puts probability 1/n at each data point NB: SWR = stepwise regression 10/1/ lecture 129

Resampling  We can draw a sample from the empirical distribution by choosing a sample of size n chosen at random with replacement from the original data (n= number of observations in the original data).  Also called a “bootstrap sample” or a “resample” 10/1/ lecture 1530

“Empirical optimism”  The “empirical optimism” is R 2 (SWR, resampled data) – R 2 (SWR, original data)  We can generate as many values of this estimate as we like by repeatedly drawing samples from the empirical distribution, or “resampling” 10/1/ lecture 1531

Resampling (cont) To correct the R 2 :  Compute the empirical optimism  Repeat for say 200 resamples, average the 200 optimisms.  This is our estimated optimism.  Correct the original R 2 for the chosen model by subtracting off the estimated optimism.  This is the “bootstrap corrected” R 2. 10/1/ lecture 1532

How well does it work 10/1/ lecture 1533

Bootstrap estimate of prediction error  Can also use the bootstrap to estimate prediction error.  Calculating the prediction error from the training set underestimates the error.  We estimate the “optimism” from a resample  Repeat and average, as before. 10/1/ lecture 1534

Estimating Prediction errors in R > ga.lm=lm(undercount~perAA+rural+gore+bush+other, data=gavote2) > cross.val(ga.lm) Cross-validated estimate of root mean square prediction error = > err.boot(ga.lm) $err [1] $Err [1] > sqrt( ) [1] /1/ lecture 1535 Bootstrap-corrected estimate Training set estimate, too low

Example: prediction error Body fat data: prediction strategies 1.Pick model with min CV, estimate prediction error 2.Pick model with min BOOT, estimate prediction error 3.Use full model, estimate prediction error 10/1/ lecture 1536

Prediction example (cont) Generate 200 samples. For each sample 1.calculate ratio (using CV estimate)  2.Calculate ratio (using BOOT estimate) 3.Average ratios over 200 samples 10/1/ lecture 1537

Results MethodCVBOOT Ratio /1/ lecture 1538 CV and BOOT in good agreement. Both ratios less than 1, so selecting subsets by CV or BOOT is giving better predictions.