Penalized Regression, Part 2

Slides:

Advertisements

Similar presentations

Autocorrelation and Heteroskedasticity

Advertisements

All Possible Regressions and Statistics for Comparing Models

A. The Basic Principle We consider the multivariate extension of multiple linear regression – modeling the relationship between m responses Y 1,…,Y m and.

Lecture 10 F-tests in MLR (continued) Coefficients of Determination BMTRY 701 Biostatistical Methods II.

Chapter Outline 3.1 Introduction

© Department of Statistics 2012 STATS 330 Lecture 32: Slide 1 Stats 330: Lecture 32.

4/14/ lecture 81 STATS 330: Lecture 8. 4/14/ lecture 82 Collinearity Aims of today’s lecture: Explain the idea of collinearity and its connection.

Regularization David Kauchak CS 451 – Fall 2013.

SUG London 2007 Least Angle Regression Translating the S-Plus/R Least Angle Regression package to Mata Adrian Mander MRC-Human Nutrition Research Unit,

Statistical Techniques I EXST7005 Multiple Regression.

Ridge Regression Population Characteristics and Carbon Emissions in China ( ) Q. Zhu and X. Peng (2012). “The Impacts of Population Change on Carbon.

R OBERTO B ATTITI, M AURO B RUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Feb 2014.

Penalized Regression BMTRY 726 3/7/14.

3.2 OLS Fitted Values and Residuals -after obtaining OLS estimates, we can then obtain fitted or predicted values for y: -given our actual and predicted.

Review of Univariate Linear Regression BMTRY 726 3/4/14.

1 SSS II Lecture 1: Correlation and Regression Graduate School 2008/2009 Social Science Statistics II Gwilym Pryce

Introduction: The General Linear Model b b The General Linear Model is a phrase used to indicate a class of statistical models which include simple linear.

Chapter 2: Lasso for linear models

Data mining and statistical learning - lecture 6

Basis Expansion and Regularization Presenter: Hongliang Fei Brian Quanz Brian Quanz Date: July 03, 2008.

Some Terms Y =  o +  1 X Regression of Y on X Regress Y on X X called independent variable or predictor variable or covariate or factor Which factors.

Coefficient Path Algorithms Karl Sjöstrand Informatics and Mathematical Modelling, DTU.

Lecture 3 Cameron Kaplan

Linear Methods for Regression Dept. Computer Science & Engineering, Shanghai Jiao Tong University.

Lasso regression. The Goals of Model Selection Model selection: Choosing the approximate best model by estimating the performance of various models Goals.

Review for Final Exam Some important themes from Chapters 9-11 Final exam covers these chapters, but implicitly tests the entire course, because we use.

Objectives of Multiple Regression

Regression Analysis (2)

Simple Linear Regression

Model Building III – Remedial Measures KNNL – Chapter 11.

10/1/ lecture 151 STATS 330: Lecture /1/ lecture 152 Variable selection Aim of today’s lecture  To describe some further techniques.

Introduction to Linear Regression

The Examination of Residuals. Examination of Residuals The fitting of models to data is done using an iterative approach. The first step is to fit a simple.

Lesson Multiple Regression Models. Objectives Obtain the correlation matrix Use technology to find a multiple regression equation Interpret the.

Ensemble Learning (1) Boosting Adaboost Boosting is an additive model

Environmental Modeling Basic Testing Methods - Statistics III.

R EGRESSION S HRINKAGE AND S ELECTION VIA THE L ASSO Author: Robert Tibshirani Journal of the Royal Statistical Society 1996 Presentation: Tinglin Liu.

1 Experimental Statistics - week 12 Chapter 12: Multiple Regression Chapter 13: Variable Selection Model Checking.

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Variable Selection 1 Chapter 8 Variable Selection Terry Dielman Applied Regression Analysis:

CpSc 881: Machine Learning

Copyright © 2010 Pearson Education, Inc Chapter Seventeen Correlation and Regression.

Ridge Regression: Biased Estimation for Nonorthogonal Problems by A.E. Hoerl and R.W. Kennard Regression Shrinkage and Selection via the Lasso by Robert.

1 Statistics 262: Intermediate Biostatistics Regression Models for longitudinal data: Mixed Models.

Logistic Regression & Elastic Net

Biostatistics Regression and Correlation Methods Class #10 April 4, 2000.

Regularized Least-Squares and Convex Optimization.

LECTURE 13: LINEAR MODEL SELECTION PT. 3 March 9, 2016 SDS 293 Machine Learning.

I. Statistical Methods for Genome-Enabled Prediction of Complex Traits OUTLINE THE CHALLENGES OF PREDICTING COMPLEX TRAITS ORDINARY LEAST SQUARES (OLS)

Stats Methods at IC Lecture 3: Regression.

Chapter 15 Multiple Regression Model Building

Linear Regression Methods for Collinearity

Penalized Regression Part 1

Correlation, Bivariate Regression, and Multiple Regression

Boosting and Additive Trees (2)

Roberto Battiti, Mauro Brunato

Correlation and Simple Linear Regression

Solutions of Tutorial 10 SSE df RMS Cp Radjsq SSE1 F Xs c).

Chapter 6: Multiple Linear Regression

Lasso/LARS summary Nasimeh Asgarian.

Multiple Regression Models

Correlation and Simple Linear Regression

What is Regression Analysis?

Linear Model Selection and regularization

SOME PROBLEMS THAT MIGHT APPEAR IN DATA

Simple Linear Regression and Correlation

Basis Expansions and Generalized Additive Models (2)

Penalized Regression, Part 3

Solutions of Tutorial 9 SSE df RMS Cp Radjsq SSE1 F Xs c).

Regression Analysis.

Penalized Regression, Part 2

Presentation transcript:

Penalized Regression, Part 2

Penalized Regression Recall in penalized regression, we re-write our loss function to include not only the squared error loss but a penalty term Our goal then becomes to minimize our a loss function (i.e. SS) In the regression setting we can write M(q ) in terms of our regression parameters b as follows The penalty function takes the form

Ridge Regression Last class we discussed ridge regression as an alternative to OLS when covariates are collinear Ridge regression can reduce the variability and improve accuracy of a regression model However, there is not a means of variable selection in ridge regression Ideally we want to be able to reduce the variability in a model but also be able to select which variables are most strongly associated with our outcome

The Lasso versus Ridge Regression In ridge regression, the new function is Consider instead the estimator which minimizes The only change is to the penalty function and while the change is subtle, is has a big impact on our regression estimator

The Lasso The name lasso stands for “Least Absolute Shrinkage and Selection Operator” Like ridge regression, penalizing the absolute values of the coefficients shrinks them towards zero But in the lasso, some coefficients are shrunk completely to zero Solutions where multiple coefficient estimates are identically zero are called sparse Thus the penalty performs a continuous variable selection, hence the name

Geometry of Ridge versus Lasso 2-dimensional case Solid areas represent the constraint regions The ellipses represent the contours of the least square error function

The Lasso Because the lasso penalty has an absolute value operation, the objective function is not differentiable and therefore lacks a closed form As a result, we must use optimization algorithms to find the minimum Examples of these algorithms include -Quadratic programming (limit ~100 predictors) -Least Angle Regression/LAR (limit ~10,000 predictors)

Selection of l Since lasso is not a linear estimator, we have no H matrix such that Thus determining the degrees of freedom are more difficult to estimate One means is to estimate the degrees of freedom based on the number of non-zero parameters in the model and then use AIC, BIC or Cp to select the best l Alternatively (and often more preferred) we could select l via cross-validation

Forward Stagewise Selection Alternative method for variable subset selection designed to handle correlated predictors Iterative process that begins with all coefficients equal to zero and build regression function in successive small steps Similar algorithm to forward selection in that predictors added successively to the model However, it is much more cautious than forward stepwise model selection -e.g. for a model with 10 possible predictors stepwise takes 10 steps at most, stagewise may take 5000+

Forward Stagewise Selection Stagewise algorithm: (1) Initialize model such that (2) Find the predictor Xj1 that is most correlated with r and add it to the model (here ) (3) Update -Note, h is a small constant controlling step-length (4) Update (5) Repeat steps 2 thru 4 until

Stagewise versus Lasso Although the algorithms look entirely different, their results are very similar! They will trace very similar paths for addition of predictors to the model They both represent special cases of a method called least angle regression (LAR)

Least Angle Regression LAR algorithm: (1) Initialize model such that Also initialize an empty “active set” A (subset of indices) (2) Find the predictor that is most correlated with r where ; update the active set to include (3) Move toward until some other covariate has the same correlation with r that does. Update the active set to include (4) Update r and move along towards the joint OLS direction for the regression of r on until a third covariate is as correlated with r as the first two predictors. Update the active set to include (5) Continue until all k covariates have been added to the model

In Pictures Consider a case where we have 2 predictors… Efron et al. 2004

Relationship Between LAR and Lasso LAR is a more general method than lasso A modification of the LAR algorithm produces the entire lasso path for l varied from 0 to infinity Modification occurs if a previously non-zero coefficient estimated to be zero at some point in the algorithm If this occurs, the LAR algorithm is modified such that the coefficient is removed from the active set and the joint direction is recomputed This modification is the most frequently implements version of LAR

Relationship Bt/ LAR and Stagewise LAR is also a more general method than stagewise selection Can also reproduce stagewise results using modified LAR Start with the LAR algorithm and determine the best direction at each stage If the direction for any predictor in the active set doesn’t agree in sign with the correlation between r and Xj, adjust to move in the direction of corr(r, Xj) As step sizes go to 0, we get a modified version of the LAR algorithm

Summary of the Three Methods LARS Uses least square directions in the active set of variables Lasso Uses the least square directions If the variable crosses 0, it is removed from the active set Forward stagewise Uses non-negative least squares directions in the active set

Degrees Freedom in LAR and lasso Consider fitting a LAR model with k < p parameters Equivalently use a lasso bound t that constrains the full regression fit General definition for the effective degrees of freedom (edf) for an adaptively fit model: For LARS at the kth step, the edf for the fit vector is exactly k For lasso, at any stage in the fit the effective degrees of freedom is approximately the number of predictors in the model

Software Packages What if we consider lasso, forward stagewise, or LAR as alternatives? There are 2 packages in R that will allow us to do this -lars -glmnet The lars package has the advantage of being able to fit all three model types (plus a typical forward stepwise selection algorithm) However, the glmnet package can fit lasso regression models for different types of regression -linear, logistic, cox-proportional hazards, multinomial, and poisson

Body Fat Example Recall our regression model > summary(mod13) Call: lm(formula = PBF ~ Age + Wt + Ht + Neck + Chest + Abd + Hip + Thigh + Knee + Ankle + Bicep + Arm + Wrist, data = bodyfat, x = T) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -18.18849 17.34857 -1.048 0.29551 Age 0.06208 0.03235 1.919 0.05618 . Wt -0.08844 0.05353 -1.652 0.09978 . Ht -0.06959 0.09601 -0.725 0.46925 Neck -0.47060 0.23247 -2.024 0.04405 * Chest -0.02386 0.09915 -0.241 0.81000 Abd 0.95477 0.08645 11.04 < 2e-16 *** Hip -0.20754 0.14591 -1.422 0.15622 Thigh 0.23610 0.14436 1.636 0.10326 Knee 0.01528 0.24198 0.063 0.94970 Ankle 0.17400 0.22147 0.786 0.43285 Bicep 0.18160 0.17113 1.061 0.28966 Arm 0.45202 0.19913 2.270 0.02410 * Wrist -1.62064 0.53495 -3.030 0.00272 ** Residual standard error: 4.305 on 238 degrees of freedom. Multiple R-squared: 0.749, Adjusted R-squared: 0.7353 . F-statistic: 54.65 on 13 and 238 DF, p-value: < 2.2e-16

Body Fat Example LAR: >library(lars) >par(mfrow=c(2,2)) >object <- lars(x=as.matrix(bodyfat[,3:15]),y=as.vector(bodyfat[,2]), type="lasso") >plot(object, breaks=F) >object2 <- lars(x=as.matrix(bodyfat[,3:15]),y=as.vector(bodyfat[,2]), type="lar") >plot(object2, breaks=F) >object3 <- lars(x=as.matrix(bodyfat[,3:15]),y=as.vector(bodyfat[,2]), type=“for") >plot(object3, breaks=F) >object4 <- lars(x=as.matrix(bodyfat[,3:15]),y=as.vector(bodyfat[,2]), type=“stepwise") >plot(object4, breaks=F)

Body Fat Example A closer look at the model: >object <- lars(x=as.matrix(bodyfat[,3:15]),y=as.vector(bodyfat[,2]), type="lasso") > names(object) [1] "call" "type" "df" "lambda" "R2" "RSS" "Cp" "actions" [9] "entry" "Gamrat" "arc.length" "Gram" "beta" "mu" "normx" "meanx" > object$beta Age Wt Ht Neck Chest Abd Hip Thigh Knee Ankle 0 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.0000000 0.000000000 0.00000000 0.00000000 0.0000000 1 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.5164924 0.000000000 0.00000000 0.00000000 0.0000000 2 0.00000000 0.00000000 -0.04395065 0.00000000 0.00000000 0.5314218 0.000000000 0.00000000 0.00000000 0.0000000 3 0.01710504 0.00000000 -0.13752803 0.00000000 0.00000000 0.5621288 0.000000000 0.00000000 0.00000000 0.0000000 4 0.04880181 0.00000000 -0.15894236 0.00000000 0.00000000 0.6550929 0.000000000 0.00000000 0.00000000 0.0000000 5 0.04994577 0.00000000 -0.15905246 -0.02624509 0.00000000 0.6626603 0.000000000 0.00000000 0.00000000 0.0000000 6 0.06499276 0.00000000 -0.15911969 -0.25799496 0.00000000 0.7079872 0.000000000 0.00000000 0.00000000 0.0000000 7 0.06467180 0.00000000 -0.15921694 -0.26404701 0.00000000 0.7118167 -0.004720494 0.00000000 0.00000000 0.0000000 8 0.06022586 -0.01117359 -0.14998300 -0.29599536 0.00000000 0.7527298 -0.022557736 0.00000000 0.00000000 0.0000000 9 0.05710956 -0.02219531 -0.14039586 -0.32675736 0.00000000 0.7842966 -0.035675017 0.00000000 0.00000000 0.0000000 10 0.05853733 -0.04577935 -0.11203059 -0.39386199 0.00000000 0.8425758 -0.101022340 0.09657784 0.00000000 0.0000000 11 0.06132775 -0.07889636 -0.07798153 -0.45141574 0.00000000 0.9142944 -0.171178163 0.20141924 0.00000000 0.1259630 12 0.06214695 -0.08452690 -0.07220347 -0.46528070 -0.01582661 0.9402896 -0.194491760 0.22553958 0.00000000 0.1586161 13 0.06207865 -0.08844468 -0.06959043 -0.47060001 -0.02386415 0.9547735 -0.207541123 0.23609984 0.01528121 0.1739954 Bicep Arm Wrist 0 0.00000000 0.0000000 0.000000 1 0.00000000 0.0000000 0.000000 2 0.00000000 0.0000000 0.000000 3 0.00000000 0.0000000 0.000000 4 0.00000000 0.0000000 -1.169755 5 0.00000000 0.0000000 -1.198047 6 0.00000000 0.2175660 -1.535349 7 0.00000000 0.2236663 -1.538953 8 0.00000000 0.2834326 -1.535810 9 0.04157133 0.3117864 -1.534938 10 0.09096070 0.3635421 -1.522325 11 0.15173471 0.4229317 -1.587661 12 0.17055965 0.4425212 -1.607395 13 0.18160242 0.4520249 -1.620639

Body Fat Example A closer look at the model: > names(object) [1] "call" "type" "df" "lambda" "R2" "RSS" "Cp" "actions" [9] "entry" "Gamrat" "arc.length" "Gram" "beta" "mu" "normx" "meanx" > object$df Intercept 1 2 3 4 5 6 7 8 9 10 11 12 13 14 > object$Cp 0 1 2 3 4 5 6 7 8 9 10 698.4 93.62 85.47 65.41 30.12 30.51 19.39 20.91 18.68 17.41 12.76 11 12 13 10.47 12.06 14.00

Body Fat Example Glmnet: >fit<-glmnet(x=as.matrix(bodyfat[,3:15]),y=as.vector(bodyfat[,2]), alpha=1) >fit.cv<-cv.glmnet(x=as.matrix(bodyfat[,3:15]), y=as.vector(bodyfat[,2]), alpha=1) >plot(fit.cv, sign.lambda=-1) >fit<-glmnet(x=as.matrix(bodyfat[,3:15]),y=as.vector(bodyfat[,2]), alpha=1, 0.02123575)

Body Fat Example Glmnet: >fit<-glmnet(x=as.matrix(bodyfat[,3:15]),y=as.vector(bodyfat[,2]), alpha=1) >names(fit) [1] "a0" "beta" "df" "dim" "lambda" "dev.ratio" "nulldev" "npasses" "jerr" [10] "offset" "call" "nobs" > fit$lambda [1] 6.793883455 6.190333574 5.640401401 5.139323686 4.682760334 4.266756812 3.887709897 3.542336464 3.227645056 [10] 2.940909965 2.679647629 2.441595119 2.224690538 2.027055162 1.846977168 1.682896807 1.533392893 1.397170495 [19] 1.273049719 1.159955490 1.056908242 0.963015426 0.877463790 0.799512325 0.728485854 0.663769178 0.604801754 [28] 0.551072833 0.502117041 0.457510347 0.416866389 0.379833128 0.346089800 0.315344136 0.287329832 0.261804242 [37] 0.238546274 0.217354481 0.198045308 0.180451508 0.164420694 0.149814013 0.136504949 0.124378225 0.113328806 [46] 0.103260988 0.094087566 0.085729086 0.078113150 0.071173793 0.064850910 0.059089734 0.053840365 0.049057335 [55] 0.044699216 0.040728261 0.037110075 0.033813318 0.030809436 0.028072411 0.025578535 0.023306209 0.021235749 [64] 0.019349224 0.017630292 0.016064066 0.014636978 0.013336669 0.012151876 0.011072337 0.010088701 0.009192449 [73] 0.008375817 0.007631733 0.006953750 0.006335998 0.005773126 0.005260257

Body Fat Example Glmnet: >fit.cv<-cv.glmnet(x=as.matrix(bodyfat[,3:15]), y=as.vector(bodyfat[,2]), alpha=1) > names(fit.cv) [1] "lambda" "cvm" "cvsd" "cvup" "cvlo" "nzero" "name" "glmnet.fit" [9] "lambda.min" "lambda.1se" > fit.cv$lambda.min [1] 0.02123575

Ridge versus Lasso Coefficient Paths

Trace Plot

Lasso LARS Stagewise Stepwise

Body Fat Example Variable OLS Ridge Lasso Age 0.0635 0.0743 0.0607 Weight -0.0824 -0.0668 0.0000 Height -0.2391 -0.0922 -0.2639 Neck -0.3881 -0.4667 -0.3140 Chest -0.1321 0.0071 -0.0916 Abdomen 0.9017 0.8703 0.8472 Hip -0.2129 -0.1750 -0.1408 Thigh 0.2026 0.2301 0.1499 Knee -0.0082 -0.0108 Ankle -0.0085 0.1374 Bicep 0.1902 0.1561 0.1792 Arm 0.1913 0.4329 0.0563 Wrist -1.6053 -1.6678 -1.5348

If we remove the outliers and clean up the data before analysis… Variable OLS Ridge Lasso Age 0.0653 0.0953 0.0473 Weight -0.0024 0.02867 0.0000 Height -0.2391 -0.3987 -0.2926 Neck -0.3881 -0.2171 -0.1379 Chest -0.1321 0.0643 Abdomen 0.9017 0.5164 0.7368 Hip -0.2129 0.0428 Thigh 0.2026 0.1894 0.2710 Knee -0.0082 0.0584 Ankle -0.0085 -0.1798 Bicep 0.1902 0.1436 0.0560 Arm 0.1913 -0.0932 Wrist -1.6053 -1.4160 -1.4049

Body Fat Example What can we do in SAS? SAS can also do cross-validation However, it only fits linear regression Here’s the basic SAS code ods graphics on; proc glmselect data=bf plots=all; model pbf=age wt ht neck chest abd hip thigh knee ankle bicep arm wrist/selection=lasso(stop=none choose=AIC); run; ods graphics off;

The GLMSELECT Procedure LASSO Selection Summary Effect Effect Number Step Entered Removed Effects In AIC 0 Intercept 1 1 1325.7477 ----------------------------------------------------------------------------------------------- Abd 2 2 1070.4404 Ht 3 3 1064.8357 Age 4 4 1049.4793 Wrist 5 5 1019.1226 Neck 6 6 1019.6222 Arm 7 7 1009.0982 Hip 8 8 1010.6285 Wt 9 9 1008.4396 Bicep 10 10 1007.1631 Thigh 11 11 1002.3524 Ankle 12 12 999.8569* Chest 13 13 1001.4229 13 Knee 14 14 1003.3574

Penalized regression methods are most useful when -high collinearity exists -when p >> n Keep in mind you still need to look at the data first Could also consider other forms of penalized regression, though in practice alternatives are not used