Logistic Regression & Elastic Net

Slides:

Advertisements

Similar presentations

Regularized risk minimization

Advertisements

Chapter Outline 3.1 Introduction

SUG London 2007 Least Angle Regression Translating the S-Plus/R Least Angle Regression package to Mata Adrian Mander MRC-Human Nutrition Research Unit,

Prediction with Regression

Pattern Recognition and Machine Learning

Dimension reduction (1)

« هو اللطیف » By : Atefe Malek. khatabi Spring 90.

Model Assessment, Selection and Averaging

Chapter 2: Lasso for linear models

Classification and Prediction: Regression Via Gradient Descent Optimization Bamshad Mobasher DePaul University.

Coefficient Path Algorithms Karl Sjöstrand Informatics and Mathematical Modelling, DTU.

Linear Methods for Regression Dept. Computer Science & Engineering, Shanghai Jiao Tong University.

Inverse Regression Methods Prasad Naik 7 th Triennial Choice Symposium Wharton, June 16, 2007.

Lasso, Support Vector Machines, Generalized linear models Kenneth D. Harris 20/5/15.

Principal Component Analysis

Linear Regression Models Based on Chapter 3 of Hastie, Tibshirani and Friedman Slides by David Madigan.

Data mining and statistical learning, lecture 4 Outline Regression on a large number of correlated inputs  A few comments about shrinkage methods, such.

Data mining and statistical learning, lecture 5 Outline  Summary of regressions on correlated inputs  Ridge regression  PCR (principal components regression)

Lecture 6: Multiple Regression

Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.

The Chicken Project Dimension Reduction-Based Penalized logistic Regression for cancer classification Using Microarray Data By L. Shen and E.C. Tan Name.

Kernel Methods and SVM’s. Predictive Modeling Goal: learn a mapping: y = f(x;  ) Need: 1. A model structure 2. A score function 3. An optimization strategy.

Support Vector Regression David R. Musicant and O.L. Mangasarian International Symposium on Mathematical Programming Thursday, August 10, 2000

Lasso regression. The Goals of Model Selection Model selection: Choosing the approximate best model by estimating the performance of various models Goals.

Mehdi Ghayoumi MSB rm 132 Ofc hr: Thur, a Machine Learning.

Classification and Prediction: Regression Analysis

An Introduction to Support Vector Machines Martin Law.

Overview of Kernel Methods Prof. Bennett Math Model of Learning and Discovery 2/27/05 Based on Chapter 2 of Shawe-Taylor and Cristianini.

Presented By Wanchen Lu 2/25/2013

Outline Separating Hyperplanes – Separable Case

Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.

1 Active learning based survival regression for censored data Bhanukiran Vinzamuri Yan Li Chandan K.

Shrinkage Estimation of Vector Autoregressive Models Pawin Siriprapanukul 11 January 2010.

Machine Learning Seminar: Support Vector Regression Presented by: Heng Ji 10/08/03.

Jeff Howbert Introduction to Machine Learning Winter Regression Linear Regression.

CS Statistical Machine learning Lecture 18 Yuan (Alan) Qi Purdue CS Oct

An Introduction to Support Vector Machines (M. Law)

Ensemble Learning (1) Boosting Adaboost Boosting is an additive model

Jeff Howbert Introduction to Machine Learning Winter Regression Linear Regression Regression Trees.

R EGRESSION S HRINKAGE AND S ELECTION VIA THE L ASSO Author: Robert Tibshirani Journal of the Royal Statistical Society 1996 Presentation: Tinglin Liu.

Guest lecture: Feature Selection Alan Qi Dec 2, 2004.

CpSc 881: Machine Learning

Data analysis tools Subrata Mitra and Jason Rahman.

Ridge Regression: Biased Estimation for Nonorthogonal Problems by A.E. Hoerl and R.W. Kennard Regression Shrinkage and Selection via the Lasso by Robert.

CS Statistical Machine learning Lecture 12 Yuan (Alan) Qi Purdue CS Oct

Predicting Post-Operative Patient Gait Jongmin Kim Movement Research Lab. Seoul National University.

Bump Hunting The objective PRIM algorithm Beam search References: Feelders, A.J. (2002). Rule induction by bump hunting. In J. Meij (Ed.), Dealing with.

LECTURE 13: LINEAR MODEL SELECTION PT. 3 March 9, 2016 SDS 293 Machine Learning.

Dimension reduction (2) EDR space Sliced inverse regression Multi-dimensional LDA Partial Least Squares Network Component analysis.

1 Bilinear Classifiers for Visual Recognition Computational Vision Lab. University of California Irvine To be presented in NIPS 2009 Hamed Pirsiavash Deva.

LECTURE 15: PARTIAL LEAST SQUARES AND DEALING WITH HIGH DIMENSIONS March 23, 2016 SDS 293 Machine Learning.

PREDICT 422: Practical Machine Learning Module 4: Linear Model Selection and Regularization Lecturer: Nathan Bastian, Section: XXX.

Chapter 7. Classification and Prediction

Deep Feedforward Networks

Background on Classification

Boosting and Additive Trees (2)

Generalized regression techniques

CSE 4705 Artificial Intelligence

Machine Learning Basics

Roberto Battiti, Mauro Brunato

Learning with information of features

Lasso/LARS summary Nasimeh Asgarian.

Logistic Regression & Parallel SGD

What is Regression Analysis?

Linear Model Selection and Regularization

Stat 324 – Day 28 Model Validation (Ch. 11).

Linear Model Selection and regularization

CRISP: Consensus Regularized Selection based Prediction

Penalized Regression, Part 3

Sparse Principal Component Analysis

Presentation transcript:

Logistic Regression & Elastic Net Weifeng Li and Hsinchun Chen Credits: Hui Zou, University of Minnesota Trevor Hastie, Stanford University Robert Tibshirani, Stanford University

Outline Logistic Regression Regularization Conclusion Why Logistic Regression? Logistic Regression Model Fitting Logistic Regression Application: Making Predictions Regularization Motivation Ridge regression LASSO Elastic Net Elastic Net Application Elastic Net v.s. LASSO Elastic Net Extensions: Sparse PCA & Kernel Elastic Net Conclusion

Outline Logistic Regression Regularization Conclusion Why Logistic Regression? Logistic Regression Model Fitting Logistic Regression Application: Making Predictions Regularization Motivation Ridge regression LASSO Elastic Net Elastic Net Application Elastic Net v.s. LASSO Elastic Net Extensions: Sparse PCA & Kernel Elastic Net Conclusion

Why use Logistic Regression? The dependent variables in many research problems are “limited” to a couple of categories. Examples: Whether a user in the hacker community is a key criminal Whether a piece of text in the hacker social media implies a potential threat Whether a patient is diabetic given her symptoms Whether the mention of a drug along with some complications suggests Adverse Drug Effect

Limitations with Linear Regression Fitting a linear model to data with binary outcome variable is problematic: This approach is analogous to fitting a linear model to the probability of the event, which can only take values between 0 (No) and 1 (Yes). When given new 𝑥, the prediction is not restricted to be 0 or 1, which is difficult to interpret (i.e., whether Yes or No.) Logistic regression model allows the dependent variable to be either 0 or 1. For this reason, logistic regression is widely used for classification problems. Linear Model Binary Outcome y x 𝑥 𝑛𝑒𝑤 Prediction: Yes? No? Yes No

Logistic Regression For a binary classification problem, where 𝑌∈{0,1}: An 𝑂𝑑𝑑𝑠 𝑅𝑎𝑡𝑖𝑜 (𝑂𝑅) is defined as 𝑂𝑅= 𝑃(𝑌=1|𝑿) 𝑃(𝑌=0|𝑿) ∈[0,∞). If 𝑂𝑅>1, 𝑃(𝑌=1|𝑿) is more likely to occur. If 𝑂𝑅<1, 𝑃(𝑌=0|𝑿) is more likely to occur. A 𝑙𝑜𝑔𝑖𝑡 is defined as 𝑙𝑜𝑔𝑖𝑡= ln 𝑂𝑅 ∈ −∞,∞ Logistic Regression is defined as regressing 𝑙𝑜𝑔𝑖𝑡 on independent variables, 𝑿: 𝑙𝑜𝑔𝑖𝑡= ln 𝑃 𝑌=1 𝑿 𝑃 𝑌=0 𝑿 = 𝜷 ′ 𝑿⇒𝑃 𝑌=1 𝑿 = 1 1+exp⁡( 𝜷 ′ 𝑿) =𝑔( 𝜷 ′ 𝑿) Logistic function 𝑔( 𝜷 ′ 𝑿): wiki

Fitting Logistic Regression Models Objective of model fitting: Given the data, 𝑿,𝒀 ={ 𝒙 1 , 𝑦 1 , 𝒙 2 , 𝑦 2 ,…, 𝒙 𝑁 , 𝑦 𝑁 }, Find 𝛽=( 𝛽 1 , 𝛽 2 ,…, 𝛽 𝐾 )′ to maximize the conditional log-likelihood of the data, ℓ 𝜷 =ln⁡(𝑃 𝒀|𝑿;𝜷 ). Formally, we want to optimize: 𝑎𝑟𝑔𝑚𝑎𝑥 𝜷 {ℓ 𝜷 = 𝑛=1 𝑁 𝑦 𝑛 ln 𝑔 𝜷 ′ 𝒙 𝑛 + 1− 𝑦 𝑛 ln 1−𝑔 𝜷 ′ 𝒙 𝑛 } cs229

Fitting Logistic Regression Models Objective: 𝑎𝑟𝑔𝑚𝑎𝑥 𝜷 {ℓ 𝜷 = 𝑛=1 𝑁 𝑦 𝑛 ln 𝑔 𝜷 ′ 𝒙 𝑛 + 1− 𝑦 𝑛 ln 1−𝑔 𝜷 ′ 𝒙 𝑛 𝑛=1 𝑁 𝑦 𝑛 ln 𝑔 𝜷 ′ 𝒙 𝑛 + 1− 𝑦 𝑛 ln 1−𝑔 𝜷 ′ 𝒙 𝑛 } Gradient Ascent can be used for optimization: Gradient: ℓ′ 𝜷 = 𝑛=1 𝑁 𝑦 𝑛 −𝑔 𝜷 ′ 𝒙 𝑛 𝑥 𝑛𝑘 . Gradient Ascent is iteratively updating until convergence: 𝛽 𝑘 ← 𝛽 𝑘 +𝛼ℓ′ 𝜷 ,where 𝛼=1/ℓ′′ 𝜷 , known as Newton’s method. Illustration of Newton’s Method: cs229 𝛽 1 𝛽 𝑖𝑛𝑖𝑡𝑖𝑎𝑙 𝛽 3

Application: Making Predictions

Logistic Regression with More than Two Classes

Outline Logistic Regression Regularization Conclusion Why Logistic Regression? Logistic Regression Model Fitting Logistic Regression Application: Making Predictions Regularization Motivation Ridge regression LASSO Elastic Net Elastic Net Application Elastic Net v.s. LASSO Elastic Net Extensions: Sparse PCA & Kernel Elastic Net Conclusion

Regularization: Motivation Modern datasets are usually high-dimensional: Documents in unigram, bigram, trigram, or even higher order model High resolution images stored pixel-by-pixel DNA Microarrays containing at least 10K genes If the dimensionality of the data (denoted as 𝑝) is higher than the number of observations (denoted as 𝑛,) the model is under-identified. That is, we cannot find a unique combination of 𝑝 coefficients, such that the model is optimal. Consequently, the prediction will not be accurate. Regularization concerns building a model by reducing the dimensionality of the data (i.e., using a subset of “predictors.”)

Regularization Methods Subset Selection: identify a subset of the p predictors that we believe to be related to the response variable. Best Subset Selection: selects the subset with the best performance Forward Stepwise Selection: adds predictors one-at-a-time Backward Stepwise Selection: iteratively removes the least useful predictor Dimension Reduction: project the p predictors into a M- dimensional subspace, where M<p. This is achieved by computing M different linear combinations of the variables. For example, Principle Component Analysis: finds a low-dimensional representation of a dataset that contains as much as possible of the variation Shrinkage: fit a model involving all p predictors, but the estimated coefficients are shrunken towards to zero relative to the least square estimates. Ridge Regression, LASSO, Elastic Net, which we will discuss in the following slides.

Ridge Regression Ridge regression penalize the size of the regression coefficients based on their 𝑙 2 norm: 𝑎𝑟𝑔𝑚𝑖𝑛 𝛽 𝑖 ( 𝑦 𝑖 − 𝜷 ′ 𝒙 𝑖 ) 2 +𝜆 𝑘=1 𝐾 𝛽 𝑘 2 The tuning parameter serves 𝜆 to control the relative impact of these two terms on the regression coefficient estimates. Selecting a good value for 𝜆 is critical; cross-validation is used for this.

Least Absolute Shrinkage and Selection Operator (LASSO) LASSO penalize the size of the regression coefficients based on their 𝑙 1 norm: 𝑎𝑟𝑔𝑚𝑖𝑛 𝛽 𝑖 ( 𝑦 𝑖 − 𝜷 ′ 𝒙 𝑖 ) 2 +𝜆 𝑘=1 𝐾 𝛽 𝑘 Limitations: If 𝑝>𝑛, the LASSO selects at most 𝑛 variables. The number of selected variables is bounded by the number of observations. The LASSO fails to do grouped selection. It tends to select one variable from a group and ignore the others. Group selection: automatically include whole groups of predictors into the model if one predictor amongst them is selected.

Comparing Ridge and LASSO The least squares solution is marked as 𝛽 , while the blue diamond and circle represent the lasso (left) and ridge regression (right) constraints. The ellipses that are centered around 𝛽 represent regions of constant RSS. As the ellipses expand away from the least squares coefficient estimates, the RSS increases. Since ridge regression (right) has a circular constraint with no sharp points, this intersection will not generally occur on an axis, and so the ridge regression coefficient estimates will be exclusively non-zero. However, the lasso constraint (left) has corners at each of the axes, and so the ellipse will often intersect the constraint region at an axis. When this occurs, one of the coefficients will equal zero.

Elastic Net Elastic Net penalize the size of the regression coefficients based on both their 𝑙 1 norm and their 𝑙 2 norm : 𝑎𝑟𝑔𝑚𝑖𝑛 𝛽 𝑖 ( 𝑦 𝑖 − 𝜷 ′ 𝒙 𝑖 ) 2 + 𝜆 1 𝑘=1 𝐾 𝛽 𝑘 + 𝜆 2 𝑘=1 𝐾 𝛽 𝑘 2 The 𝑙 1 norm penalty generates a sparse model. The 𝑙 2 norm penalty: Removes the limitation on the number of selected variables. Encourages grouping effect. Stabilizes the 𝑙 1 regularization path. Geometric Illustration of Elastic Net, Ridge regression, and LASSO Ridge Singularities at the vertexes (necessary for sparsity) Lasso Strict convex edges (necessary for grouping) Elastic Net

Elastic Net Application We show how to use Elastic Net by conducting a simple linear regression with simulated data. Construct an ill-posed problem: Number of features >> Number of data points Generate 𝑋 randomly Total: 200 coefficients Build a sparse model by setting only 10 non-zero coefficients http://scikit-learn.org/stable/_downloads/plot_lasso_and_elasticnet.py Generate 𝑌= 𝛽 ′ 𝑋+0.01𝑧, where 𝑧 is standard normal noise Source code can be found here: http://scikit-learn.org/stable/_downloads/plot_lasso_and_elasticnet.py

Elastic Net Application Specify Elastic Net Model: 𝜆 1 =0.1; 𝜆 1 𝜆 1 + 𝜆 2 =0.7 𝜆 1 𝜆 1 /( 𝜆 1 + 𝜆 2 ) Run model Make predictions Evaluate the predictions using 𝑅 2 Output Given the ill-posed problem (# features >> # samples,) Elastic Net is able to capture most of the non-zero coefficients. Elastic Net generates sparse estimates of the coefficients: most of the estimates are 0. In general LASSO gives larger coefficient estimates than Elastic Net Estimates Coefficient

Elastic Net v.s. LASSO: A Simple Illustration

Elastic Net v.s. LASSO: Solution Paths (a) Lasso and (b) elastic net ( 𝜆 2 =0.5) solution paths: the lasso paths are unstable and (a) does not reveal any correction information by itself; in contrast, the elastic net has much smoother solution paths, while clearly showing the ‘grouped selection’—x1, x2 and x3 are in one ‘significant’ group and x4, x5 and x6 are in the other ‘trivial’ group; the decorrelation yields the grouping effect and stabilizes the lasso solution

Elastic Net Extensions: Constructing Portfolios Problem: the market has 𝑝 stocks, with price 𝑃 𝑖,𝑡 at time 𝑡. How can we construct portfolios, 𝑃 𝑡 = 𝑖=1 𝑝 𝑓 𝑖 𝑃 𝑖,𝑡 , in order to minimize risks? Solution: we want portfolios to be uncorrelated with each other. How do we construct uncorrelated portfolios? Principal Component Analysis (PCA) Further Problem: trading stocks costs fees. How can we lower the cost of maintaining portfolios? Solution: keep as few stocks in each portfolios as possible. How do we achieve this? Sparse Principal Component Analysis (Sparse PCA)

Elastic Net Extensions: Sparse PCA Obtain principal components (PC) with sparse loadings. That is, we want PCs to be sparse linear combinations of input variables. Data PCA

Elastic Net Extensions: Kernel Elastic Net The Elastic Net can be reduced to linear Support Vector Machine. This reduction enables the estimation of 𝑝(𝑦|𝒙) in SVM. Linear SVM: Loss function Kernel Elastic Net Loss function The estimation of 𝑝(𝑦|𝒙) is from the loss function of Kernel Elastic Net. The implication of this reduction is that SVM solvers can also be used for Elastic net Problems.

Outline Logistic Regression Regularization Conclusion Why Logistic Regression? Logistic Regression Model Fitting Logistic Regression Application: Making Predictions Regularization Motivation Ridge regression LASSO Elastic Net Elastic Net Application Elastic Net v.s. LASSO Elastic Net Extensions: Sparse PCA & Kernel Elastic Net Conclusion

Conclusion Logistic regression performs regression on discrete dependent variables. It performs better than linear in predicting probabilities. It builds on the idea of Odds Ratio. It is solved by optimization (rather than OLS) It can be used for making predictions on the probability or category. The Elastic Net performs Ridge regression and LASSO simultaneously: It is able to perform grouped selection It is appropriate for under-identified problems Elastic Net works better than Ridge regression and LASSO. It has interesting applications in sparse PCA and support vector machine.

Resources Logistic Regression (implemented in most statistical software) SAS, R (glm; GLMNET; lmer), MATLAB (mnrfit) JAVA (mallet), Python (scikit-learn), etc. Elastic Net R packages “elasticnet” ”Glmnet: Lasso and elastic-net regularized generalized linear models” “pensim: Simulation of high-dimensional data and parallelized repeated penalized regression” JMP Pro 11 Python: scikit-learn MATLAB: SVEN (Support Vector Elastic Net)