Coefficient Path Algorithms Karl Sjöstrand Informatics and Mathematical Modelling, DTU.

Slides:



Advertisements
Similar presentations
Neural networks Introduction Fitting neural networks
Advertisements

Regularization David Kauchak CS 451 – Fall 2013.
Prediction with Regression
Graph Laplacian Regularization for Large-Scale Semidefinite Programming Kilian Weinberger et al. NIPS 2006 presented by Aggeliki Tsoli.
Chapter 2: Lasso for linear models
Data mining and statistical learning - lecture 6
Basis Expansion and Regularization Presenter: Hongliang Fei Brian Quanz Brian Quanz Date: July 03, 2008.
1 Curve-Fitting Spline Interpolation. 2 Curve Fitting Regression Linear Regression Polynomial Regression Multiple Linear Regression Non-linear Regression.
Classification and Prediction: Regression Via Gradient Descent Optimization Bamshad Mobasher DePaul University.
Support Vector Machines
Linear Methods for Regression Dept. Computer Science & Engineering, Shanghai Jiao Tong University.
1-norm Support Vector Machines Good for Feature Selection  Solve the quadratic program for some : min s. t.,, denotes where or membership. Equivalent.
Motion Analysis Slides are from RPI Registration Class.
Linear Regression  Using a linear function to interpolate the training set  The most popular criterion: Least squares approach  Given the training set:
Reformulated - SVR as a Constrained Minimization Problem subject to n+1+2m variables and 2m constrains minimization problem Enlarge the problem size and.
Kernel Methods and SVM’s. Predictive Modeling Goal: learn a mapping: y = f(x;  ) Need: 1. A model structure 2. A score function 3. An optimization strategy.
1 Computational Learning Theory and Kernel Methods Tianyi Jiang March 8, 2004.
ECIV 301 Programming & Graphics Numerical Methods for Engineers REVIEW II.
Mehdi Ghayoumi MSB rm 132 Ofc hr: Thur, a Machine Learning.
Classification and Prediction: Regression Analysis
Today Wrap up of probability Vectors, Matrices. Calculus
Collaborative Filtering Matrix Factorization Approach
Outline Separating Hyperplanes – Separable Case
Perceptual and Sensory Augmented Computing Machine Learning, WS 13/14 Machine Learning – Lecture 14 Introduction to Regression Bastian Leibe.
Shrinkage Estimation of Vector Autoregressive Models Pawin Siriprapanukul 11 January 2010.
Machine Learning Seminar: Support Vector Regression Presented by: Heng Ji 10/08/03.
Mathematical formulation XIAO LIYING. Mathematical formulation.
Perceptual and Sensory Augmented Computing Advanced Machine Learning Winter’12 Advanced Machine Learning Lecture 3 Linear Regression II Bastian.
计算机学院 计算感知 Support Vector Machines. 2 University of Texas at Austin Machine Learning Group 计算感知 计算机学院 Perceptron Revisited: Linear Separators Binary classification.
Jeff Howbert Introduction to Machine Learning Winter Regression Linear Regression.
CS Statistical Machine learning Lecture 18 Yuan (Alan) Qi Purdue CS Oct
Ensemble Learning (1) Boosting Adaboost Boosting is an additive model
Efficient computation of Robust Low-Rank Matrix Approximations in the Presence of Missing Data using the L 1 Norm Anders Eriksson and Anton van den Hengel.
Machine Learning Weak 4 Lecture 2. Hand in Data It is online Only around 6000 images!!! Deadline is one week. Next Thursday lecture will be only one hour.
Perceptual and Sensory Augmented Computing Machine Learning, WS 13/14 Machine Learning – Lecture 15 Regression II Bastian Leibe RWTH Aachen.
Sparse Kernel Methods 1 Sparse Kernel Methods for Classification and Regression October 17, 2007 Kyungchul Park SKKU.
Cross Validation of SVMs for Acoustic Feature Classification using Entire Regularization Path Tianyu Tom Wang T. Hastie et al WS04.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
CpSc 881: Machine Learning
1  Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Non-Linear Models. Non-Linear Growth models many models cannot be transformed into a linear model The Mechanistic Growth Model Equation: or (ignoring.
Machine Learning 5. Parametric Methods.
Chapter 2-OPTIMIZATION G.Anuradha. Contents Derivative-based Optimization –Descent Methods –The Method of Steepest Descent –Classical Newton’s Method.
Logistic Regression & Elastic Net
Lecture notes for Stat 231: Pattern Recognition and Machine Learning 1. Stat 231. A.L. Yuille. Fall 2004 Linear Separation and Margins. Non-Separable and.
R. Kass/W03 P416 Lecture 5 l Suppose we are trying to measure the true value of some quantity (x T ). u We make repeated measurements of this quantity.
Massive Support Vector Regression (via Row and Column Chunking) David R. Musicant and O.L. Mangasarian NIPS 99 Workshop on Learning With Support Vectors.
Neural Networks The Elements of Statistical Learning, Chapter 12 Presented by Nick Rizzolo.
Basis Expansions and Generalized Additive Models Basis expansion Piecewise polynomials Splines Generalized Additive Model MARS.
Regularized Least-Squares and Convex Optimization.
Page 1 CS 546 Machine Learning in NLP Review 2: Loss minimization, SVM and Logistic Regression Dan Roth Department of Computer Science University of Illinois.
Colorado Center for Astrodynamics Research The University of Colorado 1 STATISTICAL ORBIT DETERMINATION Statistical Interpretation of Least Squares ASEN.
Markus Uhr Feature Extraction Sparse, Flexible and Efficient Modeling using L 1 -Regularization Saharon Rosset and Ji Zhu.
The simple linear regression model and parameter estimation
PREDICT 422: Practical Machine Learning
LINEAR CLASSIFIERS The Problem: Consider a two class task with ω1, ω2.
Multiplicative updates for L1-regularized regression
Learning Recommender Systems with Adaptive Regularization
Boosting and Additive Trees (2)
CSE 4705 Artificial Intelligence
Linear Regression (continued)
Machine learning, pattern recognition and statistical data modelling
USPACOR: Universal Sparsity-Controlling Outlier Rejection
Collaborative Filtering Matrix Factorization Approach
What is Regression Analysis?
CRISP: Consensus Regularized Selection based Prediction
Basis Expansions and Generalized Additive Models (1)
Neural networks (1) Traditional multi-layer perceptrons
Linear Discrimination
Presentation transcript:

Coefficient Path Algorithms Karl Sjöstrand Informatics and Mathematical Modelling, DTU

What’s This Lecture About? The focus is on computation rather than methods. – Efficiency – Algorithms provide insight

Loss Functions We wish to model a random variable Y by a function of a set of other random variables f(X) To determine how far from Y our model is we define a loss function L(Y, f(X)).

Loss Function Example Let Y be a vector y of n outcome observations Let X be an (n×p) matrix X where the p columns are predictor variables Use squared error loss L(y,f(X))=||y -f(X)|| 2 Let f(X) be a linear model with coefficients β, f(X) = Xβ. The loss function is then The minimizer is the familiar OLS solution

Adding a Penalty Function We get different results if we consider a penalty function J(β) along with the loss function Parameter λ defines amount of penalty

Virtues of the Penalty Function Imposes structure on the model – Computational difficulties Unstable estimates Non-invertible matrices – To reflect prior knowledge – To perform variable selection Sparse solutions are easier to interpret

Effect of the Penalty We prefer simple (interpretable) model with stellar performance Are these properties contradictory? – Yes – and no. Image from Elements of Statistical Learning

Selecting a Suitable Model We must evaluate models for lots of different values of λ – For instance when doing cross-validation For each training and test set, evaluate for a suitable set of values of λ. Each evaluation of may be expensive

Topic of this Lecture Algorithms for estimating for all values of the parameter λ. Plotting the vector with respect to λ yields a coefficient path.

Example Path – Ridge Regression Regression – Quadratic loss, quadratic penalty

Example Path - LASSO Regression – Quadratic loss, piecewise linear penalty

Example Path – Support Vector Machine Classification – details on loss and penalty later

Example Path – Penalized Logistic Regression Classification – non-linear loss, piecewise linear penalty Image from Rosset, NIPS 2004

Path Properties

Piecewise Linear Paths What is required from the loss and penalty functions for piecewise linearity? One condition is that is a piecewise constant vector in λ.

Condition for Piecewise Linearity

Tracing the Entire Path From a starting point along the path (e.g. λ=∞ ), we can easily create the entire path if: – is known – the knots where change can be worked out

The Piecewise Linear Condition

Sufficient and Necessary Condition A sufficient and necessary condition for linearity of at λ 0 : – expression above is a constant vector with respect to λ in a neighborhood of λ 0.

A Stronger Sufficient Condition...but not a necessary condition The loss is a piecewise quadratic function of β The penalty is a piecewise linear function of β constantdisappearsconstant

Implications of this Condition Loss functions may be – Quadratic (standard squared error loss) – Piecewise quadratic – Piecewise linear (a variant of piecewise quadratic) Penalty functions may be – Linear (SVM ”penalty”) – Piecewise linear (L 1 and L inf )

Condition Applied - Examples Ridge regression – Quadratic loss – ok – Quadratic penalty – not ok LASSO – Quadratic loss – ok – Piecewise linear penalty - ok

When do Directions Change? Directions are only valid where L and J are differentiable. – LASSO: L is differentiable everywhere, J is not at β=0. Directions change when β touches 0. – Variables either become 0, or leave 0 – Denote the set of non-zero variables A – Denote the set of zero variables I

An algorithm for the LASSO Quadratic loss, piecewise linear penalty We now know it has a piecewise linear path! Let’s see if we can work out the directions and knots

Reformulating the LASSO

Useful Conditions Lagrange primal function KKT conditions

LASSO Algorithm Properties Coefficients are nonzero only if For zero variables I A

Working out the Knots (1) First case: a variable becomes zero ( A to I ) Assume we know the current and directions

Working out the Knots (2) Second case: a variable becomes non-zero For inactive variables change with λ. algorithm direction Second added variable

Working out the Knots (3) For some scalar d, will reach λ. – This is where variable j becomes active! – Solve for d :

Path Directions Directions for non-zero variables

The Algorithm while I is not empty – Work out the minmal distance d where a variable is either added or dropped – Update sets A and I – Update β = β + d – Calculate new directions end

Complexity Roughly O(n 2 p) About the same complexity as for a single least-sqares fit

Variants – Huberized LASSO Use a piecewise quadratic loss which is nicer to outliers

Huberized LASSO Same path algorithm applies – With a minor change due to the piecewise loss

Variants - SVM Dual SVM formulation – Quadratic ”loss” – Linear ”penalty”

A few Methods with Piecewise Linear Paths Least Angle Regression LASSO (+variants) Forward Stagewise Regression Elastic Net The Non-Negative Garotte Support Vector Machines (L1 and L2) Support Vector Domain Description Locally Adaptive Regression Splines

References Rosset and Zhu 2004 – Piecewise Linear Regularized Solution Paths Efron et. al 2003 – Least Angle Regression Hastie et. al 2004 – The Entire Regularization Path for the SVM Zhu, Rosset et. al 2003 – 1-norm Support Vector Machines Rosset 2004 – Tracking Curved Regularized Solution Paths Park and Hastie 2006 – An L1-regularization Path Algorithm for Generalized Linear Models Friedman et al – Regularized Paths for Generalized Linear Models via Coordinate Descent

Conclusion We have defined conditions which help identifying problems with piecewise linear paths –...and shown that efficient algorithms exist Having access to solutions for all values of the regularization parameter is important when selecting a suitable model

Questions? Later questions: – or –