Ridge Regression: Biased Estimation for Nonorthogonal Problems by A.E. Hoerl and R.W. Kennard Regression Shrinkage and Selection via the Lasso by Robert.

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Ordinary Least-Squares
Lecture 4. Linear Models for Regression
Linear Regression.
Operation Research Chapter 3 Simplex Method.
Regularization David Kauchak CS 451 – Fall 2013.
Edge Preserving Image Restoration using L1 norm
MS&E 211 Quadratic Programming Ashish Goel. A simple quadratic program Minimize (x 1 ) 2 Subject to: -x 1 + x 2 ≥ 3 -x 1 – x 2 ≥ -2.
Pattern Recognition and Machine Learning
Prediction with Regression
Pattern Recognition and Machine Learning
R OBERTO B ATTITI, M AURO B RUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Feb 2014.
Chapter 2: Lasso for linear models
Data mining and statistical learning - lecture 6
Classification and Prediction: Regression Via Gradient Descent Optimization Bamshad Mobasher DePaul University.
Visual Recognition Tutorial
Linear Methods for Regression Dept. Computer Science & Engineering, Shanghai Jiao Tong University.
Lecture 6 Resolution and Generalized Inverses. Syllabus Lecture 01Describing Inverse Problems Lecture 02Probability and Measurement Error, Part 1 Lecture.
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Prediction and model selection
Visual Recognition Tutorial
ECE 530 – Analysis Techniques for Large-Scale Electrical Systems
Basic Mathematics for Portfolio Management. Statistics Variables x, y, z Constants a, b Observations {x n, y n |n=1,…N} Mean.
Maximum likelihood (ML)
1 Linear Methods for Classification Lecture Notes for CMPUT 466/551 Nilanjan Ray.
PATTERN RECOGNITION AND MACHINE LEARNING
Principles of Pattern Recognition
Shrinkage Estimation of Vector Autoregressive Models Pawin Siriprapanukul 11 January 2010.
ECE 8443 – Pattern Recognition LECTURE 03: GAUSSIAN CLASSIFIERS Objectives: Normal Distributions Whitening Transformations Linear Discriminants Resources.
Chapter 6 Linear Programming: The Simplex Method
Barnett/Ziegler/Byleen Finite Mathematics 11e1 Learning Objectives for Section 6.4 The student will be able to set up and solve linear programming problems.
Perceptual and Sensory Augmented Computing Advanced Machine Learning Winter’12 Advanced Machine Learning Lecture 3 Linear Regression II Bastian.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
G(m)=d mathematical model d data m model G operator d=G(m true )+  = d true +  Forward problem: find d given m Inverse problem (discrete parameter estimation):
Water Resources Development and Management Optimization (Linear Programming) CVEN 5393 Mar 4, 2011.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.
CS Statistical Machine learning Lecture 18 Yuan (Alan) Qi Purdue CS Oct
SUPA Advanced Data Analysis Course, Jan 6th – 7th 2009 Advanced Data Analysis for the Physical Sciences Dr Martin Hendry Dept of Physics and Astronomy.
Vector Norms and the related Matrix Norms. Properties of a Vector Norm: Euclidean Vector Norm: Riemannian metric:
Chapter 6 Linear Programming: The Simplex Method Section 4 Maximization and Minimization with Problem Constraints.
Some Aspects of Bayesian Approach to Model Selection Vetrov Dmitry Dorodnicyn Computing Centre of RAS, Moscow.
R EGRESSION S HRINKAGE AND S ELECTION VIA THE L ASSO Author: Robert Tibshirani Journal of the Royal Statistical Society 1996 Presentation: Tinglin Liu.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Review of fundamental 1 Data mining in 1D: curve fitting by LLS Approximation-generalization tradeoff First homework assignment.
CpSc 881: Machine Learning
Machine Learning 5. Parametric Methods.
Chapter 2-OPTIMIZATION G.Anuradha. Contents Derivative-based Optimization –Descent Methods –The Method of Steepest Descent –Classical Newton’s Method.
ECE 530 – Analysis Techniques for Large-Scale Electrical Systems Prof. Hao Zhu Dept. of Electrical and Computer Engineering University of Illinois at Urbana-Champaign.
Review of statistical modeling and probability theory Alan Moses ML4bio.
Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Joo-kyung Kim Biointelligence Laboratory,
ECE 530 – Analysis Techniques for Large-Scale Electrical Systems Prof. Hao Zhu Dept. of Electrical and Computer Engineering University of Illinois at Urbana-Champaign.
Regularized Least-Squares and Convex Optimization.
(iii) Simplex method - I D Nagesh Kumar, IISc Water Resources Planning and Management: M3L3 Linear Programming and Applications.
Estimation Econometría. ADE.. Estimation We assume we have a sample of size T of: – The dependent variable (y) – The explanatory variables (x 1,x 2, x.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
LECTURE 14: DIMENSIONALITY REDUCTION: PRINCIPAL COMPONENT REGRESSION March 21, 2016 SDS 293 Machine Learning B.A. Miller.
I. Statistical Methods for Genome-Enabled Prediction of Complex Traits OUTLINE THE CHALLENGES OF PREDICTING COMPLEX TRAITS ORDINARY LEAST SQUARES (OLS)
Lecture 16: Image alignment
Probability Theory and Parameter Estimation I
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
Ch3: Model Building through Regression
CSE 4705 Artificial Intelligence
Linear Regression (continued)
Statistical Learning Dong Liu Dept. EEIS, USTC.
Probabilistic Models for Linear Regression
Roberto Battiti, Mauro Brunato
Machine Learning for Signal Processing Linear Gaussian Models
Pattern Recognition and Machine Learning
Lecture 8: Image alignment
Parametric Methods Berlin Chen, 2005 References:
Presentation transcript:

Ridge Regression: Biased Estimation for Nonorthogonal Problems by A.E. Hoerl and R.W. Kennard Regression Shrinkage and Selection via the Lasso by Robert Tibshirani Presented by: John Paisley Duke University, Dept. of ECE

Introduction Consider an overdetermined system of linear equations (more equations than unknowns). We want to find “optimal” solutions according to different criteria. This motivates our discussion on the following topics: Least Squares Estimation Ridge Regression Lasso ML and MAP Interpretations of LS, RR and Lasso Relationship with RVM and Bayesian Lasso –With the RVM and Lasso, we let the system become underdetermined when working with compressed sensing. However, as I understand it, the theory of compressed sensing is based on the idea that the true solution is actually an overdetermined one hiding within an underdetermined matrix.

Least Squares Estimation Consider an overdetermined system of linear equations Least squares attempts to minimize the magnitude of the error vector, which means solving By recognizing that the estimate, y*, should be orthogonal to the error, we can obtain the least squares solution for x.

Ridge Regression There are issues with the LS solution. Consider the generative interpretation of the overdetermined system. Then the following can be shown to be true: When has very small eigenvalues, the variance on the least squares estimate can lead to x vectors that “blow up,” which is bad when it is x that we’re really interested in. Ridge regression keeps the values in x from blowing up by introducing a penalty to the least squares objective function. Which has the solution:

Ridge Regression: Geometric Interpretation The least squares objective function for any x can be written as: Consider a variation of the ridge regression problem: The constraint produces a feasible region (gray area) The least squares penalty is a constant, plus a “Gaussian” with mean and precision matrix The solution is the point where the contours first touch the feasible region.

Lasso Take ridge regression, but change the constraint. The circle becomes a diamond because that’s what contours of equal length look like for the L 1 norm. The points of the constraint indicate that there could be coefficients that are exactly zero. In general, Note that when the constraint is stated in terms of a penalty, then for ridge regression and lasso, the feasible region is replaced with a second convex function having the appropriate contours. The sum of these two convex penalty functions will produce another convex function with a minimum. Solving for x, however, is harder because it’s not analytic anymore. The simplex algorithm is one way, and other linear programming or optimization methods can be used.

ML and MAP Interpretations Finding the ML solution to produces the LS solution. If we place a prior on x, and find the MAP solution we see that we are maximizing the negative of the RR penalty with Furthermore, given that we see that the mean of the posterior distribution is the RR solution, since If we were to place a double-exponential on x, we would see that the MAP solution is the Lasso solution.

RVM and Bayesian Lasso With Tikhonov regularization, we can change the RR penalty to penalize each dimension separately with a diagonal A matrix. This suggests that the RVM is a “Bayesian ridge” solution, where for each iteration we update the penalty on each dimension with the prior that the penalty should be high, which enforces sparseness. The penalty matrix, A, changes the illustration shown earlier. It is no longer a circle, but an ellipse with symmetry about the axes. As penalties increase along one dimension, the circle is squeezed into the origin. As a penalty along one dimension goes to infinity, as it can do with the RVM, we squeeze that dimension out of existence. This changes the interpretation of zero coefficients. We don’t need Lasso anymore because the penalty itself will ensure that the ellipse hits at many zeros. Without the Bayesian approach, the penalty isn’t allowed to do this, which is why the Lasso and linear programming is used. I think this is why the RVM and Bayesian Lasso produce almost identical results for us.