Ridge Regression: Biased Estimation for Nonorthogonal Problems by A.E. Hoerl and R.W. Kennard Regression Shrinkage and Selection via the Lasso by Robert Tibshirani Presented by: John Paisley Duke University, Dept. of ECE
Introduction Consider an overdetermined system of linear equations (more equations than unknowns). We want to find “optimal” solutions according to different criteria. This motivates our discussion on the following topics: Least Squares Estimation Ridge Regression Lasso ML and MAP Interpretations of LS, RR and Lasso Relationship with RVM and Bayesian Lasso –With the RVM and Lasso, we let the system become underdetermined when working with compressed sensing. However, as I understand it, the theory of compressed sensing is based on the idea that the true solution is actually an overdetermined one hiding within an underdetermined matrix.
Least Squares Estimation Consider an overdetermined system of linear equations Least squares attempts to minimize the magnitude of the error vector, which means solving By recognizing that the estimate, y*, should be orthogonal to the error, we can obtain the least squares solution for x.
Ridge Regression There are issues with the LS solution. Consider the generative interpretation of the overdetermined system. Then the following can be shown to be true: When has very small eigenvalues, the variance on the least squares estimate can lead to x vectors that “blow up,” which is bad when it is x that we’re really interested in. Ridge regression keeps the values in x from blowing up by introducing a penalty to the least squares objective function. Which has the solution:
Ridge Regression: Geometric Interpretation The least squares objective function for any x can be written as: Consider a variation of the ridge regression problem: The constraint produces a feasible region (gray area) The least squares penalty is a constant, plus a “Gaussian” with mean and precision matrix The solution is the point where the contours first touch the feasible region.
Lasso Take ridge regression, but change the constraint. The circle becomes a diamond because that’s what contours of equal length look like for the L 1 norm. The points of the constraint indicate that there could be coefficients that are exactly zero. In general, Note that when the constraint is stated in terms of a penalty, then for ridge regression and lasso, the feasible region is replaced with a second convex function having the appropriate contours. The sum of these two convex penalty functions will produce another convex function with a minimum. Solving for x, however, is harder because it’s not analytic anymore. The simplex algorithm is one way, and other linear programming or optimization methods can be used.
ML and MAP Interpretations Finding the ML solution to produces the LS solution. If we place a prior on x, and find the MAP solution we see that we are maximizing the negative of the RR penalty with Furthermore, given that we see that the mean of the posterior distribution is the RR solution, since If we were to place a double-exponential on x, we would see that the MAP solution is the Lasso solution.
RVM and Bayesian Lasso With Tikhonov regularization, we can change the RR penalty to penalize each dimension separately with a diagonal A matrix. This suggests that the RVM is a “Bayesian ridge” solution, where for each iteration we update the penalty on each dimension with the prior that the penalty should be high, which enforces sparseness. The penalty matrix, A, changes the illustration shown earlier. It is no longer a circle, but an ellipse with symmetry about the axes. As penalties increase along one dimension, the circle is squeezed into the origin. As a penalty along one dimension goes to infinity, as it can do with the RVM, we squeeze that dimension out of existence. This changes the interpretation of zero coefficients. We don’t need Lasso anymore because the penalty itself will ensure that the ellipse hits at many zeros. Without the Bayesian approach, the penalty isn’t allowed to do this, which is why the Lasso and linear programming is used. I think this is why the RVM and Bayesian Lasso produce almost identical results for us.