Coordinate Descent Algorithms

Slides:



Advertisements
Similar presentations
Accelerated, Parallel and PROXimal coordinate descent IPAM February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv: )
Advertisements

Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh BASP Frontiers Workshop January 28, 2014.
1 Modeling and Optimization of VLSI Interconnect Lecture 9: Multi-net optimization Avinoam Kolodny Konstantin Moiseev.
Introducción a la Optimización de procesos químicos. Curso 2005/2006 BASIC CONCEPTS IN OPTIMIZATION: PART II: Continuous & Unconstrained Important concepts.
Graph Laplacian Regularization for Large-Scale Semidefinite Programming Kilian Weinberger et al. NIPS 2006 presented by Aggeliki Tsoli.
Separating Hyperplanes
Visual Recognition Tutorial
Support Vector Machines
Function Optimization Newton’s Method. Conjugate Gradients
Prénom Nom Document Analysis: Linear Discrimination Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Tutorial 12 Unconstrained optimization Conjugate gradients.
EE 685 presentation Optimization Flow Control, I: Basic Algorithm and Convergence By Steven Low and David Lapsley Asynchronous Distributed Algorithm Proof.
Unconstrained Optimization Problem
Optical Flow Estimation using Variational Techniques Darya Frolova.
Tutorial 10 Iterative Methods and Matrix Norms. 2 In an iterative process, the k+1 step is defined via: Iterative processes Eigenvector decomposition.
Advanced Topics in Optimization
1 Multiple Kernel Learning Naouel Baili MRL Seminar, Fall 2009.
MOHAMMAD IMRAN DEPARTMENT OF APPLIED SCIENCES JAHANGIRABAD EDUCATIONAL GROUP OF INSTITUTES.
Today Wrap up of probability Vectors, Matrices. Calculus

UNCONSTRAINED MULTIVARIABLE
Collaborative Filtering Matrix Factorization Approach
Sequences Informally, a sequence is a set of elements written in a row. – This concept is represented in CS using one- dimensional arrays The goal of mathematics.
84 b Unidimensional Search Methods Most algorithms for unconstrained and constrained optimisation use an efficient unidimensional optimisation technique.
Nonlinear programming Unconstrained optimization techniques.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
Yaomin Jin Design of Experiments Morris Method.
Multivariate Unconstrained Optimisation First we consider algorithms for functions for which derivatives are not available. Could try to extend direct.
EE 685 presentation Optimization Flow Control, I: Basic Algorithm and Convergence By Steven Low and David Lapsley.
Data Modeling Patrice Koehl Department of Biological Sciences National University of Singapore
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Chapter 2-OPTIMIZATION G.Anuradha. Contents Derivative-based Optimization –Descent Methods –The Method of Steepest Descent –Classical Newton’s Method.
Introduction to Optimization
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
Searching a Linear Subspace Lecture VI. Deriving Subspaces There are several ways to derive the nullspace matrix (or kernel matrix). ◦ The methodology.
Generalization Error of pac Model  Let be a set of training examples chosen i.i.d. according to  Treat the generalization error as a r.v. depending on.
Regularized Least-Squares and Convex Optimization.
MAT 322: LINEAR ALGEBRA.
Support vector machines
Chapter 14 Partial Derivatives
LINEAR CLASSIFIERS The Problem: Consider a two class task with ω1, ω2.
Computation of the solutions of nonlinear polynomial systems
Chap 10. Sensitivity Analysis
Large Margin classifiers
Visual Recognition Tutorial
Gauss-Siedel Method.
Perturbation method, lexicographic method
10701 / Machine Learning.
Haim Kaplan and Uri Zwick
Quantum Two.
CSE 245: Computer Aided Circuit Simulation and Verification
NESTA: A Fast and Accurate First-Order Method for Sparse Recovery
Chap 9. General LP problems: Duality and Infeasibility
Chapter 6. Large Scale Optimization
Chap 3. The simplex method
Collaborative Filtering Matrix Factorization Approach
CSCI B609: “Foundations of Data Science”
CSE 245: Computer Aided Circuit Simulation and Verification
Victor Edneral Moscow State University Russia
Aviv Rosenberg 10/01/18 Seminar on Experts and Bandits
Chapter 5. The Duality Theorem
Support Vector Machines
Support vector machines
Eigenvalues and Eigenvectors
Section 3: Second Order Methods
Chapter 6. Large Scale Optimization
Chapter 2. Simplex method
Multivariable optimization with no constraints
Presentation transcript:

Coordinate Descent Algorithms Stephen J. Wright Sai Krishna Dubagunta

Index About Coordinate Descent Algorithm ( CD ) Introduction Optimization - minimization Types of functions [ Definitions ] Convex Smooth Regularization Functions Outline of CD Algorithms Applications

Contd.. Algorithms, Convergence and Implementations Powell’s Example Randomized Algorithms Conclusion

Coordinate Descent Algorithm solve optimization problems by successively performing approximate minimization along coordinate directions or hyperplanes dynamic programming - break a problem to subproblems of lower dimensions (even scalar) and solve each subproblem independently

Introduction Optimization - minimization Minimization of a multivariate function F(X) can be achieved by minimizing it along one direction at a time ( solving univariate problem in a loop ) problem considered for this paper is where f : ℝn ℝ is continuous.

Types of functions Convex Function : Ex : Real Value Function Domain : f is said to be convex, if

Smooth Function : a function f(x) is said to be smooth, if all the partial derivatives of f(x) are defined and continuous at every point in Domain of f.

Regularization Functions No regularizer Weighted L1 norm Box - Constraint Weighted L2 norm

Outline of Coordinate Descent Algorithms as we said in the Introduction, the function we use in this paper is motivated by the recent applications, it is now common to consider this formulation : where f is smooth, Ω is a regularization function that maybe non-smooth and extended value, and 𝛌 > 0 is a regularization parameter.

Ω is often convex and assumed to be separable or block separable Ω is often convex and assumed to be separable or block separable. If separable, where Ωi : ℝn ℝ for all i. n is a very large value representing the number of coordinates. when block separable, the nXn identity matrix can be partitioned into sub matrices Ui , i = 1,2,….N such that,

Basic Coordinate Descent Framework Each step consists of evaluation of single component ik of the gradient ∇f at the current point. adjustment is made to the respective component of x in the opposite direction to this component. 𝛼k is step length which can exact minimization along the ik component , predefined short-step the components can be selected in cyclic fashioning which i0 = 1 and , The first algorithm is with no regularizer …. i.e., that is the problem here is unconstrained. Algorithm 1

the equations from slide 10 is shown in the algorithm 2 at k iteration, a scalar problem is formed by making a linear approximation to f along ik coordinate direction at current xk iterate adding quadratic damping term weighted by 1/𝛼k and treating the regularization term Ωi respectively. Note that when regularization term is not present, the step is identical to Algorithm 1. for some interesting choices of Ωi , it is possible to device a close-form solution without performing explicit searches. This kind of operation is called “Shrink Operation”, denoted by stating the subproblem in Algorithm 2 equivalently as we can express the CD update as Unlike the first algorithm where we discussed about a minimization problem without regularization function, here we also include a regularization function. Algorithm 2

Application to Linear Equations for a linear system Aw = b, let us assume the rows of A are normalized the least norm solution’s lagrangian dual is Applying algorithm 1 to the lagrangian dual with 𝛼k == 1 , each step has the form from the lagrangian dual, we acquire and the update on after each update on xk, we obtain which is the update of Kaczmarz algorithm. Applying lagrangian dual form to the least norm solution of Aw = b.

Relationship to other Methods Stochastic Gradient (SG) Methods minimize smooth function f by taking a (-ve) step along an estimate gk of the gradient ∇f(xk) at iteration k. often assumed that gk is unbiased estimate of ∇f(xk), that is, ∇f(xk) = E(gk). Randomized CD algorithms can be viewed as special case of SG methods, in which where ik is chosen uniformly at random from {1,2,….n}.

Gauss - Seidel Method for n X n systems of linear equations which adjusts the ik variable to ensue satisfaction of the ik equation, at iteration k. Standard Gauss-Siedel uses the cyclic choice of coordinates, whereas random choice of ik would correspond to the “randomized” versions of these methods The Gauss-Siedel method applied to linear system Aw = b , that is, ATAw = ATb is equivalent to applying Algorithm 1 to the least- squares problem.

Applications here mentioned are several applications of CD algorithms Bouman & Sauer : Positron Emission Tomography in which the objective that form of equation from slide 9, where f is smooth and convex and Ω is a sum of terms of the form |x j - x l|q Liu, paratucco and Zhang describe a block of CD approach for linear least squares plus a regularization function consisting a sum of l norms of subvectors of x. Chang, Hsieh, and Lin use cyclic and stochastic CD to solve a squared-loss formulation of the support vector machine (SVM) problem in machine learning, that is, where (xi,yi) ∈ ℝn x {0,1} are feature vector / label pairs and lama is a regularization parameter.

Algorithms, Convergence, Implementations Powell’s Example : function in R3 for which cyclic CD fails to converge at a stationary point a non-convex, continuously differentiable function f : ℝ3 ℝ is defined as follows it has minimizers at the corners of the unit cube, but coordinate descent with exact minimization, started near one of the other vertices of the cube cycles round the neighborhoods of the six points close to the six no-optimal vertices. still, we cannot expect a general convergence result for non-convex functions, of the type that are available for full-gradient descent. results are available for the non convex case under certain assumptions that still admit interesting applications Powell’s example of a function f(x1,x2,x3) as the function is in 3 dimensions where the cyclic CD fails. i.e., it has no stationary convergence point. Example showing the non convergence of cyclic coordinate descent

Assumptions & Notations we consider the unconstrained problem mentioned in slide 5 , where the objective f is convex and Lipschitz continuously differentiable. we assume 𝞼 > 0 such that we define Lipschitz constants that are tied to the component directions, and are the key to the algorithms and their analysis. The Component Lipschitz constants are positive quantities Li such that for all x ∈ ℝn and all t ∈ ℝ we have we define the coordinate Lipschitz constant Lmax to be such that

The standard Lipschitz Constant L is such that for all x and d of interest. By referring to the relationships between norm and trace of a symmetry matrix, w can assume that 1 ≤ L / Lmax ≤ n. we also define restricted Lipschitz constant Lres such that the following property is true for all x ∈ ℝn , all t ∈ ℝ, and all i = 1,2,….,n: Clearly Lres ≤ L. The ratio below is important in our analysis of asynchronous parallel algorithms in later section

In the case of f convex and twice continuously differentiable, we have by positive semidefiniteness of the ∇2 f(x) at all x that from which we can deduce that we can derive stronger bounds on ⩘ for functions f in which the coupling between the components of x is weak. the coordinate Lipschitz constant corresponds Lmax to the max absolute value of the diagonal elements of the hessian ∇2f(x), while the restricted Lres is related the maximal column norm of the hessian. So , if the hessian is positive semidefinite and diagonally dominant, the ratio is at most 2.

Assumption 1 The function f is convex and uniformly Lipschitz continuously differentiable, and attains its minimum value f* on a set S, There is a finite R0 such that the level set for f defined by x0 is bounded, that is,

Randomized Algorithms the update component ik is chosen randomly at each iteration. In the given algorithm, we consider the simplest variant in which each ik is selected from {1,2,….,n} with equal probability, independently of the selections made at previous iterations. we prove a convergence result for the randomized algorithm, for the simple step length choice 𝛼k ≡ 1/Lmax. Algorithm 3

Theorem 1 : Suppose that Assumption 1 holds Theorem 1 : Suppose that Assumption 1 holds. Suppose that 𝛼k ≡ 1/Lmax in algorithm 3. Then for all k> 0 we have when 𝞼>0, we have in addition that Proof By application of Taylor’s theorem, (21) and (22), we have

where we substituted the choice 𝛼k ≡ 1/Lmax in the last equality where we substituted the choice 𝛼k ≡ 1/Lmax in the last equality. Taking the expectation of both sides of this expression over the random index ik, we have we now subtract f(x*) from both sides of this expression, take expectation of both sides with respect to all random variables i0,i1,i2,…., and use the notation to obtain By convexity of f we have for any x* ∈ S that

where the final inequality is because f(xk) ≤ f(x0), so that xk is in the level set in (26). By taking expectations of both sides, we obtain When we substitute this bound into (32), and rearrange, we obtain We thus have by applying this formula recursively, we obtain so that (27) holds as claimed. In the case of f is strongly convex with modulus 𝞼 > 0, we have by taking the minimum of both sides with respect to y in (20), and setting x = xk, that

By using this expression to bound || ∇ f( xk ) ||2 in (32), we obtain Recursive application of this formula leads to (28). Note that the same convergence expressions can be obtained for more refined choice of step-length 𝛼k, by making minor adjustments to the logic. For example, the choice 𝛼k ≡ 1/Lik leads to the same bounds, the same bounds hold too when 𝛼k is the exact minimizer of f along the coordinate search direction. we compare (27) with the corresponding result for full-gradient descent with constant step length 𝛼k ≡ 1/L. The iteration leads to a convergence expression

Accelerated Randomized Algorithms proposed by Nesterov. assumes that an estimate is available of modulus of strong convexity 𝞼 ≥ 0 from (20), as well as estimates of the component-wise Lipschitz constants Li from (21). closely related to accelerated full-gradient methods. Algorithm 4

Theorem 2: Suppose that Assumption 1 holds, and define Then for all k ≥ 0 we have In the strongly convex case 𝞼 > 0, the term eventually dominates the second term in brackets in (35), so that the linear convergence rate suggested by this expression is significantly faster than the corresponding rate (28) for algorithm 3. The term in equation 35 decreases the number of iterations required to meet the a specified error tolerance.

Efficient Implementation of Accelerated Algorithm The higher cost of each iteration of Algorithm 4 detracts from the appeal of accelerated CD methods over standard methods. However, by using a change of variables due to Lee and Sidford, it is possible to implement the accelerated randomized CD approach efficiently for problems with certain structure, including the linear system Aw=b. We explain the Lee-Sidford technique in the context of Kaczmarz algorithm for (8), assuming normalization of the rows of A (14). As we explained in (16), the Kaczmarz algorithm is obtained by applying CD to the dual formulation (10) with variables x, but operating in the space of “primal” variables w using the transformation w = ATx. If we apply transformations Ṽk = ATVk and ỹ = ATyk to the vectors in Algorithm 4 to note that Li ≡ 1 in (21), we obtain Algorithm 5.

Algorithm 5

When the matrix A is dense, there is only a small factor of difference between the per-iteration workload of the standard Kaczmarz algorithm and it accelerated variant, Algorithm 5. both would require O(m+n) operations per iteration. When A is sparse, the computational difference between the two becomes substantial. In algorithm 4, at iteration k, the standard Kaczmarz requires | Aik |. In algorithm 5, at iteration k, its variant requires O(| Aik |) operations.

Conclusion We have surveyed the state of the art in convergence of coordinate descent methods, with a focus on the most elementary settings and the most fundamental algorithms. Coordinate Descent method have become an important tool in the optimization toolbox that is used to solve problems that arise in machine learning and data analysis, particularly in “big data” settings.