Presentation is loading. Please wait.

Presentation is loading. Please wait.

Coordinate Descent Algorithms

Similar presentations


Presentation on theme: "Coordinate Descent Algorithms"— Presentation transcript:

1 Coordinate Descent Algorithms
Stephen J. Wright Sai Krishna Dubagunta

2 Index About Coordinate Descent Algorithm ( CD ) Introduction
Optimization - minimization Types of functions [ Definitions ] Convex Smooth Regularization Functions Outline of CD Algorithms Applications

3 Contd.. Algorithms, Convergence and Implementations Powell’s Example
Randomized Algorithms Conclusion

4 Coordinate Descent Algorithm
solve optimization problems by successively performing approximate minimization along coordinate directions or hyperplanes dynamic programming - break a problem to subproblems of lower dimensions (even scalar) and solve each subproblem independently

5 Introduction Optimization - minimization
Minimization of a multivariate function F(X) can be achieved by minimizing it along one direction at a time ( solving univariate problem in a loop ) problem considered for this paper is where f : ℝn ℝ is continuous.

6 Types of functions Convex Function : Ex : Real Value Function Domain :
f is said to be convex, if

7 Smooth Function : a function f(x) is said to be smooth, if all the partial derivatives of f(x) are defined and continuous at every point in Domain of f.

8 Regularization Functions
No regularizer Weighted L1 norm Box - Constraint Weighted L2 norm

9 Outline of Coordinate Descent Algorithms
as we said in the Introduction, the function we use in this paper is motivated by the recent applications, it is now common to consider this formulation : where f is smooth, Ω is a regularization function that maybe non-smooth and extended value, and 𝛌 > 0 is a regularization parameter.

10 Ω is often convex and assumed to be separable or block separable
Ω is often convex and assumed to be separable or block separable. If separable, where Ωi : ℝn ℝ for all i. n is a very large value representing the number of coordinates. when block separable, the nXn identity matrix can be partitioned into sub matrices Ui , i = 1,2,….N such that,

11 Basic Coordinate Descent Framework
Each step consists of evaluation of single component ik of the gradient ∇f at the current point. adjustment is made to the respective component of x in the opposite direction to this component. 𝛼k is step length which can exact minimization along the ik component , predefined short-step the components can be selected in cyclic fashioning which i0 = 1 and , The first algorithm is with no regularizer …. i.e., that is the problem here is unconstrained. Algorithm 1

12 the equations from slide 10 is shown in the algorithm 2
at k iteration, a scalar problem is formed by making a linear approximation to f along ik coordinate direction at current xk iterate adding quadratic damping term weighted by 1/𝛼k and treating the regularization term Ωi respectively. Note that when regularization term is not present, the step is identical to Algorithm 1. for some interesting choices of Ωi , it is possible to device a close-form solution without performing explicit searches. This kind of operation is called “Shrink Operation”, denoted by stating the subproblem in Algorithm 2 equivalently as we can express the CD update as Unlike the first algorithm where we discussed about a minimization problem without regularization function, here we also include a regularization function. Algorithm 2

13 Application to Linear Equations
for a linear system Aw = b, let us assume the rows of A are normalized the least norm solution’s lagrangian dual is Applying algorithm 1 to the lagrangian dual with 𝛼k == 1 , each step has the form from the lagrangian dual, we acquire and the update on after each update on xk, we obtain which is the update of Kaczmarz algorithm. Applying lagrangian dual form to the least norm solution of Aw = b.

14 Relationship to other Methods
Stochastic Gradient (SG) Methods minimize smooth function f by taking a (-ve) step along an estimate gk of the gradient ∇f(xk) at iteration k. often assumed that gk is unbiased estimate of ∇f(xk), that is, ∇f(xk) = E(gk). Randomized CD algorithms can be viewed as special case of SG methods, in which where ik is chosen uniformly at random from {1,2,….n}.

15 Gauss - Seidel Method for n X n systems of linear equations which adjusts the ik variable to ensue satisfaction of the ik equation, at iteration k. Standard Gauss-Siedel uses the cyclic choice of coordinates, whereas random choice of ik would correspond to the “randomized” versions of these methods The Gauss-Siedel method applied to linear system Aw = b , that is, ATAw = ATb is equivalent to applying Algorithm 1 to the least- squares problem.

16 Applications here mentioned are several applications of CD algorithms
Bouman & Sauer : Positron Emission Tomography in which the objective that form of equation from slide 9, where f is smooth and convex and Ω is a sum of terms of the form |x j - x l|q Liu, paratucco and Zhang describe a block of CD approach for linear least squares plus a regularization function consisting a sum of l norms of subvectors of x. Chang, Hsieh, and Lin use cyclic and stochastic CD to solve a squared-loss formulation of the support vector machine (SVM) problem in machine learning, that is, where (xi,yi) ∈ ℝn x {0,1} are feature vector / label pairs and lama is a regularization parameter.

17 Algorithms, Convergence, Implementations
Powell’s Example : function in R3 for which cyclic CD fails to converge at a stationary point a non-convex, continuously differentiable function f : ℝ3 ℝ is defined as follows it has minimizers at the corners of the unit cube, but coordinate descent with exact minimization, started near one of the other vertices of the cube cycles round the neighborhoods of the six points close to the six no-optimal vertices. still, we cannot expect a general convergence result for non-convex functions, of the type that are available for full-gradient descent. results are available for the non convex case under certain assumptions that still admit interesting applications Powell’s example of a function f(x1,x2,x3) as the function is in 3 dimensions where the cyclic CD fails. i.e., it has no stationary convergence point. Example showing the non convergence of cyclic coordinate descent

18 Assumptions & Notations
we consider the unconstrained problem mentioned in slide 5 , where the objective f is convex and Lipschitz continuously differentiable. we assume 𝞼 > 0 such that we define Lipschitz constants that are tied to the component directions, and are the key to the algorithms and their analysis. The Component Lipschitz constants are positive quantities Li such that for all x ∈ ℝn and all t ∈ ℝ we have we define the coordinate Lipschitz constant Lmax to be such that

19 The standard Lipschitz Constant L is such that
for all x and d of interest. By referring to the relationships between norm and trace of a symmetry matrix, w can assume that 1 ≤ L / Lmax ≤ n. we also define restricted Lipschitz constant Lres such that the following property is true for all x ∈ ℝn , all t ∈ ℝ, and all i = 1,2,….,n: Clearly Lres ≤ L. The ratio below is important in our analysis of asynchronous parallel algorithms in later section

20 In the case of f convex and twice continuously differentiable, we have by positive semidefiniteness of the ∇2 f(x) at all x that from which we can deduce that we can derive stronger bounds on ⩘ for functions f in which the coupling between the components of x is weak. the coordinate Lipschitz constant corresponds Lmax to the max absolute value of the diagonal elements of the hessian ∇2f(x), while the restricted Lres is related the maximal column norm of the hessian. So , if the hessian is positive semidefinite and diagonally dominant, the ratio is at most 2.

21 Assumption 1 The function f is convex and uniformly Lipschitz continuously differentiable, and attains its minimum value f* on a set S, There is a finite R0 such that the level set for f defined by x0 is bounded, that is,

22 Randomized Algorithms
the update component ik is chosen randomly at each iteration. In the given algorithm, we consider the simplest variant in which each ik is selected from {1,2,….,n} with equal probability, independently of the selections made at previous iterations. we prove a convergence result for the randomized algorithm, for the simple step length choice 𝛼k ≡ 1/Lmax. Algorithm 3

23 Theorem 1 : Suppose that Assumption 1 holds
Theorem 1 : Suppose that Assumption 1 holds. Suppose that 𝛼k ≡ 1/Lmax in algorithm 3. Then for all k> 0 we have when 𝞼>0, we have in addition that Proof By application of Taylor’s theorem, (21) and (22), we have

24 where we substituted the choice 𝛼k ≡ 1/Lmax in the last equality
where we substituted the choice 𝛼k ≡ 1/Lmax in the last equality. Taking the expectation of both sides of this expression over the random index ik, we have we now subtract f(x*) from both sides of this expression, take expectation of both sides with respect to all random variables i0,i1,i2,…., and use the notation to obtain By convexity of f we have for any x* ∈ S that

25 where the final inequality is because f(xk) ≤ f(x0), so that xk is in the level set in (26). By taking expectations of both sides, we obtain When we substitute this bound into (32), and rearrange, we obtain We thus have by applying this formula recursively, we obtain so that (27) holds as claimed. In the case of f is strongly convex with modulus 𝞼 > 0, we have by taking the minimum of both sides with respect to y in (20), and setting x = xk, that

26 By using this expression to bound || ∇ f( xk ) ||2 in (32), we obtain
Recursive application of this formula leads to (28). Note that the same convergence expressions can be obtained for more refined choice of step-length 𝛼k, by making minor adjustments to the logic. For example, the choice 𝛼k ≡ 1/Lik leads to the same bounds, the same bounds hold too when 𝛼k is the exact minimizer of f along the coordinate search direction. we compare (27) with the corresponding result for full-gradient descent with constant step length 𝛼k ≡ 1/L. The iteration leads to a convergence expression

27 Accelerated Randomized Algorithms
proposed by Nesterov. assumes that an estimate is available of modulus of strong convexity 𝞼 ≥ 0 from (20), as well as estimates of the component-wise Lipschitz constants Li from (21). closely related to accelerated full-gradient methods. Algorithm 4

28 Theorem 2: Suppose that Assumption 1 holds, and define
Then for all k ≥ 0 we have In the strongly convex case 𝞼 > 0, the term eventually dominates the second term in brackets in (35), so that the linear convergence rate suggested by this expression is significantly faster than the corresponding rate (28) for algorithm 3. The term in equation 35 decreases the number of iterations required to meet the a specified error tolerance.

29 Efficient Implementation of Accelerated Algorithm
The higher cost of each iteration of Algorithm 4 detracts from the appeal of accelerated CD methods over standard methods. However, by using a change of variables due to Lee and Sidford, it is possible to implement the accelerated randomized CD approach efficiently for problems with certain structure, including the linear system Aw=b. We explain the Lee-Sidford technique in the context of Kaczmarz algorithm for (8), assuming normalization of the rows of A (14). As we explained in (16), the Kaczmarz algorithm is obtained by applying CD to the dual formulation (10) with variables x, but operating in the space of “primal” variables w using the transformation w = ATx. If we apply transformations Ṽk = ATVk and ỹ = ATyk to the vectors in Algorithm 4 to note that Li ≡ 1 in (21), we obtain Algorithm 5.

30 Algorithm 5

31 When the matrix A is dense, there is only a small factor of difference between the per-iteration workload of the standard Kaczmarz algorithm and it accelerated variant, Algorithm 5. both would require O(m+n) operations per iteration. When A is sparse, the computational difference between the two becomes substantial. In algorithm 4, at iteration k, the standard Kaczmarz requires | Aik |. In algorithm 5, at iteration k, its variant requires O(| Aik |) operations.

32 Conclusion We have surveyed the state of the art in convergence of coordinate descent methods, with a focus on the most elementary settings and the most fundamental algorithms. Coordinate Descent method have become an important tool in the optimization toolbox that is used to solve problems that arise in machine learning and data analysis, particularly in “big data” settings.


Download ppt "Coordinate Descent Algorithms"

Similar presentations


Ads by Google