ADMM and DSO.

Slides:

Advertisements

Similar presentations

Regularization David Kauchak CS 451 – Fall 2013.

Advertisements

Electrical and Computer Engineering Mississippi State University

Semi-Stochastic Gradient Descent Methods Jakub Konečný (joint work with Peter Richtárik) University of Edinburgh.

Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh BASP Frontiers Workshop January 28, 2014.

Optimization Tutorial

Chapter 9 Perceptrons and their generalizations. Rosenblatt ’ s perceptron Proofs of the theorem Method of stochastic approximation and sigmoid approximation.

Lecture 13 – Perceptrons Machine Learning March 16, 2010.

Separating Hyperplanes

Distributed Optimization with Arbitrary Local Solvers

An Accelerated Gradient Method for Multi-Agent Planning in Factored MDPs Sue Ann HongGeoff Gordon CarnegieMellonUniversity.

Classification and Prediction: Regression Via Gradient Descent Optimization Bamshad Mobasher DePaul University.

Prénom Nom Document Analysis: Linear Discrimination Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.

Linear Regression  Using a linear function to interpolate the training set  The most popular criterion: Least squares approach  Given the training set:

Support Vector Machine (SVM) Classification

Reformulated - SVR as a Constrained Minimization Problem subject to n+1+2m variables and 2m constrains minimization problem Enlarge the problem size and.

EE 685 presentation Optimization Flow Control, I: Basic Algorithm and Convergence By Steven Low and David Lapsley Asynchronous Distributed Algorithm Proof.

Unconstrained Optimization Problem

Lecture 10: Support Vector Machines

Mehdi Ghayoumi MSB rm 132 Ofc hr: Thur, a Machine Learning.

Semi-Stochastic Gradient Descent Methods Jakub Konečný (joint work with Peter Richtárik) University of Edinburgh SIAM Annual Meeting, Chicago July 7, 2014.

PETE 603 Lecture Session #29 Thursday, 7/29/ Iterative Solution Methods Older methods, such as PSOR, and LSOR require user supplied iteration.

Peter Richtárik (joint work with Martin Takáč) Distributed Coordinate Descent Method AmpLab All Hands Meeting - Berkeley - October 29, 2013.

Collaborative Filtering Matrix Factorization Approach

CS Statistical Machine learning Lecture 18 Yuan (Alan) Qi Purdue CS Oct

Machine Learning Weak 4 Lecture 2. Hand in Data It is online Only around 6000 images!!! Deadline is one week. Next Thursday lecture will be only one hour.

EE 685 presentation Optimization Flow Control, I: Basic Algorithm and Convergence By Steven Low and David Lapsley.

Biointelligence Laboratory, Seoul National University

Survey of Kernel Methods by Jinsan Yang. (c) 2003 SNU Biointelligence Lab. Introduction Support Vector Machines Formulation of SVM Optimization Theorem.

Multi-area Nonlinear State Estimation using Distributed Semidefinite Programming Hao Zhu October 15, 2012 Acknowledgements: Prof. G.

1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.

Network Lasso: Clustering and Optimization in Large Graphs

Support Vector Machines

1  Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.

METU Informatics Institute Min720 Pattern Classification with Bio-Medical Applications Part 7: Linear and Generalized Discriminant Functions.

Asynchronous Distributed ADMM for Consensus Optimization Ruiliang Zhang James T. Kwok Department of Computer Science and Engineering, Hong Kong University.

1 Support Vector Machines: Maximum Margin Classifiers Machine Learning and Pattern Recognition: September 23, 2010 Piotr Mirowski Based on slides by Sumit.

Support vector machines

Mathieu Leconte, Ioannis Steiakogiannakis, Georgios Paschos

LINEAR CLASSIFIERS The Problem: Consider a two class task with ω1, ω2.

Large Margin classifiers

Zhu Han University of Houston Thanks for Dr. Mingyi Hong’s slides

Multiplicative updates for L1-regularized regression

Solver & Optimization Problems

Lecture 07: Soft-margin SVM

6.5 Stochastic Prog. and Benders’ decomposition

Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)

Privacy and Fault-Tolerance in Distributed Optimization Nitin Vaidya University of Illinois at Urbana-Champaign.

Machine Learning Basics

Probabilistic Models for Linear Regression

Kernels Usman Roshan.

Multi-layer perceptron

Steepest Descent Algorithm: Step 1.

Collaborative Filtering Matrix Factorization Approach

Lecture 07: Soft-margin SVM

CSCI B609: “Foundations of Data Science”

Large Scale Support Vector Machines

Lecture 08: Soft-margin SVM

Support vector machines

Lecture 07: Soft-margin SVM

CS5321 Numerical Optimization

Usman Roshan CS 675 Machine Learning

Support vector machines

Support vector machines

CS639: Data Management for Data Science

Neural Network Training

6.5 Stochastic Prog. and Benders’ decomposition

Multiple features Linear Regression with multiple variables

Multiple features Linear Regression with multiple variables

Mathieu Leconte, Ioannis Steiakogiannakis, Georgios Paschos

Presentation transcript:

ADMM and DSO

Pre-requisites for ADMM Dual Ascent Dual decomposition Augmented Lagrangian

Dual Ascent Constraint convex optimization of the form: Lagrangian: Dual function: Issue: Slow convergence No distributedness. Q: Why dual ascent is slow?

Dual decomposition Same constraint convex optimization problem but separable. Lagrangian: Parallelization is possible. Issue: Still slow convergence. If we map to machine learning algorithm then, its clear that dual decomposition actually does feature splitting. A_i x_i = column space of A(remember). If someone argues why for “Y” update we need to gather, then ans is : actually not for “Y” we are gathering, Here in this L(x,y) we need y^T A_i x_i . So we need to supply Y to all nodes.

xi updates of t+1 iteration are send to a central hub which calculates y(t+1) and then again propagates it to different machines.

Augmented Lagrangian method Constraint convex optimization : Updated objective function So the Lagrangian would look like: Updates would look like: Issue: Due to this new term we lost decomposability but improved convergence.

ADMM (Alternating Direction Method of multipliers) Standard ADMM Scaled ADMM

Standard ADMM Constraint convex optimization : Augmented Lagrangian: AL updates would be like: Because of the updates in alternating fashion of x and z, it is called as ADMM. “f” and “g” should be closed , proper , real-valued convex functions. Saddle point should exist and strong duality should hold.

Standard ADMM Blend of dual decomposition and augmented Lagrangian method(AL). ADMM updates would be: Because of the updates in alternating fashion of x and z, it is called as ADMM. “f” and “g” should be closed , proper , real-valued convex functions. Saddle point should exist and strong duality should hold.

Scaled ADMM Scale the dual variable: u=y/ρ The standard ADMM updates would look like: The formulas are shorter in this version. This version is widely used.

Least square problem Consider the method of least-square where we minimize the sum of square of errors for regression purpose: For standard ADMM to work, we will reformulate the problem as:

DSO (Distributed Stochastic Optimization)

Regularized Risk Minimization RRM : Introducing constraints: Here at this stage we can apply ADMM too, but let’s go for SGD formulation.

Lagrangian Fenchel-Legendre conjugate : Lagrangian: Lagrangian can be rewritten as:

DSO Again rewriting the previous equation but only for non-zero features. Now applying stochastic gradient descent for w and ascent for α : Note that we can parallelize this stochastic optimization algorithm.

Working of DSO

Thank-you 