ADMM and DSO.

Slides:



Advertisements
Similar presentations
Regularization David Kauchak CS 451 – Fall 2013.
Advertisements

Electrical and Computer Engineering Mississippi State University
Semi-Stochastic Gradient Descent Methods Jakub Konečný (joint work with Peter Richtárik) University of Edinburgh.
Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh BASP Frontiers Workshop January 28, 2014.
Optimization Tutorial
Chapter 9 Perceptrons and their generalizations. Rosenblatt ’ s perceptron Proofs of the theorem Method of stochastic approximation and sigmoid approximation.
Lecture 13 – Perceptrons Machine Learning March 16, 2010.
Separating Hyperplanes
Distributed Optimization with Arbitrary Local Solvers
An Accelerated Gradient Method for Multi-Agent Planning in Factored MDPs Sue Ann HongGeoff Gordon CarnegieMellonUniversity.
Classification and Prediction: Regression Via Gradient Descent Optimization Bamshad Mobasher DePaul University.
Prénom Nom Document Analysis: Linear Discrimination Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Linear Regression  Using a linear function to interpolate the training set  The most popular criterion: Least squares approach  Given the training set:
Support Vector Machine (SVM) Classification
Reformulated - SVR as a Constrained Minimization Problem subject to n+1+2m variables and 2m constrains minimization problem Enlarge the problem size and.
EE 685 presentation Optimization Flow Control, I: Basic Algorithm and Convergence By Steven Low and David Lapsley Asynchronous Distributed Algorithm Proof.
Unconstrained Optimization Problem
Lecture 10: Support Vector Machines
Mehdi Ghayoumi MSB rm 132 Ofc hr: Thur, a Machine Learning.
Semi-Stochastic Gradient Descent Methods Jakub Konečný (joint work with Peter Richtárik) University of Edinburgh SIAM Annual Meeting, Chicago July 7, 2014.
PETE 603 Lecture Session #29 Thursday, 7/29/ Iterative Solution Methods Older methods, such as PSOR, and LSOR require user supplied iteration.
Peter Richtárik (joint work with Martin Takáč) Distributed Coordinate Descent Method AmpLab All Hands Meeting - Berkeley - October 29, 2013.
Collaborative Filtering Matrix Factorization Approach
CS Statistical Machine learning Lecture 18 Yuan (Alan) Qi Purdue CS Oct
Machine Learning Weak 4 Lecture 2. Hand in Data It is online Only around 6000 images!!! Deadline is one week. Next Thursday lecture will be only one hour.
EE 685 presentation Optimization Flow Control, I: Basic Algorithm and Convergence By Steven Low and David Lapsley.
Biointelligence Laboratory, Seoul National University
Survey of Kernel Methods by Jinsan Yang. (c) 2003 SNU Biointelligence Lab. Introduction Support Vector Machines Formulation of SVM Optimization Theorem.
Multi-area Nonlinear State Estimation using Distributed Semidefinite Programming Hao Zhu October 15, 2012 Acknowledgements: Prof. G.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Network Lasso: Clustering and Optimization in Large Graphs
Support Vector Machines
1  Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
METU Informatics Institute Min720 Pattern Classification with Bio-Medical Applications Part 7: Linear and Generalized Discriminant Functions.
Asynchronous Distributed ADMM for Consensus Optimization Ruiliang Zhang James T. Kwok Department of Computer Science and Engineering, Hong Kong University.
1 Support Vector Machines: Maximum Margin Classifiers Machine Learning and Pattern Recognition: September 23, 2010 Piotr Mirowski Based on slides by Sumit.
Support vector machines
Mathieu Leconte, Ioannis Steiakogiannakis, Georgios Paschos
LINEAR CLASSIFIERS The Problem: Consider a two class task with ω1, ω2.
Large Margin classifiers
Zhu Han University of Houston Thanks for Dr. Mingyi Hong’s slides
Multiplicative updates for L1-regularized regression
Solver & Optimization Problems
Lecture 07: Soft-margin SVM
6.5 Stochastic Prog. and Benders’ decomposition
Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)
Privacy and Fault-Tolerance in Distributed Optimization Nitin Vaidya University of Illinois at Urbana-Champaign.
Machine Learning Basics
Probabilistic Models for Linear Regression
Kernels Usman Roshan.
Multi-layer perceptron
Steepest Descent Algorithm: Step 1.
Collaborative Filtering Matrix Factorization Approach
Lecture 07: Soft-margin SVM
CSCI B609: “Foundations of Data Science”
Large Scale Support Vector Machines
Lecture 08: Soft-margin SVM
Support vector machines
Lecture 07: Soft-margin SVM
CS5321 Numerical Optimization
Usman Roshan CS 675 Machine Learning
Support vector machines
Support vector machines
CS639: Data Management for Data Science
Neural Network Training
6.5 Stochastic Prog. and Benders’ decomposition
Multiple features Linear Regression with multiple variables
Multiple features Linear Regression with multiple variables
Mathieu Leconte, Ioannis Steiakogiannakis, Georgios Paschos
Constraints.
Presentation transcript:

ADMM and DSO

Pre-requisites for ADMM Dual Ascent Dual decomposition Augmented Lagrangian

Dual Ascent Constraint convex optimization of the form: Lagrangian: Dual function: Issue: Slow convergence No distributedness. Q: Why dual ascent is slow?

Dual decomposition Same constraint convex optimization problem but separable. Lagrangian: Parallelization is possible. Issue: Still slow convergence. If we map to machine learning algorithm then, its clear that dual decomposition actually does feature splitting. A_i x_i = column space of A(remember). If someone argues why for “Y” update we need to gather, then ans is : actually not for “Y” we are gathering, Here in this L(x,y) we need y^T A_i x_i . So we need to supply Y to all nodes.

xi updates of t+1 iteration are send to a central hub which calculates y(t+1) and then again propagates it to different machines.

Augmented Lagrangian method Constraint convex optimization : Updated objective function So the Lagrangian would look like: Updates would look like: Issue: Due to this new term we lost decomposability but improved convergence.

ADMM (Alternating Direction Method of multipliers) Standard ADMM Scaled ADMM

Standard ADMM Constraint convex optimization : Augmented Lagrangian: AL updates would be like: Because of the updates in alternating fashion of x and z, it is called as ADMM. “f” and “g” should be closed , proper , real-valued convex functions. Saddle point should exist and strong duality should hold.

Standard ADMM Blend of dual decomposition and augmented Lagrangian method(AL). ADMM updates would be: Because of the updates in alternating fashion of x and z, it is called as ADMM. “f” and “g” should be closed , proper , real-valued convex functions. Saddle point should exist and strong duality should hold.

Scaled ADMM Scale the dual variable: u=y/ρ The standard ADMM updates would look like: The formulas are shorter in this version. This version is widely used.

Least square problem Consider the method of least-square where we minimize the sum of square of errors for regression purpose: For standard ADMM to work, we will reformulate the problem as:

DSO (Distributed Stochastic Optimization)

Regularized Risk Minimization RRM : Introducing constraints: Here at this stage we can apply ADMM too, but let’s go for SGD formulation.

Lagrangian Fenchel-Legendre conjugate : Lagrangian: Lagrangian can be rewritten as:

DSO Again rewriting the previous equation but only for non-zero features. Now applying stochastic gradient descent for w and ascent for α : Note that we can parallelize this stochastic optimization algorithm.

Working of DSO

Thank-you 