Frank-Wolfe optimization insights in machine learning Simon Lacoste-Julien INRIA / École Normale Supérieure SIERRA Project Team SMILE – November 4 th 2013.

Slides:



Advertisements
Similar presentations
Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh ETH Zurich November 3, 2014.
Advertisements

Accelerated, Parallel and PROXimal coordinate descent IPAM February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv: )
Regularization David Kauchak CS 451 – Fall 2013.
Tutorial at ICCV (Barcelona, Spain, November 2011)
A KTEC Center of Excellence 1 Pattern Analysis using Convex Optimization: Part 2 of Chapter 7 Discussion Presenter: Brian Quanz.
Efficient Large-Scale Structured Learning
SVM—Support Vector Machines
Computer vision: models, learning and inference Chapter 8 Regression.
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Optimization Tutorial
Boosting Approach to ML
Separating Hyperplanes
Variational Inference in Bayesian Submodular Models
Distributed Optimization with Arbitrary Local Solvers
Middle Term Exam 03/01 (Thursday), take home, turn in at noon time of 03/02 (Friday)
Jun Zhu Dept. of Comp. Sci. & Tech., Tsinghua University This work was done when I was a visiting researcher at CMU. Joint.
Guillaume Bouchard Xerox Research Centre Europe
Stochastic Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation James Foulds 1, Levi Boyles 1, Christopher DuBois 2 Padhraic Smyth.
Speeding up multi-task learning Phong T Pham. Multi-task learning  Combine data from various data sources  Potentially exploit the inter-relation between.
1 PEGASOS Primal Efficient sub-GrAdient SOlver for SVM Shai Shalev-Shwartz Yoram Singer Nati Srebro The Hebrew University Jerusalem, Israel YASSO = Yet.
Sparse vs. Ensemble Approaches to Supervised Learning
The value of kernel function represents the inner product of two training points in feature space Kernel functions merge two steps 1. map input data from.
Unconstrained Optimization Problem
Kernel Methods and SVM’s. Predictive Modeling Goal: learn a mapping: y = f(x;  ) Need: 1. A model structure 2. A score function 3. An optimization strategy.
Sparse vs. Ensemble Approaches to Supervised Learning
1 Multiple Kernel Learning Naouel Baili MRL Seminar, Fall 2009.
Mehdi Ghayoumi MSB rm 132 Ofc hr: Thur, a Machine Learning.
Linear Discriminant Functions Chapter 5 (Duda et al.)
Semi-Stochastic Gradient Descent Methods Jakub Konečný (joint work with Peter Richtárik) University of Edinburgh SIAM Annual Meeting, Chicago July 7, 2014.
Greg GrudicIntro AI1 Support Vector Machine (SVM) Classification Greg Grudic.
Efficient Model Selection for Support Vector Machines
Dual Coordinate Descent Algorithms for Efficient Large Margin Structured Prediction Ming-Wei Chang and Scott Wen-tau Yih Microsoft Research 1.
Adaptive CSMA under the SINR Model: Fast convergence using the Bethe Approximation Krishna Jagannathan IIT Madras (Joint work with) Peruru Subrahmanya.
Annealing Paths for the Evaluation of Topic Models James Foulds Padhraic Smyth Department of Computer Science University of California, Irvine* *James.
Linear Programming Boosting by Column and Row Generation Kohei Hatano and Eiji Takimoto Kyushu University, Japan DS 2009.
Herding Dynamical Weights Max Welling Bren School of Information and Computer Science UC Irvine.
Stochastic Subgradient Approach for Solving Linear Support Vector Machines Jan Rupnik Jozef Stefan Institute.
Machine Learning Weak 4 Lecture 2. Hand in Data It is online Only around 6000 images!!! Deadline is one week. Next Thursday lecture will be only one hour.
Survey of Kernel Methods by Jinsan Yang. (c) 2003 SNU Biointelligence Lab. Introduction Support Vector Machines Formulation of SVM Optimization Theorem.
Dd Generalized Optimal Kernel-based Ensemble Learning for HS Classification Problems Generalized Optimal Kernel-based Ensemble Learning for HS Classification.
Exact Differentiable Exterior Penalty for Linear Programming Olvi Mangasarian UW Madison & UCSD La Jolla Edward Wild UW Madison December 20, 2015 TexPoint.
Learning from Big Data Lecture 5
Maximum Entropy Discrimination Tommi Jaakkola Marina Meila Tony Jebara MIT CMU MIT.
CS6772 Advanced Machine Learning Fall 2006 Extending Maximum Entropy Discrimination on Mixtures of Gaussians With Transduction Final Project by Barry.
CSC321: Introduction to Neural Networks and Machine Learning Lecture 23: Linear Support Vector Machines Geoffrey Hinton.
Page 1 CS 546 Machine Learning in NLP Review 2: Loss minimization, SVM and Logistic Regression Dan Roth Department of Computer Science University of Illinois.
BSP: An iterated local search heuristic for the hyperplane with minimum number of misclassifications Usman Roshan.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
For convex optimization
Support vector machines
Multiplicative updates for L1-regularized regression
Dan Roth Department of Computer and Information Science
Lecture 07: Soft-margin SVM
Geometrical intuition behind the dual problem
Probabilistic Models for Linear Regression
For convex optimization
Lecture 07: Soft-margin SVM
CSCI B609: “Foundations of Data Science”
Online Learning Kernels
Parallel and Distributed Block Coordinate Frank Wolfe
PEGASOS Primal Estimated sub-GrAdient Solver for SVM
Lecture 08: Soft-margin SVM
Lecture 07: Soft-margin SVM
Support Vector Machines
Support vector machines
CS639: Data Management for Data Science
Primal Sparse Max-Margin Markov Networks
Guess Free Maximization of Submodular and Linear Sums
Presentation transcript:

Frank-Wolfe optimization insights in machine learning Simon Lacoste-Julien INRIA / École Normale Supérieure SIERRA Project Team SMILE – November 4 th 2013

Outline Frank-Wolfe optimization Frank-Wolfe for structured prediction links with previous algorithms block-coordinate extension results for sequence prediction Herding as Frank-Wolfe optimization extension: weighted Herding simulations for quadrature

FW algorithm – repeat: convex & cts. differentiable convex & compact alg. for constrained opt.: (aka conditional gradient) where: 1) Find good feasible direction by minimizing linearization of : 2) Take convex step in direction: Frank-Wolfe algorithm [Frank, Wolfe 1956] Properties: O(1/T) rate sparse iterates get duality gap for free affine invariant rate holds even if linear subproblem solved approximately

Frank-Wolfe: properties convex steps => convex sparse combo: get duality gap certificate for free (special case of Fenchel duality gap) also converge as O(1/T)! only need to solve linear subproblem *approximately* (additive/multiplicative bound) affine invariant! [see Jaggi ICML 2013]

[ICML 2013] Simon Lacoste-Julien Martin Jaggi Patrick Pletscher Mark Schmidt Block-Coordinate Frank-Wolfe Optimization for Structured SVMs

Structured SVM optimization learn classifier: structured prediction: structured SVM primal: decoding vs. binary hinge loss: structured hinge loss: -> loss-augmented decoding -> exp. number of variables! structured SVM dual: primal-dual pair:

Structured SVM optimization (2) popular approaches: stochastic subgradient method pros: online! cons: sensitive to step-size; don’t know when to stop cutting plane method (SVMstruct) pros: automatic step-size; duality gap cons: batch! -> slow for large n our approach: block-coordinate Frank-Wolfe on dual -> combines best of both worlds: online! automatic step-size via analytic line search duality gap rates also hold for approximate oracles rate: after K passes through data: [Ratliff et al. 07, Shalev-Shwartz et al. 10] [Tsochantaridis et al. 05, Joachims et al. 09]

FW algorithm – repeat: convex & cts. differentiable convex & compact alg. for constrained opt.: (aka conditional gradient) where: 1) Find good feasible direction by minimizing linearization of : 2) Take convex step in direction: Frank-Wolfe algorithm [Frank, Wolfe 1956] Properties: O(1/T) rate sparse iterates get duality gap for free affine invariant rate holds even if linear subproblem solved approximately

 Frank-Wolfe for structured SVM FW algorithm – repeat: structured SVM dual: 1) Find good feasible direction by minimizing linearization of : 2) Take convex step in direction: use primal-dual link: key insight: loss-augmented decoding on each example i becomes a batch subgradient step: choose by analytic line search on quadratic dual link between FW and subgradient method: see [Bach 12]

FW for structured SVM: properties running FW on dual batch subgradient on primal but adaptive step-size from analytic line-search and duality gap stopping criterion ‘fully corrective’ FW on dual cutting plane alg. still O(1/T) rate; but provides simpler proof for SVMstruct convergence + approximate oracles guarantees not faster than simple FW in our experiments BUT: still batch => slow for large n...   (SVMstruct)

Block-Coordinate Frank-Wolfe (new!) for constrained optimization over compact product domain: pick i at random; update only block i with a FW step: we proved same O(1/T) rate as batch FW -> each step n times cheaper though -> constant can be the same (SVM e.g.) Properties: O(1/T) rate sparse iterates get duality gap guarantees affine invariant rate holds even if linear subproblem solved approximately

 Block-Coordinate Frank-Wolfe (new!) for constrained optimization over compact product domain: pick i at random; update only block i with a FW step: loss-augmented decoding structured SVM : we proved same O(1/T) rate as batch FW -> each step n times cheaper though -> constant can be the same (SVM e.g.)

BCFW for structured SVM: properties each update requires 1 oracle call advantages over stochastic subgradient: step-sizes by line-search -> more robust duality gap certificate -> know when to stop guarantees hold for approximate oracles implementation: almost as simple as stochastic subgradient method caveat: need to store one parameter vector per example (or store the dual variables) for binary SVM -> reduce to DCA method [Hsieh et al. 08] interesting link with prox SDCA [Shalev-Shwartz et al. 12] (vs. n for SVMstruct) so get error after K passes through data (vs. for SVMstruct)

More info about constants... BCFW rate: batch FW rate: comparing constants: for structured SVM – same constants: identity Hessian + cube constraint: (no speed-up) ->remove with line-search “curvature” “product curvature”

Sidenote: weighted averaging standard to average iterates of stochastic subgradient method uniform averaging: vs. t-weighted averaging: [L.-J. et al. 12], [Shamir & Zhang 13] weighted avg. improves duality gap for BCFW also makes a big difference in test error!

Experiments OCR dataset CoNLL dataset

Surprising test error though! test error: optimization error: flipped!

Conclusions for 1 st part applying FW on dual of structured SVM unified previous algorithms provided line-search version of batch subgradient new block-coordinate variant of Frank-Wolfe algorithm same convergence rate but with cheaper iteration cost yields a robust & fast algorithm for structured SVM future work: caching tricks non-uniform sampling regularization path explain weighted avg. test error mystery

On the Equivalence between Herding and Conditional Gradient Algorithms [ICML 2012] Simon Lacoste-Julien Francis Bach Guillaume Obozinski

A motivation: quadrature Approximating integrals: Random sampling yields error Herding [Welling 2009] yields error! [Chen et al. 2010] (like quasi-MC) This part -> links herding with optimization algorithm (conditional gradient / Frank-Wolfe) suggests extensions - e.g. weighted version with BUT extensions worse for learning??? -> yields interesting insights on properties of herding...

Outline Background: Herding [Conditional gradient algorithm] Equivalence between herding & cond. gradient Extensions New rates & theorems Simulations Approximation of integrals with cond. gradient variants Learned distribution vs. max entropy

Review of herding [Welling ICML 2009] Learning in MRF: Motivation: data parameter samples learning: (app.) ML / max. entropy moment matching (app.) inference: sampling (pseudo)- herding feature map

Herding updates Zero temperature limit of log-likelihood: Herding updates - subgradient ascent updates: Properties: 1) weakly chaotic -> entropy? 2) Moment matching: -> our focus ‘Tipi’ function: (thanks to Max Welling for picture)

Approx. integrals in RKHS Reproducing property: Define mean map : Want to approximate integrals of the form: Use weighted sum to get approximated mean: Approximation error is then bounded by: Controlling moment discrepancy is enough to control error of integrals in RKHS :

Conditional gradient algorithm (aka Frank-Wolfe) Alg. to optimize: Repeat: Find good feasible direction by minimizing linearization of J: Take convex step in direction: -> Converges in O(1/T) in general convex & (twice) cts. differentiable convex & compact

Trick: look at cond. gradient on dummy objective: Herding & cond. grad. are equiv. herding updates: cond. grad. updates: + Do change of variable: Same with step-size: Subgradient ascent and cond. gradient are Fenchel duals of each other! (see also [Bach 2012])

Extensions of herding More general step-sizes -> gives weighted sum: Two extensions: 1) Line search for 2) Min. norm point algorithm (min J(g) on convex hull of previously visited points)

Rates of convergence & thms. No assumption: cond. grad. yields * : If assume in rel. int. of with radius [Chen et al. 2010] yields for herding Whereas line search version yields [Guélat & Marcotte 1986, Beck & Teboulle 2004] Propositions: 1) 2) (i.e. [Chen et al. 2010] doesn’t hold!)

Simulation 1: approx. integrals Kernel herding on Use RKHS with Bernouilli polynomial kernel (infinite dim.) (closed form)

Simulation 2: max entropy? learning independent bits: error on moments error on distribution

Conclusions for 2 nd part Equivalence of herding and cond. gradient: -> Yields better alg. for quadrature based on moments -> But highlights max entropy / moment matching tradeoff! Other interesting points: Setting up fake optimization problems -> harvest properties of known algorithms Conditional gradient algorithm useful to know... Duality of subgradient & cond. gradient is more general Recent related work: link with Bayesian quadrature [Huszar & Duvenaud UAI 2012] herded Gibbs sampling [Born et al. ICLR 2013]

Thank you!