Frank-Wolfe optimization insights in machine learning Simon Lacoste-Julien INRIA / École Normale Supérieure SIERRA Project Team SMILE – November 4 th 2013.

Frank-Wolfe optimization insights in machine learning Simon Lacoste-Julien INRIA / École Normale Supérieure SIERRA Project Team SMILE – November 4 th 2013

Outline Frank-Wolfe optimization Frank-Wolfe for structured prediction links with previous algorithms block-coordinate extension results for sequence prediction Herding as Frank-Wolfe optimization extension: weighted Herding simulations for quadrature

FW algorithm – repeat: convex & cts. differentiable convex & compact alg. for constrained opt.: (aka conditional gradient) where: 1) Find good feasible direction by minimizing linearization of : 2) Take convex step in direction: Frank-Wolfe algorithm [Frank, Wolfe 1956] Properties: O(1/T) rate sparse iterates get duality gap for free affine invariant rate holds even if linear subproblem solved approximately

Frank-Wolfe: properties convex steps => convex sparse combo: get duality gap certificate for free (special case of Fenchel duality gap) also converge as O(1/T)! only need to solve linear subproblem *approximately* (additive/multiplicative bound) affine invariant! [see Jaggi ICML 2013]

[ICML 2013] Simon Lacoste-Julien Martin Jaggi Patrick Pletscher Mark Schmidt Block-Coordinate Frank-Wolfe Optimization for Structured SVMs

Structured SVM optimization learn classifier: structured prediction: structured SVM primal: decoding vs. binary hinge loss: structured hinge loss: -> loss-augmented decoding -> exp. number of variables! structured SVM dual: primal-dual pair:

Structured SVM optimization (2) popular approaches: stochastic subgradient method pros: online! cons: sensitive to step-size; don’t know when to stop cutting plane method (SVMstruct) pros: automatic step-size; duality gap cons: batch! -> slow for large n our approach: block-coordinate Frank-Wolfe on dual -> combines best of both worlds: online! automatic step-size via analytic line search duality gap rates also hold for approximate oracles rate: after K passes through data: [Ratliff et al. 07, Shalev-Shwartz et al. 10] [Tsochantaridis et al. 05, Joachims et al. 09]

FW algorithm – repeat: convex & cts. differentiable convex & compact alg. for constrained opt.: (aka conditional gradient) where: 1) Find good feasible direction by minimizing linearization of : 2) Take convex step in direction: Frank-Wolfe algorithm [Frank, Wolfe 1956] Properties: O(1/T) rate sparse iterates get duality gap for free affine invariant rate holds even if linear subproblem solved approximately

 Frank-Wolfe for structured SVM FW algorithm – repeat: structured SVM dual: 1) Find good feasible direction by minimizing linearization of : 2) Take convex step in direction: use primal-dual link: key insight: loss-augmented decoding on each example i becomes a batch subgradient step: choose by analytic line search on quadratic dual link between FW and subgradient method: see [Bach 12]

FW for structured SVM: properties running FW on dual batch subgradient on primal but adaptive step-size from analytic line-search and duality gap stopping criterion ‘fully corrective’ FW on dual cutting plane alg. still O(1/T) rate; but provides simpler proof for SVMstruct convergence + approximate oracles guarantees not faster than simple FW in our experiments BUT: still batch => slow for large n...   (SVMstruct)

Block-Coordinate Frank-Wolfe (new!) for constrained optimization over compact product domain: pick i at random; update only block i with a FW step: we proved same O(1/T) rate as batch FW -> each step n times cheaper though -> constant can be the same (SVM e.g.) Properties: O(1/T) rate sparse iterates get duality gap guarantees affine invariant rate holds even if linear subproblem solved approximately

 Block-Coordinate Frank-Wolfe (new!) for constrained optimization over compact product domain: pick i at random; update only block i with a FW step: loss-augmented decoding structured SVM : we proved same O(1/T) rate as batch FW -> each step n times cheaper though -> constant can be the same (SVM e.g.)

BCFW for structured SVM: properties each update requires 1 oracle call advantages over stochastic subgradient: step-sizes by line-search -> more robust duality gap certificate -> know when to stop guarantees hold for approximate oracles implementation: https://github.com/ppletscher/BCFWstruct https://github.com/ppletscher/BCFWstruct almost as simple as stochastic subgradient method caveat: need to store one parameter vector per example (or store the dual variables) for binary SVM -> reduce to DCA method [Hsieh et al. 08] interesting link with prox SDCA [Shalev-Shwartz et al. 12] (vs. n for SVMstruct) so get error after K passes through data (vs. for SVMstruct)

More info about constants... BCFW rate: batch FW rate: comparing constants: for structured SVM – same constants: identity Hessian + cube constraint: (no speed-up) ->remove with line-search “curvature” “product curvature”

Sidenote: weighted averaging standard to average iterates of stochastic subgradient method uniform averaging: vs. t-weighted averaging: [L.-J. et al. 12], [Shamir & Zhang 13] weighted avg. improves duality gap for BCFW also makes a big difference in test error!

Experiments OCR dataset CoNLL dataset

Surprising test error though! test error: optimization error: flipped!

Conclusions for 1 st part applying FW on dual of structured SVM unified previous algorithms provided line-search version of batch subgradient new block-coordinate variant of Frank-Wolfe algorithm same convergence rate but with cheaper iteration cost yields a robust & fast algorithm for structured SVM future work: caching tricks non-uniform sampling regularization path explain weighted avg. test error mystery

On the Equivalence between Herding and Conditional Gradient Algorithms [ICML 2012] Simon Lacoste-Julien Francis Bach Guillaume Obozinski

A motivation: quadrature Approximating integrals: Random sampling yields error Herding [Welling 2009] yields error! [Chen et al. 2010] (like quasi-MC) This part -> links herding with optimization algorithm (conditional gradient / Frank-Wolfe) suggests extensions - e.g. weighted version with BUT extensions worse for learning??? -> yields interesting insights on properties of herding...

Outline Background: Herding [Conditional gradient algorithm] Equivalence between herding & cond. gradient Extensions New rates & theorems Simulations Approximation of integrals with cond. gradient variants Learned distribution vs. max entropy

Review of herding [Welling ICML 2009] Learning in MRF: Motivation: data parameter samples learning: (app.) ML / max. entropy moment matching (app.) inference: sampling (pseudo)- herding feature map

Herding updates Zero temperature limit of log-likelihood: Herding updates - subgradient ascent updates: Properties: 1) weakly chaotic -> entropy? 2) Moment matching: -> our focus ‘Tipi’ function: (thanks to Max Welling for picture)

Approx. integrals in RKHS Reproducing property: Define mean map : Want to approximate integrals of the form: Use weighted sum to get approximated mean: Approximation error is then bounded by: Controlling moment discrepancy is enough to control error of integrals in RKHS :

Conditional gradient algorithm (aka Frank-Wolfe) Alg. to optimize: Repeat: Find good feasible direction by minimizing linearization of J: Take convex step in direction: -> Converges in O(1/T) in general convex & (twice) cts. differentiable convex & compact

Trick: look at cond. gradient on dummy objective: Herding & cond. grad. are equiv. herding updates: cond. grad. updates: + Do change of variable: Same with step-size: Subgradient ascent and cond. gradient are Fenchel duals of each other! (see also [Bach 2012])

Extensions of herding More general step-sizes -> gives weighted sum: Two extensions: 1) Line search for 2) Min. norm point algorithm (min J(g) on convex hull of previously visited points)

Rates of convergence & thms. No assumption: cond. grad. yields * : If assume in rel. int. of with radius [Chen et al. 2010] yields for herding Whereas line search version yields [Guélat & Marcotte 1986, Beck & Teboulle 2004] Propositions: 1) 2) (i.e. [Chen et al. 2010] doesn’t hold!)

Simulation 1: approx. integrals Kernel herding on Use RKHS with Bernouilli polynomial kernel (infinite dim.) (closed form)

Simulation 2: max entropy? learning independent bits: error on moments error on distribution

Conclusions for 2 nd part Equivalence of herding and cond. gradient: -> Yields better alg. for quadrature based on moments -> But highlights max entropy / moment matching tradeoff! Other interesting points: Setting up fake optimization problems -> harvest properties of known algorithms Conditional gradient algorithm useful to know... Duality of subgradient & cond. gradient is more general Recent related work: link with Bayesian quadrature [Huszar & Duvenaud UAI 2012] herded Gibbs sampling [Born et al. ICLR 2013]

Thank you!

Frank-Wolfe optimization insights in machine learning Simon Lacoste-Julien INRIA / École Normale Supérieure SIERRA Project Team SMILE – November 4 th 2013.

Similar presentations

Presentation on theme: "Frank-Wolfe optimization insights in machine learning Simon Lacoste-Julien INRIA / École Normale Supérieure SIERRA Project Team SMILE – November 4 th 2013."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Frank-Wolfe optimization insights in machine learning Simon Lacoste-Julien INRIA / École Normale Supérieure SIERRA Project Team SMILE – November 4 th 2013.

Similar presentations

Presentation on theme: "Frank-Wolfe optimization insights in machine learning Simon Lacoste-Julien INRIA / École Normale Supérieure SIERRA Project Team SMILE – November 4 th 2013."— Presentation transcript:

Similar presentations

About project

Feedback