Parallel and Distributed Block Coordinate Frank Wolfe

Slides:

Advertisements

Similar presentations

1+eps-Approximate Sparse Recovery Eric Price MIT David Woodruff IBM Almaden.

Advertisements

Lower Bounds for Local Search by Quantum Arguments Scott Aaronson (UC Berkeley) August 14, 2003.

Coordination Mechanisms for Unrelated Machine Scheduling Yossi Azar joint work with Kamal Jain Vahab Mirrokni.

Why Simple Hash Functions Work : Exploiting the Entropy in a Data Stream Michael Mitzenmacher Salil Vadhan And improvements with Kai-Min Chung.

Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh Lehigh University December 3, 2014.

Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh ETH Zurich November 3, 2014.

Semi-Stochastic Gradient Descent Peter Richtárik ANC/DTC Seminar, School of Informatics, University of Edinburgh Edinburgh - November 4, 2014.

Accelerated, Parallel and PROXimal coordinate descent IPAM February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv: )

Gibbs sampler - simple properties It’s not hard to show that this MC chain is aperiodic. Often is reversible distribution. If in addition the chain is.

Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh BASP Frontiers Workshop January 28, 2014.

SCALING SGD to Big dATA & Huge Models

Optimization Tutorial

Frank-Wolfe optimization insights in machine learning Simon Lacoste-Julien INRIA / École Normale Supérieure SIERRA Project Team SMILE – November 4 th 2013.

Distributed Optimization with Arbitrary Local Solvers

On the Spread of Viruses on the Internet Noam Berger Joint work with C. Borgs, J.T. Chayes and A. Saberi.

Navigating Nets: Simple algorithms for proximity search Robert Krauthgamer (IBM Almaden) Joint work with James R. Lee (UC Berkeley)

Stochastic Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation James Foulds 1, Levi Boyles 1, Christopher DuBois 2 Padhraic Smyth.

Convergent and Correct Message Passing Algorithms Nicholas Ruozzi and Sekhar Tatikonda Yale University TexPoint fonts used in EMF. Read the TexPoint manual.

1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 8 May 4, 2005

CISS Princeton, March Optimization via Communication Networks Matthew Andrews Alcatel-Lucent Bell Labs.

EE 685 presentation Optimization Flow Control, I: Basic Algorithm and Convergence By Steven Low and David Lapsley Asynchronous Distributed Algorithm Proof.

1 Introduction to Approximation Algorithms Lecture 15: Mar 5.

Peter Richtárik (joint work with Martin Takáč) Distributed Coordinate Descent Method AmpLab All Hands Meeting - Berkeley - October 29, 2013.

Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.

Correlation testing for affine invariant properties on Shachar Lovett Institute for Advanced Study Joint with Hamed Hatami (McGill)

The horseshoe estimator for sparse signals CARLOS M. CARVALHO NICHOLAS G. POLSON JAMES G. SCOTT Biometrika (2010) Presented by Eric Wang 10/14/2010.

Decentralised load balancing in closed and open systems A. J. Ganesh University of Bristol Joint work with S. Lilienthal, D. Manjunath, A. Proutiere and.

Ragesh Jaiswal Indian Institute of Technology Delhi Threshold Direct Product Theorems: a survey.

Adaptive CSMA under the SINR Model: Fast convergence using the Bethe Approximation Krishna Jagannathan IIT Madras (Joint work with) Peruru Subrahmanya.

1 Introduction to Approximation Algorithms. 2 NP-completeness Do your best then.

Selective Block Minimization for Faster Convergence of Limited Memory Large-scale Linear Models Kai-Wei Chang and Dan Roth Experiment Settings Block Minimization.

Online Passive-Aggressive Algorithms Shai Shalev-Shwartz joint work with Koby Crammer, Ofer Dekel & Yoram Singer The Hebrew University Jerusalem, Israel.

Sublinear Algorithms via Precision Sampling Alexandr Andoni (Microsoft Research) joint work with: Robert Krauthgamer (Weizmann Inst.) Krzysztof Onak (CMU)

EE 685 presentation Optimization Flow Control, I: Basic Algorithm and Convergence By Steven Low and David Lapsley.

Amplification and Derandomization Without Slowdown Dana Moshkovitz MIT Joint work with Ofer Grossman (MIT)

Christoph Lenzen, STOC What is Load Balancing? work sharing low-congestion routing optimizing storage utilization hashing.

Deterministic Algorithms for Submodular Maximization Problems Moran Feldman The Open University of Israel Joint work with Niv Buchbinder.

Complexity and Efficient Algorithms Group / Department of Computer Science Testing the Cluster Structure of Graphs Christian Sohler joint work with Artur.

TU/e Algorithms (2IL15) – Lecture 12 1 Linear Programming.

Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.

Data Driven Resource Allocation for Distributed Learning

Sathya Ronak Alisha Zach Devin Josh

Balaji Prabhakar Departments of EE and CS Stanford University

Large-scale Machine Learning

New Characterizations in Turnstile Streams with Applications

Understanding Generalization in Adaptive Data Analysis

Object Matching Using a Locally Affine Invariant and Linear Programming Techniques - H. Li, X. Huang, L. He Ilchae Jung.

Generalization and adaptivity in stochastic convex optimization

R. Srikant University of Illinois at Urbana-Champaign

Nonnegative polynomials and applications to learning

Distributed Submodular Maximization in Massive Datasets

Turnstile Streaming Algorithms Might as Well Be Linear Sketches

Chapter 6. Large Scale Optimization

Linear sketching with parities

Logistic Regression & Parallel SGD

Linear sketching over

On the effect of randomness on planted 3-coloring models

General Strong Polarization

Linear sketching with parities

Classical Algorithms from Quantum and Arthur-Merlin Communication Protocols Lijie Chen MIT Ruosong Wang CMU.

Balaji Prabhakar Departments of EE and CS Stanford University

CS 3343: Analysis of Algorithms

On Approximating Covering Integer Programs

Topic models for corpora and for graphs

Lecture 15: Least Square Regression Metric Embeddings

CS639: Data Management for Data Science

Generalization bounds for uniformly stable algorithms

University of Wisconsin - Madison

Sublinear Algorihms for Big Data

Chapter 6. Large Scale Optimization

Presentation transcript:

Parallel and Distributed Block Coordinate Frank Wolfe Yu-Xiang Wang Joint work with Veeru Sadhanala, Willie Neiswanger, Wei Dai, Suvrit Sra (MIT LIDS) and Eric Xing CMU ICML 2016

Frank-Wolfe and why Frank-Wolfe? Frank and Wolfe (1956): Can we use LP to iteratively solve QP? min 𝑥 𝑓(𝑥) 𝑠.𝑡. 𝑥∈𝒟 Linear Oracle: min 𝑠∈𝒟 𝑠,𝛻𝑓(⋅) Recent renaissance in Big ML. Projection free. Affine invariant Induce atomic structure ( sparse / low-rank) Duality gap for free. See, e.g., recent work from Jaggi, Llacoste-Julien, Schmidt, Takac, Grigas, Freund…

Standard Frank-Wolfe Algorithm Let 𝑥 0 ∈𝒟 for 𝑘=0…𝐾 do Compute 𝑠 ≔ arg min 𝑠∈𝒟 𝑠,𝛻𝑓 𝑥 𝑘 Update 𝑥 𝑘+1 ≔ 1−𝛾 𝑥 𝑘 +𝛾𝑠, for 𝛾≔ 2 𝑘+2 end for

Block Coordinate Frank-Wolfe (BCFW) Lacoste-Julien et. al. (2013) min 𝑥 𝑓(𝑥) 𝑠.𝑡. 𝑥= 𝑥 1 ,…, 𝑥 𝑛 ∈ 𝒟 1 ×⋯ 𝒟 𝑛 =:𝒟 min 𝑠∈𝒟 𝑠,𝛻𝑓(⋅) min 𝑠 1 ∈ 𝒟 1 𝑠 1 , 𝛻 1 𝑓(𝑥) min 𝑠 𝑖 ∈ 𝒟 𝑖 𝑠 𝑖 , 𝛻 𝑖 𝑓(𝑥) min 𝑠 𝑛 ∈ 𝒟 𝑛 𝑠 𝑛 , 𝛻 𝑛 𝑓(𝑥) Algorithm: Randomly pick 𝑖 in {1,2,…,𝑛} Solve for 𝑠 𝑖 . Do block coordinate update.

Can we parallelize it? A mini-batch version of BCFW Randomly pick subset 𝑆⊂[𝑛] Solve subroutine for each 𝑖∈𝑆 Update the parameter vector 𝑆 =:𝜏 What if we can solve this in parallel? Can we speed things up further?

Questions of interest Does it converge / convergence rate? Yes Does it converge faster than BCFW? Sometimes. It depends on each problem at hand. Is it robust to delayed updates? Yes, very much so

“Cloud” Oracle model Different types of randomization Various system schemes Parallel and distributed.

“Cloud” Oracle model A1. Updates received are i.i.d uniform over [𝑛] A2. Is an approximate solution to (2) in expectation! 𝔼 𝑠 𝑆 , 𝛻 𝑆 𝑓 𝑘 − min 𝑠 ′ ∈ 𝒟 𝑆 𝑠 ′ , 𝛻 𝑆 𝑓 𝑘 ≤ 𝛿 𝛾 𝑘 𝐶 𝑓 𝜏 2 Much weaker than what’s required previously!

Set Curvatures Curvature, Set Curvature 𝑓 𝑦 ≤𝑓 𝑥 + 𝛾 𝑠 𝑆 − 𝑥 𝑆 , 𝛻 𝑆 𝑓 𝑥 + 𝛾 2 2 𝐶 𝑓 𝑆 ∀𝛾∈ 0,1 , ∀𝑥,𝑠∈𝒟, 𝑦=𝑥+𝛾 𝑠 𝑆 − 𝑥 𝑆 Expected Set Curvature 𝐶 𝑓 𝜏 ≔ 𝔼 𝑆: 𝑆 =𝜏 𝐶 𝑓 𝑆

Question 1: Does it converge? For appropriately chosen stepsizes: 𝛾 𝑘 = 2𝑛𝜋 𝜏 2 𝑘+2𝑛 𝔼 DualityGap =O 𝑛 2 𝐶 𝑓 𝜏 1+𝛿 𝜏 2 𝑘

Question 2: Does it converge faster? 𝔼 DualityGap =O 𝑛 2 𝐶 𝑓 𝜏 1+𝛿 𝜏 2 𝑘 When 𝜏=1, this is BCFW. When 𝜏=𝑛, this reduces to batch FW. Boils down to the curvature constant. Ω 𝜏 𝑛 2 ≤ 𝐶 𝑓 𝜏 ≤𝑂 𝜏 2 𝑛 2 𝑂 1 𝑘 𝑂 1 𝜏𝑘 Hiding possible problem-specific constant that does not depend on 𝜏

A coupling condition 𝑓 𝑦 ≤𝑓 𝑥 + 𝑦−𝑥, 𝛻𝑓 𝑥 + 𝑦−𝑥 ⊤ 𝐻(𝑦−𝑥) 1.27 0.29 𝑓 𝑦 ≤𝑓 𝑥 + 𝑦−𝑥, 𝛻𝑓 𝑥 + 𝑦−𝑥 ⊤ 𝐻(𝑦−𝑥) 1.27 0.29 0.08 0.30 0.32 1.22 0.60 0.02 0.14 1.19 0.41 0.17 1.10 1.33 (c) Typical coupling In between; 𝑂 𝜏/ 𝑛 2 if SDD 1.00 0.00 𝐶 𝑓 𝜏 =𝑂 𝜏 𝑛 2 (a) No coupling 1.20 1.00 (b) High coupling 𝐶 𝑓 𝜏 =𝑂 𝜏 2 𝑛 2 Lower coupling implies faster convergence with larger minibatch.

Concrete examples Group fused lasso over an arbitrary graph 𝐶 𝑓 𝜏 =𝑂 𝜏 𝜆 2 Multiclass SVM (a special model) 𝐶 𝑓 𝜏 = 𝑂 𝑝 𝜏 𝑛 2 for 𝜏≤# of classes

Speed-up in simulation Speed-up over BCFW Speedup on OCR dataset Speedup on Group fused lasso Measured in terms of # of iterations

Question 3: delayed updates? Idea: 𝔼 DualityGap =O 𝑛 2 𝐶 𝑓 𝜏 1+𝛿 𝜏 2 𝑘 Delay as a random variable 𝜅 Expected delay 𝜅≔𝔼𝜅. Max-delay: ℙ 𝜅< 𝜅 max =1 Theorem 6. Let 𝐿 𝜏 , 𝐷 𝜏 are coordinate-Lipschitz and diameter w.r.t subsets of blocks 𝛿≤ 4𝜅𝜏 𝐿 1 𝐷 1 𝐷 𝜏 𝐶 𝑓 𝜏 . Or if 𝜅 max 𝜏=𝑂 𝑛 log 𝑛 . 𝛿=𝑂 𝜏 𝐿 1 𝐷 1 𝔼 𝐷 𝜅𝜏 𝐶 𝑓 𝜏 ≈ 𝜅 𝐷 𝜏

Compare to Async SGD and Async BCD Delay AP-BCFW (This work) AP-BCD (Liu et. al., 2013) Hogwild! (Niu et. al., 2011) Unbounded 𝑂(𝜅) Bounded Often 𝑂 𝜅 𝑂 exp 𝜅 max 𝑂( 𝜅 max 2 ) Open problem: Can we get similar bound for AP-BCD? Improved to 𝜅 recently for SGD. But require second moment bound Suvrit Sra, Adams Wei Yu, Mu Li, AdaDelay(AISTATS’16)

Proof idea of getting sublinear rate A delay of 𝜅 max does not mean a block got updated for 𝜅 max times. Load balancing: In the past 𝜅 max iterations Throw 𝜅 max 𝜏 random balls into 𝑛 bins expected max load =𝑂 log 𝑛 if 𝜅 max 𝜏≤𝑛 log 𝑛 Mitzenmacher, Michael. “The power of two choices in randomized load balancing.” IEEE Transactions on Parallel and Distributed Systems 12.10 (2001): 1094-1104.

Effect of delay and straggler Convergence with heavy-tailed delay (measured by number of iterations) Effect of a straggler worker

System implications and caveats Heterogeneous workers No problem. Average performance. Heterogeneous blocks? This may break A1 (uniform over block) Need additional algorithmic tricks to enforce A1. Is it lock-free? Almost. Atomic operation over blocks (rather than over a “double” as in Hogwild).

Speed-up in real clock time Real data experiments in OCR. For more complex subroutine solve

Summary Minibatch BCFW converges. It converges provably faster than BCFW for problems with low coupling over blocks. It converges under delayed updates. Depends only on expected delay, sometimes sublinearly.

Open problems Solve problems with heterogeneous blocks without “padding”. Can AP-BCD be improved to handle “delay” better? Projection free Affine invariant Induce atomic structure (sparse / low-rank) Duality gap for free Robust to delay (?)