Download presentation
Presentation is loading. Please wait.
Published byΞΞΞ½Ξ· ΞΞ½Ξ±ΟΟΞ±ΟΞΉΞ¬Ξ΄Ξ·Ο Modified over 6 years ago
1
Parallel and Distributed Block Coordinate Frank Wolfe
Yu-Xiang Wang Joint work with Veeru Sadhanala, Willie Neiswanger, Wei Dai, Suvrit Sra (MIT LIDS) and Eric Xing CMU ICML 2016
2
Frank-Wolfe and why Frank-Wolfe?
Frank and Wolfe (1956): Can we use LP to iteratively solve QP? min π₯ π(π₯) π .π‘. π₯βπ Linear Oracle: min π βπ π ,π»π(β
) Recent renaissance in Big ML. Projection free. Affine invariant Induce atomic structure ( sparse / low-rank) Duality gap for free. See, e.g., recent work from Jaggi, Llacoste-Julien, Schmidt, Takac, Grigas, Freundβ¦
3
Standard Frank-Wolfe Algorithm
Let π₯ 0 βπ for π=0β¦πΎ do Compute π β arg min π βπ π ,π»π π₯ π Update π₯ π+1 β 1βπΎ π₯ π +πΎπ , for πΎβ 2 π+2 end for
4
Block Coordinate Frank-Wolfe (BCFW)
Lacoste-Julien et. al. (2013) min π₯ π(π₯) π .π‘. π₯= π₯ 1 ,β¦, π₯ π β π 1 Γβ― π π =:π min π βπ π ,π»π(β
) min π 1 β π π 1 , π» 1 π(π₯) min π π β π π π π , π» π π(π₯) min π π β π π π π , π» π π(π₯) Algorithm: Randomly pick π in {1,2,β¦,π} Solve for π π . Do block coordinate update.
5
Can we parallelize it? A mini-batch version of BCFW
Randomly pick subset πβ[π] Solve subroutine for each πβπ Update the parameter vector π =:π What if we can solve this in parallel? Can we speed things up further?
6
Questions of interest Does it converge / convergence rate?
Yes Does it converge faster than BCFW? Sometimes. It depends on each problem at hand. Is it robust to delayed updates? Yes, very much so
7
βCloudβ Oracle model Different types of randomization
Various system schemes Parallel and distributed.
8
βCloudβ Oracle model A1. Updates received are i.i.d uniform over [π]
A2. Is an approximate solution to (2) in expectation! πΌ π π , π» π π π β min π β² β π π π β² , π» π π π β€ πΏ πΎ π πΆ π π 2 Much weaker than whatβs required previously!
9
Set Curvatures Curvature, Set Curvature
π π¦ β€π π₯ + πΎ π π β π₯ π , π» π π π₯ + πΎ πΆ π π βπΎβ 0,1 , βπ₯,π βπ, π¦=π₯+πΎ π π β π₯ π Expected Set Curvature πΆ π π β πΌ π: π =π πΆ π π
10
Question 1: Does it converge?
For appropriately chosen stepsizes: πΎ π = 2ππ π 2 π+2π πΌ DualityGap =O π 2 πΆ π π 1+πΏ π 2 π
11
Question 2: Does it converge faster?
πΌ DualityGap =O π 2 πΆ π π 1+πΏ π 2 π When π=1, this is BCFW. When π=π, this reduces to batch FW. Boils down to the curvature constant. Ξ© π π 2 β€ πΆ π π β€π π 2 π 2 π 1 π π 1 ππ Hiding possible problem-specific constant that does not depend on π
12
A coupling condition π π¦ β€π π₯ + π¦βπ₯, π»π π₯ + π¦βπ₯ β€ π»(π¦βπ₯) 1.27 0.29
π π¦ β€π π₯ + π¦βπ₯, π»π π₯ + π¦βπ₯ β€ π»(π¦βπ₯) 1.27 0.29 0.08 0.30 0.32 1.22 0.60 0.02 0.14 1.19 0.41 0.17 1.10 1.33 (c) Typical coupling In between; π π/ π 2 if SDD 1.00 0.00 πΆ π π =π π π 2 (a) No coupling 1.20 1.00 (b) High coupling πΆ π π =π π 2 π 2 Lower coupling implies faster convergence with larger minibatch.
13
Concrete examples Group fused lasso over an arbitrary graph
πΆ π π =π π π 2 Multiclass SVM (a special model) πΆ π π = π π π π for πβ€# of classes
14
Speed-up in simulation
Speed-up over BCFW Speedup on OCR dataset Speedup on Group fused lasso Measured in terms of # of iterations
15
Question 3: delayed updates?
Idea: πΌ DualityGap =O π 2 πΆ π π 1+πΏ π 2 π Delay as a random variable π
Expected delay π
βπΌπ
. Max-delay: β π
< π
max =1 Theorem 6. Let πΏ π , π· π are coordinate-Lipschitz and diameter w.r.t subsets of blocks πΏβ€ 4π
π πΏ 1 π· 1 π· π πΆ π π . Or if π
max π=π π log π . πΏ=π π πΏ 1 π· 1 πΌ π· π
π πΆ π π β π
π· π
16
Compare to Async SGD and Async BCD
Delay AP-BCFW (This work) AP-BCD (Liu et. al., 2013) Hogwild! (Niu et. al., 2011) Unbounded π(π
) Bounded Often π π
π exp π
max π( π
max 2 ) Open problem: Can we get similar bound for AP-BCD? Improved to π
recently for SGD. But require second moment bound Suvrit Sra, Adams Wei Yu, Mu Li, AdaDelay(AISTATSβ16)
17
Proof idea of getting sublinear rate
A delay of π
max does not mean a block got updated for π
max times. Load balancing: In the past π
max iterations Throw π
max π random balls into π bins expected max load =π log π if π
max πβ€π log π Mitzenmacher, Michael. βThe power of two choices in randomized load balancing.β IEEE Transactions on Parallel and Distributed Systems (2001):
18
Effect of delay and straggler
Convergence with heavy-tailed delay (measured by number of iterations) Effect of a straggler worker
19
System implications and caveats
Heterogeneous workers No problem. Average performance. Heterogeneous blocks? This may break A1 (uniform over block) Need additional algorithmic tricks to enforce A1. Is it lock-free? Almost. Atomic operation over blocks (rather than over a βdoubleβ as in Hogwild).
20
Speed-up in real clock time
Real data experiments in OCR. For more complex subroutine solve
21
Summary Minibatch BCFW converges.
It converges provably faster than BCFW for problems with low coupling over blocks. It converges under delayed updates. Depends only on expected delay, sometimes sublinearly.
22
Open problems Solve problems with heterogeneous blocks without βpaddingβ. Can AP-BCD be improved to handle βdelayβ better? Projection free Affine invariant Induce atomic structure (sparse / low-rank) Duality gap for free Robust to delay (?)
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.