Presentation is loading. Please wait.

Presentation is loading. Please wait.

Parallel Perceptrons and Iterative Parameter Mixing

Similar presentations


Presentation on theme: "Parallel Perceptrons and Iterative Parameter Mixing"— Presentation transcript:

1 Parallel Perceptrons and Iterative Parameter Mixing

2 Recap: perceptrons

3 The perceptron B A Compute: yi =sign( vk . xi ) ^ instance xi yi ^
If mistake: vk+1 = vk + yi xi yi u -u positive + + + Mistake bound: - - - 2 2 negative margin

4 Averaged / voted perceptron
predict using sign(v*. x) m1=3 m2=4 Last perceptron Averaging/voting Averaged / voted perceptron

5 The voted perceptron for ranking
Compute: yi = vk . xi Return: the index b* of the “best” xi ^ instances x1 x2 x3 x4… B A b* If mistake: vk+1 = vk + xb - xb* b

6 The voted perceptron for ranking
u -u v1 +x2 v2 Normal perceptron Ranking perceptron

7 Ranking perceptrons  structured perceptrons
Ranking API: A sends B a (maybe huge) set of items to rank B finds the single best one according to the current weight vector A tells B which one was actually best Structured classification API: Input: list of words: x=(w1,…,wn) Output: list of labels: y=(y1,…,yn) If there are K classes, there are Kn labels possible for x

8 Ranking perceptrons  structured perceptrons
Structured  ranking API: A sends B the word sequence x B finds the single best y according to the current weight vector (using dynamic programming) A tells B which y was actually best This is equivalent to ranking pairs g=(x,y’) Structured classification on a sequence Input: list of words: x=(w1,…,wn) Output: list of labels: y=(y1,…,yn) If there are K classes, there are Kn labels possible for x But implements structured classification’s API

9 Ranking perceptrons  structured perceptrons
If it takes time to find the best y that makes each example more expensive to process Structured  ranking API: A sends B the word sequence x B finds the single best y according to the current weight vector (using dynamic programming) A tells B which y was actually best This is equivalent to ranking pairs g=(x,y’) Structured classification on a sequence Input: list of words: x=(w1,…,wn) Output: list of labels: y=(y1,…,yn) If there are K classes, there are Kn labels possible for x So parallel learning has a bigger payoff You can use a voted perceptron for learning whenever you can find the best y (sequence label, parse tree, ….)

10 Parallel Perceptrons and Iterative Parameter Mixing

11 NAACL 2010

12 Parallel Structured Perceptrons
Simplest idea: Split data into S “shards” Train a perceptron on each shard independently weight vectors are w(1) , w(2) , … Produce a weighted average of the w(i)‘s as the final result aka parameter mixture

13 Parallelizing perceptrons
Instances/labels Split into example subsets Instances/labels – 1 Instances/labels – 2 Instances/labels – 3 Compute vk’s on subsets vk -1 Rennie, Shih, Teevan, and Karger (2003) Tackling the Poor Assumptions of Naive Bayes Text Classifiers. lingpipe-blog.com/.../rennie-shih-teevan-and-karger-2003-tackling-p... Feb 16, 2009 – Rennie, Shih, Teevan, Karger (2003) Tackling the Poor Assumptions of Naive Bayes Text Classifiers. In ICML. Rennie et al. introduce four ... Tackling the - AAAI vk- 2 vk-3 Combine by some sort of weighted averaging vk

14 Parallel Perceptrons Simplest idea: Split data into S “shards”
Train a perceptron on each shard independently weight vectors are w(1) , w(2) , … Produce some weighted average of the w(i)‘s as the final result Theorem: this can be arbitrarily bad Proof: by constructing an example where you can converge on every shard, and still have the averaged vector not separate the full training set – no matter how you average the components.

15 Parallel Perceptrons – take 2
Idea: do the simplest possible thing iteratively. Split the data into shards Let w = 0 For n=1,… Train a perceptron on each shard with one pass starting with w Average the weight vectors (somehow) and let w be that average Extra communication cost: redistributing the weight vectors done less frequently than if fully synchronized, more frequently than if fully parallelized All-Reduce

16 Parallelizing perceptrons – take 2
w (previous) Instances/labels Instances/labels – 1 Instances/labels – 2 Instances/labels – 3 w -1 w- 2 w-3 w Split into example subsets Combine by some sort of weighted averaging Compute local vk’s Rennie, Shih, Teevan, and Karger (2003) Tackling the Poor Assumptions of Naive Bayes Text Classifiers. lingpipe-blog.com/.../rennie-shih-teevan-and-karger-2003-tackling-p... Feb 16, 2009 – Rennie, Shih, Teevan, Karger (2003) Tackling the Poor Assumptions of Naive Bayes Text Classifiers. In ICML. Rennie et al. introduce four ... Tackling the - AAAI

17 A theorem uniform mixing…μ=1/S
Corollary: if we weight the vectors uniformly, then the number of mistakes is still bounded. I.e., this is “enough communication” to guarantee convergence.

18 What we know and don’t know
uniform mixing…μ=1/S could we lose our speedup-from-parallelizing to slower convergence? speedup by factor of S is cancelled by slower convergence by factor of S

19 Results on NER (sequence labeling)
Perceptron Averaged perceptron

20 Results on parsing perceptron Averaged perceptron

21 Theory - recap

22 The voted perceptron for ranking
Compute: yi = vk . xi Return: the index b* of the “best” xi ^ instances x1 x2 x3 x4… B A b* If mistake: vk+1 = vk + xb - xb* b

23 Correcting v by adding xb – xb*
u -u Correcting v by adding xb – xb* x v x x x x

24 Correcting v by adding xb – xb* (part 2) Vk+1

25 (3a) The guess v2 after the two positive examples: v2=v1+x2
v1 -u

26 (3a) The guess v2 after the two positive examples: v2=v1+x2
v1 -u 3

27 (3a) The guess v2 after the two positive examples: v2=v1+x2
v1 +x2 v2 (3a) The guess v2 after the two positive examples: v2=v1+x2 3

28 (3a) The guess v2 after the two positive examples: v2=v1+x2
v1 -u

29 Back to the IPM theorem…
uniform mixing…μ=1/S

30 Back to the IPM theorem…
i = shard, n = epoch This is not new….

31 The theorem… This is not new….

32 Reminder: Distribute Follows from: u.w(i,1) >= k1,i γ This is new …. We’ve never considered averaging operations before

33 γ IH1 inductive case: Distribute IH1 From A1 Distribute
μ’s are distribution γ

34 (3a) The guess v2 after the two positive examples: v2=v1+x2
(3b) The guess v2 after the one positive and one negative example: v2=v1-x2 v2 u u +x2 v2 v1 v1 +x1 -x2 -u -u If mistake: yi xi vk < 0 2 2 2

35 IH2 proof is similar IH1, IH2 together imply the bound (as in the usual perceptron case)

36 What we know and don’t know
uniform mixing… could we lose our speedup-from-parallelizing to slower convergence? Formally – yes; the algorithm converges but could be S times slower Experimentally – no, the parallel version (IPM) is faster How robust are those experiments?

37 What we know and don’t know
Is there a tipping point, where cost of parallelizing outweighs benefits?

38 What we know and don’t know

39 What we know and don’t know

40 Followup points On beyond perceptron-based analysis
delayed SGD, the theory All-Reduce in Map-Reduce How bad is Take 1 of this paper?

41 All-reduce

42 Introduction Common pattern: do some learning in parallel
aggregate local changes from each processor to shared parameters distribute the new shared parameters back to each processor and repeat…. AllReduce implemented in MPI, also in VW code (John Langford) in a Hadoop/compatible scheme MAP ALLREDUCE

43

44

45

46

47

48

49 Gory details of VW Hadoop-AllReduce
Spanning-tree server: Separate process constructs a spanning tree of the compute nodes in the cluster and then acts as a server Worker nodes (“fake” mappers): Input for worker is locally cached Workers all connect to spanning-tree server Workers all execute the same code, which might contain AllReduce calls: Workers synchronize whenever they reach an all-reduce

50

51 Followup points On beyond perceptron-based analysis
delayed SGD, the theory All-Reduce in Map-Reduce How bad is Take 1 of this paper?

52 Regret analysis for on-line optimization

53 2009

54 f is loss function, x is parameters
Take a gradient step: x’ = xt – ηt gt If you’ve restricted the parameters to a subspace X (e.g., must be positive, …) find the closest thing in X to x’: xt+1 = argminX dist( x – x’ ) But…. you might be using a “stale” g (from τ steps ago)

55 Parallelizing perceptrons – take 2
w (previous) Instances/labels Instances/labels – 1 Instances/labels – 2 Instances/labels – 3 w -1 w- 2 w-3 w Split into example subsets Combine by some sort of weighted averaging Compute local vk’s Rennie, Shih, Teevan, and Karger (2003) Tackling the Poor Assumptions of Naive Bayes Text Classifiers. lingpipe-blog.com/.../rennie-shih-teevan-and-karger-2003-tackling-p... Feb 16, 2009 – Rennie, Shih, Teevan, Karger (2003) Tackling the Poor Assumptions of Naive Bayes Text Classifiers. In ICML. Rennie et al. introduce four ... Tackling the - AAAI

56 Parallelizing perceptrons – take 2
w Instances/labels SGD worker update Split into example subsets Compute local vk’s Rennie, Shih, Teevan, and Karger (2003) Tackling the Poor Assumptions of Naive Bayes Text Classifiers. lingpipe-blog.com/.../rennie-shih-teevan-and-karger-2003-tackling-p... Feb 16, 2009 – Rennie, Shih, Teevan, Karger (2003) Tackling the Poor Assumptions of Naive Bayes Text Classifiers. In ICML. Rennie et al. introduce four ... Tackling the - AAAI

57

58 Regret: how much loss was incurred during learning,
over and above the loss incurrent with an optimal choice of x Special case: ft is 1 if a mistake was made, 0 otherwise ft(x*) = 0 for optimal x* Regret = # mistakes made in learning

59 Theorem: you can find a learning rate so that the regret of delayed SGD is bounded by
T = # timesteps τ= staleness > 0 Examples are “near each other”; generalizes ”all are close to origin” Gradient is bounded; no sudden sharp changes

60 R[m] = regret after m instances
Lemma: if you have to use information from at least τ time steps ago then R[m] ≥ τ R[m/τ] R[m] = regret after m instances R[m] R[m/τ] m/τ m the worse case!

61 Experiments?

62 Experiments But: this speedup required using a quadratic kernel which took 1 ms/example

63 Followup points On beyond perceptron-based analysis
delayed SGD, the theory All-Reduce in Map-Reduce How bad is Take 1 of this paper?

64

65 The data is not sharded into disjoint sets: all machines get all
The data is not sharded into disjoint sets: all machines get all* the data * Not actually all but as much as it will have time to look at Think of this as an ensemble with limited training time…..

66 Training loss (L2) Test loss (L2) There is also a formal bound on regret relative to the best classifier


Download ppt "Parallel Perceptrons and Iterative Parameter Mixing"

Similar presentations


Ads by Google