Parallel Perceptrons and Iterative Parameter Mixing

Slides:



Advertisements
Similar presentations
Rocchio’s Algorithm 1. Motivation Naïve Bayes is unusual as a learner: – Only one pass through data – Order doesn’t matter 2.
Advertisements

On-line learning and Boosting
Unsupervised Learning
Lecture: Dudu Yanay.  Input: Each instance is associated with a rank or a rating, i.e. an integer from ‘1’ to ‘K’.  Goal: To find a rank-prediction.
Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition.
Linear Learning Machines  Simplest case: the decision function is a hyperplane in input space.  The Perceptron Algorithm: Rosenblatt, 1956  An on-line.
Introduction to Boosting Aristotelis Tsirigos SCLT seminar - NYU Computer Science.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Linear Discriminators Chapter 20 From Data to Knowledge.
Distributed Iterative Training Kevin Gimpel Shay Cohen Severin Hacker Noah A. Smith.
Radial Basis Function Networks
Online Learning Algorithms
Neural Networks Lecture 8: Two simple learning algorithms
Data mining and machine learning A brief introduction.
ENSEMBLE LEARNING David Kauchak CS451 – Fall 2013.
Dual Coordinate Descent Algorithms for Efficient Large Margin Structured Prediction Ming-Wei Chang and Scott Wen-tau Yih Microsoft Research 1.
1. RECAP 2 Parallel NB Training Documents/labels Documents/labels – 1 Documents/labels – 2 Documents/labels – 3 counts DFs Split into documents subsets.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Today Ensemble Methods. Recap of the course. Classifier Fusion
Fast sampling for LDA William Cohen. MORE LDA SPEEDUPS FIRST - RECAP LDA DETAILS.
Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara.
Announcements My office hours: Tues 4pm Wed: guest lecture, Matt Hurst, Bing Local Search.
SemiBoost : Boosting for Semi-supervised Learning Pavan Kumar Mallapragada, Student Member, IEEE, Rong Jin, Member, IEEE, Anil K. Jain, Fellow, IEEE, and.
Online Learning Rong Jin. Batch Learning Given a collection of training examples D Learning a classification model from D What if training examples are.
Midterm Exam Review Notes William Cohen 1. General hints in studying Understand what you’ve done and why – There will be questions that test your understanding.
Logistic Regression William Cohen.
Scaling up LDA (Monday’s lecture). What if you try and parallelize? Split document/term matrix randomly and distribute to p processors.. then run “Approximate.
Announcements Phrases assignment out today: – Unsupervised learning – Google n-grams data – Non-trivial pipeline – Make sure you allocate time to actually.
KERNELS AND PERCEPTRONS. The perceptron A B instance x i Compute: y i = sign(v k. x i ) ^ y i ^ If mistake: v k+1 = v k + y i x i x is a vector y is -1.
CSC321: Introduction to Neural Networks and Machine Learning Lecture 15: Mixtures of Experts Geoffrey Hinton.
Chapter 6 Neural Network.
SemiBoost : Boosting for Semi-supervised Learning Pavan Kumar Mallapragada, Student Member, IEEE, Rong Jin, Member, IEEE, Anil K. Jain, Fellow, IEEE, and.
Mistake Bounds William W. Cohen. One simple way to look for interactions Naïve Bayes – two class version dense vector of g(x,y) scores for each word in.
Perceptrons – the story continues. On-line learning/regret analysis Optimization – is a great model of what you want to do – a less good model of what.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Machine Learning: Ensemble Methods
Machine Learning Supervised Learning Classification and Regression
Neural networks and support vector machines
CS 9633 Machine Learning Support Vector Machines
HW 2.
Fall 2004 Backpropagation CS478 - Machine Learning.
Dan Roth Department of Computer and Information Science
Announcements Guest lectures schedule: D. Sculley, Google Pgh, 3/26
Hans Bodlaender, Marek Cygan and Stefan Kratsch
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Lecture 07: Soft-margin SVM
Announcements Working AWS codes will be out soon
Classification with Perceptrons Reading:
COMP61011 : Machine Learning Ensemble Models
ECE 5424: Introduction to Machine Learning
Max-margin sequential learning methods
CS 4/527: Artificial Intelligence
Linear Discriminators
Combining Base Learners
Lecture 07: Soft-margin SVM
Logistic Regression & Parallel SGD
Large Scale Support Vector Machines
Chap. 7 Regularization for Deep Learning (7.8~7.12 )
Introduction Wireless Ad-Hoc Network
Lecture 08: Soft-margin SVM
Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models
Lecture 07: Soft-margin SVM
Artificial Intelligence Lecture No. 28
Ensemble learning.
Support Vector Machines
Model Combination.
Ensembles An ensemble is a set of classifiers whose combined results give the final decision. test feature vector classifier 1 classifier 2 classifier.
The Voted Perceptron for Ranking and Structured Classification
David Kauchak CS158 – Spring 2019
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Presentation transcript:

Parallel Perceptrons and Iterative Parameter Mixing

Recap: perceptrons

The perceptron B A Compute: yi =sign( vk . xi ) ^ instance xi yi ^ If mistake: vk+1 = vk + yi xi yi u -u 2γ positive + + + Mistake bound: - - - 2 2 negative margin

Averaged / voted perceptron predict using sign(v*. x) m1=3 m2=4 Last perceptron Averaging/voting Averaged / voted perceptron

The voted perceptron for ranking Compute: yi = vk . xi Return: the index b* of the “best” xi ^ instances x1 x2 x3 x4… B A b* If mistake: vk+1 = vk + xb - xb* b

The voted perceptron for ranking u -u 2γ v1 +x2 v2 Normal perceptron >γ Ranking perceptron

Ranking perceptrons  structured perceptrons Ranking API: A sends B a (maybe huge) set of items to rank B finds the single best one according to the current weight vector A tells B which one was actually best Structured classification API: Input: list of words: x=(w1,…,wn) Output: list of labels: y=(y1,…,yn) If there are K classes, there are Kn labels possible for x

Ranking perceptrons  structured perceptrons Structured  ranking API: A sends B the word sequence x B finds the single best y according to the current weight vector (using dynamic programming) A tells B which y was actually best This is equivalent to ranking pairs g=(x,y’) Structured classification on a sequence Input: list of words: x=(w1,…,wn) Output: list of labels: y=(y1,…,yn) If there are K classes, there are Kn labels possible for x But implements structured classification’s API

Ranking perceptrons  structured perceptrons If it takes time to find the best y that makes each example more expensive to process Structured  ranking API: A sends B the word sequence x B finds the single best y according to the current weight vector (using dynamic programming) A tells B which y was actually best This is equivalent to ranking pairs g=(x,y’) Structured classification on a sequence Input: list of words: x=(w1,…,wn) Output: list of labels: y=(y1,…,yn) If there are K classes, there are Kn labels possible for x So parallel learning has a bigger payoff You can use a voted perceptron for learning whenever you can find the best y (sequence label, parse tree, ….)

Parallel Perceptrons and Iterative Parameter Mixing

NAACL 2010

Parallel Structured Perceptrons Simplest idea: Split data into S “shards” Train a perceptron on each shard independently weight vectors are w(1) , w(2) , … Produce a weighted average of the w(i)‘s as the final result aka parameter mixture

Parallelizing perceptrons Instances/labels Split into example subsets Instances/labels – 1 Instances/labels – 2 Instances/labels – 3 Compute vk’s on subsets vk -1 Rennie, Shih, Teevan, and Karger (2003) Tackling the Poor Assumptions of Naive Bayes Text Classifiers. lingpipe-blog.com/.../rennie-shih-teevan-and-karger-2003-tackling-p... Feb 16, 2009 – Rennie, Shih, Teevan, Karger (2003) Tackling the Poor Assumptions of Naive Bayes Text Classifiers. In ICML. Rennie et al. introduce four ... Tackling the - AAAI vk- 2 vk-3 Combine by some sort of weighted averaging vk

Parallel Perceptrons Simplest idea: Split data into S “shards” Train a perceptron on each shard independently weight vectors are w(1) , w(2) , … Produce some weighted average of the w(i)‘s as the final result Theorem: this can be arbitrarily bad Proof: by constructing an example where you can converge on every shard, and still have the averaged vector not separate the full training set – no matter how you average the components.

Parallel Perceptrons – take 2 Idea: do the simplest possible thing iteratively. Split the data into shards Let w = 0 For n=1,… Train a perceptron on each shard with one pass starting with w Average the weight vectors (somehow) and let w be that average Extra communication cost: redistributing the weight vectors done less frequently than if fully synchronized, more frequently than if fully parallelized All-Reduce

Parallelizing perceptrons – take 2 w (previous) Instances/labels Instances/labels – 1 Instances/labels – 2 Instances/labels – 3 w -1 w- 2 w-3 w Split into example subsets Combine by some sort of weighted averaging Compute local vk’s Rennie, Shih, Teevan, and Karger (2003) Tackling the Poor Assumptions of Naive Bayes Text Classifiers. lingpipe-blog.com/.../rennie-shih-teevan-and-karger-2003-tackling-p... Feb 16, 2009 – Rennie, Shih, Teevan, Karger (2003) Tackling the Poor Assumptions of Naive Bayes Text Classifiers. In ICML. Rennie et al. introduce four ... Tackling the - AAAI

A theorem uniform mixing…μ=1/S Corollary: if we weight the vectors uniformly, then the number of mistakes is still bounded. I.e., this is “enough communication” to guarantee convergence.

What we know and don’t know uniform mixing…μ=1/S could we lose our speedup-from-parallelizing to slower convergence? speedup by factor of S is cancelled by slower convergence by factor of S

Results on NER (sequence labeling) Perceptron Averaged perceptron

Results on parsing perceptron Averaged perceptron

Theory - recap

The voted perceptron for ranking Compute: yi = vk . xi Return: the index b* of the “best” xi ^ instances x1 x2 x3 x4… B A b* If mistake: vk+1 = vk + xb - xb* b

Correcting v by adding xb – xb* u -u Correcting v by adding xb – xb* x v x x x x

Correcting v by adding xb – xb* (part 2) Vk+1

(3a) The guess v2 after the two positive examples: v2=v1+x2 >γ v1 -u 2γ

(3a) The guess v2 after the two positive examples: v2=v1+x2 >γ v1 -u 2γ 3

(3a) The guess v2 after the two positive examples: v2=v1+x2 2γ v1 +x2 v2 (3a) The guess v2 after the two positive examples: v2=v1+x2 >γ 3

(3a) The guess v2 after the two positive examples: v2=v1+x2 >γ v1 -u 2γ

Back to the IPM theorem… uniform mixing…μ=1/S

Back to the IPM theorem… i = shard, n = epoch This is not new….

The theorem… This is not new….

Reminder: Distribute Follows from: u.w(i,1) >= k1,i γ This is new …. We’ve never considered averaging operations before

γ IH1 inductive case: Distribute IH1 From A1 Distribute μ’s are distribution γ

(3a) The guess v2 after the two positive examples: v2=v1+x2 (3b) The guess v2 after the one positive and one negative example: v2=v1-x2 v2 u u +x2 v2 v1 v1 +x1 -x2 -u -u 2γ 2γ If mistake: yi xi vk < 0 2 2 2

IH2 proof is similar IH1, IH2 together imply the bound (as in the usual perceptron case)

What we know and don’t know uniform mixing… could we lose our speedup-from-parallelizing to slower convergence? Formally – yes; the algorithm converges but could be S times slower Experimentally – no, the parallel version (IPM) is faster How robust are those experiments?

What we know and don’t know Is there a tipping point, where cost of parallelizing outweighs benefits?

What we know and don’t know

What we know and don’t know

Followup points On beyond perceptron-based analysis delayed SGD, the theory All-Reduce in Map-Reduce How bad is Take 1 of this paper?

All-reduce

Introduction Common pattern: do some learning in parallel aggregate local changes from each processor to shared parameters distribute the new shared parameters back to each processor and repeat…. AllReduce implemented in MPI, also in VW code (John Langford) in a Hadoop/compatible scheme MAP ALLREDUCE

Gory details of VW Hadoop-AllReduce Spanning-tree server: Separate process constructs a spanning tree of the compute nodes in the cluster and then acts as a server Worker nodes (“fake” mappers): Input for worker is locally cached Workers all connect to spanning-tree server Workers all execute the same code, which might contain AllReduce calls: Workers synchronize whenever they reach an all-reduce

Followup points On beyond perceptron-based analysis delayed SGD, the theory All-Reduce in Map-Reduce How bad is Take 1 of this paper?

Regret analysis for on-line optimization

2009

f is loss function, x is parameters Take a gradient step: x’ = xt – ηt gt If you’ve restricted the parameters to a subspace X (e.g., must be positive, …) find the closest thing in X to x’: xt+1 = argminX dist( x – x’ ) But…. you might be using a “stale” g (from τ steps ago)

Parallelizing perceptrons – take 2 w (previous) Instances/labels Instances/labels – 1 Instances/labels – 2 Instances/labels – 3 w -1 w- 2 w-3 w Split into example subsets Combine by some sort of weighted averaging Compute local vk’s Rennie, Shih, Teevan, and Karger (2003) Tackling the Poor Assumptions of Naive Bayes Text Classifiers. lingpipe-blog.com/.../rennie-shih-teevan-and-karger-2003-tackling-p... Feb 16, 2009 – Rennie, Shih, Teevan, Karger (2003) Tackling the Poor Assumptions of Naive Bayes Text Classifiers. In ICML. Rennie et al. introduce four ... Tackling the - AAAI

Parallelizing perceptrons – take 2 w Instances/labels SGD worker update Split into example subsets Compute local vk’s Rennie, Shih, Teevan, and Karger (2003) Tackling the Poor Assumptions of Naive Bayes Text Classifiers. lingpipe-blog.com/.../rennie-shih-teevan-and-karger-2003-tackling-p... Feb 16, 2009 – Rennie, Shih, Teevan, Karger (2003) Tackling the Poor Assumptions of Naive Bayes Text Classifiers. In ICML. Rennie et al. introduce four ... Tackling the - AAAI

Regret: how much loss was incurred during learning, over and above the loss incurrent with an optimal choice of x Special case: ft is 1 if a mistake was made, 0 otherwise ft(x*) = 0 for optimal x* Regret = # mistakes made in learning

Theorem: you can find a learning rate so that the regret of delayed SGD is bounded by T = # timesteps τ= staleness > 0 Examples are “near each other”; generalizes ”all are close to origin” Gradient is bounded; no sudden sharp changes

R[m] = regret after m instances Lemma: if you have to use information from at least τ time steps ago then R[m] ≥ τ R[m/τ] R[m] = regret after m instances R[m] R[m/τ] m/τ m the worse case!

Experiments?

Experiments But: this speedup required using a quadratic kernel which took 1 ms/example

Followup points On beyond perceptron-based analysis delayed SGD, the theory All-Reduce in Map-Reduce How bad is Take 1 of this paper?

The data is not sharded into disjoint sets: all machines get all The data is not sharded into disjoint sets: all machines get all* the data * Not actually all but as much as it will have time to look at Think of this as an ensemble with limited training time…..

Training loss (L2) Test loss (L2) There is also a formal bound on regret relative to the best classifier