1. RECAP 2 Parallel NB Training Documents/labels Documents/labels – 1 Documents/labels – 2 Documents/labels – 3 counts DFs Split into documents subsets.

RECAP 2

Parallel NB Training Documents/labels Documents/labels – 1 Documents/labels – 2 Documents/labels – 3 counts DFs Split into documents subsets Sort and add vectors Compute partial counts counts Key Points: The “full” event counts are a sum of the “local” counts Easy to combine independently computed local counts 3

Parallel Rocchio Documents/labels Documents/labels – 1 Documents/labels – 2 Documents/labels – 3 v-1 v-2 v-3 DFs Split into documents subsets Sort and add vectors Compute partial v ( y)’s v ( y )’s Key Points: We need shared read access to DF’s, but not write access. The “full classifier” is a weighted average of the “local” classifiers – still easy! 4

Parallel Perceptron Learning? Documents/labels Documents/labels – 1 Documents/labels – 2 Documents/labels – 3 v-1 v-2 v-3 Classifier Split into documents subsets v ( y )’s Like DFs or event counts, size is O(|V|) Key Points: The “full classifier” is a weighted average of the “local” classifiers Obvious solution requires read/write access to a shared classifier. 5

Parallel Streaming Learning Documents/labels Documents/labels – 1 Documents/labels – 2 Documents/labels – 3 v-1 v-2 v-3 Classifier Split into documents subsets v ( y )’s Key Point: We need shared write access to the classifier – not just read access. So we only need to not copy the information but synchronize it. Question: How much extra communication is there? Answer: Depends on how the learner behaves… …how many weights get updated with each example … (in Naïve Bayes and Rocchio, only weights for features with non-zero weight in x are updated when scanning x ) …how often it needs to update weight … (how many mistakes it makes) Like DFs or event counts, size is O(|V|) 6

The perceptron game A B instance x i Compute: y i = sign(v k. x i ) ^ y i ^ If mistake: v k+1 = v k + y i x i x is a vector y is -1 or +1 7

The perceptron game A B instance x i Compute: y i = sign(v k. x i ) ^ y i ^ If mistake: v k+1 = v k + y i x i x is a vector y is -1 or +1 u -u 2γ2γ v1v1 +x2+x2 v2v2 >γ>γ Depends on how easy the learning problem is, not dimension of vectors x Fairly intuitive: “Similarity” of v to u looks like ( v.u )/| v.v | ( v.u ) grows by >= γ after mistake ( v.v ) grows by <= R 2 8

The Voted Perceptron for Ranking and Structured Classification 9

The voted perceptron for ranking A B instances x 1 x 2 x 3 x 4 … Compute: y i = v k. x i Return: the index b* of the “best” x i ^ b* bIf mistake: v k+1 = v k + x b - x b* 10

u -u x x x x x γ Ranking some x’s with the target vector u 11

u -u x x x x x γ v Ranking some x’s with some guess vector v – part 1 12

u -u x x x x x v Ranking some x’s with some guess vector v – part 2. The purple-circled x is x b* - the one the learner has chosen to rank highest. The green circled x is x b, the right answer. 13

u -u x x x x x v Correcting v by adding x b – x b* 14

x x x x x vkvk V k+1 Correcting v by adding x b – x b* (part 2) 15

(3a) The guess v 2 after the two positive examples: v 2 =v 1 +x 2 u -u 2γ2γ v1v1 +x2+x2 v2v2 >γ>γ 16

u -u 2γ2γ v1v1 +x2+x2 v2v2 (3a) The guess v 2 after the two positive examples: v 2 =v 1 +x 2 >γ>γ 3 17

u -u 2γ2γ v1v1 +x2+x2 v2v2 (3a) The guess v 2 after the two positive examples: v 2 =v 1 +x 2 >γ>γ 3 18

u -u 2γ2γ v1v1 +x2+x2 v2v2 (3a) The guess v 2 after the two positive examples: v 2 =v 1 +x 2 >γ>γ Notice this doesn’t depend at all on the number of x’s being ranked Neither proof depends on the dimension of the x’s. 2 2 19

Ranking perceptrons  structured perceptrons The API: – A sends B a (maybe huge ) set of items to rank – B finds the single best one according to the current weight vector – A tells B which one was actually best Structured classification on a sequence – Input: list of words: x =(w 1,…,w n ) – Output: list of labels: y= (y 1, …,y n ) – If there are K classes, there are K n labels possible for x 20

Ranking perceptrons  structured perceptrons Suppose we can 1.Given x and y, form a feature vector F(x,y) – Then we can score x,y using a weight vector: w.F(x,y) 2. Given x, find the top- scoring y, with respect to w.F(x,y ) Then we can learn…. Structured classification on a sequence – Input: list of words: x =(w 1,…,w n ) – Output: list of labels: y= (y 1, …,y n ) – If there are K classes, there are K n labels possible for x 21

22 Example Structure classification problem: segmentation or NER – Example: Addresses, bib records – Problem: some DBs may split records up differently (eg no “mail stop” field, combine address and apt #, …) or not at all – Solution: Learn to segment textual form of records P.P.Wangikar, T.P. Graycar, D.A. Estell, D.S. Clark, J.S. Dordick (1993) Protein and Solvent Engineering of Subtilising BPN' in Nearly Anhydrous Organic Media J.Amer. Chem. Soc. 115, 12231-12237. Author Year TitleJournal Volume Page When will prof Cohen post the notes …

Converting segmentation to a feature set (Begin,In,Out) When will prof Cohen post the notes … Idea 1: features are properties of two adjacent tokens, and the pair of labels assigned to them. (y(i)==B or y(i)==I) and (token(i) is capitalized) (y(i)==I and y(i-1)==B) and (token(i) is hyphenated) (y(i)==B and y(i-1)==B) eg “tell Rahul William is on the way” Idea 2: construct a graph where each path is a possible sequence labeling. 23

Find the top-scoring y B I O B I O B I O B I O B I O B I O B I O When will prof Cohen post the notes … Inference: find the highest-weight path This can be done efficiently using dynamic programming (Viterbi) 24

Ranking perceptrons  structured perceptrons New API: – A sends B the word sequence x – B finds the single best y according to the current weight vector using Viterbi – A tells B which y was actually best – This is equivalent to ranking pairs g=( x,y’ ) based on w.F(x,y) Structured classification on a sequence – Input: list of words: x =(w 1,…,w n ) – Output: list of labels: y= (y 1, …,y n ) – If there are K classes, there are K n labels possible for x 25

The voted perceptron for NER A B instances g 1 g 2 g 3 g 4 … Compute: y i = v k. g i Return: the index b* of the “best” g i ^ b* bIf mistake: v k+1 = v k + g b - g b* 1.A sends B feature functions, and instructions for creating the instances g: A sends a word vector x i. Then B could create the instances g 1 =F(x i,y 1 ), g 2 = F(x i,y 2 ), … but instead B just returns the y* that gives the best score for the dot product v k. F(x i,y*) by using Viterbi. 2.A sends B the correct label sequence y i. 3.On errors, B sets v k+1 = v k + g b - g b* = v k + F(x i,y) - F(x i,y*) 26

EMNLP 2002 27

Some background… Collins’ parser: generative model… …New Ranking Algorithms for Parsing and Tagging: Kernels over Discrete Structures, and the Voted Perceptron, Collins and Duffy, ACL 2002. …Ranking Algorithms for Named-Entity Extraction: Boosting and the Voted Perceptron, Collins, ACL 2002. –Propose entities using a MaxEnt tagger (as in MXPOST) –Use beam search to get multiple taggings for each document (20) –Learn to rerank the candidates to push correct ones to the top, using some new candidate-specific features: Value of the “whole entity” (e.g., “Professor_Cohen”) Capitalization features for the whole entity (e.g., “Xx+_Xx+”) Last word in entity, and capitalization features of last word Bigrams/Trigrams of words and capitalization features before and after the entity 28

Some background… 29

Collins’ Experiments POS tagging NP Chunking (words and POS tags from Brill’s tagger as features) and BIO output tags Compared Maxent Tagging/MEMM’s (with iterative scaling) and “Voted Perceptron trained HMM’s” –With and w/o averaging –With and w/o feature selection (count>5) 30

Collins’ results 31

STRUCTURED PERCEPTRONS… 32

NAACL 2010 33

Aside: this paper is on structured perceptrons …but everything they say formally applies to the standard perceptron as well Briefly: a structured perceptron uses a weight vector to rank possible structured predictions y’ using features f ( x,y’ ) Instead of incrementing weight vector by y x, the weight vector is incremented by f(x,y) - f ( x,y’) 34

Parallel Structured Perceptrons Simplest idea: – Split data into S “shards” – Train a perceptron on each shard independently weight vectors are w (1), w (2), … – Produce some weighted average of the w (i) ‘s as the final result 35

Parallelizing perceptrons Instances/labels Instances/labels – 1 Instances/labels – 2 Instances/labels – 3 vk -1 vk- 2 vk-3 vk Split into example subsets Combine by some sort of weighted averaging Compute vk’s on subsets 36

Parallel Perceptrons Simplest idea: – Split data into S “shards” – Train a perceptron on each shard independently weight vectors are w (1), w (2), … – Produce some weighted average of the w (i) ‘s as the final result Theorem: this doesn’t always work. Proof: by constructing an example where you can converge on every shard, and still have the averaged vector not separate the full training set – no matter how you average the components. 37

Parallel Perceptrons – take 2 Idea: do the simplest possible thing iteratively. Split the data into shards Let w = 0 For n=1,… Train a perceptron on each shard with one pass starting with w Average the weight vectors (somehow) and let w be that average Extra communication cost: redistributing the weight vectors done less frequently than if fully synchronized, more frequently than if fully parallelized All-Reduce 38

Parallelizing perceptrons – take 2 Instances/labels Instances/labels – 1 Instances/labels – 2 Instances/labels – 3 w -1 w- 2 w-3 w w Split into example subsets Combine by some sort of weighted averaging Compute local vk’s w (previous) 39

A theorem Corollary: if we weight the vectors uniformly, then the number of mistakes is still bounded. I.e., this is “enough communication” to guarantee convergence. 40

What we know and don’t know uniform mixing… μ =1/S could we lose our speedup-from- parallelizing to slower convergence? CP speedup by factor of S is cancelled by slower convergence by factor of S 41

Results on NER perceptron Averaged perceptron 42

Results on parsing perceptron Averaged perceptron 43

The theorem… Shard i Iteration n This is not new…. 44

u -u 2γ2γ u -u-u 2γ2γ v1v1 +x2+x2 v2v2 +x1+x1 v1v1 -x2-x2 v2v2 (3a) The guess v 2 after the two positive examples: v 2 =v 1 +x 2 (3b) The guess v 2 after the one positive and one negative example: v 2 =v 1 -x 2 >γ>γ If mistake: v k+1 = v k + y i x i 45

The theorem… This is not new…. 46

u -u 2γ2γ u -u-u 2γ2γ v1v1 +x2+x2 v2v2 +x1+x1 v1v1 -x2-x2 v2v2 (3a) The guess v 2 after the two positive examples: v 2 =v 1 +x 2 (3b) The guess v 2 after the one positive and one negative example: v 2 =v 1 -x 2 If mistake: y i x i v k < 0 2 2 2 47

This is new …. We’ve never considered averaging operations before Follows from: u.w (i,1) >= k 1,i γ 48

IH1 inductive case: γ IH1 From A1 Distribute μ’s are distribution 49

IH2 proof is similar IH1, IH2 together imply the bound (as in the usual perceptron case) 50

Review/outline Streaming learning algorithms … and beyond – Naïve Bayes – Rocchio’s algorithm Similarities & differences – Probabilistic vs vector space models – Computationally similar – Parallelizing Naïve Bayes and Rocchio Alternative: – Adding up contributions for every example vs conservatively updating a linear classifier – On-line learning model: mistake-bounds some theory a mistake bound for perceptron – Parallelizing the perceptron 51

What we know and don’t know uniform mixing… could we lose our speedup-from- parallelizing to slower convergence? 52

What we know and don’t know 53

Review/outline Streaming learning algorithms … and beyond – Naïve Bayes – Rocchio’s algorithm Similarities & differences – Probabilistic vs vector space models – Computationally similar – Parallelizing Naïve Bayes and Rocchio Alternative: – Adding up contributions for every example vs conservatively updating a linear classifier – On-line learning model: mistake-bounds some theory a mistake bound for perceptron – Parallelizing the perceptron 56

Where we are… Summary of course so far: – Math tools: complexity, probability, on-line learning – Algorithms: Naïve Bayes, Rocchio, Perceptron, Phrase- finding as BLRT/pointwise KL comparisons, … – Design patterns: stream and sort, messages How to write scanning algorithms that scale linearly on large data (memory does not depend on input size) – Beyond scanning: parallel algorithms for ML – Formal issues involved in parallelizing Naïve Bayes, Rocchio, … easy? Conservative on-line methods (e.g., perceptron) … hard? Next: practical issues in parallelizing – details on Hadoop 57

1. RECAP 2 Parallel NB Training Documents/labels Documents/labels – 1 Documents/labels – 2 Documents/labels – 3 counts DFs Split into documents subsets.

Similar presentations

Presentation on theme: "1. RECAP 2 Parallel NB Training Documents/labels Documents/labels – 1 Documents/labels – 2 Documents/labels – 3 counts DFs Split into documents subsets."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1. RECAP 2 Parallel NB Training Documents/labels Documents/labels – 1 Documents/labels – 2 Documents/labels – 3 counts DFs Split into documents subsets.

Similar presentations

Presentation on theme: "1. RECAP 2 Parallel NB Training Documents/labels Documents/labels – 1 Documents/labels – 2 Documents/labels – 3 counts DFs Split into documents subsets."— Presentation transcript:

Similar presentations

About project

Feedback