1. RECAP 2 Parallel NB Training Documents/labels Documents/labels – 1 Documents/labels – 2 Documents/labels – 3 counts DFs Split into documents subsets.

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.
Rocchio’s Algorithm 1. Motivation Naïve Bayes is unusual as a learner: – Only one pass through data – Order doesn’t matter 2.
Introduction to Information Retrieval
Linear Classifiers (perceptrons)
Linear Separators.
Support Vector Machines and Margins
Boosting Approach to ML
Machine Learning in Practice Lecture 7 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Exploiting Dictionaries in Named Entity Extraction: Combining Semi-Markov Extraction Processes and Data Integration Methods William W. Cohen, Sunita Sarawagi.
Linear Separators.
Ensemble Learning: An Introduction
Scaling Personalized Web Search Glen Jeh, Jennfier Widom Stanford University Presented by Li-Tal Mashiach Search Engine Technology course (236620) Technion.
Co-training LING 572 Fei Xia 02/21/06. Overview Proposed by Blum and Mitchell (1998) Important work: –(Nigam and Ghani, 2000) –(Goldman and Zhou, 2000)
Sequence labeling and beam search LING 572 Fei Xia 2/15/07.
Maximum Entropy Model LING 572 Fei Xia 02/07-02/09/06.
Seven Lectures on Statistical Parsing Christopher Manning LSA Linguistic Institute 2007 LSA 354 Lecture 7.
Maximum Entropy Model LING 572 Fei Xia 02/08/07. Topics in LING 572 Easy: –kNN, Rocchio, DT, DL –Feature selection, binarization, system combination –Bagging.
Online Stacked Graphical Learning Zhenzhen Kou +, Vitor R. Carvalho *, and William W. Cohen + Machine Learning Department + / Language Technologies Institute.
Radial Basis Function Networks
Online Learning Algorithms
STRUCTURED PERCEPTRON Alice Lai and Shi Zhi. Presentation Outline Introduction to Structured Perceptron ILP-CRF Model Averaged Perceptron Latent Variable.
Tree Kernels for Parsing: (Collins & Duffy, 2001) Advanced Statistical Methods in NLP Ling 572 February 28, 2012.
Data mining and machine learning A brief introduction.
ENSEMBLE LEARNING David Kauchak CS451 – Fall 2013.
CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.
1 CS546: Machine Learning and Natural Language Discriminative vs Generative Classifiers This lecture is based on (Ng & Jordan, 02) paper and some slides.
Perceptrons and Linear Classifiers William Cohen
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
Machine Learning in Ad-hoc IR. Machine Learning for ad hoc IR We’ve looked at methods for ranking documents in IR using factors like –Cosine similarity,
Today Ensemble Methods. Recap of the course. Classifier Fusion
Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.
Announcements My office hours: Tues 4pm Wed: guest lecture, Matt Hurst, Bing Local Search.
Powerpoint Templates Page 1 Powerpoint Templates Scalable Text Classification with Sparse Generative Modeling Antti PuurulaWaikato University.
CHAPTER 8 DISCRIMINATIVE CLASSIFIERS HIDDEN MARKOV MODELS.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
KNN & Naïve Bayes Hongning Wang Today’s lecture Instance-based classifiers – k nearest neighbors – Non-parametric learning algorithm Model-based.
Midterm Exam Review Notes William Cohen 1. General hints in studying Understand what you’ve done and why – There will be questions that test your understanding.
Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.
Conditional Markov Models: MaxEnt Tagging and MEMMs
Announcements Phrases assignment out today: – Unsupervised learning – Google n-grams data – Non-trivial pipeline – Make sure you allocate time to actually.
CS 8751 ML & KDDComputational Learning Theory1 Notions of interest: efficiency, accuracy, complexity Probably, Approximately Correct (PAC) Learning Agnostic.
KERNELS AND PERCEPTRONS. The perceptron A B instance x i Compute: y i = sign(v k. x i ) ^ y i ^ If mistake: v k+1 = v k + y i x i x is a vector y is -1.
Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.
Machine Learning in Practice Lecture 6 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Learning Analogies and Semantic Relations Nov William Cohen.
Wrapper Learning: Cohen et al 2002; Kushmeric 2000; Kushmeric & Frietag 2000 William Cohen 1/26/03.
Efficient Logistic Regression with Stochastic Gradient Descent William Cohen 1.
Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.
Page 1 CS 546 Machine Learning in NLP Review 1: Supervised Learning, Binary Classifiers Dan Roth Department of Computer Science University of Illinois.
Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.
MACHINE LEARNING FOR NATURAL LANGUAGE PROCESSING.
KNN & Naïve Bayes Hongning Wang
Mistake Bounds William W. Cohen. One simple way to look for interactions Naïve Bayes – two class version dense vector of g(x,y) scores for each word in.
Perceptrons – the story continues. On-line learning/regret analysis Optimization – is a great model of what you want to do – a less good model of what.
Machine Learning: Ensemble Methods
Artificial Intelligence
Announcements Guest lectures schedule: D. Sculley, Google Pgh, 3/26
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Perceptrons Lirong Xia.
Classification with Perceptrons Reading:
Max-margin sequential learning methods
CS 4/527: Artificial Intelligence
CRFs for SPLODD William W. Cohen Sep 8, 2011.
Rank Aggregation.
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Parallel Perceptrons and Iterative Parameter Mixing
The Voted Perceptron for Ranking and Structured Classification
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Perceptrons Lirong Xia.
Presentation transcript:

1

RECAP 2

Parallel NB Training Documents/labels Documents/labels – 1 Documents/labels – 2 Documents/labels – 3 counts DFs Split into documents subsets Sort and add vectors Compute partial counts counts Key Points: The “full” event counts are a sum of the “local” counts Easy to combine independently computed local counts 3

Parallel Rocchio Documents/labels Documents/labels – 1 Documents/labels – 2 Documents/labels – 3 v-1 v-2 v-3 DFs Split into documents subsets Sort and add vectors Compute partial v ( y)’s v ( y )’s Key Points: We need shared read access to DF’s, but not write access. The “full classifier” is a weighted average of the “local” classifiers – still easy! 4

Parallel Perceptron Learning? Documents/labels Documents/labels – 1 Documents/labels – 2 Documents/labels – 3 v-1 v-2 v-3 Classifier Split into documents subsets v ( y )’s Like DFs or event counts, size is O(|V|) Key Points: The “full classifier” is a weighted average of the “local” classifiers Obvious solution requires read/write access to a shared classifier. 5

Parallel Streaming Learning Documents/labels Documents/labels – 1 Documents/labels – 2 Documents/labels – 3 v-1 v-2 v-3 Classifier Split into documents subsets v ( y )’s Key Point: We need shared write access to the classifier – not just read access. So we only need to not copy the information but synchronize it. Question: How much extra communication is there? Answer: Depends on how the learner behaves… …how many weights get updated with each example … (in Naïve Bayes and Rocchio, only weights for features with non-zero weight in x are updated when scanning x ) …how often it needs to update weight … (how many mistakes it makes) Like DFs or event counts, size is O(|V|) 6

The perceptron game A B instance x i Compute: y i = sign(v k. x i ) ^ y i ^ If mistake: v k+1 = v k + y i x i x is a vector y is -1 or +1 7

The perceptron game A B instance x i Compute: y i = sign(v k. x i ) ^ y i ^ If mistake: v k+1 = v k + y i x i x is a vector y is -1 or +1 u -u 2γ2γ v1v1 +x2+x2 v2v2 >γ>γ Depends on how easy the learning problem is, not dimension of vectors x Fairly intuitive: “Similarity” of v to u looks like ( v.u )/| v.v | ( v.u ) grows by >= γ after mistake ( v.v ) grows by <= R 2 8

The Voted Perceptron for Ranking and Structured Classification 9

The voted perceptron for ranking A B instances x 1 x 2 x 3 x 4 … Compute: y i = v k. x i Return: the index b* of the “best” x i ^ b* bIf mistake: v k+1 = v k + x b - x b* 10

u -u x x x x x γ Ranking some x’s with the target vector u 11

u -u x x x x x γ v Ranking some x’s with some guess vector v – part 1 12

u -u x x x x x v Ranking some x’s with some guess vector v – part 2. The purple-circled x is x b* - the one the learner has chosen to rank highest. The green circled x is x b, the right answer. 13

u -u x x x x x v Correcting v by adding x b – x b* 14

x x x x x vkvk V k+1 Correcting v by adding x b – x b* (part 2) 15

(3a) The guess v 2 after the two positive examples: v 2 =v 1 +x 2 u -u 2γ2γ v1v1 +x2+x2 v2v2 >γ>γ 16

u -u 2γ2γ v1v1 +x2+x2 v2v2 (3a) The guess v 2 after the two positive examples: v 2 =v 1 +x 2 >γ>γ 3 17

u -u 2γ2γ v1v1 +x2+x2 v2v2 (3a) The guess v 2 after the two positive examples: v 2 =v 1 +x 2 >γ>γ 3 18

u -u 2γ2γ v1v1 +x2+x2 v2v2 (3a) The guess v 2 after the two positive examples: v 2 =v 1 +x 2 >γ>γ Notice this doesn’t depend at all on the number of x’s being ranked Neither proof depends on the dimension of the x’s

Ranking perceptrons  structured perceptrons The API: – A sends B a (maybe huge ) set of items to rank – B finds the single best one according to the current weight vector – A tells B which one was actually best Structured classification on a sequence – Input: list of words: x =(w 1,…,w n ) – Output: list of labels: y= (y 1, …,y n ) – If there are K classes, there are K n labels possible for x 20

Ranking perceptrons  structured perceptrons Suppose we can 1.Given x and y, form a feature vector F(x,y) – Then we can score x,y using a weight vector: w.F(x,y) 2. Given x, find the top- scoring y, with respect to w.F(x,y ) Then we can learn…. Structured classification on a sequence – Input: list of words: x =(w 1,…,w n ) – Output: list of labels: y= (y 1, …,y n ) – If there are K classes, there are K n labels possible for x 21

22 Example Structure classification problem: segmentation or NER – Example: Addresses, bib records – Problem: some DBs may split records up differently (eg no “mail stop” field, combine address and apt #, …) or not at all – Solution: Learn to segment textual form of records P.P.Wangikar, T.P. Graycar, D.A. Estell, D.S. Clark, J.S. Dordick (1993) Protein and Solvent Engineering of Subtilising BPN' in Nearly Anhydrous Organic Media J.Amer. Chem. Soc. 115, Author Year TitleJournal Volume Page When will prof Cohen post the notes …

Converting segmentation to a feature set (Begin,In,Out) When will prof Cohen post the notes … Idea 1: features are properties of two adjacent tokens, and the pair of labels assigned to them. (y(i)==B or y(i)==I) and (token(i) is capitalized) (y(i)==I and y(i-1)==B) and (token(i) is hyphenated) (y(i)==B and y(i-1)==B) eg “tell Rahul William is on the way” Idea 2: construct a graph where each path is a possible sequence labeling. 23

Find the top-scoring y B I O B I O B I O B I O B I O B I O B I O When will prof Cohen post the notes … Inference: find the highest-weight path This can be done efficiently using dynamic programming (Viterbi) 24

Ranking perceptrons  structured perceptrons New API: – A sends B the word sequence x – B finds the single best y according to the current weight vector using Viterbi – A tells B which y was actually best – This is equivalent to ranking pairs g=( x,y’ ) based on w.F(x,y) Structured classification on a sequence – Input: list of words: x =(w 1,…,w n ) – Output: list of labels: y= (y 1, …,y n ) – If there are K classes, there are K n labels possible for x 25

The voted perceptron for NER A B instances g 1 g 2 g 3 g 4 … Compute: y i = v k. g i Return: the index b* of the “best” g i ^ b* bIf mistake: v k+1 = v k + g b - g b* 1.A sends B feature functions, and instructions for creating the instances g: A sends a word vector x i. Then B could create the instances g 1 =F(x i,y 1 ), g 2 = F(x i,y 2 ), … but instead B just returns the y* that gives the best score for the dot product v k. F(x i,y*) by using Viterbi. 2.A sends B the correct label sequence y i. 3.On errors, B sets v k+1 = v k + g b - g b* = v k + F(x i,y) - F(x i,y*) 26

EMNLP

Some background… Collins’ parser: generative model… …New Ranking Algorithms for Parsing and Tagging: Kernels over Discrete Structures, and the Voted Perceptron, Collins and Duffy, ACL …Ranking Algorithms for Named-Entity Extraction: Boosting and the Voted Perceptron, Collins, ACL –Propose entities using a MaxEnt tagger (as in MXPOST) –Use beam search to get multiple taggings for each document (20) –Learn to rerank the candidates to push correct ones to the top, using some new candidate-specific features: Value of the “whole entity” (e.g., “Professor_Cohen”) Capitalization features for the whole entity (e.g., “Xx+_Xx+”) Last word in entity, and capitalization features of last word Bigrams/Trigrams of words and capitalization features before and after the entity 28

Some background… 29

Collins’ Experiments POS tagging NP Chunking (words and POS tags from Brill’s tagger as features) and BIO output tags Compared Maxent Tagging/MEMM’s (with iterative scaling) and “Voted Perceptron trained HMM’s” –With and w/o averaging –With and w/o feature selection (count>5) 30

Collins’ results 31

STRUCTURED PERCEPTRONS… 32

NAACL

Aside: this paper is on structured perceptrons …but everything they say formally applies to the standard perceptron as well Briefly: a structured perceptron uses a weight vector to rank possible structured predictions y’ using features f ( x,y’ ) Instead of incrementing weight vector by y x, the weight vector is incremented by f(x,y) - f ( x,y’) 34

Parallel Structured Perceptrons Simplest idea: – Split data into S “shards” – Train a perceptron on each shard independently weight vectors are w (1), w (2), … – Produce some weighted average of the w (i) ‘s as the final result 35

Parallelizing perceptrons Instances/labels Instances/labels – 1 Instances/labels – 2 Instances/labels – 3 vk -1 vk- 2 vk-3 vk Split into example subsets Combine by some sort of weighted averaging Compute vk’s on subsets 36

Parallel Perceptrons Simplest idea: – Split data into S “shards” – Train a perceptron on each shard independently weight vectors are w (1), w (2), … – Produce some weighted average of the w (i) ‘s as the final result Theorem: this doesn’t always work. Proof: by constructing an example where you can converge on every shard, and still have the averaged vector not separate the full training set – no matter how you average the components. 37

Parallel Perceptrons – take 2 Idea: do the simplest possible thing iteratively. Split the data into shards Let w = 0 For n=1,… Train a perceptron on each shard with one pass starting with w Average the weight vectors (somehow) and let w be that average Extra communication cost: redistributing the weight vectors done less frequently than if fully synchronized, more frequently than if fully parallelized All-Reduce 38

Parallelizing perceptrons – take 2 Instances/labels Instances/labels – 1 Instances/labels – 2 Instances/labels – 3 w -1 w- 2 w-3 w w Split into example subsets Combine by some sort of weighted averaging Compute local vk’s w (previous) 39

A theorem Corollary: if we weight the vectors uniformly, then the number of mistakes is still bounded. I.e., this is “enough communication” to guarantee convergence. 40

What we know and don’t know uniform mixing… μ =1/S could we lose our speedup-from- parallelizing to slower convergence? CP speedup by factor of S is cancelled by slower convergence by factor of S 41

Results on NER perceptron Averaged perceptron 42

Results on parsing perceptron Averaged perceptron 43

The theorem… Shard i Iteration n This is not new…. 44

u -u 2γ2γ u -u-u 2γ2γ v1v1 +x2+x2 v2v2 +x1+x1 v1v1 -x2-x2 v2v2 (3a) The guess v 2 after the two positive examples: v 2 =v 1 +x 2 (3b) The guess v 2 after the one positive and one negative example: v 2 =v 1 -x 2 >γ>γ If mistake: v k+1 = v k + y i x i 45

The theorem… This is not new…. 46

u -u 2γ2γ u -u-u 2γ2γ v1v1 +x2+x2 v2v2 +x1+x1 v1v1 -x2-x2 v2v2 (3a) The guess v 2 after the two positive examples: v 2 =v 1 +x 2 (3b) The guess v 2 after the one positive and one negative example: v 2 =v 1 -x 2 If mistake: y i x i v k <

This is new …. We’ve never considered averaging operations before Follows from: u.w (i,1) >= k 1,i γ 48

IH1 inductive case: γ IH1 From A1 Distribute μ’s are distribution 49

IH2 proof is similar IH1, IH2 together imply the bound (as in the usual perceptron case) 50

Review/outline Streaming learning algorithms … and beyond – Naïve Bayes – Rocchio’s algorithm Similarities & differences – Probabilistic vs vector space models – Computationally similar – Parallelizing Naïve Bayes and Rocchio Alternative: – Adding up contributions for every example vs conservatively updating a linear classifier – On-line learning model: mistake-bounds some theory a mistake bound for perceptron – Parallelizing the perceptron 51

What we know and don’t know uniform mixing… could we lose our speedup-from- parallelizing to slower convergence? 52

What we know and don’t know 53

What we know and don’t know 54

What we know and don’t know 55

Review/outline Streaming learning algorithms … and beyond – Naïve Bayes – Rocchio’s algorithm Similarities & differences – Probabilistic vs vector space models – Computationally similar – Parallelizing Naïve Bayes and Rocchio Alternative: – Adding up contributions for every example vs conservatively updating a linear classifier – On-line learning model: mistake-bounds some theory a mistake bound for perceptron – Parallelizing the perceptron 56

Where we are… Summary of course so far: – Math tools: complexity, probability, on-line learning – Algorithms: Naïve Bayes, Rocchio, Perceptron, Phrase- finding as BLRT/pointwise KL comparisons, … – Design patterns: stream and sort, messages How to write scanning algorithms that scale linearly on large data (memory does not depend on input size) – Beyond scanning: parallel algorithms for ML – Formal issues involved in parallelizing Naïve Bayes, Rocchio, … easy? Conservative on-line methods (e.g., perceptron) … hard? Next: practical issues in parallelizing – details on Hadoop 57