Perceptrons – the story continues. On-line learning/regret analysis Optimization – is a great model of what you want to do – a less good model of what.

Slides:



Advertisements
Similar presentations
Rocchio’s Algorithm 1. Motivation Naïve Bayes is unusual as a learner: – Only one pass through data – Order doesn’t matter 2.
Advertisements

Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
On-line learning and Boosting
Search-Based Structured Prediction
Linear Classifiers (perceptrons)
Machine learning continued Image source:
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Exploiting Dictionaries in Named Entity Extraction: Combining Semi-Markov Extraction Processes and Data Integration Methods William W. Cohen, Sunita Sarawagi.
Support Vector Machines (and Kernel Methods in general)
Support Vector Machines and Kernel Methods
Sparse vs. Ensemble Approaches to Supervised Learning
Announcements See Chapter 5 of Duda, Hart, and Stork. Tutorial by Burge linked to on web page. “Learning quickly when irrelevant attributes abound,” by.
Ensemble Learning: An Introduction
Sequence labeling and beam search LING 572 Fei Xia 2/15/07.
Machine Learning: Ensemble Methods
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Seven Lectures on Statistical Parsing Christopher Manning LSA Linguistic Institute 2007 LSA 354 Lecture 7.
Online Learning Algorithms
STRUCTURED PERCEPTRON Alice Lai and Shi Zhi. Presentation Outline Introduction to Structured Perceptron ILP-CRF Model Averaged Perceptron Latent Variable.
Tree Kernels for Parsing: (Collins & Duffy, 2001) Advanced Statistical Methods in NLP Ling 572 February 28, 2012.
Data mining and machine learning A brief introduction.
Midterm Review Rao Vemuri 16 Oct Posing a Machine Learning Problem Experience Table – Each row is an instance – Each column is an attribute/feature.
ENSEMBLE LEARNING David Kauchak CS451 – Fall 2013.
CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.
1. RECAP 2 Parallel NB Training Documents/labels Documents/labels – 1 Documents/labels – 2 Documents/labels – 3 counts DFs Split into documents subsets.
Research Ranked Recall: Efficient Classification by Learning Indices That Rank Omid Madani with Michael Connor (UIUC)
Efficient Logistic Regression with Stochastic Gradient Descent – part 2 William Cohen.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
1 Boosting-based parse re-ranking with subtree features Taku Kudo Jun Suzuki Hideki Isozaki NTT Communication Science Labs.
Machine Learning in Ad-hoc IR. Machine Learning for ad hoc IR We’ve looked at methods for ranking documents in IR using factors like –Cosine similarity,
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.
Announcements My office hours: Tues 4pm Wed: guest lecture, Matt Hurst, Bing Local Search.
Powerpoint Templates Page 1 Powerpoint Templates Scalable Text Classification with Sparse Generative Modeling Antti PuurulaWaikato University.
CHAPTER 8 DISCRIMINATIVE CLASSIFIERS HIDDEN MARKOV MODELS.
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1.
Some More Efficient Learning Methods William W. Cohen.
Ensemble Methods in Machine Learning
Final Exam Review CS479/679 Pattern Recognition Dr. George Bebis 1.
Logistic Regression William Cohen.
Conditional Markov Models: MaxEnt Tagging and MEMMs
Announcements Phrases assignment out today: – Unsupervised learning – Google n-grams data – Non-trivial pipeline – Make sure you allocate time to actually.
Information Extraction Entity Extraction: Statistical Methods Sunita Sarawagi.
Supervised Machine Learning: Classification Techniques Chaleece Sandberg Chris Bradley Kyle Walsh.
CS 8751 ML & KDDComputational Learning Theory1 Notions of interest: efficiency, accuracy, complexity Probably, Approximately Correct (PAC) Learning Agnostic.
Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.
6.S093 Visual Recognition through Machine Learning Competition Image by kirkh.deviantart.com Joseph Lim and Aditya Khosla Acknowledgment: Many slides from.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Learning Analogies and Semantic Relations Nov William Cohen.
Machine Learning Chapter 7. Computational Learning Theory Tom M. Mitchell.
Efficient Logistic Regression with Stochastic Gradient Descent William Cohen 1.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Page 1 CS 546 Machine Learning in NLP Review 2: Loss minimization, SVM and Logistic Regression Dan Roth Department of Computer Science University of Illinois.
MACHINE LEARNING FOR NATURAL LANGUAGE PROCESSING.
Mistake Bounds William W. Cohen. One simple way to look for interactions Naïve Bayes – two class version dense vector of g(x,y) scores for each word in.
Machine Learning: Ensemble Methods
Artificial Intelligence
Announcements Guest lectures schedule: D. Sculley, Google Pgh, 3/26
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Announcements Next week:
Classification with Perceptrons Reading:

Max-margin sequential learning methods
CS 4/527: Artificial Intelligence
Pattern Recognition CS479/679 Pattern Recognition Dr. George Bebis
Parallel Perceptrons and Iterative Parameter Mixing
The Voted Perceptron for Ranking and Structured Classification
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Presentation transcript:

Perceptrons – the story continues

On-line learning/regret analysis Optimization – is a great model of what you want to do – a less good model of what you have time to do Example: – How much to we lose when we replace gradient descent with SGD? – what if we can only approximate the local gradient? – what if the distribution changes over time? – … One powerful analytic approach: online-learning aka regret analysis (~aka on-line optimization)

Theory: the prediction game Player A: – picks a “target concept” c for now - from a finite set of possibilities C (e.g., all decision trees of size m) – for t=1,…., Player A picks x =(x 1,…,x n ) and sends it to B – For now, from a finite set of possibilities (e.g., all binary vectors of length n) B predicts a label, ŷ, and sends it to A A sends B the true label y =c( x ) we record if B made a mistake or not – We care about the worst case number of mistakes B will make over all possible concept & training sequences of any length The “Mistake bound” for B, M B (C), is this bound

The voted perceptron: a simple algorithm with an easy mistake bound A B instance x i Compute: y i = sign(v k. x i ) ^ y i ^ If mistake: v k+1 = v k + y i x i 2

The perceptron game A B instance x i Compute: y i = sign(v k. x i ) ^ y i ^ If mistake: v k+1 = v k + y i x i x is a vector y is -1 or +1 u -u 2γ2γ v1v1 +x2+x2 v2v2 >γ>γ Depends on how easy the learning problem is, not dimension of vectors x Fairly intuitive: “Similarity” of v to u looks like ( v.u )/| v.v | ( v.u ) grows by >= γ after mistake ( v.v ) grows by <= R 2 5

u -u 2γ2γ u -u-u 2γ2γ v1v1 +x2+x2 v2v2 +x1+x1 v1v1 -x2-x2 v2v2 (3a) The guess v 2 after the two positive examples: v 2 =v 1 +x 2 (3b) The guess v 2 after the one positive and one negative example: v 2 =v 1 -x 2 >γ>γ

u -u 2γ2γ u -u-u 2γ2γ v1v1 +x2+x2 v2v2 +x1+x1 v1v1 -x2-x2 v2v2 (3a) The guess v 2 after the two positive examples: v 2 =v 1 +x 2 (3b) The guess v 2 after the one positive and one negative example: v 2 =v 1 -x 2

Summary We have shown that –If : exists a u with unit norm that has margin γ on examples in the seq (x 1,y 1 ),(x 2,y 2 ),…. (the data is easily separable) –Then : the perceptron algorithm makes =max i ||x i ||) –Independent of dimension of the data or classifier (!) We can improve this (it’s not optimal, just bounded) –There are many variants that rely on similar analysis (ROMMA, Passive-Aggressive, MIRA, …) We can extend the analysis to deal with noise We can extend the method to deal with other losses We can make the method more robust with averaging or voting

Theory: regret analysis In general: We care about the worst case “regret” of B – its loss on the on- line sequence – compared to the best-case predictions it could make if it had seem the entire sequence before making any predictions Mistakes for realizable case: – loss is 0/1 (mistake or not) – best case: 0 mistakes

Averaging and voting Theory says we get low error on the training set with a randomized algorithm: – pick a timepoint t – pick the perceptron v k active at t – predict on x with v k We can approximate this by – voting the predictions of all the classifiers, weighted by their probability of being picked – averaging the classifiers themselves, with the same weights

perceptrons on the NIST digits data

Summary We have shown that –If : exists a u with unit norm that has margin γ on examples in the seq (x 1,y 1 ),(x 2,y 2 ),…. (the data is easily separable) –Then : the perceptron algorithm makes =max i ||x i ||) –Independent of dimension of the data or classifier (!) Very similar to SVMs: –in an SVM you search for a u that will maximize margin γ subject to a bound on ||u|| –or else, minimize ||u|| subject to constraint on the margin –like SVMs you can kernelize perceptrons….

SPARSIFYING THE AVERAGED PERCEPTRON UPDATE 14

Complexity of perceptron learning Algorithm: v=0 for each example x, y: – if sign( v.x) != y v = v + y x init hashtable for x i !=0, v i += y x i O(n) O(| x |)=O(|d|)

Complexity of perceptron learning Algorithm: v=0 for each example x, y: – if sign( v.x) != y v = v + y x init hashtable for x i !=0, v i += y x i O(n) O(| x |)=O(|d|)

Complexity of averaged perceptron Algorithm: vk=0 va = 0 for each example x, y: – if sign( vk.x) != y va = va + vk * mk vk = vk + y x mk = 1 – else mk++ init hashtables for vk i !=0, va i += vk i for x i !=0, v i += y x i O(n) O(n|V|) O(| x |)=O(|d|) O(|V|)

Complexity of averaged perceptron Algorithm: vk=0 va = 0 for each example x, y: – if sign( vk.x) != y va = va + vk * mk vk = vk + y x mk = 1 – else mk++ return va /k where k=#mistakes init hashtables for vk i !=0, va i += vk i for x i !=0, v i += y x i O(n) O(n|V|) O(| x |)=O(|d|) O(|V|)

There are lots of mistakes….

Alternative averaged perceptron Algorithm: vk = 0 va = 0 for each example x, y: – va = va + vk – t = t+1 – if sign( vk.x ) != y vk = vk + y* x Return va /t S k is the set of examples including the first k mistakes Observe :

Alternative averaged perceptron Algorithm: vk = 0 va = 0 for each example x, y: – va = va + vk – t = t+1 – if sign( vk.x ) != y vk = vk + y* x Return va /t So when there’s a mistake at time t on x, y: y* x is added to va on every subsequent iteration Suppose you know T, the total number of examples in the stream…

Alternative averaged perceptron Algorithm: vk = 0 va = 0 for each example x, y: – va = va + vk – t = t+1 – if sign( vk.x ) != y vk = vk + y* x va = va + (T-t)* y *x Return va /T T = the total number of examples in the stream… All subsequent additions of x to va Unpublished? I figured this out recently, Leon Bottou knows it too

PERCEPTRONS IN PARALLEL 23

Parallelizing perceptrons Instances/labels Instances/labels – 1 Instances/labels – 2 Instances/labels – 3 vk/va -1 vk/va- 2 vk/va-3 vk Split into example subsets Combine somehow? Compute vk’s on subsets

Parallelizing perceptrons Instances/labels Instances/labels – 1 Instances/labels – 2 Instances/labels – 3 vk/va -1 vk/va- 2 vk/va-3 vk/va Split into example subsets Combine somehow Compute vk’s on subsets Synchonization cost vs Inference (classification) cost Synchonization cost vs Inference (classification) cost

A hidden agenda Part of machine learning is good grasp of theory Part of ML is a good grasp of what hacks tend to work These are not always the same – Especially in big-data situations Catalog of useful tricks so far – Brute-force estimation of a joint distribution – Naive Bayes – Stream-and-sort, request-and-answer patterns – BLRT and KL-divergence (and when to use them) – TF-IDF weighting – especially IDF it’s often useful even when we don’t understand why – Perceptron/mistake bound model often leads to fast, competitive, easy-to-implement methods parallel versions are non-trivial to implement/understand

The Voted Perceptron for Ranking and Structured Classification 27

The voted perceptron for ranking A B instances x 1 x 2 x 3 x 4 … Compute: y i = v k. x i Return: the index b* of the “best” x i ^ b* bIf mistake: v k+1 = v k + x b - x b* 28

u -u x x x x x γ Ranking some x’s with the target vector u 29

u -u x x x x x γ v Ranking some x’s with some guess vector v – part 1 30

u -u x x x x x v Ranking some x’s with some guess vector v – part 2. The purple-circled x is x b* - the one the learner has chosen to rank highest. The green circled x is x b, the right answer. 31

u -u x x x x x v Correcting v by adding x b – x b* 32

x x x x x vkvk V k+1 Correcting v by adding x b – x b* (part 2) 33

(3a) The guess v 2 after the two positive examples: v 2 =v 1 +x 2 u -u 2γ2γ v1v1 +x2+x2 v2v2 >γ>γ 34

u -u 2γ2γ v1v1 +x2+x2 v2v2 (3a) The guess v 2 after the two positive examples: v 2 =v 1 +x 2 >γ>γ 3 35

u -u 2γ2γ v1v1 +x2+x2 v2v2 (3a) The guess v 2 after the two positive examples: v 2 =v 1 +x 2 >γ>γ 3 36

u -u 2γ2γ v1v1 +x2+x2 v2v2 (3a) The guess v 2 after the two positive examples: v 2 =v 1 +x 2 >γ>γ Notice this doesn’t depend at all on the number of x’s being ranked Neither proof depends on the dimension of the x’s

Ranking perceptrons  structured perceptrons The API: – A sends B a (maybe huge ) set of items to rank – B finds the single best one according to the current weight vector – A tells B which one was actually best Structured classification on a sequence – Input: list of words: x =(w 1,…,w n ) – Output: list of labels: y= (y 1, …,y n ) – If there are K classes, there are K n labels possible for x 38

Ranking perceptrons  structured perceptrons Suppose we can 1.Given x and y, form a feature vector F(x,y) – Then we can score x,y using a weight vector: w.F(x,y) 2. Given x, find the top- scoring y, with respect to w.F(x,y ) Then we can learn…. Structured classification on a sequence – Input: list of words: x =(w 1,…,w n ) – Output: list of labels: y= (y 1, …,y n ) – If there are K classes, there are K n labels possible for x 39

40 Example Structure classification problem: segmentation or NER – Example: Addresses, bib records – Problem: some DBs may split records up differently (eg no “mail stop” field, combine address and apt #, …) or not at all – Solution: Learn to segment textual form of records P.P.Wangikar, T.P. Graycar, D.A. Estell, D.S. Clark, J.S. Dordick (1993) Protein and Solvent Engineering of Subtilising BPN' in Nearly Anhydrous Organic Media J.Amer. Chem. Soc. 115, Author Year TitleJournal Volume Page When will prof Cohen post the notes …

Converting segmentation to a feature set (Begin,In,Out) When will prof Cohen post the notes … Idea 1: features are properties of two adjacent tokens, and the pair of labels assigned to them. (y(i)==B or y(i)==I) and (token(i) is capitalized) (y(i)==I and y(i-1)==B) and (token(i) is hyphenated) (y(i)==B and y(i-1)==B) eg “tell Rahul William is on the way” Idea 2: construct a graph where each path is a possible sequence labeling. 41

Find the top-scoring y B I O B I O B I O B I O B I O B I O B I O When will prof Cohen post the notes … Inference: find the highest-weight path This can be done efficiently using dynamic programming (Viterbi) 42

Ranking perceptrons  structured perceptrons New API: – A sends B the word sequence x – B finds the single best y according to the current weight vector using Viterbi – A tells B which y was actually best – This is equivalent to ranking pairs g=( x,y’ ) based on w.F(x,y) Structured classification on a sequence – Input: list of words: x =(w 1,…,w n ) – Output: list of labels: y= (y 1, …,y n ) – If there are K classes, there are K n labels possible for x 43

The voted perceptron for NER A B instances g 1 g 2 g 3 g 4 … Compute: y i = v k. g i Return: the index b* of the “best” g i ^ b* bIf mistake: v k+1 = v k + g b - g b* 1.A sends B feature functions, and instructions for creating the instances g: A sends a word vector x i. Then B could create the instances g 1 =F(x i,y 1 ), g 2 = F(x i,y 2 ), … but instead B just returns the y* that gives the best score for the dot product v k. F(x i,y*) by using Viterbi. 2.A sends B the correct label sequence y i. 3.On errors, B sets v k+1 = v k + g b - g b* = v k + F(x i,y) - F(x i,y*) 44

EMNLP

Some background… Collins’ parser: generative model… …New Ranking Algorithms for Parsing and Tagging: Kernels over Discrete Structures, and the Voted Perceptron, Collins and Duffy, ACL …Ranking Algorithms for Named-Entity Extraction: Boosting and the Voted Perceptron, Collins, ACL –Propose entities using a MaxEnt tagger (as in MXPOST) –Use beam search to get multiple taggings for each document (20) –Learn to rerank the candidates to push correct ones to the top, using some new candidate-specific features: Value of the “whole entity” (e.g., “Professor_Cohen”) Capitalization features for the whole entity (e.g., “Xx+_Xx+”) Last word in entity, and capitalization features of last word Bigrams/Trigrams of words and capitalization features before and after the entity 46

Some background… 47

Collins’ Experiments POS tagging NP Chunking (words and POS tags from Brill’s tagger as features) and BIO output tags Compared Maxent Tagging/MEMM’s (with iterative scaling) and “Voted Perceptron trained HMM’s” –With and w/o averaging –With and w/o feature selection (count>5) 48

Collins’ results 49