The Voted Perceptron for Ranking and Structured Classification

Slides:



Advertisements
Similar presentations
HPSG parser development at U-tokyo Takuya Matsuzaki University of Tokyo.
Advertisements

On-line learning and Boosting
Search-Based Structured Prediction
Linear Classifiers (perceptrons)
Linear Separators.
Support Vector Machines and Margins
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Exploiting Dictionaries in Named Entity Extraction: Combining Semi-Markov Extraction Processes and Data Integration Methods William W. Cohen, Sunita Sarawagi.
Re-ranking for NP-Chunking: Maximum-Entropy Framework By: Mona Vajihollahi.
Lecture: Dudu Yanay.  Input: Each instance is associated with a rank or a rating, i.e. an integer from ‘1’ to ‘K’.  Goal: To find a rank-prediction.
Ensemble Learning: An Introduction
Course Summary LING 572 Fei Xia 03/06/07. Outline Problem description General approach ML algorithms Important concepts Assignments What’s next?
Sparse vs. Ensemble Approaches to Supervised Learning
Seven Lectures on Statistical Parsing Christopher Manning LSA Linguistic Institute 2007 LSA 354 Lecture 7.
Online Learning Algorithms
STRUCTURED PERCEPTRON Alice Lai and Shi Zhi. Presentation Outline Introduction to Structured Perceptron ILP-CRF Model Averaged Perceptron Latent Variable.
Review: Hidden Markov Models Efficient dynamic programming algorithms exist for –Finding Pr(S) –The highest probability path P that maximizes Pr(S,P) (Viterbi)
Tree Kernels for Parsing: (Collins & Duffy, 2001) Advanced Statistical Methods in NLP Ling 572 February 28, 2012.
A search-based Chinese Word Segmentation Method ——WWW 2007 Xin-Jing Wang: IBM China Wen Liu: Huazhong Univ. China Yong Qin: IBM China.
1. RECAP 2 Parallel NB Training Documents/labels Documents/labels – 1 Documents/labels – 2 Documents/labels – 3 counts DFs Split into documents subsets.
Training dependency parsers by jointly optimizing multiple objectives Keith HallRyan McDonaldJason Katz- BrownMichael Ringgaard.
Combining multiple learners Usman Roshan. Bagging Randomly sample training data Determine classifier C i on sampled data Goto step 1 and repeat m times.
Maximum Entropy (ME) Maximum Entropy Markov Model (MEMM) Conditional Random Field (CRF)
IE with Dictionaries Cohen & Sarawagi. Announcements Current statistics: –days with unscheduled student talks: 2 –students with unscheduled student talks:
Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara.
Announcements My office hours: Tues 4pm Wed: guest lecture, Matt Hurst, Bing Local Search.
Gang WangDerek HoiemDavid Forsyth. INTRODUCTION APROACH (implement detail) EXPERIMENTS CONCLUSION.
Biographies, Bollywood, Boom-boxes and Blenders: Domain Adaptation for Sentiment Classification John Blitzer, Mark Dredze and Fernando Pereira University.
Online Multiple Kernel Classification Steven C.H. Hoi, Rong Jin, Peilin Zhao, Tianbao Yang Machine Learning (2013) Presented by Audrey Cheong Electrical.
Conditional Markov Models: MaxEnt Tagging and MEMMs
Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.
Learning Analogies and Semantic Relations Nov William Cohen.
Efficient Logistic Regression with Stochastic Gradient Descent William Cohen 1.
MACHINE LEARNING FOR NATURAL LANGUAGE PROCESSING.
Mistake Bounds William W. Cohen. One simple way to look for interactions Naïve Bayes – two class version dense vector of g(x,y) scores for each word in.
Perceptrons – the story continues. On-line learning/regret analysis Optimization – is a great model of what you want to do – a less good model of what.
CMPS 142/242 Review Section Fall 2011 Adapted from Lecture Slides.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
IE With Undirected Models: the saga continues
Dan Roth Department of Computer and Information Science
Announcements Guest lectures schedule: D. Sculley, Google Pgh, 3/26
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Announcements Next week:
Lecture 07: Soft-margin SVM
Perceptrons Lirong Xia.
CIS 700 Advanced Machine Learning Structured Machine Learning:   Theory and Applications in Natural Language Processing Shyam Upadhyay Department of.
Classification with Perceptrons Reading:

Max-margin sequential learning methods
Machine Learning Week 1.
Statistical Learning Dong Liu Dept. EEIS, USTC.
CRFs for SPLODD William W. Cohen Sep 8, 2011.
Klein and Manning on CRFs vs CMMs
Lecture 07: Soft-margin SVM
CSCI B609: “Foundations of Data Science”
Learning Markov Networks
Kernels for Relation Extraction
Online Learning Kernels
Large Scale Support Vector Machines
Instance Based Learning
Kai-Wei Chang University of Virginia
MACHINE LEARNING TECHNIQUES IN IMAGE PROCESSING
MACHINE LEARNING TECHNIQUES IN IMAGE PROCESSING
Support Vector Machines
Parallel Perceptrons and Iterative Parameter Mixing
IE With Undirected Models
NER with Models Allowing Long-Range Dependencies
Sequential Learning with Dependency Nets
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Perceptrons Lirong Xia.
Presentation transcript:

The Voted Perceptron for Ranking and Structured Classification William Cohen

If mistake: vk+1 = vk + yi xi The voted perceptron Compute: yi = vk . xi ^ instance xi B A If mistake: vk+1 = vk + yi xi yi ^ yi

(2) The guess v1 after one positive example. (1) A target u u -u 2γ u -u 2γ (2) The guess v1 after one positive example. +x1 v1

(3a) The guess v2 after the two positive examples: v2=v1+x2 (3b) The guess v2 after the one positive and one negative example: v2=v1-x2 v2 u u +x2 v2 v1 v1 +x1 -x2 -u -u 2γ 2γ

(3a) The guess v2 after the two positive examples: v2=v1+x2 (3b) The guess v2 after the one positive and one negative example: v2=v1-x2 v2 u u +x2 v2 >γ v1 v1 +x1 -x2 -u -u 2γ 2γ

(3a) The guess v2 after the two positive examples: v2=v1+x2 (3b) The guess v2 after the one positive and one negative example: v2=v1-x2 v2 u u +x2 v2 v1 v1 +x1 -x2 -u -u 2γ 2γ

On-line to batch learning Pick a vk at random according to mk/m, the fraction of examples it was used for. Predict using the vk you just picked. (Actually, use some sort of deterministic approximation to this).

where K(v,x) = dot product of v and x (a similarity function) The kernel trick You can think of a perceptron as a weighted nearest-neighbor classifier…. where K(v,x) = dot product of v and x (a similarity function)

Here’s yes another similarity function: K(v,x) is The kernel trick Here’s another similarity function: K’(v,x)=dot product of H’(v),H’(x)) where Here’s yes another similarity function: K(v,x) is

Claim: K(v,x)=dot product of H(x),H(v) for this H: The kernel trick Claim: K(v,x)=dot product of H(x),H(v) for this H:

The voted perceptron for ranking Compute: yi = vk . xi Return: the index b* of the “best” xi ^ instances x1 x2 x3 x4… B A b* If mistake: vk+1 = vk + xb - xb* b

Ranking some x’s with the target vector u γ x x x x

Ranking some x’s with some guess vector v – part 1 γ v x x x x

Ranking some x’s with some guess vector v – part 2. The purple-circled x is xb* - the one the learner has chosen to rank highest. The green circled x is xb, the right answer. x v x x x x

Correcting v by adding xb – xb* u -u Correcting v by adding xb – xb* x v x x x x

Correcting v by adding xb – xb* (part 2) Vk+1

(3a) The guess v2 after the two positive examples: v2=v1+x2 >γ v1 -u 2γ

(3a) The guess v2 after the two positive examples: v2=v1+x2 >γ v1 -u 2γ 3

(3a) The guess v2 after the two positive examples: v2=v1+x2 2γ v1 +x2 v2 (3a) The guess v2 after the two positive examples: v2=v1+x2 >γ 3

Notice this doesn’t depend at all on the number of x’s being ranked (3a) The guess v2 after the two positive examples: v2=v1+x2 v2 u +x2 >γ v1 -u 2γ Neither proof depends on the dimension of the x’s.

The voted perceptron for ranking Compute: yi = vk . xi Return: the index b* of the “best” xi ^ instances x1 x2 x3 x4… B A b* If mistake: vk+1 = vk + xb - xb* b Change number one: replace x with z

The voted perceptron for NER Compute: yi = vk . zi Return: the index b* of the “best” zi ^ instances z1 z2 z3 z4… B A If mistake: vk+1 = vk + zb - zb* b* b A sends B the Sha & Pereira paper and instructions for creating the instances: A sends a word vector xi. Then B could create the instances F(xi,y)….. but instead B just returns the y* that gives the best score for the dot product vk . F(xi,y*) by using Viterbi. A sends B the correct label sequence yi. On errors, B sets vk+1 = vk + zb - zb* = vk + F(xi,y) - F(xi,y*)

The voted perceptron for NER Compute: yi = vk . zi Return: the index b* of the “best” zi ^ instances z1 z2 z3 z4… B A If mistake: vk+1 = vk + zb - zb* b* b A sends a word vector xi. B just returns the y* that gives the best score for vk . F(xi,y*) A sends B the correct label sequence yi. On errors, B sets vk+1 = vk + zb - zb* = vk + F(xi,y) - F(xi,y*) So, this algorithm can also be viewed as an approximation to the CRF learning algorithm – where we’re using a viterbi approximation to the expectations, and stochastic gradient descent to optimize the likelihood.

EMNLP 2002, Best paper

Some background… Collins’ parser: generative model… …New Ranking Algorithms for Parsing and Tagging: Kernels over Discrete Structures, and the Voted Perceptron, Collins and Duffy, ACL 2002. …Ranking Algorithms for Named-Entity Extraction: Boosting and the Voted Perceptron, Collins, ACL 2002. Propose entities using a MaxEnt tagger (as in MXPOST) Use beam search to get multiple taggings for each document (20) Learn to rerank the candidates to push correct ones to the top, using some new candidate-specific features: Value of the “whole entity” (e.g., “Professor_Cohen”) Capitalization features for the whole entity (e.g., “Xx+_Xx+”) Last word in entity, and capitalization features of last word Bigrams/Trigrams of words and capitalization features before and after the entity

Some background…

And back to the paper….. EMNLP 2002, Best paper

Collins’ Experiments POS tagging (with MXPOST features) NP Chunking (words and POS tags from Brill’s tagger as features) and BIO output tags Compared Maxent Tagging/MEMM’s (with iterative scaling) and “Voted Perceptron trained HMM’s” With and w/o averaging With and w/o feature selection (count>5)

Collins’ results

Announcements/Discussion Deadlines --? Wiki Proposed additional types: Software Systems (OpenNLP?) Metrics? How to’s: Method, Dataset, etc pages Special pages as a way to check up on yourself