The Voted Perceptron for Ranking and Structured Classification

The Voted Perceptron for Ranking and Structured Classification
William Cohen

If mistake: vk+1 = vk + yi xi
The voted perceptron Compute: yi = vk . xi ^ instance xi B A If mistake: vk+1 = vk + yi xi yi ^ yi

(2) The guess v1 after one positive example.
(1) A target u u -u 2γ u -u 2γ (2) The guess v1 after one positive example. +x1 v1

(3a) The guess v2 after the two positive examples: v2=v1+x2
(3b) The guess v2 after the one positive and one negative example: v2=v1-x2 v2 u u +x2 v2 v1 v1 +x1 -x2 -u -u 2γ 2γ

(3b) The guess v2 after the one positive and one negative example: v2=v1-x2 v2 u u +x2 v2 >γ v1 v1 +x1 -x2 -u -u 2γ 2γ

(3b) The guess v2 after the one positive and one negative example: v2=v1-x2 v2 u u +x2 v2 v1 v1 +x1 -x2 -u -u 2γ 2γ

On-line to batch learning
Pick a vk at random according to mk/m, the fraction of examples it was used for. Predict using the vk you just picked. (Actually, use some sort of deterministic approximation to this).

where K(v,x) = dot product of v and x (a similarity function)
The kernel trick You can think of a perceptron as a weighted nearest-neighbor classifier…. where K(v,x) = dot product of v and x (a similarity function)

Here’s yes another similarity function: K(v,x) is
The kernel trick Here’s another similarity function: K’(v,x)=dot product of H’(v),H’(x)) where Here’s yes another similarity function: K(v,x) is

Claim: K(v,x)=dot product of H(x),H(v) for this H:
The kernel trick Claim: K(v,x)=dot product of H(x),H(v) for this H:

The voted perceptron for ranking
Compute: yi = vk . xi Return: the index b* of the “best” xi ^ instances x1 x2 x3 x4… B A b* If mistake: vk+1 = vk + xb - xb* b

Ranking some x’s with the target vector u
γ x x x x

Ranking some x’s with some guess vector v – part 1
γ v x x x x

Ranking some x’s with some guess vector v – part 2.
The purple-circled x is xb* - the one the learner has chosen to rank highest. The green circled x is xb, the right answer. x v x x x x

Correcting v by adding xb – xb*
u -u Correcting v by adding xb – xb* x v x x x x

Correcting v by adding xb – xb* (part 2) Vk+1

>γ v1 -u 2γ

>γ v1 -u 2γ 3

2γ v1 +x2 v2 (3a) The guess v2 after the two positive examples: v2=v1+x2 >γ 3

Notice this doesn’t depend at all on the number of x’s being ranked
(3a) The guess v2 after the two positive examples: v2=v1+x2 v2 u +x2 >γ v1 -u 2γ Neither proof depends on the dimension of the x’s.

The voted perceptron for ranking
Compute: yi = vk . xi Return: the index b* of the “best” xi ^ instances x1 x2 x3 x4… B A b* If mistake: vk+1 = vk + xb - xb* b Change number one: replace x with z

The voted perceptron for NER
Compute: yi = vk . zi Return: the index b* of the “best” zi ^ instances z1 z2 z3 z4… B A If mistake: vk+1 = vk + zb - zb* b* b A sends B the Sha & Pereira paper and instructions for creating the instances: A sends a word vector xi. Then B could create the instances F(xi,y)….. but instead B just returns the y* that gives the best score for the dot product vk . F(xi,y*) by using Viterbi. A sends B the correct label sequence yi. On errors, B sets vk+1 = vk + zb - zb* = vk + F(xi,y) - F(xi,y*)

The voted perceptron for NER
Compute: yi = vk . zi Return: the index b* of the “best” zi ^ instances z1 z2 z3 z4… B A If mistake: vk+1 = vk + zb - zb* b* b A sends a word vector xi. B just returns the y* that gives the best score for vk . F(xi,y*) A sends B the correct label sequence yi. On errors, B sets vk+1 = vk + zb - zb* = vk + F(xi,y) - F(xi,y*) So, this algorithm can also be viewed as an approximation to the CRF learning algorithm – where we’re using a viterbi approximation to the expectations, and stochastic gradient descent to optimize the likelihood.

EMNLP 2002, Best paper

Some background… Collins’ parser: generative model…
…New Ranking Algorithms for Parsing and Tagging: Kernels over Discrete Structures, and the Voted Perceptron, Collins and Duffy, ACL 2002. …Ranking Algorithms for Named-Entity Extraction: Boosting and the Voted Perceptron, Collins, ACL 2002. Propose entities using a MaxEnt tagger (as in MXPOST) Use beam search to get multiple taggings for each document (20) Learn to rerank the candidates to push correct ones to the top, using some new candidate-specific features: Value of the “whole entity” (e.g., “Professor_Cohen”) Capitalization features for the whole entity (e.g., “Xx+_Xx+”) Last word in entity, and capitalization features of last word Bigrams/Trigrams of words and capitalization features before and after the entity

Some background…

And back to the paper….. EMNLP 2002, Best paper

Collins’ Experiments POS tagging (with MXPOST features)
NP Chunking (words and POS tags from Brill’s tagger as features) and BIO output tags Compared Maxent Tagging/MEMM’s (with iterative scaling) and “Voted Perceptron trained HMM’s” With and w/o averaging With and w/o feature selection (count>5)

Collins’ results

Announcements/Discussion
Deadlines --? Wiki Proposed additional types: Software Systems (OpenNLP?) Metrics? How to’s: Method, Dataset, etc pages Special pages as a way to check up on yourself

The Voted Perceptron for Ranking and Structured Classification

Similar presentations

Presentation on theme: "The Voted Perceptron for Ranking and Structured Classification"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

The Voted Perceptron for Ranking and Structured Classification

Similar presentations

Presentation on theme: "The Voted Perceptron for Ranking and Structured Classification"— Presentation transcript:

Similar presentations

About project

Feedback