STRUCTURED PERCEPTRON Alice Lai and Shi Zhi. Presentation Outline Introduction to Structured Perceptron ILP-CRF Model Averaged Perceptron Latent Variable.

Slides:

Advertisements

Similar presentations

A Support Vector Method for Optimizing Average Precision

Advertisements

Latent Variables Naman Agarwal Michael Nute May 1, 2013.

Document Summarization using Conditional Random Fields Dou Shen, Jian-Tao Sun, Hua Li, Qiang Yang, Zheng Chen IJCAI 2007 Hao-Chin Chang Department of Computer.

Why does it work? We have not addressed the question of why does this classifier performs well, given that the assumptions are unlikely to be satisfied.

Learning for Structured Prediction Overview of the Material TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA A A A.

Structured SVM Chen-Tse Tsai and Siddharth Gupta.

CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.

Conditional Random Fields and beyond …

Supervised Learning Recap

Lecture 13 – Perceptrons Machine Learning March 16, 2010.

John Lafferty, Andrew McCallum, Fernando Pereira

Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data John Lafferty Andrew McCallum Fernando Pereira.

Computer vision: models, learning and inference

Jun Zhu Dept. of Comp. Sci. & Tech., Tsinghua University This work was done when I was a visiting researcher at CMU. Joint.

Hidden Markov Models Theory By Johan Walters (SR 2003)

Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.

Logistics Course reviews Project report deadline: March 16 Poster session guidelines: – 2.5 minutes per poster (3 hrs / 55 minus overhead) – presentations.

Ensemble Learning: An Introduction

Conditional Random Fields

Page 1 Generalized Inference with Multiple Semantic Role Labeling Systems Peter Koomen, Vasin Punyakanok, Dan Roth, (Scott) Wen-tau Yih Department of Computer.

Distributed Representations of Sentences and Documents

Linear Discriminant Functions Chapter 5 (Duda et al.)

Scalable Text Mining with Sparse Generative Models

Online Learning Algorithms

Extracting Places and Activities from GPS Traces Using Hierarchical Conditional Random Fields Yong-Joong Kim Dept. of Computer Science Yonsei.

Conditional Random Fields

Graphical models for part of speech tagging

Dual Coordinate Descent Algorithms for Efficient Large Margin Structured Prediction Ming-Wei Chang and Scott Wen-tau Yih Microsoft Research 1.

8/25/05 Cognitive Computations Software Tutorial Page 1 SNoW: Sparse Network of Winnows Presented by Nick Rizzolo.

Classification and Ranking Approaches to Discriminative Language Modeling for ASR Erinç Dikici, Murat Semerci, Murat Saraçlar, Ethem Alpaydın 報告者：郝柏翰 2013/01/28.

Discriminative Models for Spoken Language Understanding Ye-Yi Wang, Alex Acero Microsoft Research, Redmond, Washington USA ICSLP 2006.

Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.

CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov

Maximum Entropy (ME) Maximum Entropy Markov Model (MEMM) Conditional Random Field (CRF)

Indirect Supervision Protocols for Learning in Natural Language Processing II. Learning by Inventing Binary Labels This work is supported by DARPA funding.

Maximum Entropy Models and Feature Engineering CSCI-GA.2590 – Lecture 6B Ralph Grishman NYU.

Date : 2013/03/18 Author : Jeffrey Pound, Alexander K. Hudek, Ihab F. Ilyas, Grant Weddell Source : CIKM’12 Speaker : Er-Gang Liu Advisor : Prof. Jia-Ling.

Tell Me What You See and I will Show You Where It Is Jia Xu 1 Alexander G. Schwing 2 Raquel Urtasun 2,3 1 University of Wisconsin-Madison 2 University.

CHAPTER 8 DISCRIMINATIVE CLASSIFIERS HIDDEN MARKOV MODELS.

John Lafferty Andrew McCallum Fernando Pereira

Final Exam Review CS479/679 Pattern Recognition Dr. George Bebis 1.

Online Learning of Maximum Margin Classifiers Kohei HATANO Kyusyu University (Joint work with K. Ishibashi and M. Takeda) p-Norm with Bias COLT 2008.

Conditional Markov Models: MaxEnt Tagging and MEMMs

Markov Random Fields & Conditional Random Fields

Global Inference via Linear Programming Formulation Presenter: Natalia Prytkova Tutor: Maximilian Dylla

1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.

Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.

Page 1 July 2008 ICML Workshop on Prior Knowledge for Text and Language Constraints as Prior Knowledge Ming-Wei Chang, Lev Ratinov, Dan Roth Department.

Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.

Conditional Random Fields & Table Extraction Dongfang Xu School of Information.

Graphical Models for Segmenting and Labeling Sequence Data Manoj Kumar Chinnakotla NLP-AI Seminar.

Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.

Conditional Random Fields and Its Applications Presenter: Shih-Hsiang Lin 06/25/2007.

Linear Discriminant Functions Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.

Bounded Nonlinear Optimization to Fit a Model of Acoustic Foams

Deep Feedforward Networks

Artificial Neural Networks

Structured prediction

Maximum Entropy Models and Feature Engineering CSCI-GA.2591

By Dan Roth and Wen-tau Yih PowerPoint by: Reno Kriz CIS

CIS 700 Advanced Machine Learning Structured Machine Learning: Theory and Applications in Natural Language Processing Shyam Upadhyay Department of.

Margin-based Decomposed Amortized Inference

Max-margin sequential learning methods

CSC 594 Topics in AI – Natural Language Processing

Pattern Recognition CS479/679 Pattern Recognition Dr. George Bebis

Probabilistic Models with Latent Variables

Learning Markov Networks

N-Gram Model Formulas Word sequences Chain rule of probability

Dan Roth Department of Computer Science

Reinforcement Learning (2)

Presentation transcript:

STRUCTURED PERCEPTRON Alice Lai and Shi Zhi

Presentation Outline Introduction to Structured Perceptron ILP-CRF Model Averaged Perceptron Latent Variable Perceptron

Motivation An algorithm to learn weights for structured prediction Alternative to POS tagging with MEMM and CRF (Collins 2002) Convergence guarantees under certain conditions even for inseparable data Generalizes to new examples and other sequence labeling problems

POS Tagging Example D N V A D N V A D N V A D N V A D N V A Example:the mansawdog

MEMM Approach Conditional model: probability of the current state given previous state and current observation For tagging problem, define local features for each tag in context Features are often indicator functions Learn parameter vector α with Generalized Iterative Scaling or gradient descent

Global Features Local features are defined only for a single label Global features are defined for an observed sequence and a possible label sequence Simple version: global features are local features summed over an observation-label sequence pair Compared to original perceptron algorithm, we have prediction of a vector of labels instead of a single label Which of the possible incorrect label vectors do we use as the negative example in training?

Structured Perceptron Algorithm

Properties

Global vs. Local Learning Global learning (IBT): constraints are used during training Local learning (L+I): classifiers are trained without constraints, constraints are applied later to produce global output Example: ILP-CRF model [Roth and Yih 2005]

Perceptron IBT

Perceptron I+L

ILP-CRF Introduction [Roth and Yih 2005] ILP-CRF model for Semantic Role Labeling as a sequence labeling problem Viterbi inference for CRFs can include constraints Cannot handle long-range or general constraints Viterbi is a shortest path problem that can be solved with ILP Use integer linear programming to express general constraints during inference Allows incorporation of expressive constraints, including long-range constraints between distant tokens that cannot be handled by Viterbi s A B C A B C A B C A B C A B C t

ILP-CRF Models CRF trained with max log-likelihood CRF trained with voted perceptron I+L IBT Local training (L+I) Perceptron, winnow, voted perceptron, voted winnow

ILP-CRF Results Sequential Models Local L+I IBT

ILP-CRF Conclusions Performance of local learning models perform poorly improves dramatically when constraints are added at evaluation Performance is comparable to IBT methods The best models for global and local training show comparable results L+I vs. IBT: L+I requires fewer training examples, is more efficient, outperforms IBT in most situations (unless local problems are difficult to solve) [Punyakanok et. al, IJCAI 2005]

Variations: Voted Perceptron For iteration t=1,…,T For example i=1,…,n Given parameter,by Viterbi Decoding, Get sequence labels for one example Each example define a tagging sequence. The voted perceptron: takes the most frequently ocurring output in the set

Variations: Voted Perceptron Averaged algorithm(Collins‘02): approximation of the voted method. It takes the averaging parameter instead of final parameter Performance: Higher F-Measure, Lower error rate Greater Stability on variance in its scores Variation: modified averaged algorithm for latent perceptron

Variations: Latent Structure Perceptron Model Definition is the parameter for perceptron. is the feature encoding function mapping to feature vector In NER task, x is word sequence, y is the named-entity type sequence, h is the hidden latent variable sequence. Features: unigram bigram for word, POS and orthography (prefix, upper/lower case) Why latent variables? Capture latent dependencies (i.e. hidden sub-structure)

Variations: Latent Structure Perceptron Purely Latent Structure Perceptron(Connor’s) Training(Structure perceptron with margin) C: margin Alpha: learning rate Variation: modified averaging parameter method(Sun’s): re-initiate the parameter with averaged parameter in each k iteration. Advantage: reduce overfitting of the latent perceptron

Variations: Latent Structure Perceptron Disadvantage of purely latent perceptron: h* is found and then forgotten for each x. Solution: Online Latent Classifier (Connor’s) Two classifiers: latent classifier: parameter: u label classifier: parameter: w

Variations: Latent Structure Perceptron Online Latent Classifier Training(Connor’s)

Variations: Latent Structure Perceptron Experiments: Bio-NER with purely latent perceptron cc: cut-off Odr:#order dependency Train-time F-measure High-order

Variations: Latent Structure Perceptron Experiments: Semantic Role Labeling with argument/predicate as latent structure X: She likes yellow flowers (sentence) Y: agent predicate patient (role) H: predicate: only one; argument: at least one (latent structure) Optimization for (h*,y*): search all possible argument/predicate structure. For more complex data, need other methods. On test set:

Summary Structured Perceptron definition and motivation IBT vs. L+I Variations of Structure Perceptron References: Discriminative Training for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms, M. Collins, EMNLP Latent Variable Perceptron Algorithm for Structured Classification, Sun, Xu, Takuya Matsuzaki, Daisuke Okanohara and Jun'ichi Tsujii, IJCAI Integer Linear Programming Inference for Conditional Random Fields, D. Roth, W. Yih, ICML Online Latent Structure Training for Language Acquisition, M. Connor and C. Fisher and D. Roth, IJCAI 2011