Learning with lookahead: Can history-based models rival globally optimized models? Yoshimasa Tsuruoka Japan Advanced Institute of Science and Technology.

Slides:

Advertisements

Similar presentations

Three Basic Problems Compute the probability of a text: P m (W 1,N ) Compute maximum probability tag sequence: arg max T 1,N P m (T 1,N | W 1,N ) Compute.

Advertisements

Feature Forest Models for Syntactic Parsing Yusuke Miyao University of Tokyo.

School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Chunking: Shallow Parsing Eric Atwell, Language Research Group.

Yoshimasa Tsuruoka, Jun’ichi Tsujii, and Sophia Ananiadou

Name ____________________ Date ___________ Period ____.

Authors Sebastian Riedel and James Clarke Paper review by Anusha Buchireddygari Incremental Integer Linear Programming for Non-projective Dependency Parsing.

Three Basic Problems 1.Compute the probability of a text (observation) language modeling – evaluate alternative texts and models P m (W 1,N ) 2.Compute.

HPSG parser development at U-tokyo Takuya Matsuzaki University of Tokyo.

Search-Based Structured Prediction

Learning for Structured Prediction Overview of the Material TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA A A A.

Part-of-speech tagging and chunking with log-linear models University of Manchester Yoshimasa Tsuruoka.

Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data John Lafferty Andrew McCallum Fernando Pereira.

Learning with Probabilistic Features for Improved Pipeline Models Razvan C. Bunescu Electrical Engineering and Computer Science Ohio University Athens,

Deep Learning in NLP Word representation and how to use it for Parsing

Hindi POS tagging and chunking : An MEMM approach Aniket Dalal Kumar Nagaraj Uma Sawant Sandeep Shelke Under the guidance of Prof. P. Bhattacharyya.

Ch 10 Part-of-Speech Tagging Edited from: L. Venkata Subramaniam February 28, 2002.

1 CSC 594 Topics in AI – Applied Natural Language Processing Fall 2009/ Shallow Parsing.

Conditional Random Fields

Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Part-of-Speech Tagging and Chunking with Maximum Entropy Model Sandipan Dandapat.

Part-of-speech tagging and chunking with log-linear models University of Manchester National Centre for Text Mining (NaCTeM) Yoshimasa Tsuruoka.

Maximum Entropy Model LING 572 Fei Xia 02/07-02/09/06.

Seven Lectures on Statistical Parsing Christopher Manning LSA Linguistic Institute 2007 LSA 354 Lecture 7.

Maximum Entropy Model LING 572 Fei Xia 02/08/07. Topics in LING 572 Easy: –kNN, Rocchio, DT, DL –Feature selection, binarization, system combination –Bagging.

STRUCTURED PERCEPTRON Alice Lai and Shi Zhi. Presentation Outline Introduction to Structured Perceptron ILP-CRF Model Averaged Perceptron Latent Variable.

1 Sequence Labeling Raymond J. Mooney University of Texas at Austin.

Review: Hidden Markov Models Efficient dynamic programming algorithms exist for –Finding Pr(S) –The highest probability path P that maximizes Pr(S,P) (Viterbi)

Instance Weighting for Domain Adaptation in NLP Jing Jiang & ChengXiang Zhai University of Illinois at Urbana-Champaign June 25, 2007.

Beam-Width Prediction for Efficient Context-Free Parsing Nathan Bodenstab, Aaron Dunlop, Keith Hall, Brian Roark June 2011.

1 CS546: Machine Learning and Natural Language Preparation to the Term Project: - Dependency Parsing - Dependency Representation for Semantic Role Labeling.

Ronan Collobert Jason Weston Leon Bottou Michael Karlen Koray Kavukcouglu Pavel Kuksa.

1 Data-Driven Dependency Parsing. 2 Background: Natural Language Parsing Syntactic analysis String to (tree) structure He likes fish S NP VP NP VNPrn.

Graphical models for part of speech tagging

1 CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems Slides from Taskar and Klein are used in this lecture TexPoint.

Dual Coordinate Descent Algorithms for Efficient Large Margin Structured Prediction Ming-Wei Chang and Scott Wen-tau Yih Microsoft Research 1.

Discriminative Syntactic Language Modeling for Speech Recognition Michael Collins, Brian Roark Murat, Saraclar MIT CSAIL, OGI/OHSU, Bogazici University.

Inductive Dependency Parsing Joakim Nivre

Training dependency parsers by jointly optimizing multiple objectives Keith HallRyan McDonaldJason Katz- BrownMichael Ringgaard.

Discriminative Models for Spoken Language Understanding Ye-Yi Wang, Alex Acero Microsoft Research, Redmond, Washington USA ICSLP 2006.

1 Boosting-based parse re-ranking with subtree features Taku Kudo Jun Suzuki Hideki Isozaki NTT Communication Science Labs.

INSTITUTE OF COMPUTING TECHNOLOGY Forest-based Semantic Role Labeling Hao Xiong, Haitao Mi, Yang Liu and Qun Liu Institute of Computing Technology Academy.

CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov

Transformation-Based Learning Advanced Statistical Methods in NLP Ling 572 March 1, 2012.

Maximum Entropy (ME) Maximum Entropy Markov Model (MEMM) Conditional Random Field (CRF)

Indirect Supervision Protocols for Learning in Natural Language Processing II. Learning by Inventing Binary Labels This work is supported by DARPA funding.

A Systematic Exploration of the Feature Space for Relation Extraction Jing Jiang & ChengXiang Zhai Department of Computer Science University of Illinois,

Dependency Parser for Swedish Project for EDA171 by Jonas Pålsson Marcus Stamborg.

Maximum Entropy Models and Feature Engineering CSCI-GA.2590 – Lecture 6B Ralph Grishman NYU.

CHAPTER 8 DISCRIMINATIVE CLASSIFIERS HIDDEN MARKOV MODELS.

Multi-core Structural SVM Training Kai-Wei Chang Department of Computer Science University of Illinois at Urbana-Champaign Joint Work With Vivek Srikumar.

Revisiting Output Coding for Sequential Supervised Learning Guohua Hao & Alan Fern School of Electrical Engineering and Computer Science Oregon State University.

School of Computer Science 1 Information Extraction with HMM Structures Learned by Stochastic Optimization Dayne Freitag and Andrew McCallum Presented.

Automatic Grammar Induction and Parsing Free Text - Eric Brill Thur. POSTECH Dept. of Computer Science 심 준 혁.

Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.

John Lafferty Andrew McCallum Fernando Pereira

Shallow Parsing for South Asian Languages -Himanshu Agrawal.

Conditional Markov Models: MaxEnt Tagging and MEMMs

Using Wikipedia for Hierarchical Finer Categorization of Named Entities Aasish Pappu Language Technologies Institute Carnegie Mellon University PACLIC.

Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.

Stochastic Methods for NLP Probabilistic Context-Free Parsers Probabilistic Lexicalized Context-Free Parsers Hidden Markov Models – Viterbi Algorithm Statistical.

1 Experiments with Detector- based Conditional Random Fields in Phonetic Recogntion Jeremy Morris 06/01/2007.

Graphical Models for Segmenting and Labeling Sequence Data Manoj Kumar Chinnakotla NLP-AI Seminar.

Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.

Non-Monotonic Parsing of Fluent Umm I mean Disfluent Sentences Mohammad Sadegh Rasooli[Columbia University] Joel Tetreault[Yahoo Labs] This work conducted.

Learning as Search Optimization Slide 1 Hal Daumé III Learning as Search Optimization: Approximate Large Margin Methods for Structured.

Language Identification and Part-of-Speech Tagging

Efficient Inference on Sequence Segmentation Models

Raymond J. Mooney University of Texas at Austin

Max-margin sequential learning methods

CSCI 5832 Natural Language Processing

Artificial Intelligence 9. Perceptron

Presentation transcript:

Learning with lookahead: Can history-based models rival globally optimized models? Yoshimasa Tsuruoka Japan Advanced Institute of Science and Technology (JAIST) Yusuke Miyao National Institute of Informatics (NII) Junichi Kazama National Institute of Information and Communications Technology (NICT)

History-based models Structured prediction problems in NLP – POS tagging, named entity recognition, parsing, … History-based models – Decompose the structured prediction problem into a series of classification problems Have been widely used in many NLP tasks – MEMMs (Ratnaparkhi, 1996; McCallum et al., 2000) – Transition-based parsers (Yamada & Matsumoto, 2003; Nivre et al., 2006) Becoming less popular

Part-of-speech (POS) tagging Perform multi-class classification at each word Features are defined on observations (i.e. words) and the POS tags on the left I saw a dog with eyebrows NVDPNVDP NVDPNVDP NVDPNVDP NVDPNVDP NVDPNVDP NVDPNVDP

Dependency parsing I saw a dog with eyebrows OPERATIONSTACKQUEUE Shift ReduceL ReduceR I saw a dog with eyebrows

Dependency parsing I saw a dog with eyebrows OPERATIONSTACKQUEUE Shift ReduceL ReduceR Isaw a dog with eyebrows

Dependency parsing I saw a dog with eyebrows OPERATIONSTACKQUEUE Shift ReduceL ReduceR a dog with eyebrows Isaw

Dependency parsing I saw a dog with eyebrows OPERATIONSTACKQUEUE Shift ReduceL ReduceR sawa dog with eyebrows

Dependency parsing I saw a dog with eyebrows OPERATIONSTACKQUEUE Shift ReduceL ReduceR saw adog with eyebrows

Dependency parsing I saw a dog with eyebrows OPERATIONSTACKQUEUE Shift ReduceL ReduceR saw a dogwith eyebrows

Dependency parsing I saw a dog with eyebrows OPERATIONSTACKQUEUE Shift ReduceL ReduceR saw dogwith eyebrows

Dependency parsing I saw a dog with eyebrows OPERATIONSTACKQUEUE Shift ReduceL ReduceR saw dog witheyebrows

Dependency parsing I saw a dog with eyebrows OPERATIONSTACKQUEUE Shift ReduceL ReduceR saw dog with eyebrows

Dependency parsing I saw a dog with eyebrows OPERATIONSTACKQUEUE Shift ReduceL ReduceR saw dog with

Dependency parsing I saw a dog with eyebrows OPERATIONSTACKQUEUE Shift ReduceL ReduceR saw dog

Lookahead Playing Chess If I move this pawn, then the knight will be captured by that bishop, but then I can …

POS tagging with lookahead Consider all possible sequences of future tagging actions to a certain depth I saw a dog with eyebrows NVD NVDPNVDP NVDPNVDP

POS tagging with lookahead Consider all possible sequences of future tagging actions to a certain depth I saw a dog with eyebrows NVD NVDPNVDP NVDPNVDP

POS tagging with lookahead Consider all possible sequences of future tagging actions to a certain depth I saw a dog with eyebrows NVD NVDPNVDP NVDPNVDP

POS tagging with lookahead Consider all possible sequences of future tagging actions to a certain depth I saw a dog with eyebrows NVD NVDPNVDP NVDPNVDP

POS tagging with lookahead Consider all possible sequences of future tagging actions to a certain depth I saw a dog with eyebrows NVD NVDPNVDP NVDPNVDP

Dependency parsing I saw a dog with eyebrows OPERATIONSTACKQUEUE Shift ReduceL ReduceR saw dogwith eyebrows Shift ReduceL ReduceR saw dog witheyebrows

Dependency parsing I saw a dog with eyebrows OPERATIONSTACKQUEUE Shift ReduceL ReduceR saw dogwith eyebrows Shift ReduceL ReduceR sawwith eyebrows

Choosing the best action by search S1S1 S2S2 SmSm a1a1 a2a2 amam S1*S1*S2*S2*S3*S3* search depth S

Search

Decoding cost Time complexity: O(nm^(D+1)) – n: number of actions to complete the structure – m: average number of possible actions at each state – D: search depth Time complexity of k-th order CRFs: O(nm^(k+1)) History-based models with k-depth lookahead are comparable to k-th order CRFs in terms of training/testing time

Perceptron learning with Lookahead S1S1 S2S2 SmSm S1*S1*S2*S2*Sm*Sm* a1a1 a2a2 amam Without lookahead With lookahead Linear scoring model Correct action Guaranteed to converge

Experiments Sequence prediction tasks – POS tagging – Text chunking (a.k.a. shallow parsing) – Named entity recognition Syntactic parsing – Dependency parsing Compared to first-order CRFs in terms of speed and accuracy

POS tagging Accuracy WSJ corpus

Training time Second WSJ corpus

POS tagging (+ tag trigram features) Accuracy WSJ corpus

Chunking (shallow parsing) F-score CoNLL 2000 data set

Named entity recognition F-score BioNLP/NLPBA 2004 data set

Dependency parsing F-score WSJ corpus (Zhang and Clark, 2008)

Related work MEMMs + Viterbi – label bias problem (Lafferty et al., 2001) Learning as search optimization (LaSO) (Daume III and Marcu 2005) – No lookahead Structured perceptron with beam search (Zhang and Clark, 2008)

Conclusion Can history-based models rival globally optimized models? – Yes, they can be more accurate than CRFs The same computational cost as CRFs

Future work Feature Engineering Flexible search extension/reduction Easy-first tagging/parsing – (Goldbergand & Elhadad, 2010) Max-margin learning

THANK YOU