Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network Kristina Toutanova, Dan Klein, Christopher Manning, Yoram Singer Stanford University.

Slides:



Advertisements
Similar presentations
Three Basic Problems Compute the probability of a text: P m (W 1,N ) Compute maximum probability tag sequence: arg max T 1,N P m (T 1,N | W 1,N ) Compute.
Advertisements

Three Basic Problems 1.Compute the probability of a text (observation) language modeling – evaluate alternative texts and models P m (W 1,N ) 2.Compute.
Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.
2 – In previous chapters: – We could design an optimal classifier if we knew the prior probabilities P(wi) and the class- conditional probabilities P(x|wi)
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Dynamic Bayesian Networks (DBNs)
CMPUT 466/551 Principal Source: CMU
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data John Lafferty Andrew McCallum Fernando Pereira.
A Joint Model For Semantic Role Labeling Aria Haghighi, Kristina Toutanova, Christopher D. Manning Computer Science Department Stanford University.
EE462 MLCV Lecture Introduction of Graphical Models Markov Random Fields Segmentation Tae-Kyun Kim 1.
Statistical NLP: Lecture 11
From: Probabilistic Methods for Bioinformatics - With an Introduction to Bayesian Networks By: Rich Neapolitan.
Re-ranking for NP-Chunking: Maximum-Entropy Framework By: Mona Vajihollahi.
Data Mining Techniques Outline
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Part-of-speech Tagging cs224n Final project Spring, 2008 Tim Lai.
Ch 10 Part-of-Speech Tagging Edited from: L. Venkata Subramaniam February 28, 2002.
Conditional Random Fields
Transformation-based error- driven learning (TBL) LING 572 Fei Xia 1/19/06.
Boosting Applied to Tagging and PP Attachment By Aviad Barzilai.
Syllabus Text Books Classes Reading Material Assignments Grades Links Forum Text Books עיבוד שפות טבעיות - שיעור חמישי POS Tagging Algorithms עידו.
1 Complementarity of Lexical and Simple Syntactic Features: The SyntaLex Approach to S ENSEVAL -3 Saif Mohammad Ted Pedersen University of Toronto, Toronto.
1 lBayesian Estimation (BE) l Bayesian Parameter Estimation: Gaussian Case l Bayesian Parameter Estimation: General Estimation l Problems of Dimensionality.
Continuous Random Variables and Probability Distributions
CPSC 422, Lecture 18Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 18 Feb, 25, 2015 Slide Sources Raymond J. Mooney University of.
Causal Models, Learning Algorithms and their Application to Performance Modeling Jan Lemeire Parallel Systems lab November 15 th 2006.
Seven Lectures on Statistical Parsing Christopher Manning LSA Linguistic Institute 2007 LSA 354 Lecture 7.
STRUCTURED PERCEPTRON Alice Lai and Shi Zhi. Presentation Outline Introduction to Structured Perceptron ILP-CRF Model Averaged Perceptron Latent Variable.
Generative and Discriminative Models in NLP: A Survey Kristina Toutanova Computer Science Department Stanford University.
Chapter 8 Introduction to Hypothesis Testing
0 Pattern Classification, Chapter 3 0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda,
Graphical models for part of speech tagging
Albert Gatt Corpora and Statistical Methods Lecture 10.
인공지능 연구실 정 성 원 Part-of-Speech Tagging. 2 The beginning The task of labeling (or tagging) each word in a sentence with its appropriate part of speech.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
Transformation-Based Learning Advanced Statistical Methods in NLP Ling 572 March 1, 2012.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
Classification Techniques: Bayesian Classification
Chapter 4: Pattern Recognition. Classification is a process that assigns a label to an object according to some representation of the object’s properties.
Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression.
Optimizing Local Probability Models for Statistical Parsing Kristina Toutanova, Mark Mitchell, Christopher Manning Computer Science Department Stanford.
Maximum Entropy Models and Feature Engineering CSCI-GA.2590 – Lecture 6B Ralph Grishman NYU.
Chapter 3: Maximum-Likelihood Parameter Estimation l Introduction l Maximum-Likelihood Estimation l Multivariate Case: unknown , known  l Univariate.
Slides for “Data Mining” by I. H. Witten and E. Frank.
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
CPSC 422, Lecture 19Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 19 Oct, 23, 2015 Slide Sources Raymond J. Mooney University of.
Dependency Networks for Collaborative Filtering and Data Visualization UAI-2000 발표 : 황규백.
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
John Lafferty Andrew McCallum Fernando Pereira
Maximum Entropy Model, Bayesian Networks, HMM, Markov Random Fields, (Hidden/Segmental) Conditional Random Fields.
Conditional Markov Models: MaxEnt Tagging and MEMMs
Part of Speech Tagging in Context month day, year Alex Cheng Ling 575 Winter 08 Michele Banko, Robert Moore.
1 Experiments with Detector- based Conditional Random Fields in Phonetic Recogntion Jeremy Morris 06/01/2007.
Conditional Random Fields & Table Extraction Dongfang Xu School of Information.
Dependency Networks for Inference, Collaborative filtering, and Data Visualization Heckerman et al. Microsoft Research J. of Machine Learning Research.
Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.
Bayesian Classification 1. 2 Bayesian Classification: Why? A statistical classifier: performs probabilistic prediction, i.e., predicts class membership.
Maximum Entropy Models and Feature Engineering CSCI-GA.2591
By Dan Roth and Wen-tau Yih PowerPoint by: Reno Kriz CIS
Chapter 3: Maximum-Likelihood and Bayesian Parameter Estimation (part 2)
CSC 594 Topics in AI – Natural Language Processing
CSCI 5832 Natural Language Processing
Classification Techniques: Bayesian Classification
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
N-Gram Model Formulas Word sequences Chain rule of probability
CONTEXT DEPENDENT CLASSIFICATION
Brian Nisonger Shauna Eggers Joshua Johanson
IE With Undirected Models
Sequential Learning with Dependency Nets
Chapter 3: Maximum-Likelihood and Bayesian Parameter Estimation (part 2)
Presentation transcript:

Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network Kristina Toutanova, Dan Klein, Christopher Manning, Yoram Singer Stanford University The Hebrew University of Jerusalem

Highlights Just using P(t|w) works even better than you thought—using a better unknown word model You can tag really well with no sequence model at all Conditioning on BOTH left AND right tags yields best published tagging performance If you are using a maxent model: Use proper smoothing Consider more lexicalization Use conjunctions of features

Sequential Classifiers Learn classifiers for local decisions – predict the tag of a word based on features we like – neighboring words, tags, etc. Combine the decisions of the classifiers using their output probabilities or scores and chose the best global tag sequence t0t0 w0w0 w -1 w1w1 t1t1 t -1 When the dependencies are not cyclic and the classifier is probabilistic, this corresponds to a Bayesian Network (CMM)

Experiments for Part-of-Speech Tagging Data – WSJ 0-18 training, dev, test Log-linear models for local distributions All features are binary and formed by instantiating templates f 1 (h,t)=1, iff w 0 =“to” and t=TO (0 otherwise) Separate feature templates targeted at unknown words -- prefixes, suffixes,etc.

Tagging Without Sequence Information t0t0 w0w0 Baseline t0t0 w0w0 w -1 w1w1 Three Words ModelFeaturesTokenUnknownSentence Baseline56, %82.61%26.74% 3Words239, %86.78%48.27% Using words only works significantly better than using the previous two or three tags!

CMM Tagging Models - I Independence Assumptions of Left-to-Right CMM t i is independent of t 1 …t i-2 and w 1 …w i-1 given t i-1 t i is independent of all following observations Similar assumptions in the Right- to-Left CMM t i is independent of all preceding observations t2t2 w2w2 t1t1 t3t3 w3w3 w1w1 t2t2 w2w2 t1t1 t3t3 w3w3 w1w1

CMM Tagging Models - II The bad independence assumptions lead to label bias (Bottou 91, Lafferty 01) and observation bias (Klein & Manning 02) will {MD, NN} to {TO} fight {NN, VB, VBP} will will be mis-tagged as MD, because MD is the most common tagging TO to t1t1 t3t3 fight will P(t 1 =MD,t 2 =TO|will,to)= P(MD|will,sos)*P(TO|to,MD)= P(MD|will,sos)*1

CMM Tagging Models - III will {MD, NN} to {TO} fight {NN, VB, VBP} In the Right-to-Left CMM, fight will most likely be mis- tagged as NN TO to t1t1 t3t3 fightwill P(t 2 =TO,t 3 =NN|to,fight)= P(NN|fight,X)*P(TO|to,NN)= P(NN|fight,X)*1

Dependency Networks Conditioning on both left and right tags fixes the problem TO to t1t1 t3t3 fight will

Dependency Networks We do not attempt to construct a joint distribution. We classify to the highest scoring sequence Efficient dynamic programming algorithm similar to Viterbi exists for finding the most likely sequence t2t2 w2w2 t1t1 w1w1

Inference for Linear Dependency Networks titi wiwi t i-1 w i-1 t i+2 w i+2 t i+1 w i+1

Using Tags: Left Context is Better t0t0 w0w0 Baseline t0t0 w0w0 t -1 Model LModel R t1t1 t0t0 w0w0 ModelFeaturesTokenUnknownSentence Baseline56, %82.61%26.74% L27, %85.49%41.89% R27, %85.65%36.31% Model L has 13.4% error reduction from Model R

Centered Context is Better t0t0 w0w0 t -1 t1t1 t0t0 w0w0 t -2 L+L 2 t2t2 R+R 2 t0t0 t -1 w0w0 t1t1 L+R ModelFeaturesTokenUnknownSentence L+L 2 32, %85.92%44.04% R+R 2 33, %84.49%37.20% L+R32, %87.15%49.50% Model L+R has 13.2% error reduction from Model L+L 2

Centered Context is Better in the End t0t0 w0w0 t -1 t -2 t -3 L+LL+LLL t1t1 w0w0 t0t0 t -1 t -2 L+LL+LR+R+RR t2t2 ModelFeaturesTokenUnknownSentence L+LL+LLL118, %86.52%45.14% L+LL+LR +R+RR 81, %87.91%53.23% 15% error reduction due to including right word tags

Lexicalization and More Unknown Word Features t1t1 t0t0 t -1 t -2 t2t2 w0w0 w -1 w +1 L+LL+LR+R+RR+3W ModelFeaturesTokenUnknownSentence L+LL+LR+R+RR (TAGS) 81, %87.91%53.23% TAGS +3W263, %88.05%53.83% TAGS+3W+LW 0 +RW 0 + W -1 W 0 +W 0 W 1 (BEST) 460, %88.61%55.83% BEST test set460, %89.04%56.34%

Final Test Results ModelFeaturesTokenUnknownSentence BEST test set460, %89.04%56.34% Comparison to best published results – Collins % error reduction Statistically significant

Unknown Word Features Because we use a conditional model, it is easy to define complex features of the words A crude company name detector --- the feature is on if the word is capitalized and followed by a company name suffix like Co. or Inc within 3 words. Conjunctions of character level features – capitalized, contains digit, contains dash, all capitalized, etc. (ex. CFC-12 F/A-18) Prefixes and suffixes up to length 10

Regularization Helps a Lot Higher accuracy, faster convergence, more features can be added before overfitting

Regularization Helps a Lot Accuracy with and without Gaussian smoothing Effect of reducing feature support cutoffs in smoothed and un-smoothed models

Semantics of Dependency Networks Let X=(X 1,…,X n ). A dependency network for X is a pair (G,P) where G is a cyclic dependency graph and P is a set of probability distributions. Each node in G corresponds to a variable X i and the parents of X i are all nodes Pa(X i ), such that P(X i | X 1,.. X i-1, X i+1,.., X n )= P(X i |Pa(X i )) The distributions in P are the local probability distributions p(X i |Pa(X i )). If there exists a joint distribution P(X) such that the conditional distributions in P are derivable from it, then the dependency network is called consistent For positive distributions P, we can obtain the joint distribution P(X) by Gibbs sampling Hofmann and Tresp (1997) Heckerman (2000)

Dependency Networks - Problems The dependency network probabilities learned from data may be inconsistent – there may not be a joint distribution having these conditionals Even if they define a consistent network, the scoring criterion is susceptible to mutually re-enforcing but unlikely sequences Suppose we have the following sequence of observations Most likely state is, but Score(11)=2/3*1=2/3 and Score(33)=1 ab

Conclusions The use of dependency networks was very helpful for tagging both left and right words and tags are used for prediction, avoids bad independence assumptions in training and test, the time/space complexity is the same as for CMMs Promising for other NLP sequence tasks More predictive features for tagging Rich lexicalization further improved accuracy Conjunctions of feature templates Smoothing is critical