Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network Kristina Toutanova, Dan Klein, Christopher Manning, Yoram Singer Stanford University The Hebrew University of Jerusalem
Highlights Just using P(t|w) works even better than you thought—using a better unknown word model You can tag really well with no sequence model at all Conditioning on BOTH left AND right tags yields best published tagging performance If you are using a maxent model: Use proper smoothing Consider more lexicalization Use conjunctions of features
Sequential Classifiers Learn classifiers for local decisions – predict the tag of a word based on features we like – neighboring words, tags, etc. Combine the decisions of the classifiers using their output probabilities or scores and chose the best global tag sequence t0t0 w0w0 w -1 w1w1 t1t1 t -1 When the dependencies are not cyclic and the classifier is probabilistic, this corresponds to a Bayesian Network (CMM)
Experiments for Part-of-Speech Tagging Data – WSJ 0-18 training, dev, test Log-linear models for local distributions All features are binary and formed by instantiating templates f 1 (h,t)=1, iff w 0 =“to” and t=TO (0 otherwise) Separate feature templates targeted at unknown words -- prefixes, suffixes,etc.
Tagging Without Sequence Information t0t0 w0w0 Baseline t0t0 w0w0 w -1 w1w1 Three Words ModelFeaturesTokenUnknownSentence Baseline56, %82.61%26.74% 3Words239, %86.78%48.27% Using words only works significantly better than using the previous two or three tags!
CMM Tagging Models - I Independence Assumptions of Left-to-Right CMM t i is independent of t 1 …t i-2 and w 1 …w i-1 given t i-1 t i is independent of all following observations Similar assumptions in the Right- to-Left CMM t i is independent of all preceding observations t2t2 w2w2 t1t1 t3t3 w3w3 w1w1 t2t2 w2w2 t1t1 t3t3 w3w3 w1w1
CMM Tagging Models - II The bad independence assumptions lead to label bias (Bottou 91, Lafferty 01) and observation bias (Klein & Manning 02) will {MD, NN} to {TO} fight {NN, VB, VBP} will will be mis-tagged as MD, because MD is the most common tagging TO to t1t1 t3t3 fight will P(t 1 =MD,t 2 =TO|will,to)= P(MD|will,sos)*P(TO|to,MD)= P(MD|will,sos)*1
CMM Tagging Models - III will {MD, NN} to {TO} fight {NN, VB, VBP} In the Right-to-Left CMM, fight will most likely be mis- tagged as NN TO to t1t1 t3t3 fightwill P(t 2 =TO,t 3 =NN|to,fight)= P(NN|fight,X)*P(TO|to,NN)= P(NN|fight,X)*1
Dependency Networks Conditioning on both left and right tags fixes the problem TO to t1t1 t3t3 fight will
Dependency Networks We do not attempt to construct a joint distribution. We classify to the highest scoring sequence Efficient dynamic programming algorithm similar to Viterbi exists for finding the most likely sequence t2t2 w2w2 t1t1 w1w1
Inference for Linear Dependency Networks titi wiwi t i-1 w i-1 t i+2 w i+2 t i+1 w i+1
Using Tags: Left Context is Better t0t0 w0w0 Baseline t0t0 w0w0 t -1 Model LModel R t1t1 t0t0 w0w0 ModelFeaturesTokenUnknownSentence Baseline56, %82.61%26.74% L27, %85.49%41.89% R27, %85.65%36.31% Model L has 13.4% error reduction from Model R
Centered Context is Better t0t0 w0w0 t -1 t1t1 t0t0 w0w0 t -2 L+L 2 t2t2 R+R 2 t0t0 t -1 w0w0 t1t1 L+R ModelFeaturesTokenUnknownSentence L+L 2 32, %85.92%44.04% R+R 2 33, %84.49%37.20% L+R32, %87.15%49.50% Model L+R has 13.2% error reduction from Model L+L 2
Centered Context is Better in the End t0t0 w0w0 t -1 t -2 t -3 L+LL+LLL t1t1 w0w0 t0t0 t -1 t -2 L+LL+LR+R+RR t2t2 ModelFeaturesTokenUnknownSentence L+LL+LLL118, %86.52%45.14% L+LL+LR +R+RR 81, %87.91%53.23% 15% error reduction due to including right word tags
Lexicalization and More Unknown Word Features t1t1 t0t0 t -1 t -2 t2t2 w0w0 w -1 w +1 L+LL+LR+R+RR+3W ModelFeaturesTokenUnknownSentence L+LL+LR+R+RR (TAGS) 81, %87.91%53.23% TAGS +3W263, %88.05%53.83% TAGS+3W+LW 0 +RW 0 + W -1 W 0 +W 0 W 1 (BEST) 460, %88.61%55.83% BEST test set460, %89.04%56.34%
Final Test Results ModelFeaturesTokenUnknownSentence BEST test set460, %89.04%56.34% Comparison to best published results – Collins % error reduction Statistically significant
Unknown Word Features Because we use a conditional model, it is easy to define complex features of the words A crude company name detector --- the feature is on if the word is capitalized and followed by a company name suffix like Co. or Inc within 3 words. Conjunctions of character level features – capitalized, contains digit, contains dash, all capitalized, etc. (ex. CFC-12 F/A-18) Prefixes and suffixes up to length 10
Regularization Helps a Lot Higher accuracy, faster convergence, more features can be added before overfitting
Regularization Helps a Lot Accuracy with and without Gaussian smoothing Effect of reducing feature support cutoffs in smoothed and un-smoothed models
Semantics of Dependency Networks Let X=(X 1,…,X n ). A dependency network for X is a pair (G,P) where G is a cyclic dependency graph and P is a set of probability distributions. Each node in G corresponds to a variable X i and the parents of X i are all nodes Pa(X i ), such that P(X i | X 1,.. X i-1, X i+1,.., X n )= P(X i |Pa(X i )) The distributions in P are the local probability distributions p(X i |Pa(X i )). If there exists a joint distribution P(X) such that the conditional distributions in P are derivable from it, then the dependency network is called consistent For positive distributions P, we can obtain the joint distribution P(X) by Gibbs sampling Hofmann and Tresp (1997) Heckerman (2000)
Dependency Networks - Problems The dependency network probabilities learned from data may be inconsistent – there may not be a joint distribution having these conditionals Even if they define a consistent network, the scoring criterion is susceptible to mutually re-enforcing but unlikely sequences Suppose we have the following sequence of observations Most likely state is, but Score(11)=2/3*1=2/3 and Score(33)=1 ab
Conclusions The use of dependency networks was very helpful for tagging both left and right words and tags are used for prediction, avoids bad independence assumptions in training and test, the time/space complexity is the same as for CMMs Promising for other NLP sequence tasks More predictive features for tagging Rich lexicalization further improved accuracy Conjunctions of feature templates Smoothing is critical