Generative and Discriminative Models in NLP: A Survey Kristina Toutanova Computer Science Department Stanford University.

Slides:



Advertisements
Similar presentations
Expectation Maximization Dekang Lin Department of Computing Science University of Alberta.
Advertisements

Learning Accurate, Compact, and Interpretable Tree Annotation Recent Advances in Parsing Technology WS 2011/2012 Saarland University in Saarbrücken Miloš.
ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
CMPUT 466/551 Principal Source: CMU
John Lafferty, Andrew McCallum, Fernando Pereira
Conditional Random Fields - A probabilistic graphical model Stefan Mutter Machine Learning Group Conditional Random Fields - A probabilistic graphical.
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data John Lafferty Andrew McCallum Fernando Pereira.
Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by grants from the National.
Re-ranking for NP-Chunking: Maximum-Entropy Framework By: Mona Vajihollahi.
Part II. Statistical NLP Advanced Artificial Intelligence Part of Speech Tagging Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most.
Logistics Course reviews Project report deadline: March 16 Poster session guidelines: – 2.5 minutes per poster (3 hrs / 55 minus overhead) – presentations.
Final review LING572 Fei Xia Week 10: 03/13/08 1.
Conditional Random Fields
Statistical techniques in NLP Vasileios Hatzivassiloglou University of Texas at Dallas.
Maximum Entropy Model LING 572 Fei Xia 02/07-02/09/06.
Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network Kristina Toutanova, Dan Klein, Christopher Manning, Yoram Singer Stanford University.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Seven Lectures on Statistical Parsing Christopher Manning LSA Linguistic Institute 2007 LSA 354 Lecture 7.
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Maximum Entropy Model LING 572 Fei Xia 02/08/07. Topics in LING 572 Easy: –kNN, Rocchio, DT, DL –Feature selection, binarization, system combination –Bagging.
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
Crash Course on Machine Learning
STRUCTURED PERCEPTRON Alice Lai and Shi Zhi. Presentation Outline Introduction to Structured Perceptron ILP-CRF Model Averaged Perceptron Latent Variable.
Final review LING572 Fei Xia Week 10: 03/11/
Part II. Statistical NLP Advanced Artificial Intelligence Applications of HMMs and PCFGs in NLP Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme.
Richard Socher Cliff Chiung-Yu Lin Andrew Y. Ng Christopher D. Manning
Tree Kernels for Parsing: (Collins & Duffy, 2001) Advanced Statistical Methods in NLP Ling 572 February 28, 2012.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Graphical models for part of speech tagging
1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.
인공지능 연구실 정 성 원 Part-of-Speech Tagging. 2 The beginning The task of labeling (or tagging) each word in a sentence with its appropriate part of speech.
1 Semi-Supervised Approaches for Learning to Parse Natural Languages Rebecca Hwa
1 CS546: Machine Learning and Natural Language Discriminative vs Generative Classifiers This lecture is based on (Ng & Jordan, 02) paper and some slides.
CSE 446 Perceptron Learning Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.
1 Generative and Discriminative Models Jie Tang Department of Computer Science & Technology Tsinghua University 2012.
CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov
Transformation-Based Learning Advanced Statistical Methods in NLP Ling 572 March 1, 2012.
Optimizing Local Probability Models for Statistical Parsing Kristina Toutanova, Mark Mitchell, Christopher Manning Computer Science Department Stanford.
Prototype-Driven Learning for Sequence Models Aria Haghighi and Dan Klein University of California Berkeley Slides prepared by Andrew Carlson for the Semi-
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging I Introduction Tagsets Approaches.
Maximum Entropy Models and Feature Engineering CSCI-GA.2590 – Lecture 6B Ralph Grishman NYU.
INTRODUCTION TO Machine Learning 3rd Edition
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Classification: Logistic Regression –NB & LR connections Readings: Barber.
Supertagging CMSC Natural Language Processing January 31, 2006.
School of Computer Science 1 Information Extraction with HMM Structures Learned by Stochastic Optimization Dayne Freitag and Andrew McCallum Presented.
Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.
John Lafferty Andrew McCallum Fernando Pereira
Shallow Parsing for South Asian Languages -Himanshu Agrawal.
Logistic Regression William Cohen.
Conditional Markov Models: MaxEnt Tagging and MEMMs
Machine Learning 5. Parametric Methods.
NTU & MSRA Ming-Feng Tsai
Overview of Statistical NLP IR Group Meeting March 7, 2006.
Part-of-Speech Tagging CSCI-GA.2590 – Lecture 4 Ralph Grishman NYU.
Dependency Parsing Niranjan Balasubramanian March 24 th 2016 Credits: Many slides from: Michael Collins, Mausam, Chris Manning, COLNG 2014 Dependency Parsing.
Graphical Models for Segmenting and Labeling Sequence Data Manoj Kumar Chinnakotla NLP-AI Seminar.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.
Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.
Conditional Random Fields and Its Applications Presenter: Shih-Hsiang Lin 06/25/2007.
Maximum Entropy Models and Feature Engineering CSCI-GA.2591
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
COMP61011 : Machine Learning Ensemble Models
CSC 594 Topics in AI – Natural Language Processing
Klein and Manning on CRFs vs CMMs
LECTURE 23: INFORMATION THEORY REVIEW
Parametric Methods Berlin Chen, 2005 References:
Recap: Naïve Bayes classifier
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Presentation transcript:

Generative and Discriminative Models in NLP: A Survey Kristina Toutanova Computer Science Department Stanford University

Motivation Many problems in natural language processing are disambiguation problems word senses jaguar – a big cat, a car, name of a Java package line - phone, queue, in mathematics, air line, etc. part-of-speech tags (noun, verb, proper noun, etc.) ? ? ? Joy makes progress every day. NN VB DTNN NNP VBZ NNS

Motivation Parsing – choosing preferred phrase structure trees for sentences, corresponding to likely semantics Possible approaches to disambiguation Encode knowledge about the problem, define rules, hand-engineer grammars and patterns (requires much effort, not always possible to have categorical answers) Treat the problem as a classification task and learn classifiers from labeled training data VBD “Mary” NNP S “I” VP NP “saw” PP IN NNP “with”“the”“telescope” ?

Overview General ML perspective Examples The case of Part-of-Speech Tagging The case of Syntactic Parsing Conclusions

The Classification Problem Given a training set of iid samples T={(X 1,Y 1 ) … (X n,Y n )} of input and class variables from an unknown distribution D(X,Y), estimate a function that predicts the class from the input variables The goal is to come up with a hypothesis with minimum expected loss (usually 0-1 loss) Under 0-1 loss the hypothesis with minimum expected loss is the Bayes optimal classifier

Approaches to Solving Classification Problems - I 1. Generative. Try to estimate the probability distribution of the data D(X,Y) specify a parametric model family choose parameters by maximum likelihood on training data estimate conditional probabilities by Bayes rule classify new instances to the most probable class Y according to

Approaches to Solving Classification Problems - I 2. Discriminative. Try to estimate the conditional distribution D(Y|X) from data. specify a parametric model family estimate parameters by maximum conditional likelihood of training data classify new instances to the most probable class Y according to 3. Discriminative. Distribution-free. Try to estimate directly from data so that its expected loss will be minimized

Axes for comparison of different approaches Asymptotic accuracy Accuracy for limited training data Speed of convergence to the best hypothesis Complexity of training Modeling ease

Generative-Discriminative Pairs Definition: If a generative and discriminative parametric model family can represent the same set of conditional probability distributions they are a generative- discriminative pair Example: Naïve Bayes and Logistic Regression Y X2X2 X1X1

Comparison of Naïve Bayes and Logistic Regression The NB assumption that features are independent given the class is not made by logistic regression The logistic regression model is more general because it allows a larger class of probability distributions for the features given classes

Example: Traffic Lights Lights WorkingLights Broken P(g,r,w) = 3/7P(r,g,w) = 3/7P(r,r,b) = 1/7 Working? NSEW Model assumptions false! JL and CL estimates differ… JL:P(w) = 6/7 CL: (w) =  P(r|w)= ½ (r|w)= ½ P(r|b)= 1 (r|b)= 1 NB Model Reality

Joint Traffic Lights Lights Working Lights Broken 3/14 2/14 Conditional likelihood of working is 1 Conditional likelihood of working is > ½! Incorrectly assigned! Accuracy: 6/7 000

Conditional Traffic Lights Lights Working Lights Broken 1-  Conditional likelihood of working is still 1 Now correctly assigned to broken.  /4 Accuracy: 7/7 CL perfect (1) JL low (to 0) 00 0

Comparison of Naïve Bayes and Logistic Regression Naïve BayesLogistic Regression Accuracy + Convergence + Training Speed + Model assumptions independence of features given class Linear log-odds Advantages Faster convergence, uses information in P(X), faster training More robust and accurate because fewer assumptions Disadvantages Large bias if the independence assumptions are very wrong Harder parameter estimation problem, ignores information in P(X)

Some Experimental Comparisons error training data size error Ng & Jordan 2002 (15 datasets from UCI ML) Klein & Manning 2002 (WSD line and hard data)

Part-of-Speech Tagging POS tagging is determining the part of speech of every word in a sentence. ? ? ? Joy makes progress every day. Sequence classification problem with 45 classes (Penn Treebank). Accuracies are high 97%! Some argue it can’t go much higher Existing approaches: rule-based (hand-crafted, TBL) generative (HMM) discriminative (maxent, memory-based, decision tree, neural network, linear models(boosting,perceptron) ) NN VB DT NN NNP VBZ NNS

Part-of-Speech Tagging Useful Features The complete solution of the problem requires full syntactic and semantic understanding of sentences In most cases information about surrounding words/tags is strong disambiguator “The long fenestration was tiring. “ Useful features tags of previous/following words P(NN|JJ)=.45;P(VBP|JJ)= identity of word being tagged/surrounding words suffix/prefix for unknown words, hyphenation, capitalization longer distance features others we haven’t figured out yet

HMM Tagging Models - I t1t1 t2t2 t3t3 w1w1 w2w2 w3w3 states can be single tags or pairs of successive tags or variable length sequences of last tags t uw Independence Assumptions t i is independent of t 1 …t i-2 and w 1 …w i-1 given t i-1 words are independent given their tags Cap?suffixhyph unknown words (Weischedel et al. 93)

HMM Tagging Models - Brants 2000 Highly competitive with other state-of-the art models Trigram HMM with smoothed transition probabilities Capitalization feature becomes part of the state – each tag state is split into two e.g. NN, Suffix features for unknown words t suffix n suffix n-1 suffix 2 suffix 1

CMM Tagging Models t1t1 t2t2 t3t3 w1w1 w2w2 w3w3 Independence Assumptions t i is independent of t 1 …t i-2 and w 1 …w i-1 given t i-1 t i is independent of all following observations no independence assumptions on the observation sequence Dependence of current tag on previous and future observations can be added; overlapping features of the observation can be taken as predictors

MEMM Tagging Models -II Ratnaparkhi (1996) local distributions are estimated using maximum entropy models used previous two tags, current word, previous two words, next two words suffix, prefix, hyphenation, and capitalization features for unknown words ModelOverall Accuracy Unknown Words HMM ( Brants 2000 ) MEMM( Ratn 1996 ) MEMM( T&M 2000 )

HMM vs CMM – I ModelAccuracy 95.5% 94.4% 95.3% tjtj w j+1 wjwj t j+1 tjtj w j+1 wjwj t j+1 tjtj w j+1 wjwj t j+1 Johnson (2001)

HMM vs CMM - II The per-state conditioning of the CMM has been observed to exhibit label bias (Bottou, Lafferty) and observation bias (Klein & Manning ) Klein & Manning (2002) HMMCMMCMM Unobserving words with unambiguous tags improved performance significantly t1t1 t2t2 t3t3 w1w1 w2w2 w3w3 t1t1 t2t2 t3t3 w1w1 w2w2 w3w3

Conditional Random Fields (Lafferty et al 2001) Models that are globally conditioned on the observation sequence; define distribution P(Y|X) of tag sequence given word sequence No independence assumptions about the observations; no need to model their distribution The labels can depend on past and future observations Avoids the independence assumption of CMMs that labels are independent of future observations and thus the label and observation bias problems The parameter estimation problem is much harder

CRF - II HMM and this chain CRF form a generative- discriminative pair Independence assumptions : a tag is independent of all other tags in the sequence given its neighbors and the word sequence t1t1 t2t2 t3t3 w1w1 w2w2 w3w3

CRF-Experimental Results ModelAccuracyUnknown Word Accuracy HMM94.31% 54.01% CMM (MEMM)93.63%45.39% CRF94.45% 51.95% CMM+ (MEMM+) 95.19%73.01% CRF+95.73%76.24%

Discriminative Tagging Model – Voted Perceptron Collins 2002; Best reported tagging results on WSJ Uses all features used by Ratnaparkhi (96) Learns a linear function Classifies according to Error MEMM(Ratn 96) 96.72% V Perceptron 97.11%

Summary of Tagging Review For tagging, the change from generative to discriminative model does not by itself result in great improvement (e.g. HMM and CRF) One profits from discriminative models for specifying dependence on overlapping features of the observation such as spelling, suffix analysis,etc The CMM model allows integration of rich features of the observations, but suffers strongly from assuming independence from following observations; this effect can be relieved by adding dependence on following words This additional power (of the CMM,CRF, Perceptron models) has been shown to result in improvements in accuracy though not dramatic (up to 11% error reduction) The higher accuracy of discriminative models comes at the price of much slower training More research is needed on specifying useful features (or tagging WSJ Penn Treebank is a noisy task and the limit is reached)

Syntactic parsing is the task of assigning a parse tree to a sentence corresponding to its most likely interpretation Existing approaches hand-crafted rule-based heuristic methods probabilistic generative models conditional probabilistic discriminative models discriminative ranking models Parsing Models VBD “Mary” NNP “I” VP NP “saw” PP IN NNP “with”“the”“telescope” S

Generative Parsing Models Generative models based on PCFG grammars learned from corpora are still among the best performing (Collins 97,Charniak 97,00) 88% -89% labeled precision/recall The generative models learn a distribution P(X,Y) on pairs: and select a single most likely parse for a sentence X based on: Easy to train using RFE for maximum likelihood These models have the advantage of being usable as language models (Chelba&Jelinek 00, Charniak 00)

Generative History-Based Model – Collins 97 TOP S(bought ) NP(week)NP-C(Marks)VP(bought) VBD(bought) NP-C(Brooks) “bought” NNP(Brooks) “Brooks” NNP(Marks) “Marks” JJ(Last) “week”“Last” NN(week) Accuracy <= 100 words 88.1% LP 87.5% LR

Discriminative models Shift-reduce parser Ratnaparkhi (98) Learns a distribution P(T|S) of parse trees given sentences using the sequence of actions of a shift- reduce parser Uses a maximum entropy model to learn conditional distribution of parse action given history Suffers from independence assumptions that actions are independent of future observations as CMM Higher parameter estimation cost to learn local maximum entropy models Lower but still good accuracy 86% - 87% labeled precision/recall

Discriminative Models – Distribution Free Re-ranking Represent sentence-parse tree pairs by a feature vector F(X,Y) Learn a linear ranking model with parameters using the boosting loss ModelLPLR Collins 99 (Generative) 88.3%88.1% Collins 00 (BoostLoss) 89.9%89.6% 13% error reduction Still very close in accuracy to generative model (Charniak 00)

Comparison of Generative- Discriminative Pairs Johnson (2001) have compared simple PCFG trained to maximize L(T,S) and L(T|S) A Simple PCFG has parameters Models: Results: ModelLPrecisionLRecall MLE MCLE

Weighted CFGs for Unification-Based Grammars - I Unification-based grammars (UBG) are often defined using a context-free base and a set of path equations S[number X] -> NP[number X] VP[number X] NP[number X] -> N [number X] VP[number X] ->V[number X] N[number sg]-> dog ; N[number pl] ->dogs; V[number sg] ->barks ; V[number pl] ->bark; A PCFG grammar can be defined using the context-free backbone CFG UBG (S-> NP, VP) The UBG generates “dogs bark” and “dog barks”. The CFG UBG generates “dogs bark”,“dog barks”, “dog bark”, and “dogs barks”.

Weighted CFGs for Unification-Based Grammars - II A Simple PCFG for CFG UBG has parameters from the set It defines a joint distribution P(T,S) and a conditional distributions of trees given sentences A conditional weighted CFG defines only a conditional probability; the conditional probability of any tree T outside the UBG is 0

Weighted CFGs for Unification-based grammars - III The conditional weighted CFGs perform consistently better than their generative counterparts Negative information is extremely helpful here; knowing that the conditional probability of trees outside the UBG is zero plus conditional training amounts to 38% error reduction for the simple PCFG model Accuracy

Summary of Parsing Results The single small study comparing a parsing generative- discriminative pair for PCFG parsing showed a small (insignificant) advantage for the discriminative model; the added computational cost is probably not worth it The best performing statistical parsers are still generative(Charniak 00, Collins 99) or use a generative model as a preprocessing stage(Collins 00, Collins 2002) (part of which has to do with computational complexity) Discriminative models allow more complex representations such as the all subtrees representation (Collins 2002) or other overlapping features (Collins 00) and this has led to up to 13% improvement over a generative model Discriminative training seems promising for parse selection tasks for UBG, where the number of possible analyses is not enormous

Conclusions For the current sizes of training data available for NLP tasks such as tagging and parsing, discriminative training has not by itself yielded large gains in accuracy The flexibility of including non-independent features of the observations in discriminative models has resulted in improved part-of-speech tagging models (for some tasks it might not justify the added computational complexity) For parsing, discriminative training has shown improvements when used for re-ranking or when using negative information (UBG) if you come up with a feature that is very hard to incorporate in a generative models and seems extremely useful, see if a discriminative approach will be computationally feasible !