IE With Undirected Models

Slides:

Advertisements

Similar presentations

Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data John Lafferty Andrew McCallum Fernando Pereira.

Advertisements

Logistics Course reviews Project report deadline: March 16 Poster session guidelines: – 2.5 minutes per poster (3 hrs / 55 minus overhead) – presentations.

Final review LING572 Fei Xia Week 10: 03/13/08 1.

Hidden Markov Model 11/28/07. Bayes Rule The posterior distribution Select k with the largest posterior distribution. Minimizes the average misclassification.

Lecture 5: Learning models using EM

Conditional Random Fields

Bayesian Networks Alan Ritter.

CPSC 422, Lecture 18Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 18 Feb, 25, 2015 Slide Sources Raymond J. Mooney University of.

Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network Kristina Toutanova, Dan Klein, Christopher Manning, Yoram Singer Stanford University.

CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.

Handwritten Character Recognition using Hidden Markov Models Quantifying the marginal benefit of exploiting correlations between adjacent characters and.

Online Learning Algorithms

STRUCTURED PERCEPTRON Alice Lai and Shi Zhi. Presentation Outline Introduction to Structured Perceptron ILP-CRF Model Averaged Perceptron Latent Variable.

11 CS 388: Natural Language Processing: Discriminative Training and Conditional Random Fields (CRFs) for Sequence Labeling Raymond J. Mooney University.

Final review LING572 Fei Xia Week 10: 03/11/

Review: Hidden Markov Models Efficient dynamic programming algorithms exist for –Finding Pr(S) –The highest probability path P that maximizes Pr(S,P) (Viterbi)

Learning Structure in Bayes Nets (Typically also learn CPTs here) Given the set of random variables (features), the space of all possible networks.

Graphical models for part of speech tagging

Margin Learning, Online Learning, and The Voted Perceptron SPLODD ~= AE* – 3, 2011 * Autumnal Equinox.

CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 16 Nov, 3, 2011 Slide credit: C. Conati, S.

CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.

CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov

Maximum Entropy (ME) Maximum Entropy Markov Model (MEMM) Conditional Random Field (CRF)

Prototype-Driven Learning for Sequence Models Aria Haghighi and Dan Klein University of California Berkeley Slides prepared by Andrew Carlson for the Semi-

IE with Dictionaries Cohen & Sarawagi. Announcements Current statistics: –days with unscheduled student talks: 2 –students with unscheduled student talks:

Maximum Entropy Models and Feature Engineering CSCI-GA.2590 – Lecture 6B Ralph Grishman NYU.

Conditional Markov Models: MaxEnt Tagging and MEMMs William W. Cohen CALD.

Slides for “Data Mining” by I. H. Witten and E. Frank.

CHAPTER 8 DISCRIMINATIVE CLASSIFIERS HIDDEN MARKOV MODELS.

The famous “sprinkler” example (J. Pearl, Probabilistic Reasoning in Intelligent Systems, 1988)

Learning TFC Meeting, SRI March 2005 On the Collective Classification of “Speech Acts” Vitor R. Carvalho & William W. Cohen Carnegie Mellon University.

CHAPTER 6 Naive Bayes Models for Classification. QUESTION????

John Lafferty Andrew McCallum Fernando Pereira

Logistic Regression William Cohen.

CPSC 422, Lecture 17Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 17 Oct, 19, 2015 Slide Sources D. Koller, Stanford CS - Probabilistic.

Conditional Markov Models: MaxEnt Tagging and MEMMs

SA-1 University of Washington Department of Computer Science & Engineering Robotics and State Estimation Lab Dieter Fox Stephen Friedman, Lin Liao, Benson.

Graphical Models for Segmenting and Labeling Sequence Data Manoj Kumar Chinnakotla NLP-AI Seminar.

MACHINE LEARNING FOR NATURAL LANGUAGE PROCESSING.

IE With Undirected Models: the saga continues

Lecture 7: Constrained Conditional Models

Maximum Entropy Models and Feature Engineering CSCI-GA.2591

Boosted Augmented Naive Bayes. Efficient discriminative learning of

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

Conditional Random Fields

Web-Mining Agents Part: Data Mining

Conditional Random Fields for ASR

CSC 594 Topics in AI – Natural Language Processing

ECE 5424: Introduction to Machine Learning

MEMMs/CMMs and CRFs William W. Cohen Sep 22, 2010.

Max-margin sequential learning methods

CSC 594 Topics in AI – Natural Language Processing

CRFs for SPLODD William W. Cohen Sep 8, 2011.

CRFs vs CMMs, and Stacking

Klein and Manning on CRFs vs CMMs

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 18

Hidden Markov Models Part 2: Algorithms

Bayesian Models in Machine Learning

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

CSCI 5832 Natural Language Processing

ML – Lecture 3B Deep NN.

CSCI 5832 Natural Language Processing

NER with Models Allowing Long-Range Dependencies

The Voted Perceptron for Ranking and Structured Classification

CS 188: Artificial Intelligence Spring 2006

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 18

Reuben Feinman Research advised by Brenden Lake

Sequential Learning with Dependency Nets

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

Evaluation David Kauchak CS 158 – Fall 2019.

Presentation transcript:

IE With Undirected Models William W. Cohen CALD

Announcements Upcoming assignments: Mon 2/23: Klein & Manning, Toutanova et al Wed 2/25: no writeup due Mon 3/1: no writeup due Wed 3/3: project proposal due: personnel + 1-2 page Spring break week, no class

Motivation for CMMs identity of word ends in “-ski” is capitalized is part of a noun phrase is in a list of city names is under node X in WordNet is in bold font is indented is in hyperlink anchor … S S S t - 1 t t+1 … is “Wisniewski” … part of noun phrase ends in “-ski” O O O t - 1 t t +1 Idea: replace generative model in HMM with a maxent model, where state depends on observations and previous state

Implications of the model Does this do what we want? Q: does Y[i-1] depend on X[i+1] ? “a nodes is conditionally independent of its non-descendents given its parents”

Another view of label bias [Sha & Pereira] So what’s the alternative?

CRF model y1 y2 y3 y4 x

CRF learning – from Sha & Pereira

CRF learning – from Sha & Pereira

CRF learning – from Sha & Pereira Something like forward-backward Idea: Define matrix of y,y’ “affinities” at stage i Mi[y,y’] = “unnormalized probability” of transition from y to y’ at stage I Mi * Mi+1 = “unnormalized probability” of any path through stages i and i+1

Forward backward ideas name name name c g b f nonName nonName nonName d h

CRF learning – from Sha & Pereira

CRF results (from S&P, L et al) Sha & Pereira even use some statistical tests! And show CRF beats MEMM (McNemar’s test) - but not voted perceptron.

CRFs: the good, the bad, and the cumbersome… Good points: Global optimization of weight vector that guides decision making Trade off decisions made at different points in sequence Worries: Cost (of training) Complexity (do we need all this math?) Amount of context: Matrix for normalizer is |Y| * |Y|, so high-order models for many classes get expensive fast. Strong commitment to maxent-style learning Loglinear models are nice, but nothing is always best.

Dependency Nets

Proposed solution: parents of node are the Markov blanket like undirected Markov net capture all “correlational associations” one conditional probability for each node X, namely P(X|parents of X) like directed Bayes net–no messy clique potentials

Dependency nets The bad and the ugly: Inference is less efficient –MCMC sampling Can’t reconstruct probability via chain rule Networks might be inconsistent ie local P(x|pa(x)’s don’t define a pdf Exactly equal, representationally, to normal undirected Markov nets

Dependency nets The good: Learning is simple and elegant (if you know each node’s Markov blanket): just learn a probabilistic classifier for P(X|pa(X)) for each node X. (You might not learn a consistent model, but you’ll probably learn a reasonably good one.) Inference can be speeded up substantially over naïve Gibbs sampling.

Dependency nets Learning is simple and elegant (if you know each node’s Markov blanket): just learn a probabilistic classifier for P(X|pa(X)) for each node X. Pr(y1|x,y2) Pr(y2|x,y1,y2) Pr(y3|x,y2,y4) Pr(y4|x,y3) y1 y2 y3 y4 Learning is local, but inference is not, and need not be unidirectional x

Toutanova, Klein, Manning, Singer Dependency nets for POS tagging vs CMM’s. Maxent is used for local conditional model. Goals: An easy-to-train bidirectional model A really good POS tagger

Toutanova et al D = {11, 11, 11, 12, 21, 33} ML state: {11} Don’t use Gibbs sampling for inference: instead use a Viterbi variant (which is not guaranteed to produce the ML sequence) D = {11, 11, 11, 12, 21, 33} ML state: {11} P(a=1|b=1)P(b=1|a=1) < 1 P(a=3|b=3)P(b=3|a=3) = 1

Results with model

Results with model

Results with model “Best” model includes some special unknown-word features, including “a crude company-name detector”

Results with model MXPost: 47.6, 96.4, 86.2 CRF+: 95.7, 76.4 Final test-set results MXPost: 47.6, 96.4, 86.2 CRF+: 95.7, 76.4

Other comments Smoothing (quadratic regularization, aka Gaussian prior) is important—it avoids overfitting effects reported elsewhere

More on smoothing...

Klein & Manning: Conditional Structure vs Estimation

Task 1: WSD (Word Sense Disambiguation) Bush’s election-year ad campaign will begin this summer, with... (sense1) Bush whacking is tiring but rewarding—who wants to spend all their time on marked trails? (sense2) Class is sense1/sense2, features are context words.

Task 1: WSD (Word Sense Disambiguation) Model 1: Naive Bayes multinomial model: Use conditional rule to predict sense s from context-word observations o. Standard NB training maximizes “joint likelihood” under independence assumption

Task 1: WSD (Word Sense Disambiguation) Model 2: Keep same functional form, but maximize conditional likelihood (sound familiar?) or maybe SenseEval score: or maybe even:

Task 1: WSD (Word Sense Disambiguation) Optimize JL with std NB learning Optimize SCL, CL with conjugate gradient Also over “non-deficient models” (?) using Lagrange penalties to enforce “soft” version of deficiency constraint I think this makes sure non-conditional version is a valid probability “Punt” on optimizing accuracy Penalty for extreme predictions in SCL

Conclusion: maxent beats NB? All generalizations are wrong?

Task 2: POS Tagging Sequential problem Replace NB with HMM model. Standard algorithms maximize joint likelihood Claim: keeping the same model but maximizing conditional likelihood leads to a CRF Is this true? Alternative is conditional structure (CMM)

Using conditional structure vs maximizing conditional likelihood CMM factors Pr(s,o) into Pr(s|o)Pr(o). For the CMM model, adding dependencies btwn observations does not change Pr(s|o), ie JL estimate =CL estimate for Pr(s|o)

Task 2: POS Tagging Experiments with a simple feature set: For fixed model, CL is preferred to JL (CRF beats HMM) For fixed objective, HMM is preferred to MEMM/CMM

Error analysis for POS tagging Label bias is not the issue: state-state dependencies are weak compared to observation-state dependencies too much emphasis on observation, not enough on previous states (“observation bias”) put another way: label bias predicts overprediction of states with few outgoing transitions, of more generally, low entropy...

Error analysis for POS tagging