MEMMs/CMMs and CRFs William W. Cohen Sep 22, 2010.

Slides:



Advertisements
Similar presentations
Learning for Structured Prediction Overview of the Material TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA A A A.
Advertisements

John Lafferty, Andrew McCallum, Fernando Pereira
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data John Lafferty Andrew McCallum Fernando Pereira.
Hidden Markov Models Theory By Johan Walters (SR 2003)
GS 540 week 6. HMM basics Given a sequence, and state parameters: – Each possible path through the states has a certain probability of emitting the sequence.
Logistics Course reviews Project report deadline: March 16 Poster session guidelines: – 2.5 minutes per poster (3 hrs / 55 minus overhead) – presentations.
Learning Seminar, 2004 Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data J. Lafferty, A. McCallum, F. Pereira Presentation:
Conditional Random Fields
Taking the Kitchen Sink Seriously: An Ensemble Approach to Word Sense Disambiguation from Christopher Manning et al.
CPSC 422, Lecture 18Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 18 Feb, 25, 2015 Slide Sources Raymond J. Mooney University of.
Handwritten Character Recognition using Hidden Markov Models Quantifying the marginal benefit of exploiting correlations between adjacent characters and.
STRUCTURED PERCEPTRON Alice Lai and Shi Zhi. Presentation Outline Introduction to Structured Perceptron ILP-CRF Model Averaged Perceptron Latent Variable.
Review: Hidden Markov Models Efficient dynamic programming algorithms exist for –Finding Pr(S) –The highest probability path P that maximizes Pr(S,P) (Viterbi)
Graphical models for part of speech tagging
Margin Learning, Online Learning, and The Voted Perceptron SPLODD ~= AE* – 3, 2011 * Autumnal Equinox.
Maximum Entropy (ME) Maximum Entropy Markov Model (MEMM) Conditional Random Field (CRF)
Distance functions and IE – 5 William W. Cohen CALD.
Prototype-Driven Learning for Sequence Models Aria Haghighi and Dan Klein University of California Berkeley Slides prepared by Andrew Carlson for the Semi-
IE with Dictionaries Cohen & Sarawagi. Announcements Current statistics: –days with unscheduled student talks: 2 –students with unscheduled student talks:
Maximum Entropy Models and Feature Engineering CSCI-GA.2590 – Lecture 6B Ralph Grishman NYU.
Conditional Markov Models: MaxEnt Tagging and MEMMs William W. Cohen CALD.
Slides for “Data Mining” by I. H. Witten and E. Frank.
CHAPTER 8 DISCRIMINATIVE CLASSIFIERS HIDDEN MARKOV MODELS.
Conditional Markov Models: MaxEnt Tagging and MEMMs William W. Cohen Feb 8 IE Lecture.
CPSC 422, Lecture 19Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 19 Oct, 23, 2015 Slide Sources Raymond J. Mooney University of.
School of Computer Science 1 Information Extraction with HMM Structures Learned by Stochastic Optimization Dayne Freitag and Andrew McCallum Presented.
Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.
John Lafferty Andrew McCallum Fernando Pereira
Logistic Regression William Cohen.
Conditional Markov Models: MaxEnt Tagging and MEMMs
Information Extraction Entity Extraction: Statistical Methods Sunita Sarawagi.
Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.
Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.
Graphical Models for Segmenting and Labeling Sequence Data Manoj Kumar Chinnakotla NLP-AI Seminar.
Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.
MACHINE LEARNING FOR NATURAL LANGUAGE PROCESSING.
Conditional Random Fields and Its Applications Presenter: Shih-Hsiang Lin 06/25/2007.
Hidden Markov Models BMI/CS 576
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
IE With Undirected Models: the saga continues
Lecture 7: Constrained Conditional Models
Maximum Entropy Models and Feature Engineering CSCI-GA.2591
Simone Paolo Ponzetto University of Heidelberg Massimo Poesio
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Conditional Random Fields
Conditional Random Fields for ASR
Reading: Pedro Domingos: A Few Useful Things to Know about Machine Learning source: /cacm12.pdf reading.
Intelligent Information System Lab
Max-margin sequential learning methods
CSC 594 Topics in AI – Natural Language Processing
CS 4/527: Artificial Intelligence
CRFs for SPLODD William W. Cohen Sep 8, 2011.
CRFs vs CMMs, and Stacking
Klein and Manning on CRFs vs CMMs
Hidden Markov Models Part 2: Algorithms
CSE 574 Finite State Machines for Information Extraction
Jeremy Morris & Eric Fosler-Lussier 04/19/2007
Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models
CSCI 5832 Natural Language Processing
Information Extraction Lecture
CSCI 5832 Natural Language Processing
IE With Undirected Models
Learning to Rank Typed Graph Walks: Local and Global Approaches
NER with Models Allowing Long-Range Dependencies
The Voted Perceptron for Ranking and Structured Classification
An introduction to: Deep Learning aka or related to Deep Neural Networks Deep Structural Learning Deep Belief Networks etc,
Introduction to Neural Networks
Sequential Learning with Dependency Nets
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Learning to Search as a Means of Doing Structured Prediction
Presentation transcript:

MEMMs/CMMs and CRFs William W. Cohen Sep 22, 2010

Announcements…

Wiki Pages - HowTo Key points http://malt.ml.cmu.edu/mw/index.php/Social_Media_Analysis_10-802_in_Spring_2010#Other_Resources Example: http://malt.ml.cmu.edu/mw/index.php/Turney,_ACL_2002 Key points Naming the pages – examples: [[Cohen ICML 1995]] [[Lin and Cohen ICML 2010]] [[Minkov et al IJCAI 2005]] Structured links: [[AddressesProblem::named entity recognition]] [[UsesMethod::absolute discounting] [[RelatedPaper::Pang et al ACL 2002] [[UsesDataset::Citeseer]] [[Category::Paper]] [[Category::Problem]] [[Category::Method]] [[Category::Dataset]] Rule of 2: Don’t create a page unless you expect 2 inlinks A method from a paper that’s not used anywhere else should be described in-line No inverse links – but you can emulate these with queries

Wiki Pages – HowTo, con’t To turn in: Add them to the wiki Add links to them on your user page Send me an email with links to each page you want to get graded on [I may send back bug reports until people get the hang of this…] WhenTo: Three pages by 9/30 at midnight Actually 10/1 at dawn is fine. Suggestion: Think of your project and build pages for the dataset, the problem, and the (baseline) method you plan to use.

Projects Some sample projects On Wed 9/29 - “turn in”: On Friday 10/8: Apply existing method to a new problem http://staff.science.uva.nl/~otsur/papers/sarcasmAmazonICWSM10.pdf Apply new method to an existing dataset Build something that might help you in your research E.g., Extract names of people (pundits, politicians, …) from political blogs Classify folksonomy tags as person names, place names, … On Wed 9/29 - “turn in”: One page, covering some subset of: What you plan to do with what data Why you think it’s interesting Any relevant superpowers you might have How you plan to evaluate What techniques you plan to use What question you want to answer Who you might work with These will be posted on the class web site On Friday 10/8: Similar abstract from each team Team is (preferably) 2-3 people, but I’m flexible Main new information: who’s on what team

Conditional Markov Models

What is a symbol? Ideally we would like to use many, arbitrary, overlapping features of words. identity of word ends in “-ski” is capitalized is part of a noun phrase is in a list of city names … S S S t - 1 t t+1 … is “Wisniewski” … part of noun phrase ends in “-ski” O O O t - 1 t t +1 Lots of learning systems are not confounded by multiple, non-independent features: decision trees, neural nets, SVMs, …

Stupid HMM tricks Pr(y|x) = Pr(x|y) * Pr(y) / Pr(x) start Pr(red) Pr(green) Pr(green|green) = 1 Pr(red|red) = 1 Pr(y|x) = Pr(x|y) * Pr(y) / Pr(x) argmax{y} Pr(y|x) = argmax{y} Pr(x|y) * Pr(y) = argmax{y} Pr(y) * Pr(x1|y)*Pr(x2|y)*...*Pr(xm|y) Pr(“I voted for Ralph Nader”|ggggg) = Pr(g)*Pr(I|g)*Pr(voted|g)*Pr(for|g)*Pr(Ralph|g)*Pr(Nader|g)

From NB to Maxent

From NB to Maxent

What is a symbol? identity of word ends in “-ski” is capitalized is part of a noun phrase is in a list of city names is under node X in WordNet is in bold font is indented is in hyperlink anchor … S S t - 1 S t+1 t … is “Wisniewski” … part of noun phrase ends in “-ski” O O O t - 1 t t +1 Idea: replace generative model in HMM with a maxent model, where state depends on observations and previous state history

Ratnaparkhi’s MXPOST Sequential learning problem: predict POS tags of words. Uses MaxEnt model described above. Rich feature set. To smooth, discard features occurring < 10 times.

MXPOST

Inference for MENE When will prof Cohen post the notes … B B B B B B B

Inference for MXPOST When will prof Cohen post the notes … B B B B B B B I I I I I I I O O O O O O O (Approx view): find best path, weights are now on arcs from state to state.

Inference for MXPOST When will prof Cohen post the notes … B B B B B B B I I I I I I I O O O O O O O More accurately: find total flow to each node, weights are now on arcs from state to state.

Inference for MXPOST Find best path? tree? Weights are on hyperedges When will prof Cohen post the notes … B B B B B B B I I I I I I I O O O O O O O Find best path? tree? Weights are on hyperedges

Inference for MxPOST Beam search is alternative to Viterbi: When will prof Cohen post the notes … I iI oI O iO oO Beam search is alternative to Viterbi: at each stage, find all children, score them, and discard all but the top n states

Inference for MxPOST Beam search is alternative to Viterbi: When will prof Cohen post the notes … oII I iI oiO oI ioI O iO ioO ooI oO ooO Beam search is alternative to Viterbi: at each stage, find all children, score them, and discard all but the top n states

Inference for MxPOST Beam search is alternative to Viterbi: When will prof Cohen post the notes … oiI oiiI I iI oiO oiiO oI ioI iooI O iO ioO iooO ooI oO oooI ooO oooO Beam search is alternative to Viterbi: at each stage, find all children, score them, and discard all but the top n states

MXPost results State of art accuracy (for 1996) Same approach used successfully for several other sequential classification steps of a stochastic parser (also state of art). Same (or similar) approaches used for NER by Borthwick, Malouf, Manning, and others.

Frietag, McCallum, Pereira

MEMMs Basic difference from ME tagging: ME tagging: previous state is feature of MaxEnt classifier MEMM: build a separate MaxEnt classifier for each state. Can build any HMM architecture you want; eg parallel nested HMM’s, etc. Data is fragmented: examples where previous tag is “proper noun” give no information about learning tags when previous tag is “noun” Mostly a difference in viewpoint MEMM does allow possibility of “hidden” states and Baum-Welsh like training Viterbi is the most natural inference scheme

MEMM task: FAQ parsing

MEMM features

MEMMs

Conditional Random Fields

Implications of the MEMM model Does this do what we want? Q: does Y[i-1] depend on X[i+1] ? “a nodes is conditionally independent of its non-descendents given its parents” Q: what is Y[0] for the sentence “Qbbzzt of America Inc announced layoffs today in …”

Inference for MXPOST When will prof Cohen post the notes … B B B B B B B I I I I I I I O O O O O O O (Approx view): find best path, weights are now on arcs from state to state.

Inference for MXPOST When will prof Cohen post the notes … B B B B B B B I I I I I I I O O O O O O O More accurately: find total flow to each node, weights are now on arcs from state to state. Flow out of a node is always fixed:

Label Bias Problem Pr(0123|rib)=1 Pr(0453|rob)=1 Consider this MEMM, and enough training data to perfectly model it: Pr(0123|rib)=1 Pr(0453|rob)=1 Pr(0123|rob) = Pr(1|0,r)/Z1 * Pr(2|1,o)/Z2 * Pr(3|2,b)/Z3 = 0.5 * 1 * 1 Pr(0453|rib) = Pr(4|0,r)/Z1’ * Pr(5|4,i)/Z2’ * Pr(3|5,b)/Z3’ = 0.5 * 1 *1

How important is label bias? Could be avoided in this case by changing structure: Our models are always wrong – is this “wrongness” a problem? See Klein & Manning’s paper for more on this….

Another view of label bias [Sha & Pereira] So what’s the alternative?

Inference for MXPOST When will prof Cohen post the notes … B B B B B B B I I I I I I I O O O O O O O More accurately: find total flow to each node, weights are now on arcs from state to state. Flow out of a node is always fixed:

Another max-flow scheme When will prof Cohen post the notes … B B B B B B B I I I I I I I O O O O O O O More accurately: find total flow to each node, weights are now on arcs from state to state. Flow out of a node is always fixed:

Another max-flow scheme: MRFs When will prof Cohen post the notes … B B B B B B B I I I I I I I O O O O O O O Goal is to learn how to weight edges in the graph: weight(yi,yi+1) = 2*[(yi=B or I) and isCap(xi)] + 1*[(yi=B and isFirstName(xi)] - 5*[(yi+1≠B and isLower(xi) and isUpper(xi+1)]

Another max-flow scheme: MRFs When will prof Cohen post the notes … B B B B B B B I I I I I I I O O O O O O O Find total flow to each node, weights are now on edges from state to state. Goal is to learn how to weight edges in the graph, given features from the examples.

CRFs vs MEMMs MEMMs: CRFs: x1 x2 x3 x4 x5 x6 … x1 x2 x3 x4 x5 x6 … … Sequence classification f:xy is reduced to many cases of ordinary classification, f:xiyi …combined with Viterbi or beam search CRFs: Sequence classification f:xy is done by: Converting x,Y to a MRF Using “flow” computations on the MRF to compute the best y|x x1 x2 x3 x4 x5 x6 … x1 x2 x3 x4 x5 x6 Pr(Y|x2,y1’) MRF: φ(Y1,Y2), φ(Y2,Y3),…. Pr(Y|x4,y3) … Pr(Y|x5,y5) Pr(Y|x2,y1) … y1 y2 y3 y4 y5 y6 y1 y2 y3 y4 y5 y6

The math: Review of maxent

Review of maxent/MEMM/CMMs We know how to compute this.

Details on CMMs

From CMMs to CRFs Recall why we’re unhappy: we don’t want local normalization New model How to compute this?

What’s the new model look like? What’s independent? If fi is HMM-like and depends on only xj,yj or yj,yj-1 y1 y2 y3 x1 x2 x3

What’s the new model look like? What’s independent now?? y1 y2 y3 x

CRF learning – from Sha & Pereira

CRF learning – from Sha & Pereira

CRF learning – from Sha & Pereira Something like forward-backward Idea: Define matrix of y,y’ “affinities” at stage i Mi[y,y’] = “unnormalized probability” of transition from y to y’ at stage I Mi * Mi+1 = “unnormalized probability” of any path through stages i and i+1

y1 y2 y3 x y1 y2 y3

Forward backward ideas name name name c g b f nonName nonName nonName d h

CRF learning – from Sha & Pereira

Sha & Pereira results CRF beats MEMM (McNemar’s test); MEMM probably beats voted perceptron

Sha & Pereira results in minutes, 375k examples

Klein & Manning: Conditional Structure vs Estimation

Task 1: WSD (Word Sense Disambiguation) Bush’s election-year ad campaign will begin this summer, with... (sense1) Bush whacking is tiring but rewarding—who wants to spend all their time on marked trails? (sense2) Class is sense1/sense2, features are context words.

Task 1: WSD (Word Sense Disambiguation) Model 1: Naive Bayes multinomial model: Use conditional rule to predict sense s from context-word observations o. Standard NB training maximizes “joint likelihood” under independence assumption

Task 1: WSD (Word Sense Disambiguation) Model 2: Keep same functional form, but maximize conditional likelihood (sound familiar?) or maybe SenseEval score: or maybe even:

In other words… MaxEnt Naïve Bayes Different “optimization goals”… … or, dropping a constraint about f’s and λ’s

Task 1: WSD (Word Sense Disambiguation) Optimize JL with std NB learning Optimize SCL, CL with conjugate gradient Also over “non-deficient models” (?) using Lagrange penalties to enforce “soft” version of deficiency constraint I think this makes sure non-conditional version is a valid probability “Punt” on optimizing accuracy Penalty for extreme predictions in SCL

Conclusion: maxent beats NB? All generalizations are wrong?

Task 2: POS Tagging Sequential problem Replace NB with HMM model. Standard algorithms maximize joint likelihood Claim: keeping the same model but maximizing conditional likelihood leads to a CRF Is this true? Alternative is conditional structure (CMM)

HMM CRF

Using conditional structure vs maximizing conditional likelihood CMM factors Pr(s,o) into Pr(s|o)Pr(o). For the CMM model, adding dependencies btwn observations does not change Pr(s|o), ie JL estimate =CL estimate for Pr(s|o)

Task 2: POS Tagging Experiments with a simple feature set: For fixed model, CL is preferred to JL (CRF beats HMM) For fixed objective, HMM is preferred to MEMM/CMM

Error analysis for POS tagging Label bias is not the issue: state-state dependencies are weak compared to observation-state dependencies too much emphasis on observation, not enough on previous states (“observation bias”) put another way: label bias predicts overprediction of states with few outgoing transitions, of more generally, low entropy...

Error analysis for POS tagging