Review: Hidden Markov Models Efficient dynamic programming algorithms exist for –Finding Pr(S) –The highest probability path P that maximizes Pr(S,P) (Viterbi)

Slides:



Advertisements
Similar presentations
HMM II: Parameter Estimation. Reminder: Hidden Markov Model Markov Chain transition probabilities: p(S i+1 = t|S i = s) = a st Emission probabilities:
Advertisements

Supervised Learning Recap
Lirong Xia Approximate inference: Particle filter Tue, April 1, 2014.
CMPUT 466/551 Principal Source: CMU
John Lafferty, Andrew McCallum, Fernando Pereira
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data John Lafferty Andrew McCallum Fernando Pereira.
Hidden Markov Models Eine Einführung.
Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
Hidden Markov Models Ellen Walker Bioinformatics Hiram College, 2008.
Statistical NLP: Lecture 11
Chapter 6: HIDDEN MARKOV AND MAXIMUM ENTROPY Heshaam Faili University of Tehran.
Hidden Markov Models Theory By Johan Walters (SR 2003)
1 Hidden Markov Models (HMMs) Probabilistic Automata Ubiquitous in Speech/Speaker Recognition/Verification Suitable for modelling phenomena which are dynamic.
GS 540 week 6. HMM basics Given a sequence, and state parameters: – Each possible path through the states has a certain probability of emitting the sequence.
Part II. Statistical NLP Advanced Artificial Intelligence (Hidden) Markov Models Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most.
Hidden Markov Model 11/28/07. Bayes Rule The posterior distribution Select k with the largest posterior distribution. Minimizes the average misclassification.
Learning Seminar, 2004 Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data J. Lafferty, A. McCallum, F. Pereira Presentation:
Hidden Markov Models K 1 … 2. Outline Hidden Markov Models – Formalism The Three Basic Problems of HMMs Solutions Applications of HMMs for Automatic Speech.
Conditional Random Fields
. Hidden Markov Models with slides from Lise Getoor, Sebastian Thrun, William Cohen, and Yair Weiss.
Hidden Markov Models (HMMs) for Information Extraction
Sequence labeling and beam search LING 572 Fei Xia 2/15/07.
Learning HMM parameters Sushmita Roy BMI/CS 576 Oct 21 st, 2014.
Handwritten Character Recognition using Hidden Markov Models Quantifying the marginal benefit of exploiting correlations between adjacent characters and.
STRUCTURED PERCEPTRON Alice Lai and Shi Zhi. Presentation Outline Introduction to Structured Perceptron ILP-CRF Model Averaged Perceptron Latent Variable.
Graphical models for part of speech tagging
Margin Learning, Online Learning, and The Voted Perceptron SPLODD ~= AE* – 3, 2011 * Autumnal Equinox.
BINF6201/8201 Hidden Markov Models for Sequence Analysis
Hidden Markov Models for Information Extraction CSE 454.
Hidden Markov Models Usman Roshan CS 675 Machine Learning.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
1 Information Extraction using HMMs Sunita Sarawagi.
Sequence Models With slides by me, Joshua Goodman, Fei Xia.
CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov
Maximum Entropy (ME) Maximum Entropy Markov Model (MEMM) Conditional Random Field (CRF)
Distance functions and IE – 5 William W. Cohen CALD.
NLP. Introduction to NLP Sequence of random variables that aren’t independent Examples –weather reports –text.
IE with Dictionaries Cohen & Sarawagi. Announcements Current statistics: –days with unscheduled student talks: 2 –students with unscheduled student talks:
Maximum Entropy Models and Feature Engineering CSCI-GA.2590 – Lecture 6B Ralph Grishman NYU.
Hidden Markov Models & POS Tagging Corpora and Statistical Methods Lecture 9.
Conditional Markov Models: MaxEnt Tagging and MEMMs William W. Cohen CALD.
Slides for “Data Mining” by I. H. Witten and E. Frank.
Conditional Markov Models: MaxEnt Tagging and MEMMs William W. Cohen Feb 8 IE Lecture.
CPSC 422, Lecture 19Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 19 Oct, 23, 2015 Slide Sources Raymond J. Mooney University of.
School of Computer Science 1 Information Extraction with HMM Structures Learned by Stochastic Optimization Dayne Freitag and Andrew McCallum Presented.
John Lafferty Andrew McCallum Fernando Pereira
Logistic Regression William Cohen.
Conditional Markov Models: MaxEnt Tagging and MEMMs
Information Extraction Entity Extraction: Statistical Methods Sunita Sarawagi.
Albert Gatt Corpora and Statistical Methods. Acknowledgement Some of the examples in this lecture are taken from a tutorial on HMMs by Wolgang Maass.
1 Hidden Markov Models Hsin-min Wang References: 1.L. R. Rabiner and B. H. Juang, (1993) Fundamentals of Speech Recognition, Chapter.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.
Graphical Models for Segmenting and Labeling Sequence Data Manoj Kumar Chinnakotla NLP-AI Seminar.
Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.
MACHINE LEARNING FOR NATURAL LANGUAGE PROCESSING.
Conditional Random Fields and Its Applications Presenter: Shih-Hsiang Lin 06/25/2007.
Hidden Markov Models BMI/CS 576
IE With Undirected Models: the saga continues
Conditional Random Fields
MEMMs/CMMs and CRFs William W. Cohen Sep 22, 2010.
CRFs for SPLODD William W. Cohen Sep 8, 2011.
Klein and Manning on CRFs vs CMMs
CSE 574 Finite State Machines for Information Extraction
CSCI 5832 Natural Language Processing
Information Extraction Lecture
IE With Undirected Models
NER with Models Allowing Long-Range Dependencies
The Voted Perceptron for Ranking and Structured Classification
Sequential Learning with Dependency Nets
Presentation transcript:

Review: Hidden Markov Models Efficient dynamic programming algorithms exist for –Finding Pr(S) –The highest probability path P that maximizes Pr(S,P) (Viterbi) Training the model –State seq known: MLE + smoothing –Otherwise: Baum-Welch algorithm S2S2 S4S4 S1S S3S3 ACAC ACAC ACAC 0.5 ACAC

HMM for Segmentation Simplest Model: One state per entity type

HMM Learning Manally pick HMM’s graph (eg simple model, fully connected) Learn transition probabilities: Pr(s i |s j ) Learn emission probabilities: Pr(w|s i )

Learning model parameters When training data defines unique path through HMM –Transition probabilities Probability of transitioning from state i to state j = number of transitions from i to j total transitions from state i –Emission probabilities Probability of emitting symbol k from state i = number of times k generated from i number of transition from I

What is a “symbol” ??? Cohen => “Cohen”, “cohen”, “Xxxxx”, “Xx”, … ? 4601 => “4601”, “9999”, “9+”, “number”, … ? Datamold: choose best abstraction level using holdout set

What is a symbol? Ideally we would like to use many, arbitrary, overlapping features of words. S t-1 S t O t S t+1 O t +1 O t - 1 identity of word ends in “-ski” is capitalized is part of a noun phrase is in a list of city names is under node X in WordNet is in bold font is indented is in hyperlink anchor … … … part of noun phrase is “Wisniewski” ends in “-ski” We can extend the HMM model so that each state generates multiple “features” – but they should be independent.

Borthwick et al solution We could use YFCL: an SVM, logistic regression, a decision tree, …. We’ll be talking about logistic regression. S t-1 S t O t S t+1 O t +1 O t - 1 identity of word ends in “-ski” is capitalized is part of a noun phrase is in a list of city names is under node X in WordNet is in bold font is indented is in hyperlink anchor … … … part of noun phrase is “Wisniewski” ends in “-ski” Instead of an HMM, classify each token. Don’t learn transition probabilities, instead constrain them at test time.

Stupid HMM tricks start Pr(red) Pr(green) Pr(green|green) = 1 Pr(red|red) = 1

Stupid HMM tricks start Pr(red) Pr(green) Pr(green|green) = 1 Pr(red|red) = 1 Pr(y|x) = Pr(x|y) * Pr(y) / Pr(x) argmax{y} Pr(y|x) = argmax{y} Pr(x|y) * Pr(y) = argmax{y} Pr(y) * Pr(x 1 |y)*Pr(x 2 |y)*...*Pr(x m |y) Pr(“I voted for Ralph Nader”|ggggg) = Pr(g)*Pr(I|g)*Pr(voted|g)*Pr(for|g)*Pr(Ralph|g)*Pr(Nader|g)

HMM’s = sequential NB

From NB to Maxent

Or: Idea: keep the same functional form as naïve Bayes, but pick the parameters to optimize performance on training data. One possible definition of performance is conditional log likelihood of the data:

MaxEnt Comments –Implementation: All methods are iterative For NLP like problems with many features, modern gradient-like or Newton-like methods work well Thursday I’ll derive the gradient for CRFs –Smoothing: Typically maxent will overfit data if there are many infrequent features. Old-school solutions: discard low-count features; early stopping with holdout set; … Modern solutions: penalize large parameter values with a prior centered on zero to limit size of alphas (ie, optimize log likelihood - sum alpha); other regularization techniques

What is a symbol? Ideally we would like to use many, arbitrary, overlapping features of words. S t-1 S t O t S t+1 O t +1 O t - 1 identity of word ends in “-ski” is capitalized is part of a noun phrase is in a list of city names is under node X in WordNet is in bold font is indented is in hyperlink anchor … … … part of noun phrase is “Wisniewski” ends in “-ski”

Borthwick et al idea S t-1 S t O t S t+1 O t +1 O t - 1 identity of word ends in “-ski” is capitalized is part of a noun phrase is in a list of city names is under node X in WordNet is in bold font is indented is in hyperlink anchor … … … part of noun phrase is “Wisniewski” ends in “-ski” Idea: replace generative model in HMM with a maxent model, where state depends on observations

Another idea…. S t-1 S t O t S t+1 O t +1 O t - 1 identity of word ends in “-ski” is capitalized is part of a noun phrase is in a list of city names is under node X in WordNet is in bold font is indented is in hyperlink anchor … … … part of noun phrase is “Wisniewski” ends in “-ski” Idea: replace generative model in HMM with a maxent model, where state depends on observations and previous state

MaxEnt taggers and MEMMs S t-1 S t O t S t+1 O t +1 O t - 1 identity of word ends in “-ski” is capitalized is part of a noun phrase is in a list of city names is under node X in WordNet is in bold font is indented is in hyperlink anchor … … … part of noun phrase is “Wisniewski” ends in “-ski” Idea: replace generative model in HMM with a maxent model, where state depends on observations and previous state history Learning does not change – you’ve just added a few additional features that are the previous labels. Classification is trickier – we don’t know the previous-label features at test time – so we will need to search for the best sequence of labels (like for an HMM).

Partial history of the idea Sliding window classifiers –Sejnowski’s NETTalk, mid 1980’s Recurrent neural networks and other “recurrent” sliding- window classifiers –Late 1980’s and 1990’s Ratnaparkhi’s thesis –Mid-late 1990’s Frietag, McCallum & Pereira ICML 2000 –Formalize notion of MEMM OpenNLP –Based largely on MaxEnt taggers, Apache Open Source

Ratnaparkhi’s MXPOST Sequential learning problem: predict POS tags of words. Uses MaxEnt model described above. Rich feature set. To smooth, discard features occurring < 10 times.

MXPOST

MXPOST: learning & inference GIS Feature selection

23 Using the HMM to segment Find highest probability path through the HMM. Viterbi: quadratic dynamic programming algorithm House otot Road City Pin Butler Highway Greenville House Road City Pin otot House Road City Pin Butler

Alternative inference schemes

MXPost inference

Inference for MENE (Borthwick et al system) B I O B I O B I O B I O B I O B I O B I O When will prof Cohen post the notes … Goal: best legal path through lattice (i.e., path that runs through the most black ink. Like Viterbi but cost of possible transitions are ignored.)

Inference for MXPOST B I O B I O B I O B I O B I O B I O B I O When will prof Cohen post the notes … (Approx view): find best path, weights are now on arcs from state to state. window of k tags k=1

Inference for MXPOST B I O B I O B I O B I O B I O B I O B I O When will prof Cohen post the notes … More accurately: find total flow to each node, weights are now on arcs from state to state.

Inference for MXPOST B I O B I O B I O B I O B I O B I O B I O When will prof Cohen post the notes … Find best path? tree? Weights are on hyperedges

Inference for MxPOST I O iI iO When will prof Cohen post the notes … oI oO iI iO oI oO iI iO oI oO iI iO oI oO iI iO oI oO iI iO oI oO …… Beam search is alternative to Viterbi: at each stage, find all children, score them, and discard all but the top n states

MXPost results State of art accuracy (for 1996) Same approach used successfully for several other sequential classification steps of a stochastic parser (also state of art). Same (or similar) approaches used for NER by Borthwick, Malouf, Manning, and others.

MEMMs Basic difference from ME tagging: –ME tagging: previous state is feature of MaxEnt classifier –MEMM: build a separate MaxEnt classifier for each state. Can build any HMM architecture you want; eg parallel nested HMM’s, etc. Data is fragmented: examples where previous tag is “proper noun” give no information about learning tags when previous tag is “noun” –Mostly a difference in viewpoint –MEMM does allow possibility of “hidden” states and Baum-Welsh like training

MEMM task: FAQ parsing

MEMM features

MEMMs

Looking forward HMMS –Easy to train generative model –Features for a state must be independent (-) MaxEnt tagger/MEMM –Multiple cascaded classifiers –Features can be arbitrary (+) –Have we given anything up?

37 HMM inference House otot Road City Pin Total probability of transitions out of a state must sum to 1 But …they can all lead to “unlikely” states So…. a state can be a (probable) “dead end” in the lattice House Road City Pin otot House Road City Pin Butler

Inference for MXPOST B I O B I O B I O B I O B I O B I O B I O When will prof Cohen post the notes … More accurately: find total flow to each node, weights are now on arcs from state to state. Flow out of each node is always fixed:

Label Bias Problem (Lafferty, McCallum, Pereira ICML 2001) Consider this MEMM, and enough training data to perfectly model it: Pr(0123|rob) = Pr(1|0,r)/Z1 * Pr(2|1,o)/Z2 * Pr(3|2,b)/Z3 = 0.5 * 1 * 1 Pr(0453|rib) = Pr(4|0,r)/Z1’ * Pr(5|4,i)/Z2’ * Pr(3|5,b)/Z3’ = 0.5 * 1 *1 Pr(0123|rib)=1 Pr(0453|rob)=1

Another max-flow scheme B I O B I O B I O B I O B I O B I O B I O When will prof Cohen post the notes … More accurately: find total flow to each node, weights are now on arcs from state to state. Flow out of a node is always fixed:

Another max-flow scheme: MRFs B I O B I O B I O B I O B I O B I O B I O When will prof Cohen post the notes … Goal is to learn how to weight edges in the graph: weight(y i,y i+1 ) = 2*[(y i =B or I) and isCap(x i )] + 1*[(y i =B and isFirstName(x i )] - 5*[(y i+1 ≠B and isLower(x i ) and isUpper(x i+1 )]

Another max-flow scheme: MRFs B I O B I O B I O B I O B I O B I O B I O When will prof Cohen post the notes … Find total flow to each node, weights are now on edges from state to state. Goal is to learn how to weight edges in the graph, given features from the examples.

Another view of label bias [Sha & Pereira] So what’s the alternative?