Conditional Markov Models: MaxEnt Tagging and MEMMs William W. Cohen Feb 8 IE Lecture.

Slides:



Advertisements
Similar presentations
Why does it work? We have not addressed the question of why does this classifier performs well, given that the assumptions are unlikely to be satisfied.
Advertisements

CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
CMPUT 466/551 Principal Source: CMU
Hidden Markov Models Ellen Walker Bioinformatics Hiram College, 2008.
Ch 9. Markov Models 고려대학교 자연어처리연구실 한 경 수
Statistical NLP: Lecture 11
Chapter 6: HIDDEN MARKOV AND MAXIMUM ENTROPY Heshaam Faili University of Tehran.
Hidden Markov Models Theory By Johan Walters (SR 2003)
GS 540 week 6. HMM basics Given a sequence, and state parameters: – Each possible path through the states has a certain probability of emitting the sequence.
Logistics Course reviews Project report deadline: March 16 Poster session guidelines: – 2.5 minutes per poster (3 hrs / 55 minus overhead) – presentations.
Lecture 5: Learning models using EM
Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition.
Hidden Markov Models K 1 … 2. Outline Hidden Markov Models – Formalism The Three Basic Problems of HMMs Solutions Applications of HMMs for Automatic Speech.
Conditional Random Fields
. Hidden Markov Models with slides from Lise Getoor, Sebastian Thrun, William Cohen, and Yair Weiss.
Sequence labeling and beam search LING 572 Fei Xia 2/15/07.
Maximum Entropy Model LING 572 Fei Xia 02/07-02/09/06.
Computer vision: models, learning and inference Chapter 10 Graphical Models.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Review: Hidden Markov Models Efficient dynamic programming algorithms exist for –Finding Pr(S) –The highest probability path P that maximizes Pr(S,P) (Viterbi)
Issues with Data Mining
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Graphical models for part of speech tagging
BINF6201/8201 Hidden Markov Models for Sequence Analysis
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 16 Nov, 3, 2011 Slide credit: C. Conati, S.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov
Today Ensemble Methods. Recap of the course. Classifier Fusion
Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.
First topic: clustering and pattern recognition Marc Sobel.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
IE with Dictionaries Cohen & Sarawagi. Announcements Current statistics: –days with unscheduled student talks: 2 –students with unscheduled student talks:
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
Maximum Entropy Models and Feature Engineering CSCI-GA.2590 – Lecture 6B Ralph Grishman NYU.
Conditional Markov Models: MaxEnt Tagging and MEMMs William W. Cohen CALD.
CS Statistical Machine learning Lecture 24
Slides for “Data Mining” by I. H. Witten and E. Frank.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Supervised Learning Resources: AG: Conditional Maximum Likelihood DP:
School of Computer Science 1 Information Extraction with HMM Structures Learned by Stochastic Optimization Dayne Freitag and Andrew McCallum Presented.
Ensemble Methods in Machine Learning
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Elements of a Discrete Model Evaluation.
Classification Ensemble Methods 1
Logistic Regression William Cohen.
Conditional Markov Models: MaxEnt Tagging and MEMMs
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
Information Extraction Entity Extraction: Statistical Methods Sunita Sarawagi.
Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.
CSC Lecture 23: Sigmoid Belief Nets and the wake-sleep algorithm Geoffrey Hinton.
Graphical Models for Segmenting and Labeling Sequence Data Manoj Kumar Chinnakotla NLP-AI Seminar.
Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.
MACHINE LEARNING FOR NATURAL LANGUAGE PROCESSING.
Conditional Random Fields and Its Applications Presenter: Shih-Hsiang Lin 06/25/2007.
1 Machine Learning Lecture 8: Ensemble Methods Moshe Koppel Slides adapted from Raymond J. Mooney and others.
Hidden Markov Models BMI/CS 576
Data Science Credibility: Evaluating What’s Been Learned
IE With Undirected Models: the saga continues
Maximum Entropy Models and Feature Engineering CSCI-GA.2591
Conditional Random Fields
MEMMs/CMMs and CRFs William W. Cohen Sep 22, 2010.
Data Mining Lecture 11.
CRFs for SPLODD William W. Cohen Sep 8, 2011.
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Overview of Machine Learning
CSCI 5832 Natural Language Processing
Information Extraction Lecture
IE With Undirected Models
NER with Models Allowing Long-Range Dependencies
Sequential Learning with Dependency Nets
Presentation transcript:

Conditional Markov Models: MaxEnt Tagging and MEMMs William W. Cohen Feb 8 IE Lecture

Top ten answers to “Is it cold enough for you?” No, but don’t change it just for me, the others might like it this way. No, in fact I need it to drop another 40 degrees to improve statistical significance of the results in my upcoming grant proposal to Exxon-Mobile refuting the theory of global warming. No, but …? …?

Projects and Critiques I believe I’ve responded to all submitted project proposals (even if briefly). –If you haven’t heard from me, check in after class. I believe two people have not submitted project proposals. –If you haven’t done that, definitely check in with me. Everyone: please look over these proposals –even if you’re pretty sure what you’re going to do Likewise if you’re behind on critiques. –Note: ZMM does not have a happy ending. Reminder, next week form teams. –Singleton teams are not encouraged.

Sample Critiques D. Freitag and N. Kushmerick. Boosted wrapper induction. In Proc. of the 17th National Conference on Artificial Intelligence AAAI-2000, pages 577–583, … There were several things that made this a strong paper. First, they were clear about where they were starting from (wrapper induction) and what their contribution was. Also, they described their algorithm with sufficient generality and clarity for a reader to implement it or adapt it to a different problem. They did not do much “feature engineering” or plug in many outside resources into their system. I liked this approach, since they showed that it can be done easily in their framework, but also that they don’t need to do massive feature engineering to gain a performance improvement over other rule-based systems. I always like when a paper abstracts its novel pieces away from the problem at hand and presents them more theoretically than would be necessary to simply communicate a problem solution. The paper clearly has a very different feel than, say, Borthwick et al. (1998), which focused more on a systems view and delved deeply into the choice of features in the system. Borthwick, A., Sterling, J., Agichtein, E., & Grishman, R. (1998). Exploiting diverse knowledge sources via maximum entropy in named entity recognition. In Proceedings of the Sixth Workshop on Very Large Corpora New Brunswick, New Jersey. Association for Computational Linguistics. … They describe the use of many knowledge sources in addition to standard features for named- entity recognition, but it’s unclear how much each of these help performance. As a result, if researchers were to build a similar system based on this paper, they would probably have to discover these results themselves. It’s interesting how they didn’t do as well as the other systems on the surprise domain of the test data. Assuming they would be evaluated on data of the same domain as the training data, they did not hesitate to include domain-specific features in their model. It’s in contrast to the Jansche and Abney paper, in which the latter focused on features that would avoid overfitting on feature instances that are common in the training data. …. I was a bit disappointed that the authors stopped at L=4 in the table, when the first column was still increasing in accuracy – I wanted to see exactly how far it would go. (It also made me wonder if future work was possible to develop heuristics that would allow a larger window, with sub-exponential increase in training time, perhaps only still using L words, but also adding a distance parameter, such that you could have something like ). … They also mention that they might use hundreds of these high precision, low recall patterns. While this might work well empirically, it just doesn't seem very elegant. There is a certain appeal to looking for simple rules to identify fields, especially in highly structured text like certain cgi generated web pages. But there is also an appeal to regularization: it does not seem that memorizing hundreds of one- or two-off rules is the best way to learn what's really going on in a document.

Sample Critiques ====================================== Information Extraction from Voic Transcripts - Jansche & Abney ====================================== … One thing that was not clear to me and hence I did not like was the explanation of the exact feature representation for each task. For example the authors repeatedly mention "a small set of lexical features" and "handful of common first names" but do not explain where they came from, who selected them, by what method … I was wondering why the authors decided to use classification for predicting length of the caller phrases/names as opposed to regression. I realize that they have argued that these lengths are discrete valued and therefore they chose classification, but the length attribute has the significance of order in its values … … There were two things that really bothered me about the evaluation, however. One thing was that in their numbers they included the empty calls (hangups). …. Secondly, and perhaps a larger issue for me, is that I'm not clear if the hand-crafted rules were made before or after they looked at the corpora they were using. It seems to me that if you hand craft rules after looking at your data, what you are in essence doing is training on your test data. This makes it seem a bit unfair to compare these results against strictly machine-learning based approaches. I particularly like one of the concluding points to the article: the authors’ clearly demonstrated that generic NE extractors can not be used on every task. In the phone number extraction task, longer- range patterns are necessary, but the bi- or tri-gram features cannot reflect that. Their model, with features specifically designed for the task is a clear winner.

Review: Hidden Markov Models Efficient dynamic programming algorithms exist for –Finding Pr(S) –The highest probability path P that maximizes Pr(S,P) (Viterbi) Training the model –(Baum-Welch algorithm) S2S2 S4S4 S1S S3S3 ACAC ACAC ACAC 0.5 ACAC

HMM for Segmentation Simplest Model: One state per entity type

HMM Learning Manally pick HMM’s graph (eg simple model, fully connected) Learn transition probabilities: Pr(si|sj) Learn emission probabilities: Pr(w|si)

Learning model parameters When training data defines unique path through HMM –Transition probabilities Probability of transitioning from state i to state j = number of transitions from i to j total transitions from state i –Emission probabilities Probability of emitting symbol k from state i = number of times k generated from i number of transition from I When training data defines multiple path: –A more general EM like algorithm (Baum-Welch)

What is a “symbol” ??? Cohen => “Cohen”, “cohen”, “Xxxxx”, “Xx”, … ? 4601 => “4601”, “9999”, “9+”, “number”, … ? Datamold: choose best abstraction level using holdout set

What is a symbol? Bikel et al mix symbols from two abstraction levels

What is a symbol? Ideally we would like to use many, arbitrary, overlapping features of words. S t-1 S t O t S t+1 O t +1 O t - 1 identity of word ends in “-ski” is capitalized is part of a noun phrase is in a list of city names is under node X in WordNet is in bold font is indented is in hyperlink anchor … … … part of noun phrase is “Wisniewski” ends in “-ski” Lots of learning systems are not confounded by multiple, non- independent features: decision trees, neural nets, SVMs, …

Stupid HMM tricks start Pr(red) Pr(green) Pr(green|green) = 1 Pr(red|red) = 1

Stupid HMM tricks start Pr(red) Pr(green) Pr(green|green) = 1 Pr(red|red) = 1 Pr(y|x) = Pr(x|y) * Pr(y) / Pr(x) argmax{y} Pr(y|x) = argmax{y} Pr(x|y) * Pr(y) = argmax{y} Pr(y) * Pr(x1|y)*Pr(x2|y)*...*Pr(xm|y) Pr(“I voted for Ralph Nader”|ggggg) = Pr(g)*Pr(I|g)*Pr(voted|g)*Pr(for|g)*Pr(Ralph|g)*Pr(Nader|g)

HMM’s = sequential NB

From NB to Maxent

Learning: set alpha parameters to maximize this: the ML model of the data, given we’re using the same functional form as NB. Turns out this is the same as maximizing entropy of p(y|x) over all distributions.

MaxEnt Comments –Implementation: All methods are iterative Numerical issues (underflow rounding) are important. For NLP like problems with many features, modern gradient- like or Newton-like methods work well – sometimes better(?) and faster than GIS and IIS –Smoothing: Typically maxent will overfit data if there are many infrequent features. Common solutions: discard low-count features; early stopping with holdout set; Gaussian prior centered on zero to limit size of alphas (ie, optimize log likelihood - sum alpha)

MaxEnt Comments –Performance: Good MaxEnt methods are competitive with linear SVMs and other state of are classifiers in accuracy. Can’t as easily extend to higher-order interactions (e.g. kernel SVMs, AdaBoost) – but see [Lafferty, Zhu, Liu ICML2004] Training is relatively expensive. –Embedding in a larger system: MaxEnt optimizes Pr(y|x), not error rate.

MaxEnt Comments –MaxEnt competitors: Model Pr(y|x) with Pr(y|score(x)) using score from SVM’s, NB, … Regularized Winnow, BPETs, … Ranking-based methods that estimate if Pr(y1|x)>Pr(y2|x). –Things I don’t understand: Why don’t we call it logistic regression? Why is always used to estimate the density of (y,x) pairs rather than a separate density for each class y? When are its confidence estimates reliable?

What is a symbol? Ideally we would like to use many, arbitrary, overlapping features of words. S t-1 S t O t S t+1 O t +1 O t - 1 identity of word ends in “-ski” is capitalized is part of a noun phrase is in a list of city names is under node X in WordNet is in bold font is indented is in hyperlink anchor … … … part of noun phrase is “Wisniewski” ends in “-ski”

What is a symbol? S t-1 S t O t S t+1 O t +1 O t - 1 identity of word ends in “-ski” is capitalized is part of a noun phrase is in a list of city names is under node X in WordNet is in bold font is indented is in hyperlink anchor … … … part of noun phrase is “Wisniewski” ends in “-ski” Idea: replace generative model in HMM with a maxent model, where state depends on observations

What is a symbol? S t-1 S t O t S t+1 O t +1 O t - 1 identity of word ends in “-ski” is capitalized is part of a noun phrase is in a list of city names is under node X in WordNet is in bold font is indented is in hyperlink anchor … … … part of noun phrase is “Wisniewski” ends in “-ski” Idea: replace generative model in HMM with a maxent model, where state depends on observations and previous state

What is a symbol? S t-1 S t O t S t+1 O t +1 O t - 1 identity of word ends in “-ski” is capitalized is part of a noun phrase is in a list of city names is under node X in WordNet is in bold font is indented is in hyperlink anchor … … … part of noun phrase is “Wisniewski” ends in “-ski” Idea: replace generative model in HMM with a maxent model, where state depends on observations and previous state history

Ratnaparkhi’s MXPOST Sequential learning problem: predict POS tags of words. Uses MaxEnt model described above. Rich feature set. To smooth, discard features occurring < 10 times.

MXPOST

MXPOST: learning & inference GIS Feature selection

Alternative inference schemes

MXPost inference

Inference for MENE B I O B I O B I O B I O B I O B I O B I O When will prof Cohen post the notes …

Inference for MXPOST B I O B I O B I O B I O B I O B I O B I O When will prof Cohen post the notes … (Approx view): find best path, weights are now on arcs from state to state.

Inference for MXPOST B I O B I O B I O B I O B I O B I O B I O When will prof Cohen post the notes … More accurately: find total flow to each node, weights are now on arcs from state to state.

Inference for MXPOST B I O B I O B I O B I O B I O B I O B I O When will prof Cohen post the notes … Find best path? tree? Weights are on hyperedges

Inference for MxPOST I O iI iOiO When will prof Cohen post the notes … oI oOoO iI iOiO oI oOoO iI iOiO oI oOoO iI iOiO oI oOoO iI iOiO oI oOoO iI iOiO oI oOoO …… Beam search is alternative to Viterbi: at each stage, find all children, score them, and discard all but the top n states

MXPost results State of art accuracy (for 1996) Same approach used successfully for several other sequential classification steps of a stochastic parser (also state of art). Same (or similar) approaches used for NER by Borthwick, Malouf, Manning, and others.

MEMMs Basic difference from ME tagging: –ME tagging: previous state is feature of MaxEnt classifier –MEMM: build a separate MaxEnt classifier for each state. Can build any HMM architecture you want; eg parallel nested HMM’s, etc. Data is fragmented: examples where previous tag is “proper noun” give no information about learning tags when previous tag is “noun” –Mostly a difference in viewpoint –MEMM does allow possibility of “hidden” states and Baum-Welsh like training

MEMM task: FAQ parsing

MEMM features

MEMMs

Some interesting points to ponder “Easier to think of observations as part of the the arcs, rather than the states.” FeatureHMM works surprisingly? well. Both approaches allow Pr(y i |x,y i-1,…) to be determined by arbitrary features of the history. –“Factored” MEMM