Conditional Markov Models: MaxEnt Tagging and MEMMs William W. Cohen Feb 8 IE Lecture
Top ten answers to “Is it cold enough for you?” No, but don’t change it just for me, the others might like it this way. No, in fact I need it to drop another 40 degrees to improve statistical significance of the results in my upcoming grant proposal to Exxon-Mobile refuting the theory of global warming. No, but …? …?
Projects and Critiques I believe I’ve responded to all submitted project proposals (even if briefly). –If you haven’t heard from me, check in after class. I believe two people have not submitted project proposals. –If you haven’t done that, definitely check in with me. Everyone: please look over these proposals –even if you’re pretty sure what you’re going to do Likewise if you’re behind on critiques. –Note: ZMM does not have a happy ending. Reminder, next week form teams. –Singleton teams are not encouraged.
Sample Critiques D. Freitag and N. Kushmerick. Boosted wrapper induction. In Proc. of the 17th National Conference on Artificial Intelligence AAAI-2000, pages 577–583, … There were several things that made this a strong paper. First, they were clear about where they were starting from (wrapper induction) and what their contribution was. Also, they described their algorithm with sufficient generality and clarity for a reader to implement it or adapt it to a different problem. They did not do much “feature engineering” or plug in many outside resources into their system. I liked this approach, since they showed that it can be done easily in their framework, but also that they don’t need to do massive feature engineering to gain a performance improvement over other rule-based systems. I always like when a paper abstracts its novel pieces away from the problem at hand and presents them more theoretically than would be necessary to simply communicate a problem solution. The paper clearly has a very different feel than, say, Borthwick et al. (1998), which focused more on a systems view and delved deeply into the choice of features in the system. Borthwick, A., Sterling, J., Agichtein, E., & Grishman, R. (1998). Exploiting diverse knowledge sources via maximum entropy in named entity recognition. In Proceedings of the Sixth Workshop on Very Large Corpora New Brunswick, New Jersey. Association for Computational Linguistics. … They describe the use of many knowledge sources in addition to standard features for named- entity recognition, but it’s unclear how much each of these help performance. As a result, if researchers were to build a similar system based on this paper, they would probably have to discover these results themselves. It’s interesting how they didn’t do as well as the other systems on the surprise domain of the test data. Assuming they would be evaluated on data of the same domain as the training data, they did not hesitate to include domain-specific features in their model. It’s in contrast to the Jansche and Abney paper, in which the latter focused on features that would avoid overfitting on feature instances that are common in the training data. …. I was a bit disappointed that the authors stopped at L=4 in the table, when the first column was still increasing in accuracy – I wanted to see exactly how far it would go. (It also made me wonder if future work was possible to develop heuristics that would allow a larger window, with sub-exponential increase in training time, perhaps only still using L words, but also adding a distance parameter, such that you could have something like ). … They also mention that they might use hundreds of these high precision, low recall patterns. While this might work well empirically, it just doesn't seem very elegant. There is a certain appeal to looking for simple rules to identify fields, especially in highly structured text like certain cgi generated web pages. But there is also an appeal to regularization: it does not seem that memorizing hundreds of one- or two-off rules is the best way to learn what's really going on in a document.
Sample Critiques ====================================== Information Extraction from Voic Transcripts - Jansche & Abney ====================================== … One thing that was not clear to me and hence I did not like was the explanation of the exact feature representation for each task. For example the authors repeatedly mention "a small set of lexical features" and "handful of common first names" but do not explain where they came from, who selected them, by what method … I was wondering why the authors decided to use classification for predicting length of the caller phrases/names as opposed to regression. I realize that they have argued that these lengths are discrete valued and therefore they chose classification, but the length attribute has the significance of order in its values … … There were two things that really bothered me about the evaluation, however. One thing was that in their numbers they included the empty calls (hangups). …. Secondly, and perhaps a larger issue for me, is that I'm not clear if the hand-crafted rules were made before or after they looked at the corpora they were using. It seems to me that if you hand craft rules after looking at your data, what you are in essence doing is training on your test data. This makes it seem a bit unfair to compare these results against strictly machine-learning based approaches. I particularly like one of the concluding points to the article: the authors’ clearly demonstrated that generic NE extractors can not be used on every task. In the phone number extraction task, longer- range patterns are necessary, but the bi- or tri-gram features cannot reflect that. Their model, with features specifically designed for the task is a clear winner.
Review: Hidden Markov Models Efficient dynamic programming algorithms exist for –Finding Pr(S) –The highest probability path P that maximizes Pr(S,P) (Viterbi) Training the model –(Baum-Welch algorithm) S2S2 S4S4 S1S S3S3 ACAC ACAC ACAC 0.5 ACAC
HMM for Segmentation Simplest Model: One state per entity type
HMM Learning Manally pick HMM’s graph (eg simple model, fully connected) Learn transition probabilities: Pr(si|sj) Learn emission probabilities: Pr(w|si)
Learning model parameters When training data defines unique path through HMM –Transition probabilities Probability of transitioning from state i to state j = number of transitions from i to j total transitions from state i –Emission probabilities Probability of emitting symbol k from state i = number of times k generated from i number of transition from I When training data defines multiple path: –A more general EM like algorithm (Baum-Welch)
What is a “symbol” ??? Cohen => “Cohen”, “cohen”, “Xxxxx”, “Xx”, … ? 4601 => “4601”, “9999”, “9+”, “number”, … ? Datamold: choose best abstraction level using holdout set
What is a symbol? Bikel et al mix symbols from two abstraction levels
What is a symbol? Ideally we would like to use many, arbitrary, overlapping features of words. S t-1 S t O t S t+1 O t +1 O t - 1 identity of word ends in “-ski” is capitalized is part of a noun phrase is in a list of city names is under node X in WordNet is in bold font is indented is in hyperlink anchor … … … part of noun phrase is “Wisniewski” ends in “-ski” Lots of learning systems are not confounded by multiple, non- independent features: decision trees, neural nets, SVMs, …
Stupid HMM tricks start Pr(red) Pr(green) Pr(green|green) = 1 Pr(red|red) = 1
Stupid HMM tricks start Pr(red) Pr(green) Pr(green|green) = 1 Pr(red|red) = 1 Pr(y|x) = Pr(x|y) * Pr(y) / Pr(x) argmax{y} Pr(y|x) = argmax{y} Pr(x|y) * Pr(y) = argmax{y} Pr(y) * Pr(x1|y)*Pr(x2|y)*...*Pr(xm|y) Pr(“I voted for Ralph Nader”|ggggg) = Pr(g)*Pr(I|g)*Pr(voted|g)*Pr(for|g)*Pr(Ralph|g)*Pr(Nader|g)
HMM’s = sequential NB
From NB to Maxent
Learning: set alpha parameters to maximize this: the ML model of the data, given we’re using the same functional form as NB. Turns out this is the same as maximizing entropy of p(y|x) over all distributions.
MaxEnt Comments –Implementation: All methods are iterative Numerical issues (underflow rounding) are important. For NLP like problems with many features, modern gradient- like or Newton-like methods work well – sometimes better(?) and faster than GIS and IIS –Smoothing: Typically maxent will overfit data if there are many infrequent features. Common solutions: discard low-count features; early stopping with holdout set; Gaussian prior centered on zero to limit size of alphas (ie, optimize log likelihood - sum alpha)
MaxEnt Comments –Performance: Good MaxEnt methods are competitive with linear SVMs and other state of are classifiers in accuracy. Can’t as easily extend to higher-order interactions (e.g. kernel SVMs, AdaBoost) – but see [Lafferty, Zhu, Liu ICML2004] Training is relatively expensive. –Embedding in a larger system: MaxEnt optimizes Pr(y|x), not error rate.
MaxEnt Comments –MaxEnt competitors: Model Pr(y|x) with Pr(y|score(x)) using score from SVM’s, NB, … Regularized Winnow, BPETs, … Ranking-based methods that estimate if Pr(y1|x)>Pr(y2|x). –Things I don’t understand: Why don’t we call it logistic regression? Why is always used to estimate the density of (y,x) pairs rather than a separate density for each class y? When are its confidence estimates reliable?
What is a symbol? Ideally we would like to use many, arbitrary, overlapping features of words. S t-1 S t O t S t+1 O t +1 O t - 1 identity of word ends in “-ski” is capitalized is part of a noun phrase is in a list of city names is under node X in WordNet is in bold font is indented is in hyperlink anchor … … … part of noun phrase is “Wisniewski” ends in “-ski”
What is a symbol? S t-1 S t O t S t+1 O t +1 O t - 1 identity of word ends in “-ski” is capitalized is part of a noun phrase is in a list of city names is under node X in WordNet is in bold font is indented is in hyperlink anchor … … … part of noun phrase is “Wisniewski” ends in “-ski” Idea: replace generative model in HMM with a maxent model, where state depends on observations
What is a symbol? S t-1 S t O t S t+1 O t +1 O t - 1 identity of word ends in “-ski” is capitalized is part of a noun phrase is in a list of city names is under node X in WordNet is in bold font is indented is in hyperlink anchor … … … part of noun phrase is “Wisniewski” ends in “-ski” Idea: replace generative model in HMM with a maxent model, where state depends on observations and previous state
What is a symbol? S t-1 S t O t S t+1 O t +1 O t - 1 identity of word ends in “-ski” is capitalized is part of a noun phrase is in a list of city names is under node X in WordNet is in bold font is indented is in hyperlink anchor … … … part of noun phrase is “Wisniewski” ends in “-ski” Idea: replace generative model in HMM with a maxent model, where state depends on observations and previous state history
Ratnaparkhi’s MXPOST Sequential learning problem: predict POS tags of words. Uses MaxEnt model described above. Rich feature set. To smooth, discard features occurring < 10 times.
MXPOST
MXPOST: learning & inference GIS Feature selection
Alternative inference schemes
MXPost inference
Inference for MENE B I O B I O B I O B I O B I O B I O B I O When will prof Cohen post the notes …
Inference for MXPOST B I O B I O B I O B I O B I O B I O B I O When will prof Cohen post the notes … (Approx view): find best path, weights are now on arcs from state to state.
Inference for MXPOST B I O B I O B I O B I O B I O B I O B I O When will prof Cohen post the notes … More accurately: find total flow to each node, weights are now on arcs from state to state.
Inference for MXPOST B I O B I O B I O B I O B I O B I O B I O When will prof Cohen post the notes … Find best path? tree? Weights are on hyperedges
Inference for MxPOST I O iI iOiO When will prof Cohen post the notes … oI oOoO iI iOiO oI oOoO iI iOiO oI oOoO iI iOiO oI oOoO iI iOiO oI oOoO iI iOiO oI oOoO …… Beam search is alternative to Viterbi: at each stage, find all children, score them, and discard all but the top n states
MXPost results State of art accuracy (for 1996) Same approach used successfully for several other sequential classification steps of a stochastic parser (also state of art). Same (or similar) approaches used for NER by Borthwick, Malouf, Manning, and others.
MEMMs Basic difference from ME tagging: –ME tagging: previous state is feature of MaxEnt classifier –MEMM: build a separate MaxEnt classifier for each state. Can build any HMM architecture you want; eg parallel nested HMM’s, etc. Data is fragmented: examples where previous tag is “proper noun” give no information about learning tags when previous tag is “noun” –Mostly a difference in viewpoint –MEMM does allow possibility of “hidden” states and Baum-Welsh like training
MEMM task: FAQ parsing
MEMM features
MEMMs
Some interesting points to ponder “Easier to think of observations as part of the the arcs, rather than the states.” FeatureHMM works surprisingly? well. Both approaches allow Pr(y i |x,y i-1,…) to be determined by arbitrary features of the history. –“Factored” MEMM