Conditional Markov Models
What is a symbol? Ideally we would like to use many, arbitrary, overlapping features of words. identity of word ends in “-ski” is capitalized is part of a noun phrase is in a list of city names … S S S t - 1 t t+1 … is “Wisniewski” … part of noun phrase ends in “-ski” O O O t - 1 t t +1 Lots of learning systems are not confounded by multiple, non-independent features: decision trees, neural nets, SVMs, …
Stupid HMM tricks Pr(y|x) = Pr(x|y) * Pr(y) / Pr(x) start Pr(red) Pr(green) Pr(green|green) = 1 Pr(red|red) = 1 Pr(y|x) = Pr(x|y) * Pr(y) / Pr(x) argmax{y} Pr(y|x) = argmax{y} Pr(x|y) * Pr(y) = argmax{y} Pr(y) * Pr(x1|y)*Pr(x2|y)*...*Pr(xm|y) Pr(“I voted for Ralph Nader”|ggggg) = Pr(g)*Pr(I|g)*Pr(voted|g)*Pr(for|g)*Pr(Ralph|g)*Pr(Nader|g)
From NB to Maxent
What is a symbol? identity of word ends in “-ski” is capitalized is part of a noun phrase is in a list of city names is under node X in WordNet is in bold font is indented is in hyperlink anchor … S S t - 1 S t+1 t … is “Wisniewski” … part of noun phrase ends in “-ski” O O O t - 1 t t +1 Idea: replace generative model in HMM with a maxent model, where state depends on observations and previous state history
Ratnaparkhi’s MXPOST Sequential learning problem: predict POS tags of words. Uses MaxEnt model described above. Rich feature set. To smooth, discard features occurring < 10 times.
Inference for MENE When will prof Cohen post the notes … B B B B B B B
Inference for MXPOST When will prof Cohen post the notes … B B B B B B B I I I I I I I O O O O O O O (Approx view): find best path, weights are now on arcs from state to state.
Inference for MXPOST Find best path? tree? Weights are on hyperedges When will prof Cohen post the notes … B B B B B B B I I I I I I I O O O O O O O Find best path? tree? Weights are on hyperedges
Inference for MxPOST Beam search is alternative to Viterbi: When will prof Cohen post the notes … I iI oI O iO oO Beam search is alternative to Viterbi: at each stage, find all children, score them, and discard all but the top n states
Inference for MxPOST Beam search is alternative to Viterbi: When will prof Cohen post the notes … oII I iI oiO oI ioI O iO ioO ooI oO ooO Beam search is alternative to Viterbi: at each stage, find all children, score them, and discard all but the top n states
Inference for MxPOST Beam search is alternative to Viterbi: When will prof Cohen post the notes … oiI oiiI I iI oiO oiiO oI ioI iooI O iO ioO iooO ooI oO oooI ooO oooO Beam search is alternative to Viterbi: at each stage, find all children, score them, and discard all but the top n states
MXPost results State of art accuracy (for 1996) Same approach used successfully for several other sequential classification steps of a stochastic parser (also state of art). Same (or similar) approaches used for NER by Borthwick, Malouf, Manning, and others.
Frietag, McCallum, Pereira
MEMMs Basic difference from ME tagging: ME tagging: previous state is feature of MaxEnt classifier MEMM: build a separate MaxEnt classifier for each state. Can build any HMM architecture you want; eg parallel nested HMM’s, etc. Data is fragmented: examples where previous tag is “proper noun” give no information about learning tags when previous tag is “noun” Mostly a difference in viewpoint MEMM does allow possibility of “hidden” states and Baum-Welsh like training Viterbi is the most natural inference scheme
MEMM task: FAQ parsing
MEMM features
Conditional Random Fields
Implications of the MEMM model Does this do what we want? Q: does Y[i-1] depend on X[i+1] ? “a nodes is conditionally independent of its non-descendents given its parents” Q: what is Y[0] for the sentence “Qbbzzt of America Inc announced layoffs today in …”
Label Bias Problem Pr(0123|rib)=1 Pr(0453|rob)=1 Consider this MEMM, and enough training data to perfectly model it: Pr(0123|rib)=1 Pr(0453|rob)=1 Pr(0123|rob) = Pr(1|0,r)/Z1 * Pr(2|1,o)/Z2 * Pr(3|2,b)/Z3 = 0.5 * 1 * 1 Pr(0453|rib) = Pr(4|0,r)/Z1’ * Pr(5|4,i)/Z2’ * Pr(3|5,b)/Z3’ = 0.5 * 1 *1
How important is label bias? Could be avoided in this case by changing structure: Our models are always wrong – is this “wrongness” a problem? See Klein & Manning’s paper for more on this….
Another view of label bias [Sha & Pereira] So what’s the alternative?
Another max-flow scheme When will prof Cohen post the notes … B B B B B B B I I I I I I I O O O O O O O More accurately: find total flow to each node, weights are now on arcs from state to state. Flow out of a node is always fixed:
Another max-flow scheme: MRFs When will prof Cohen post the notes … B B B B B B B I I I I I I I O O O O O O O Goal is to learn how to weight edges in the graph: weight(yi,yi+1) = 2*[(yi=B or I) and isCap(xi)] + 1*[(yi=B and isFirstName(xi)] - 5*[(yi+1≠B and isLower(xi) and isUpper(xi+1)]
Another max-flow scheme: MRFs When will prof Cohen post the notes … B B B B B B B I I I I I I I O O O O O O O Find total flow to each node, weights are now on edges from state to state. Goal is to learn how to weight edges in the graph, given features from the examples.
CRFs vs MEMMs MEMMs: CRFs: x1 x2 x3 x4 x5 x6 … x1 x2 x3 x4 x5 x6 … … Sequence classification f:xy is reduced to many cases of ordinary classification, f:xiyi …combined with Viterbi or beam search CRFs: Sequence classification f:xy is done by: Converting x,Y to a MRF Using “flow” computations on the MRF to compute the best y|x x1 x2 x3 x4 x5 x6 … x1 x2 x3 x4 x5 x6 Pr(Y|x2,y1’) MRF: φ(Y1,Y2), φ(Y2,Y3),…. Pr(Y|x4,y3) … Pr(Y|x5,y5) Pr(Y|x2,y1) … y1 y2 y3 y4 y5 y6 y1 y2 y3 y4 y5 y6
The math: Review of maxent
Review of maxent/MEMM/CMMs We know how to compute this.
Details on CMMs
From CMMs to CRFs Recall why we’re unhappy: we don’t want local normalization New model How to compute this?
What’s the new model look like? What’s independent? If fi is HMM-like and depends on only xj,yj or yj,yj-1 y1 y2 y3 x1 x2 x3
What’s the new model look like? What’s independent now?? y1 y2 y3 x
CRF learning – from Sha & Pereira
CRF learning – from Sha & Pereira
CRF learning – from Sha & Pereira Something like forward-backward Idea: Define matrix of y,y’ “affinities” at stage i Mi[y,y’] = “unnormalized probability” of transition from y to y’ at stage I Mi * Mi+1 = “unnormalized probability” of any path through stages i and i+1
y1 y2 y3 x y1 y2 y3
Forward backward ideas name name name c g b f nonName nonName nonName d h
CRF learning – from Sha & Pereira
Sha & Pereira results CRF beats MEMM (McNemar’s test); MEMM probably beats voted perceptron
Sha & Pereira results in minutes, 375k examples
Klein & Manning: Conditional Structure vs Estimation
Task 1: WSD (Word Sense Disambiguation) Bush’s election-year ad campaign will begin this summer, with... (sense1) Bush whacking is tiring but rewarding—who wants to spend all their time on marked trails? (sense2) Class is sense1/sense2, features are context words.
Task 1: WSD (Word Sense Disambiguation) Model 1: Naive Bayes multinomial model: Use conditional rule to predict sense s from context-word observations o. Standard NB training maximizes “joint likelihood” under independence assumption
Task 1: WSD (Word Sense Disambiguation) Model 2: Keep same functional form, but maximize conditional likelihood (sound familiar?) or maybe SenseEval score: or maybe even:
In other words… MaxEnt Naïve Bayes Different “optimization goals”… … or, dropping a constraint about f’s and λ’s
Task 1: WSD (Word Sense Disambiguation) Optimize JL with std NB learning Optimize SCL, CL with conjugate gradient Also over “non-deficient models” (?) using Lagrange penalties to enforce “soft” version of deficiency constraint I think this makes sure non-conditional version is a valid probability “Punt” on optimizing accuracy Penalty for extreme predictions in SCL
Conclusion: maxent beats NB? All generalizations are wrong?
Task 2: POS Tagging Sequential problem Replace NB with HMM model. Standard algorithms maximize joint likelihood Claim: keeping the same model but maximizing conditional likelihood leads to a CRF Is this true? Alternative is conditional structure (CMM)
Using conditional structure vs maximizing conditional likelihood CMM factors Pr(s,o) into Pr(s|o)Pr(o). For the CMM model, adding dependencies btwn observations does not change Pr(s|o), ie JL estimate =CL estimate for Pr(s|o)
Task 2: POS Tagging Experiments with a simple feature set: For fixed model, CL is preferred to JL (CRF beats HMM) For fixed objective, HMM is preferred to MEMM/CMM
Error analysis for POS tagging Label bias is not the issue: state-state dependencies are weak compared to observation-state dependencies too much emphasis on observation, not enough on previous states (“observation bias”) put another way: label bias predicts overprediction of states with few outgoing transitions, of more generally, low entropy...
