IE With Undirected Models William W. Cohen CALD
Announcements Upcoming assignments: Mon 2/23: Klein & Manning, Toutanova et al Wed 2/25: no writeup due Mon 3/1: no writeup due Wed 3/3: project proposal due: personnel + 1-2 page Spring break week, no class
Motivation for CMMs identity of word ends in “-ski” is capitalized is part of a noun phrase is in a list of city names is under node X in WordNet is in bold font is indented is in hyperlink anchor … S S S t - 1 t t+1 … is “Wisniewski” … part of noun phrase ends in “-ski” O O O t - 1 t t +1 Idea: replace generative model in HMM with a maxent model, where state depends on observations and previous state
Implications of the model Does this do what we want? Q: does Y[i-1] depend on X[i+1] ? “a nodes is conditionally independent of its non-descendents given its parents”
Another view of label bias [Sha & Pereira] So what’s the alternative?
CRF model y1 y2 y3 y4 x
CRF learning – from Sha & Pereira
CRF learning – from Sha & Pereira
CRF learning – from Sha & Pereira Something like forward-backward Idea: Define matrix of y,y’ “affinities” at stage i Mi[y,y’] = “unnormalized probability” of transition from y to y’ at stage I Mi * Mi+1 = “unnormalized probability” of any path through stages i and i+1
Forward backward ideas name name name c g b f nonName nonName nonName d h
CRF learning – from Sha & Pereira
CRF results (from S&P, L et al) Sha & Pereira even use some statistical tests! And show CRF beats MEMM (McNemar’s test) - but not voted perceptron.
CRFs: the good, the bad, and the cumbersome… Good points: Global optimization of weight vector that guides decision making Trade off decisions made at different points in sequence Worries: Cost (of training) Complexity (do we need all this math?) Amount of context: Matrix for normalizer is |Y| * |Y|, so high-order models for many classes get expensive fast. Strong commitment to maxent-style learning Loglinear models are nice, but nothing is always best.
Dependency Nets
Proposed solution: parents of node are the Markov blanket like undirected Markov net capture all “correlational associations” one conditional probability for each node X, namely P(X|parents of X) like directed Bayes net–no messy clique potentials
Dependency nets The bad and the ugly: Inference is less efficient –MCMC sampling Can’t reconstruct probability via chain rule Networks might be inconsistent ie local P(x|pa(x)’s don’t define a pdf Exactly equal, representationally, to normal undirected Markov nets
Dependency nets The good: Learning is simple and elegant (if you know each node’s Markov blanket): just learn a probabilistic classifier for P(X|pa(X)) for each node X. (You might not learn a consistent model, but you’ll probably learn a reasonably good one.) Inference can be speeded up substantially over naïve Gibbs sampling.
Dependency nets Learning is simple and elegant (if you know each node’s Markov blanket): just learn a probabilistic classifier for P(X|pa(X)) for each node X. Pr(y1|x,y2) Pr(y2|x,y1,y2) Pr(y3|x,y2,y4) Pr(y4|x,y3) y1 y2 y3 y4 Learning is local, but inference is not, and need not be unidirectional x
Toutanova, Klein, Manning, Singer Dependency nets for POS tagging vs CMM’s. Maxent is used for local conditional model. Goals: An easy-to-train bidirectional model A really good POS tagger
Toutanova et al D = {11, 11, 11, 12, 21, 33} ML state: {11} Don’t use Gibbs sampling for inference: instead use a Viterbi variant (which is not guaranteed to produce the ML sequence) D = {11, 11, 11, 12, 21, 33} ML state: {11} P(a=1|b=1)P(b=1|a=1) < 1 P(a=3|b=3)P(b=3|a=3) = 1
Results with model
Results with model
Results with model “Best” model includes some special unknown-word features, including “a crude company-name detector”
Results with model MXPost: 47.6, 96.4, 86.2 CRF+: 95.7, 76.4 Final test-set results MXPost: 47.6, 96.4, 86.2 CRF+: 95.7, 76.4
Other comments Smoothing (quadratic regularization, aka Gaussian prior) is important—it avoids overfitting effects reported elsewhere
More on smoothing...
Klein & Manning: Conditional Structure vs Estimation
Task 1: WSD (Word Sense Disambiguation) Bush’s election-year ad campaign will begin this summer, with... (sense1) Bush whacking is tiring but rewarding—who wants to spend all their time on marked trails? (sense2) Class is sense1/sense2, features are context words.
Task 1: WSD (Word Sense Disambiguation) Model 1: Naive Bayes multinomial model: Use conditional rule to predict sense s from context-word observations o. Standard NB training maximizes “joint likelihood” under independence assumption
Task 1: WSD (Word Sense Disambiguation) Model 2: Keep same functional form, but maximize conditional likelihood (sound familiar?) or maybe SenseEval score: or maybe even:
Task 1: WSD (Word Sense Disambiguation) Optimize JL with std NB learning Optimize SCL, CL with conjugate gradient Also over “non-deficient models” (?) using Lagrange penalties to enforce “soft” version of deficiency constraint I think this makes sure non-conditional version is a valid probability “Punt” on optimizing accuracy Penalty for extreme predictions in SCL
Conclusion: maxent beats NB? All generalizations are wrong?
Task 2: POS Tagging Sequential problem Replace NB with HMM model. Standard algorithms maximize joint likelihood Claim: keeping the same model but maximizing conditional likelihood leads to a CRF Is this true? Alternative is conditional structure (CMM)
Using conditional structure vs maximizing conditional likelihood CMM factors Pr(s,o) into Pr(s|o)Pr(o). For the CMM model, adding dependencies btwn observations does not change Pr(s|o), ie JL estimate =CL estimate for Pr(s|o)
Task 2: POS Tagging Experiments with a simple feature set: For fixed model, CL is preferred to JL (CRF beats HMM) For fixed objective, HMM is preferred to MEMM/CMM
Error analysis for POS tagging Label bias is not the issue: state-state dependencies are weak compared to observation-state dependencies too much emphasis on observation, not enough on previous states (“observation bias”) put another way: label bias predicts overprediction of states with few outgoing transitions, of more generally, low entropy...
Error analysis for POS tagging