Download presentation
Presentation is loading. Please wait.
Published byMatheus Henrique Neves Bergler Modified over 6 years ago
1
Klein and Manning on CRFs vs CMMs
2
Announcements Projects are all in and approved Anything else?
People working on weird datasets should send me some examples so I understand the task. Anything else?
3
CRF wrap-up MEMMs: CRFs: x1 x2 x3 x4 x5 x6 y1 y2 y3 y4 y5 y6 …
Sequence classification f:xy is reduced to many cases of ordinary classification, f:xiyi …combined with Viterbi or beam search CRFs: Sequence classification f:xy is done by: Converting x,Y to a MRF Using “flow” computations on the MRF to compute some best y|x x1 x2 x3 x4 x5 x6 MRF: φ(Y1,Y2), φ(Y2,Y3),…. y1 y2 y3 y4 y5 y6 … x1 x2 x3 x4 x5 x6 Pr(Y|x2,y1’) Pr(Y|x4,y3) … Pr(Y|x5,y5) Pr(Y|x2,y1) … y1 y2 y3 y4 y5 y6
4
CRF wrap-up CRFs: x1 x2 x3 x4 x5 x6 y1 y2 y3 y4 y5 y6
Sequence classification f:xy is done by: Converting x,Y to a MRF Using “flow” computations on the MRF to compute some best y|x x1 x2 x3 x4 x5 x6 MRF: φ(Y1,Y2), φ(Y2,Y3),…. y1 y2 y3 y4 y5 y6
5
CRF wrap-up CRFs: ? x1 x2 x3 x4 x5 x6 y1 y2 y3 y4 y5 y6
Sequence classification f:xy is done by: Converting x,Y to a MRF Using “flow” computations on the MRF to compute some best y|x ? x1 x2 x3 x4 x5 x6 MRF: φ(Y1,Y2), φ(Y2,Y3),…. y1 y2 y3 y4 y5 y6
6
CRF wrap-up
7
CRF wrap-up
8
Klein & Manning: Conditional Structure vs Estimation
9
Task 1: WSD (Word Sense Disambiguation)
Bush’s election-year ad campaign will begin this summer, with... (sense1) Bush whacking is tiring but rewarding—who wants to spend all their time on marked trails? (sense2) Class is sense1/sense2, features are context words.
10
Task 1: WSD (Word Sense Disambiguation)
Model 1: Naive Bayes multinomial model: Use conditional rule to predict sense s from context-word observations o. Standard NB training maximizes “joint likelihood” under independence assumption
11
Task 1: WSD (Word Sense Disambiguation)
Model 2: Keep same functional form, but maximize conditional likelihood (sound familiar?) or maybe SenseEval score: or maybe even:
12
In other words… MaxEnt Naïve Bayes Different “optimization goals”…
… or, dropping a constraint about f’s and λ’s
13
Task 1: WSD (Word Sense Disambiguation)
Optimize JL with std NB learning Optimize SCL, CL with conjugate gradient Also over “non-deficient models” (?) using Lagrange penalties to enforce “soft” version of deficiency constraint I think this makes sure non-conditional version is a valid probability “Punt” on optimizing accuracy Penalty for extreme predictions in SCL
15
Conclusion: maxent beats NB?
All generalizations are wrong?
16
Task 2: POS Tagging Sequential problem Replace NB with HMM model.
Standard algorithms maximize joint likelihood Claim: keeping the same model but maximizing conditional likelihood leads to a CRF Is this true? Alternative is conditional structure (CMM)
17
HMM CRF
18
Using conditional structure vs maximizing conditional likelihood
CMM factors Pr(s,o) into Pr(s|o)Pr(o). For the CMM model, adding dependencies btwn observations does not change Pr(s|o), ie JL estimate =CL estimate for Pr(s|o)
19
Task 2: POS Tagging Experiments with a simple feature set:
For fixed model, CL is preferred to JL (CRF beats HMM) For fixed objective, HMM is preferred to MEMM/CMM
20
Error analysis for POS tagging
Label bias is not the issue: state-state dependencies are weak compared to observation-state dependencies too much emphasis on observation, not enough on previous states (“observation bias”) put another way: label bias predicts overprediction of states with few outgoing transitions, of more generally, low entropy...
21
Error analysis for POS tagging
22
Next: Cohen & Carvalho (from IJCAI-2005, Edinburgh)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.