Download presentation
Presentation is loading. Please wait.
Published byLester Hubbard Modified over 9 years ago
1
Conditional Random Fields Advanced Statistical Methods in NLP Ling 572 February 9, 2012 1
2
Roadmap Graphical Models Modeling independence Models revisited Generative & discriminative models Conditional random fields Linear chain models Skip chain models 2
3
Preview Conditional random fields Undirected graphical model Due to Lafferty, McCallum, and Pereira, 2001 3
4
Preview Conditional random fields Undirected graphical model Due to Lafferty, McCallum, and Pereira, 2001 Discriminative model Supports integration of rich feature sets 4
5
Preview Conditional random fields Undirected graphical model Due to Lafferty, McCallum, and Pereira, 2001 Discriminative model Supports integration of rich feature sets Allows range of dependency structures Linear-chain, skip-chain, general Can encode long-distance dependencies 5
6
Preview Conditional random fields Undirected graphical model Due to Lafferty, McCallum, and Pereira, 2001 Discriminative model Supports integration of rich feature sets Allows range of dependency structures Linear-chain, skip-chain, general Can encode long-distance dependencies Used diverse NLP sequence labeling tasks: Named entity recognition, coreference resolution, etc 6
7
Graphical Models 7
8
Graphical model Simple, graphical notation for conditional independence Probabilistic model where: Graph structure denotes conditional independence b/t random variables 8
9
Graphical Models Graphical model Simple, graphical notation for conditional independence Probabilistic model where: Graph structure denotes conditional independence b/t random variables Nodes: random variables 9
10
Graphical Models Graphical model Simple, graphical notation for conditional independence Probabilistic model where: Graph structure denotes conditional independence b/t random variables Nodes: random variables Edges: dependency relation between random variables 10
11
Graphical Models Graphical model Simple, graphical notation for conditional independence Probabilistic model where: Graph structure denotes conditional independence b/t random variables Nodes: random variables Edges: dependency relation between random variables Model types: Bayesian Networks Markov Random Fields 11
12
Modeling (In)dependence Bayesian network 12
13
Modeling (In)dependence Bayesian network Directed acyclic graph (DAG) 13
14
Modeling (In)dependence Bayesian network Directed acyclic graph (DAG) Nodes = Random Variables Arc ~ directly influences, conditional dependency 14
15
Modeling (In)dependence Bayesian network Directed acyclic graph (DAG) Nodes = Random Variables Arc ~ directly influences, conditional dependency Arcs = Child depends on parent(s) No arcs = independent (0 incoming: only a priori) Parents of X = For each X need 15
16
Example I 16 Russel & Norvig, AIMA
17
Example I 17 Russel & Norvig, AIMA
18
Example I 18 Russel & Norvig, AIMA
19
Simple Bayesian Network MCBN1 ABCDE A B depends on C depends on D depends on E depends on Need: Truth table 19
20
Simple Bayesian Network MCBN1 ABCDE A = only a priori B depends on C depends on D depends on E depends on Need: P(A) Truth table 2 20
21
Simple Bayesian Network MCBN1 ABCDE A = only a priori B depends on A C depends on D depends on E depends on Need: P(A) P(B|A) Truth table 2 2*2 21
22
Simple Bayesian Network MCBN1 ABCDE A = only a priori B depends on A C depends on A D depends on E depends on Need: P(A) P(B|A) P(C|A) Truth table 2 2*2 22
23
Simple Bayesian Network MCBN1 ABCDE A = only a priori B depends on A C depends on A D depends on B,C E depends on C Need: P(A) P(B|A) P(C|A) P(D|B,C) P(E|C) Truth table 2 2*2 2*2*2 2*2 23
24
Holmes Example (Pearl) Holmes is worried that his house will be burgled. For the time period of interest, there is a 10^-4 a priori chance of this happening, and Holmes has installed a burglar alarm to try to forestall this event. The alarm is 95% reliable in sounding when a burglary happens, but also has a false positive rate of 1%. Holmes ’ neighbor, Watson, is 90% sure to call Holmes at his office if the alarm sounds, but he is also a bit of a practical joker and, knowing Holmes ’ concern, might (30%) call even if the alarm is silent. Holmes ’ other neighbor Mrs. Gibbons is a well-known lush and often befuddled, but Holmes believes that she is four times more likely to call him if there is an alarm than not. 24
25
Holmes Example: Model There a four binary random variables: 25
26
Holmes Example: Model There a four binary random variables: B: whether Holmes ’ house has been burgled A: whether his alarm sounded W: whether Watson called G: whether Gibbons called BAWG 26
27
Holmes Example: Model There a four binary random variables: B: whether Holmes ’ house has been burgled A: whether his alarm sounded W: whether Watson called G: whether Gibbons called BAWG 27
28
Holmes Example: Model There a four binary random variables: B: whether Holmes ’ house has been burgled A: whether his alarm sounded W: whether Watson called G: whether Gibbons called BAWG 28
29
Holmes Example: Model There a four binary random variables: B: whether Holmes ’ house has been burgled A: whether his alarm sounded W: whether Watson called G: whether Gibbons called BAWG 29
30
Holmes Example: Tables B = #t B=#f 0.0001 0.9999 A=#t A=#f B #t #f 0.95 0.05 0.01 0.99 W=#t W=#fA #t #f 0.90 0.10 0.30 0.70 G=#t G=#fA #t #f 0.40 0.60 0.10 0.90 30
31
Bayes’ Nets: Markov Property Bayes’s Nets: Satisfy the local Markov property Variables: conditionally independent of non-descendents given their parents 31
32
Bayes’ Nets: Markov Property Bayes’s Nets: Satisfy the local Markov property Variables: conditionally independent of non-descendents given their parents 32
33
Bayes’ Nets: Markov Property Bayes’s Nets: Satisfy the local Markov property Variables: conditionally independent of non-descendents given their parents 33
34
Simple Bayesian Network MCBN1 ABCDE A = only a priori B depends on A C depends on A D depends on B,C E depends on C P(A,B,C,D,E)= 34
35
Simple Bayesian Network MCBN1 ABCDE A = only a priori B depends on A C depends on A D depends on B,C E depends on C P(A,B,C,D,E)=P(A) 35
36
Simple Bayesian Network MCBN1 ABCDE A = only a priori B depends on A C depends on A D depends on B,C E depends on C P(A,B,C,D,E)=P(A)P(B|A) 36
37
Simple Bayesian Network MCBN1 ABCDE A = only a priori B depends on A C depends on A D depends on B,C E depends on C P(A,B,C,D,E)=P(A)P(B|A)P(C|A) 37
38
Simple Bayesian Network MCBN1 ABCDE A = only a priori B depends on A C depends on A D depends on B,C E depends on C P(A,B,C,D,E)=P(A)P(B|A)P(C|A)P(D|B,C)P(E|C) There exist algorithms for training, inference on BNs 38
39
Naïve Bayes Model Bayes’ Net: Conditional independence of features given class Y f1f1 f2f2 f3f3 fkfk 39
40
Naïve Bayes Model Bayes’ Net: Conditional independence of features given class Y f1f1 f2f2 f3f3 fkfk 40
41
Naïve Bayes Model Bayes’ Net: Conditional independence of features given class Y f1f1 f2f2 f3f3 fkfk 41
42
Hidden Markov Model Bayesian Network where: y t depends on 42
43
Hidden Markov Model Bayesian Network where: y t depends on y t-1 x t 43
44
Hidden Markov Model Bayesian Network where: y t depends on y t-1 x t depends on y t y1y1 y2y2 y3y3 ykyk x 1 x 2 x 3 x k 44
45
Hidden Markov Model Bayesian Network where: y t depends on y t-1 x t depends on y t y1y1 y2y2 y3y3 ykyk x 1 x 2 x 3 x k 45
46
Hidden Markov Model Bayesian Network where: y t depends on y t-1 x t depends on y t y1y1 y2y2 y3y3 ykyk x 1 x 2 x 3 x k 46
47
Hidden Markov Model Bayesian Network where: y t depends on y t-1 x t depends on y t y1y1 y2y2 y3y3 ykyk x 1 x 2 x 3 x k 47
48
Generative Models Both Naïve Bayes and HMMs are generative models 48
49
Generative Models Both Naïve Bayes and HMMs are generative models We use the term generative model to refer to a directed graphical model in which the outputs topologically precede the inputs, that is, no x in X can be a parent of an output y in Y. (Sutton & McCallum, 2006) State y generates an observation (instance) x 49
50
Generative Models Both Naïve Bayes and HMMs are generative models We use the term generative model to refer to a directed graphical model in which the outputs topologically precede the inputs, that is, no x in X can be a parent of an output y in Y. (Sutton & McCallum, 2006) State y generates an observation (instance) x Maximum Entropy and linear-chain Conditional Random Fields (CRFs) are, respectively, their discriminative model counterparts 50
51
Markov Random Fields aka Markov Network Graphical representation of probabilistic model Undirected graph Can represent cyclic dependencies (vs DAG in Bayesian Networks, can represent induced dep) 51
52
Markov Random Fields aka Markov Network Graphical representation of probabilistic model Undirected graph Can represent cyclic dependencies (vs DAG in Bayesian Networks, can represent induced dep) Also satisfy local Markov property: where ne(X) are the neighbors of X 52
53
Factorizing MRFs Many MRFs can be analyzed in terms of cliques Clique: in undirected graph G(V,E), clique is a subset of vertices v in V, s.t. for every pair of vertices v i,v j, there exists E(v i,v j ) Example due to F. Xia 53
54
Factorizing MRFs Many MRFs can be analyzed in terms of cliques Clique: in undirected graph G(V,E), clique is a subset of vertices v in V, s.t. for every pair of vertices v i,v j, there exists E(v i,v j ) Maximal clique can not be extended Example due to F. Xia 54
55
Factorizing MRFs Many MRFs can be analyzed in terms of cliques Clique: in undirected graph G(V,E), clique is a subset of vertices v in V, s.t. for every pair of vertices v i,v j, there exists E(v i,v j ) Maximal clique can not be extended Maximum clique is largest clique in G. Clique: Maximal clique: Maximum clique: Example due to F. Xia A B C ED 55
56
MRFs Given an undirected graph G(V,E), random vars: X Cliques over G: cl(G) Example due to F. Xia 56
57
MRFs Given an undirected graph G(V,E), random vars: X Cliques over G: cl(G) B C ED Example due to F. Xia 57
58
MRFs Given an undirected graph G(V,E), random vars: X Cliques over G: cl(G) B C ED Example due to F. Xia 58
59
Conditional Random Fields Definition due to Lafferty et al, 2001: Let G = (V,E) be a graph such that Y=(Y v ) vinV, so that Y is indexed by the vertices of G. Then (X,Y) is a conditional random field in case, when conditioned on X, the random variables Y v obey the Markov property with respect to the graph: p(Y v |X,Y w,w!=v)=p(Y v |X,Y w,w~v), where w ∼ v means that w and v are neighbors in G 59
60
Conditional Random Fields Definition due to Lafferty et al, 2001: Let G = (V,E) be a graph such that Y=(Y v ) vinV, so that Y is indexed by the vertices of G. Then (X,Y) is a conditional random field in case, when conditioned on X, the random variables Y v obey the Markov property with respect to the graph: p(Y v |X,Y w,w!=v)=p(Y v |X,Y w,w~v), where w ∼ v means that w and v are neighbors in G. A CRF is a Markov Random Field globally conditioned on the observation X, and has the form: 60
61
Linear-Chain CRF CRFs can have arbitrary graphical structure, but.. 61
62
Linear-Chain CRF CRFs can have arbitrary graphical structure, but.. Most common form is linear chain Supports sequence modeling Many sequence labeling NLP problems: Named Entity Recognition (NER), Coreference 62
63
Linear-Chain CRF CRFs can have arbitrary graphical structure, but.. Most common form is linear chain Supports sequence modeling Many sequence labeling NLP problems: Named Entity Recognition (NER), Coreference Similar to combining HMM sequence w/MaxEnt model Supports sequence structure like HMM but HMMs can’t do rich feature structure 63
64
Linear-Chain CRF CRFs can have arbitrary graphical structure, but.. Most common form is linear chain Supports sequence modeling Many sequence labeling NLP problems: Named Entity Recognition (NER), Coreference Similar to combining HMM sequence w/MaxEnt model Supports sequence structure like HMM but HMMs can’t do rich feature structure Supports rich, overlapping features like MaxEnt but MaxEnt doesn’t directly supports sequences labeling 64
65
Discriminative & Generative Model perspectives (Sutton & McCallum) 65
66
Linear-Chain CRFs Feature functions: In MaxEnt: f: X x Y {0,1} e.g. f j (x,y) = 1, if x=“rifle” and y=talk.politics.guns, 0 o.w. 66
67
Linear-Chain CRFs Feature functions: In MaxEnt: f: X x Y {0,1} e.g. f j (x,y) = 1, if x=“rifle” and y=talk.politics.guns, 0 o.w. In CRFs, f: Y x Y x X x T R e.g. f k (y t,y t-1,x,t)=1, if y t =V and y t-1 =N and x t =“flies”,0 o.w. frequently indicator function, for efficiency 67
68
Linear-Chain CRFs Feature functions: In MaxEnt: f: X x Y {0,1} e.g. f j (x,y) = 1, if x=“rifle” and y=talk.politics.guns, 0 o.w. In CRFs, f: Y x Y x X x T R e.g. f k (y t,y t-1,x,t)=1, if y t =V and y t-1 =N and x t =“flies”,0 o.w. frequently indicator function, for efficiency 68
69
Linear-Chain CRFs 69
70
Linear-Chain CRFs 70
71
Linear-chain CRFs: Training & Decoding Training: 71
72
Linear-chain CRFs: Training & Decoding Training: Learn λ j Approach similar to MaxEnt: e.g. L-BFGS 72
73
Linear-chain CRFs: Training & Decoding Training: Learn λ j Approach similar to MaxEnt: e.g. L-BFGS Decoding: Compute label sequence that optimizes P(y|x) Can use approaches like HMM, e.g. Viterbi 73
74
Skip-chain CRFs 74
75
Motivation Long-distance dependencies: 75
76
Motivation Long-distance dependencies: Linear chain CRFs, HMMs, beam search, etc All make very local Markov assumptions Preceding label; current data given current label Good for some tasks 76
77
Motivation Long-distance dependencies: Linear chain CRFs, HMMs, beam search, etc All make very local Markov assumptions Preceding label; current data given current label Good for some tasks However, longer context can be useful e.g. NER: Repeated capitalized words should get same tag 77
78
Motivation Long-distance dependencies: Linear chain CRFs, HMMs, beam search, etc All make local Markov assumptions Preceding label; current data given current label Good for some tasks However, longer context can be useful e.g. NER: Repeated capitalized words should get same tag 78
79
Skip-Chain CRFs Basic approach: Augment linear-chain CRF model with Long-distance ‘skip edges’ Add evidence from both endpoints 79
80
Skip-Chain CRFs Basic approach: Augment linear-chain CRF model with Long-distance ‘skip edges’ Add evidence from both endpoints Which edges? 80
81
Skip-Chain CRFs Basic approach: Augment linear-chain CRF model with Long-distance ‘skip edges’ Add evidence from both endpoints Which edges? Identical words, words with same stem? 81
82
Skip-Chain CRFs Basic approach: Augment linear-chain CRF model with Long-distance ‘skip edges’ Add evidence from both endpoints Which edges? Identical words, words with same stem? How many edges? 82
83
Skip-Chain CRFs Basic approach: Augment linear-chain CRF model with Long-distance ‘skip edges’ Add evidence from both endpoints Which edges? Identical words, words with same stem? How many edges? Not too many, increases inference cost 83
84
Skip Chain CRF Model Two clique templates: Standard linear chain template 84
85
Skip Chain CRF Model Two clique templates: Standard linear chain template Skip edge template 85
86
Skip Chain CRF Model Two clique templates: Standard linear chain template Skip edge template 86
87
Skip Chain CRF Model Two clique templates: Standard linear chain template Skip edge template 87
88
Skip Chain NER Named Entity Recognition: Task: start time, end time, speaker, location In corpus of seminar announcement emails 88
89
Skip Chain NER Named Entity Recognition: Task: start time, end time, speaker, location In corpus of seminar announcement emails All approaches: Orthographic, gazeteer, POS features Within preceding, following 4 word window 89
90
Skip Chain NER Named Entity Recognition: Task: start time, end time, speaker, location In corpus of seminar announcement emails All approaches: Orthographic, gazeteer, POS features Within preceding, following 4 word window Skip chain CRFs: Skip edges between identical capitalized words 90
91
NER Features 91
92
Skip Chain NER Results Skip chain improves substantially on ‘speaker’ recognition - Slight reduction in accuracy for times 92
93
Summary Conditional random fields (CRFs) Undirected graphical model Compare with Bayesian Networks, Markov Random Fields 93
94
Summary Conditional random fields (CRFs) Undirected graphical model Compare with Bayesian Networks, Markov Random Fields Linear-chain models HMM sequence structure + MaxEnt feature models 94
95
Summary Conditional random fields (CRFs) Undirected graphical model Compare with Bayesian Networks, Markov Random Fields Linear-chain models HMM sequence structure + MaxEnt feature models Skip-chain models Augment with longer distance dependencies Pros: 95
96
Summary Conditional random fields (CRFs) Undirected graphical model Compare with Bayesian Networks, Markov Random Fields Linear-chain models HMM sequence structure + MaxEnt feature models Skip-chain models Augment with longer distance dependencies Pros: Good performance Cons: 96
97
Summary Conditional random fields (CRFs) Undirected graphical model Compare with Bayesian Networks, Markov Random Fields Linear-chain models HMM sequence structure + MaxEnt feature models Skip-chain models Augment with longer distance dependencies Pros: Good performance Cons: Compute intensive 97
98
HW #5 98
99
HW #5: Beam Search Apply Beam Search to MaxEnt sequence decoding Task: POS tagging Given files: test data: usual format boundary file: sentence lengths model file Comparisons: Different topN, topK, beam_width 99
100
Tag Context Following Ratnaparkhi ‘96, model uses previous tag (prevT=tag) and previous tag bigram (prevTwoTags=tag i-2 +tag i-1 ) These are NOT in the data file; you compute them on the fly. Notes: Due to sparseness, it is possible a bigram may not appear in the model file. Skip it. These are feature functions: If you have a different candidate tag for the same word, weights will differ. 100
101
Uncertainty Real world tasks: Partially observable, stochastic, extremely complex Probabilities capture “Ignorance & Laziness” Lack relevant facts, conditions Failure to enumerate all conditions, exceptions 101
102
Motivation Uncertainty in medical diagnosis Diseases produce symptoms In diagnosis, observed symptoms => disease ID Uncertainties Symptoms may not occur Symptoms may not be reported Diagnostic tests not perfect False positive, false negative How do we estimate confidence? 102
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.