Download presentation
Presentation is loading. Please wait.
Published byLorraine Ward Modified over 6 years ago
1
State-of-the-art NLP Approaches to Coreference Resolution: Theory and Practical Recipes
Simone Paolo Ponzetto University of Heidelberg Massimo Poesio University of Trento
2
Road map The “simple” model from Soon et al. (2001) has two major drawback: Decision locality Knowledge bottleneck
3
Global constraints for coreference
Decision locality: coreference decisions are only locally optimized no dependency assumption is made between different local coreference decisions we would like to enforce transitivity
4
Overcoming the knowledge bottleneck
Numerous knowledge sources play a role in coreference, e.g. world and common-sense knowledge … but the model rely on a small set of shallow, i.e. surface features
5
Twin-candidate model for anaphora resolution (Yang et al., 2008)
Learn the preference relationship between competing candidates The antecedent is then the best, i.e. most preferred, candidate among a set of competing candidates
6
Twin-candidate model for anaphora resolution (Yang et al., 2008)
The probability that a candidate is preferred over all the other competing candidates: Assuming that the preferences between candidate pairs are independent of each other:
7
Twin-candidate model for anaphora resolution (Yang et al., 2008)
The probability that a candidate is selected the antecedent can be calculated using the preference classification results between the candidate and its opponents The actual antecedent for an anaphor is the one maximizing such probability
8
Twin-candidate model for anaphora resolution (Yang et al., 2008)
The probability that a candidate is selected the antecedent can be calculated using the preference classification results between the candidate and its opponents The actual antecedent for an anaphor is the one maximizing such probability
9
Single-candidate vs. twin-candidate model
Single-candidate instance: Twin-candidate instance: <ANAPHOR (j), ANTECEDENT (i)> <ANAPHOR (j), COMPETITOR_1 (i), COMPETITOR_2 (k)>
10
Single-candidate vs. twin-candidate model
Single-candidate class label: Twin-candidate class label: COREF, NOT COREF COMPETITOR_1, COMPETITOR_2
11
Yang et al. (2008): generating training instances
12
Yang et al. (2008): generating training instances
<Its, Friday, Israel> 01
13
Yang et al. (2008): generating training instances
<Its, defense minister, Israel> 01
14
Yang et al. (2008): generating training instances
<Its, non-conventional weapons, Israel> 01
15
Yang et al. (2008): classifier generation
In the twin-candidate model, replace each feature “Candi_X” with “Candi1_X” “and Candi2_X” Classifiers include C5 and MaxEnt
16
Yang et al. (2008): antecedent identification as tournament elimination
Candidates are compared linearly from the beginning of the document to the end. Each candidate in turn is paired with the next candidate and passed to the classifier to determine the preference. The “losing” candidate that is judged less preferred by the classifier is eliminated and never considered. The “winner” is compared with the next candidate.
17
The process continues until all the preceding candidates are compared
Yang et al. (2008): antecedent identification as tournament elimination The process continues until all the preceding candidates are compared The candidate that wins in the last comparison is selected as the antecedent Computational complexity of O(N) for N candidates
18
Yang et al. (2008): antecedent identification as tournament elimination
<Its, Israel, the Unites States> => Israel
19
Yang et al. (2008): antecedent identification as tournament elimination
<Its, Israel, a military strike> => Israel
20
Yang et al. (2008): antecedent identification as tournament elimination
<Its, Israel, Iraq> => Iraq
21
Yang et al. (2008): antecedent identification as tournament elimination
<Its, Iraq, the Jewish state> => the Jewish state
22
Yang et al. (2008): antecedent identification as round robin
Compare all antecedent candidates with each other Select the antecedent with the best record of wins Computational complexity of for N candidates
23
Yang et al. (2008): antecedent identification as round robin
24
Yang et al. (2008): antecedent identification as round robin
25
Yang et al. (2008): antecedent identification as round robin
26
Yang et al. (2008): antecedent identification as round robin
27
Yang et al. (2008): antecedent identification as round robin
28
Yang et al. (2008): antecedent identification as round robin
29
Antecedent identification as simple round robin
NP1 Israel NP2 United States NP3 a military strike Score +1 +2 -1 -2
30
Antecedent identification as weighted round robin
NP1 Israel NP2 United States NP3 a military strike Score +0.55 +0.9 1.45 -0.45 +0.8 0.35 -0.1 -0.2 -2
31
Yang et al. (2008): Results
32
Twin-candidate model for coreference resolution (Yang et al., 2008)
The model we have seen so far works for pronominal anaphora resolution For each NP it will always look for the best antecedent However, for coreference some NPs are not anaphoric Extend the classification model to include a special class for non-anaphors
33
Single-candidate vs. twin-candidate model (coreference)
Single-candidate instance: Twin-candidate instance: <ANAPHOR (j), ANTECEDENT (i)> <ANAPHOR (j), COMPETITOR_1 (i), COMPETITOR_2 (k)>
34
Single-candidate vs. twin-candidate model (coreference)
Single-candidate class label: Twin-candidate class label: COREF, NOT COREF COMPETITOR_1, COMPETITOR_2, NONE
35
Yang et al. (2008): generating training instances (coreference)
<Israel, the Jewish state, Iraqi attack> 10
36
Yang et al. (2008): generating training instances (coreference)
<Israel, the Jewish state, non-conventional weapons> 10
37
Yang et al. (2008): generating training instances (coreference)
<Israel, the United States, the Jewish state> 01
38
Yang et al. (2008): generating training instances (coreference)
39
Yang et al. (2008): generating training instances (coreference)
<Lipkin-Shahak, the United States, Iraq> NONE
40
Yang et al. (2008): generating training instances (coreference)
<Lipkin-Shahak, the United States, Friday> NONE
41
Yang et al. (2008): classifier generation for coreference
In the twin-candidate model, replace each feature “Candi_X” with “Candi1_X” “and Candi2_X” Classifiers include C5 and MaxEnt
42
Same as for pronominal anaphors
Yang et al. (2008): antecedent identification as tournament elimination (coreference) Same as for pronominal anaphors Modification for non-anaphoric mentions: If an instance is classified as NONE, both competing candidates are discarded If both of the candidates in the last match are judged to be in a NONE relation, the mention is left unresolved
43
Same as for pronominal anaphors
Yang et al. (2008): antecedent identification as round robin (coreference) Same as for pronominal anaphors Modification for non-anaphoric mentions: If an instance is classified as NONE, both competing candidates receive a penalty of -1 A mention is considered non-anaphoric and left unresolved if no candidate has a positive final score
44
Yang et al. (2008): Results for coreference
45
Bell-Tree Clustering (Luo et al., 2004)
searches for the most probable partition of a set of mentions structures the search space as a Bell tree
46
Bell-Tree Clustering (Luo et al., 2004)
searches for the most probable partition of a set of mentions structures the search space as a Bell tree [1]
47
Bell-Tree Clustering (Luo et al., 2004)
searches for the most probable partition of a set of mentions structures the search space as a Bell tree [12] [1][2] [1]
48
Bell-Tree Clustering (Luo et al., 2004)
searches for the most probable partition of a set of mentions structures the search space as a Bell tree [123] [12] [12][3] [1] [1][2]
49
Bell-Tree Clustering (Luo et al., 2004)
searches for the most probable partition of a set of mentions structures the search space as a Bell tree [123] [12][3] [13][2] [1][23] [1][2][3] [12] [1] [1][2]
50
Bell-Tree Clustering (Luo et al., 2004)
searches for the most probable partition of a set of mentions structures the search space as a Bell tree [123] [12][3] [13][2] [1][23] [1][2][3] [12] [1] [1][2]
51
Bell-Tree Clustering (Luo et al., 2004)
searches for the most probable partition of a set of mentions structures the search space as a Bell tree [123] [12][3] [13][2] [1][23] [1][2][3] Leaves contain all the possible partitions of all of the mentions [12] [1] [1][2]
52
Bell-Tree Clustering (Luo et al., 2004)
searches for the most probable partition of a set of mentions structures the search space as a Bell tree [123] [12][3] [13][2] [1][23] [1][2][3] Leaves contain all the possible partitions of all of the mentions [12] Computationally infeasible to expand all nodes in the Bell tree [1] [1][2]
53
Bell-Tree Clustering (Luo et al., 2004)
searches for the most probable partition of a set of mentions structures the search space as a Bell tree [123] [12][3] [13][2] [1][23] [1][2][3] expands only the most promising nodes [12] [1] [1][2]
54
Bell-Tree Clustering (Luo et al., 2004)
searches for the most probable partition of a set of mentions structures the search space as a Bell tree [123] [12][3] [13][2] [1][23] [1][2][3] expands only the most promising nodes [12] [1] How to determine which nodes are promising? [1][2]
55
Bell-Tree Clustering (Luo et al., 2004)
The partial entity a mention considers linking with is the in-focus entity The current mention to link is the active mention [123] [12][3] [13][2] [1][23] [1][2][3] [12] [1] [1][2]
56
Bell-Tree Clustering (Luo et al., 2004)
The partial entity a mention considers linking with is the in-focus entity The current mention to link is the active mention [123] [12][3] [13][2] [1][23] [1][2][3] [12] In-focus entities highlighted on the edges Active mentions highlighted with * [12] 3* [1] [1] 2* 3 [1] [2] [1][2] 3*
57
Bell-Tree Clustering (Luo et al., 2004)
The model we are after must estimate Ek is the set of partially-established entities mk is the current mention to be linked or not Ak tells us which entity is in-focus
58
Bell-Tree Clustering (Luo et al., 2004)
The partial entity a mention considers linking with is the in-focus entity The current mention to link is the active mention [12] 3* P(L=1|E2={[1]},”2”,A3=[1]) [1] [1] 2* 3 [1][2] 3*
59
Bell-Tree Clustering (Luo et al., 2004)
The partial entity a mention considers linking with is the in-focus entity The current mention to link is the active mention [12] 3* P(L=1|E2={[1]},”2”,A3=[1]) [1] [1] 2* 3 [1][2] 3* P(L=0|E2={[1]},”2”)
60
Bell-Tree Clustering (Luo et al., 2004)
The partial entity a mention considers linking with is the in-focus entity The current mention to link is the active mention [123] [12][3] [13][2] [1][23] [1][2][3] P(L=1|E3={[1,2]},”3”,A3=[1,2]) P(L=1|E3={[1],[2]},”3”,A3=[1]) P(L=1|E3={[1],[2]},”3”,A3=[2]) [12] [12] 3* [1] [1] 2* 3 [1] [2] [1][2] 3*
61
Bell-Tree Clustering (Luo et al., 2004)
How to compute the probability of an entity-starting mention? Derive it from linking probabilities [123] [12][3] [13][2] [1][23] [1][2][3] P(L=1|E3={[1,2]},”3”,A3=[1,2]) P(L=0|E3={[1,2]},”3”) = ? P(L=1|E3={[1],[2]},”3”,A3=[1]) P(L=1|E3={[1],[2]},”3”,A3=[2]) P(L=0|E3={[1],[2]},”3”) = ? [12] [12] 3* [1] [1] 2* 3 [1] [2] [1][2] 3*
62
Entity starting probability
The probability of starting a new entity
63
Entity-mention model What about ?
ASSUMPTION: entities other than the one in focus have no influence on the linking decision
64
Mention-pair model What about ?
ASSUMPTION: entity-mention score can be obtained from the maximum mention pair score
65
Classifier training Probabilities for both models are estimated from the training data using a maximum entropy model
66
Mention-pair model: features
Lexical features
67
Mention-pair model: features
Distance features Syntax features
68
Mention-pair model: features
Count feature How many times a mention occurred in the document Pronoun features
69
Entity-mention model: features
Remove pair-specific features, e.g. (PoS pairs) Lexical features test the active mention against all mentions in the in-focus entity Distance features take the minimum distance between mentions in the in-focus entity and the active mention
70
Bell-Tree Clustering: example
Idea: assign a score to each node, based on the pairwise probabilities returned by the coreference classifier Classifier gives us: Pc(1, 2) = 0.6, Pc(1, 3) = 0.2, Pc(2, 3) = 0.7
71
Bell-Tree Clustering: example
Idea: assign a score to each node, based on the pairwise probabilities returned by the coreference classifier Classifier gives us: Pc(1, 2) = 0.6, Pc(1, 3) = 0.2, Pc(2, 3) = 0.7 [1] 1
72
Bell-Tree Clustering: example
Idea: assign a score to each node, based on the pairwise probabilities returned by the coreference classifier Classifier gives us: Pc(1, 2) = 0.6, Pc(1, 3) = 0.2, Pc(2, 3) = 0.7 [12] [1][2] [1] 1
73
Bell-Tree Clustering: example
Idea: assign a score to each node, based on the pairwise probabilities returned by the coreference classifier Classifier gives us: Pc(1, 2) = 0.6, Pc(1, 3) = 0.2, Pc(2, 3) = 0.7 [12] [1][2] 1 * Pc(1,2) = 1 * 0.6 = 0.6 [1] 1
74
Bell-Tree Clustering: example
Idea: assign a score to each node, based on the pairwise probabilities returned by the coreference classifier Classifier gives us: Pc(1, 2) = 0.6, Pc(1, 3) = 0.2, Pc(2, 3) = 0.7 [12] [1][2] 0.6 [1] 1 1 * (1 - Pc(1,2)) = 1 * ( ) = 0.4
75
Bell-Tree Clustering: example
Idea: assign a score to each node, based on the pairwise probabilities returned by the coreference classifier Classifier gives us: Pc(1, 2) = 0.6, Pc(1, 3) = 0.2, Pc(2, 3) = 0.7 [12] [1][2] 0.6 [1] 1 0.4
76
Bell-Tree Clustering: example
Idea: assign a score to each node, based on the pairwise probabilities returned by the coreference classifier Classifier gives us: Pc(1, 2) = 0.6, Pc(1, 3) = 0.2, Pc(2, 3) = 0.7 [123] [12][3] [13][2] [1][23] [1][2][3] [12] [1][2] 0.6 [1] 1 0.4
77
Bell-Tree Clustering: example
Idea: assign a score to each node, based on the pairwise probabilities returned by the coreference classifier Classifier gives us: Pc(1, 2) = 0.6, Pc(1, 3) = 0.2, Pc(2, 3) = 0.7 [123] [12][3] [13][2] [1][23] [1][2][3] 0.6 * max (Pc(1,3), Pc(2,3)) = 0.6 * max(0.2, 0.7) = 0.42 [12] [1][2] 0.6 [1] 1 0.4
78
Bell-Tree Clustering: example
Idea: assign a score to each node, based on the pairwise probabilities returned by the coreference classifier Classifier gives us: Pc(1, 2) = 0.6, Pc(1, 3) = 0.2, Pc(2, 3) = 0.7 [123] [12][3] [13][2] [1][23] [1][2][3] [12] [1][2] * max (Pc(1,3), Pc(2,3)) = * max(0.2, 0.7) = * 0.7 = 0.58 0.6 [1] 1 0.4
79
Bell-Tree Clustering: example
Idea: assign a score to each node, based on the pairwise probabilities returned by the coreference classifier Classifier gives us: Pc(1, 2) = 0.6, Pc(1, 3) = 0.2, Pc(2, 3) = 0.7 [123] [12][3] [13][2] [1][23] [1][2][3] [12] [1][2] 0.6 [1] 0.4 * max (Pc(1,3)) = 0.4 * 0.2 = 0.08 1 0.4
80
Bell-Tree Clustering: example
Idea: assign a score to each node, based on the pairwise probabilities returned by the coreference classifier Classifier gives us: Pc(1, 2) = 0.6, Pc(1, 3) = 0.2, Pc(2, 3) = 0.7 [123] [12][3] [13][2] [1][23] [1][2][3] [12] [1][2] 0.6 [1] 1 0.4 * max (Pc(2,3)) = 0.4 * 0.7 = 0.28 0.4
81
Bell-Tree Clustering: example
Idea: assign a score to each node, based on the pairwise probabilities returned by the coreference classifier Classifier gives us: Pc(1, 2) = 0.6, Pc(1, 3) = 0.2, Pc(2, 3) = 0.7 [123] [12][3] [13][2] [1][23] [1][2][3] [12] [1][2] 0.6 [1] 0.4 * (1 – (0 * Pc(1,3) + 1 * Pc(2,3))) = 0.4 * (1 – 0.7) = 0.12 1 0.4
82
Bell-Tree Clustering: example
Idea: assign a score to each node, based on the pairwise probabilities returned by the coreference classifier Classifier gives us: Pc(1, 2) = 0.6, Pc(1, 3) = 0.2, Pc(2, 3) = 0.7 [123] [12][3] [13][2] [1][23] [1][2][3] 0.42 [12] [1][2] 0.6 0.58 [1] 0.08 1 0.28 0.4 0.12
83
Bell-Tree Clustering: example
Idea: assign a score to each node, based on the pairwise probabilities returned by the coreference classifier Classifier gives us: Pc(1, 2) = 0.6, Pc(1, 3) = 0.2, Pc(2, 3) = 0.7 [123] [12][3] [13][2] [1][23] [1][2][3] 0.42 [12] [1][2] 0.6 expands only the N most probable nodes at each level 0.58 [1] 0.08 1 0.28 0.4 0.12
84
Bell Tree: search algorithm
85
Bell Tree: search algorithm
Mention-linking Entity-starting
86
Bell Tree: search algorithm
Pruning
87
Bell Tree: results No statistical significant difference between MP and EM (at p-value 0.05) MP requires 20 times more features than EM Features for EM needs more engineering…
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.