Simone Paolo Ponzetto University of Heidelberg Massimo Poesio

State-of-the-art NLP Approaches to Coreference Resolution: Theory and Practical Recipes
Simone Paolo Ponzetto University of Heidelberg Massimo Poesio University of Trento

Road map The “simple” model from Soon et al. (2001) has two major drawback: Decision locality Knowledge bottleneck

Global constraints for coreference
Decision locality: coreference decisions are only locally optimized no dependency assumption is made between different local coreference decisions  we would like to enforce transitivity

Overcoming the knowledge bottleneck
Numerous knowledge sources play a role in coreference, e.g. world and common-sense knowledge … but the model rely on a small set of shallow, i.e. surface features

Twin-candidate model for anaphora resolution (Yang et al., 2008)
Learn the preference relationship between competing candidates The antecedent is then the best, i.e. most preferred, candidate among a set of competing candidates

The probability that a candidate is preferred over all the other competing candidates: Assuming that the preferences between candidate pairs are independent of each other:

The probability that a candidate is selected the antecedent can be calculated using the preference classification results between the candidate and its opponents The actual antecedent for an anaphor is the one maximizing such probability

Single-candidate vs. twin-candidate model
Single-candidate instance: Twin-candidate instance: <ANAPHOR (j), ANTECEDENT (i)> <ANAPHOR (j), COMPETITOR_1 (i), COMPETITOR_2 (k)>

Single-candidate vs. twin-candidate model
Single-candidate class label: Twin-candidate class label: COREF, NOT COREF COMPETITOR_1, COMPETITOR_2

Yang et al. (2008): generating training instances

<Its, Friday, Israel> 01

<Its, defense minister, Israel> 01

<Its, non-conventional weapons, Israel> 01

Yang et al. (2008): classifier generation
In the twin-candidate model, replace each feature “Candi_X” with “Candi1_X” “and Candi2_X” Classifiers include C5 and MaxEnt

Yang et al. (2008): antecedent identification as tournament elimination
Candidates are compared linearly from the beginning of the document to the end. Each candidate in turn is paired with the next candidate and passed to the classifier to determine the preference. The “losing” candidate that is judged less preferred by the classifier is eliminated and never considered. The “winner” is compared with the next candidate.

The process continues until all the preceding candidates are compared
Yang et al. (2008): antecedent identification as tournament elimination The process continues until all the preceding candidates are compared The candidate that wins in the last comparison is selected as the antecedent Computational complexity of O(N) for N candidates

<Its, Israel, the Unites States> => Israel

<Its, Israel, a military strike> => Israel

<Its, Israel, Iraq> => Iraq

<Its, Iraq, the Jewish state> => the Jewish state

Yang et al. (2008): antecedent identification as round robin
Compare all antecedent candidates with each other Select the antecedent with the best record of wins Computational complexity of for N candidates

Yang et al. (2008): antecedent identification as round robin

Antecedent identification as simple round robin
NP1 Israel NP2 United States NP3 a military strike Score +1 +2 -1 -2

Antecedent identification as weighted round robin
NP1 Israel NP2 United States NP3 a military strike Score +0.55 +0.9 1.45 -0.45 +0.8 0.35 -0.1 -0.2 -2

Yang et al. (2008): Results

Twin-candidate model for coreference resolution (Yang et al., 2008)
The model we have seen so far works for pronominal anaphora resolution For each NP it will always look for the best antecedent However, for coreference some NPs are not anaphoric Extend the classification model to include a special class for non-anaphors

Single-candidate vs. twin-candidate model (coreference)
Single-candidate instance: Twin-candidate instance: <ANAPHOR (j), ANTECEDENT (i)> <ANAPHOR (j), COMPETITOR_1 (i), COMPETITOR_2 (k)>

Single-candidate vs. twin-candidate model (coreference)
Single-candidate class label: Twin-candidate class label: COREF, NOT COREF COMPETITOR_1, COMPETITOR_2, NONE

Yang et al. (2008): generating training instances (coreference)
<Israel, the Jewish state, Iraqi attack> 10

<Israel, the Jewish state, non-conventional weapons> 10

<Israel, the United States, the Jewish state> 01

<Lipkin-Shahak, the United States, Iraq> NONE

<Lipkin-Shahak, the United States, Friday> NONE

Yang et al. (2008): classifier generation for coreference
In the twin-candidate model, replace each feature “Candi_X” with “Candi1_X” “and Candi2_X” Classifiers include C5 and MaxEnt

Same as for pronominal anaphors
Yang et al. (2008): antecedent identification as tournament elimination (coreference) Same as for pronominal anaphors Modification for non-anaphoric mentions: If an instance is classified as NONE, both competing candidates are discarded If both of the candidates in the last match are judged to be in a NONE relation, the mention is left unresolved

Same as for pronominal anaphors
Yang et al. (2008): antecedent identification as round robin (coreference) Same as for pronominal anaphors Modification for non-anaphoric mentions: If an instance is classified as NONE, both competing candidates receive a penalty of -1 A mention is considered non-anaphoric and left unresolved if no candidate has a positive final score

Yang et al. (2008): Results for coreference

Bell-Tree Clustering (Luo et al., 2004)
searches for the most probable partition of a set of mentions structures the search space as a Bell tree

searches for the most probable partition of a set of mentions structures the search space as a Bell tree [1]

searches for the most probable partition of a set of mentions structures the search space as a Bell tree [12] [1][2] [1]

searches for the most probable partition of a set of mentions structures the search space as a Bell tree [123] [12] [12][3] [1] [1][2]

searches for the most probable partition of a set of mentions structures the search space as a Bell tree [123] [12][3] [13][2] [1][23] [1][2][3] [12] [1] [1][2]

searches for the most probable partition of a set of mentions structures the search space as a Bell tree [123] [12][3] [13][2] [1][23] [1][2][3] Leaves contain all the possible partitions of all of the mentions [12] [1] [1][2]

searches for the most probable partition of a set of mentions structures the search space as a Bell tree [123] [12][3] [13][2] [1][23] [1][2][3] Leaves contain all the possible partitions of all of the mentions [12] Computationally infeasible to expand all nodes in the Bell tree [1] [1][2]

searches for the most probable partition of a set of mentions structures the search space as a Bell tree [123] [12][3] [13][2] [1][23] [1][2][3] expands only the most promising nodes [12] [1] [1][2]

searches for the most probable partition of a set of mentions structures the search space as a Bell tree [123] [12][3] [13][2] [1][23] [1][2][3] expands only the most promising nodes [12] [1] How to determine which nodes are promising? [1][2]

The partial entity a mention considers linking with is the in-focus entity The current mention to link is the active mention [123] [12][3] [13][2] [1][23] [1][2][3] [12] [1] [1][2]

The partial entity a mention considers linking with is the in-focus entity The current mention to link is the active mention [123] [12][3] [13][2] [1][23] [1][2][3] [12] In-focus entities highlighted on the edges Active mentions highlighted with * [12] 3* [1] [1] 2* 3 [1] [2] [1][2] 3*

The model we are after must estimate Ek is the set of partially-established entities mk is the current mention to be linked or not Ak tells us which entity is in-focus

The partial entity a mention considers linking with is the in-focus entity The current mention to link is the active mention [12] 3* P(L=1|E2={[1]},”2”,A3=[1]) [1] [1] 2* 3 [1][2] 3*

The partial entity a mention considers linking with is the in-focus entity The current mention to link is the active mention [12] 3* P(L=1|E2={[1]},”2”,A3=[1]) [1] [1] 2* 3 [1][2] 3* P(L=0|E2={[1]},”2”)

The partial entity a mention considers linking with is the in-focus entity The current mention to link is the active mention [123] [12][3] [13][2] [1][23] [1][2][3] P(L=1|E3={[1,2]},”3”,A3=[1,2]) P(L=1|E3={[1],[2]},”3”,A3=[1]) P(L=1|E3={[1],[2]},”3”,A3=[2]) [12] [12] 3* [1] [1] 2* 3 [1] [2] [1][2] 3*

How to compute the probability of an entity-starting mention? Derive it from linking probabilities [123] [12][3] [13][2] [1][23] [1][2][3] P(L=1|E3={[1,2]},”3”,A3=[1,2]) P(L=0|E3={[1,2]},”3”) = ? P(L=1|E3={[1],[2]},”3”,A3=[1]) P(L=1|E3={[1],[2]},”3”,A3=[2]) P(L=0|E3={[1],[2]},”3”) = ? [12] [12] 3* [1] [1] 2* 3 [1] [2] [1][2] 3*

Entity starting probability
The probability of starting a new entity

Entity-mention model What about ?
ASSUMPTION: entities other than the one in focus have no influence on the linking decision

Mention-pair model What about ?
ASSUMPTION: entity-mention score can be obtained from the maximum mention pair score

Classifier training Probabilities for both models are estimated from the training data using a maximum entropy model

Mention-pair model: features
Lexical features

Distance features Syntax features

Count feature How many times a mention occurred in the document Pronoun features

Entity-mention model: features
Remove pair-specific features, e.g. (PoS pairs) Lexical features test the active mention against all mentions in the in-focus entity Distance features take the minimum distance between mentions in the in-focus entity and the active mention

Bell-Tree Clustering: example
Idea: assign a score to each node, based on the pairwise probabilities returned by the coreference classifier Classifier gives us: Pc(1, 2) = 0.6, Pc(1, 3) = 0.2, Pc(2, 3) = 0.7

Idea: assign a score to each node, based on the pairwise probabilities returned by the coreference classifier Classifier gives us: Pc(1, 2) = 0.6, Pc(1, 3) = 0.2, Pc(2, 3) = 0.7 [1] 1

Idea: assign a score to each node, based on the pairwise probabilities returned by the coreference classifier Classifier gives us: Pc(1, 2) = 0.6, Pc(1, 3) = 0.2, Pc(2, 3) = 0.7 [12] [1][2] [1] 1

Idea: assign a score to each node, based on the pairwise probabilities returned by the coreference classifier Classifier gives us: Pc(1, 2) = 0.6, Pc(1, 3) = 0.2, Pc(2, 3) = 0.7 [12] [1][2] 1 * Pc(1,2) = 1 * 0.6 = 0.6 [1] 1

Idea: assign a score to each node, based on the pairwise probabilities returned by the coreference classifier Classifier gives us: Pc(1, 2) = 0.6, Pc(1, 3) = 0.2, Pc(2, 3) = 0.7 [12] [1][2] 0.6 [1] 1 1 * (1 - Pc(1,2)) = 1 * ( ) = 0.4

Idea: assign a score to each node, based on the pairwise probabilities returned by the coreference classifier Classifier gives us: Pc(1, 2) = 0.6, Pc(1, 3) = 0.2, Pc(2, 3) = 0.7 [12] [1][2] 0.6 [1] 1 0.4

Idea: assign a score to each node, based on the pairwise probabilities returned by the coreference classifier Classifier gives us: Pc(1, 2) = 0.6, Pc(1, 3) = 0.2, Pc(2, 3) = 0.7 [123] [12][3] [13][2] [1][23] [1][2][3] [12] [1][2] 0.6 [1] 1 0.4

Idea: assign a score to each node, based on the pairwise probabilities returned by the coreference classifier Classifier gives us: Pc(1, 2) = 0.6, Pc(1, 3) = 0.2, Pc(2, 3) = 0.7 [123] [12][3] [13][2] [1][23] [1][2][3] 0.6 * max (Pc(1,3), Pc(2,3)) = 0.6 * max(0.2, 0.7) = 0.42 [12] [1][2] 0.6 [1] 1 0.4

Idea: assign a score to each node, based on the pairwise probabilities returned by the coreference classifier Classifier gives us: Pc(1, 2) = 0.6, Pc(1, 3) = 0.2, Pc(2, 3) = 0.7 [123] [12][3] [13][2] [1][23] [1][2][3] [12] [1][2] * max (Pc(1,3), Pc(2,3)) = * max(0.2, 0.7) = * 0.7 = 0.58 0.6 [1] 1 0.4

Idea: assign a score to each node, based on the pairwise probabilities returned by the coreference classifier Classifier gives us: Pc(1, 2) = 0.6, Pc(1, 3) = 0.2, Pc(2, 3) = 0.7 [123] [12][3] [13][2] [1][23] [1][2][3] [12] [1][2] 0.6 [1] 0.4 * max (Pc(1,3)) = 0.4 * 0.2 = 0.08 1 0.4

Idea: assign a score to each node, based on the pairwise probabilities returned by the coreference classifier Classifier gives us: Pc(1, 2) = 0.6, Pc(1, 3) = 0.2, Pc(2, 3) = 0.7 [123] [12][3] [13][2] [1][23] [1][2][3] [12] [1][2] 0.6 [1] 1 0.4 * max (Pc(2,3)) = 0.4 * 0.7 = 0.28 0.4

Idea: assign a score to each node, based on the pairwise probabilities returned by the coreference classifier Classifier gives us: Pc(1, 2) = 0.6, Pc(1, 3) = 0.2, Pc(2, 3) = 0.7 [123] [12][3] [13][2] [1][23] [1][2][3] [12] [1][2] 0.6 [1] 0.4 * (1 – (0 * Pc(1,3) + 1 * Pc(2,3))) = 0.4 * (1 – 0.7) = 0.12 1 0.4

Idea: assign a score to each node, based on the pairwise probabilities returned by the coreference classifier Classifier gives us: Pc(1, 2) = 0.6, Pc(1, 3) = 0.2, Pc(2, 3) = 0.7 [123] [12][3] [13][2] [1][23] [1][2][3] 0.42 [12] [1][2] 0.6 0.58 [1] 0.08 1 0.28 0.4 0.12

Idea: assign a score to each node, based on the pairwise probabilities returned by the coreference classifier Classifier gives us: Pc(1, 2) = 0.6, Pc(1, 3) = 0.2, Pc(2, 3) = 0.7 [123] [12][3] [13][2] [1][23] [1][2][3] 0.42 [12] [1][2] 0.6 expands only the N most probable nodes at each level 0.58 [1] 0.08 1 0.28 0.4 0.12

Bell Tree: search algorithm

Mention-linking Entity-starting

Pruning

Bell Tree: results No statistical significant difference between MP and EM (at p-value 0.05) MP requires 20 times more features than EM Features for EM needs more engineering…

Simone Paolo Ponzetto University of Heidelberg Massimo Poesio

Similar presentations

Presentation on theme: "Simone Paolo Ponzetto University of Heidelberg Massimo Poesio"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Simone Paolo Ponzetto University of Heidelberg Massimo Poesio

Similar presentations

Presentation on theme: "Simone Paolo Ponzetto University of Heidelberg Massimo Poesio"— Presentation transcript:

Similar presentations

About project

Feedback