NYU Coreference CSCI-GA.2591 Ralph Grishman
basically a clustering task: clustering mentions into entities
Types of Referring Expressions names nominals pronouns
Types of ‘Coreference’ Identity Predication ACE considers this part of coreference Bridging Anaphora Not included in ACE
Strategies mention-pair model entity-mention model entity-entity model for every pair of mentions, model determines coreferentiality then (to enforce transitivity) cluster mentions guided by these decisions entity-mention model single pass through document, building entities model chooses which entity to add mention to (if any) entity-entity model agglomerative clustering
Ordering clustering can be done in one pass in several passes (sieve) in dynamically-determined order Ragunathan et al. report gains of 1-4% in F1 score from multi-pass over single pass But multi-pass makes incremental processing more difficult
Hand-coded rules The bulk of the cases follow well-understood patterns: predicate complement apposition role modifier relative pronouns … So many systems use hand-coded rules or hybrid systems combining corpus-trained and hand-coded rules [Jet]
Hand-coded rules Sieve: exact extent match appositive | predicate nominative | role appositive | relative pronoun | acronym | demonym cluster head match with word inclusion and compatible modifiers cluster head match with word inclusion cluster head match with compatible modifiers relaxed cluster head match with word inclusion pronoun match [Ragunathan et al. 2010]
Hand-coded rules Sieve: P R F exact extent match 96 32 48 appositive | predicate nominative | role appositive | 95 44 60 relative pronoun | acronym | demonym cluster head match with word inclusion and 92 51 66 compatible modifiers cluster head match with word inclusion 92 52 66 cluster head match with compatible modifiers 91 53 67 relaxed cluster head match with word inclusion 90 54 67 pronoun match 84 74 79 nominal coref helps little
Anaphoricity Should we have a separate model for anaphoricity?
Role of deep learning Systems do worst in resolving nominal anaphors systems typically extract features of the anaphor and candidate antecedent and then use a log-linear model to capture compatibility for example, using WordNet lexical relations Deep learning systems try to do this more directly: building a large distributed representation of the mentions and the entities (based on the word embeddings of the words in and words in the immediate context of the anaphor and the candidate antecedent) and then learning a ranking among entity pairs [Clark and Manning 2016]
Benchmarks Most common evaluation is SemEval 2011 Based on OntoNotes corpus Did not mark singletons Included event references
Evaluation Metric There is no consensus on an evaluation metric for coreference SemEval used an average of 3 scores MUC score B-cubed CEAF (not to mention the official ACE scorer)
MUC Scoring The first coreference scorer was developed for MUC-6; it is link-based The key S and the response R each define a set of equivalence classes Si and Ri To assess the recall of the response with respect to class Si, we ask how many links would have to be added to R to link all the mentions in S = p(S)-1 Recalli = |Si – p(Si)| / |Si| - 1 Recall = sum of Recalli Precision is computed by swapping S and R
MUC Scoring Example: Truth 1 --- 2 --- 3 --- 4 --- 5 6 ---7 8 ---9 --- A --- B --- C Response 1 --- 2 --- 3 --- 4 --- 5
MUC Scoring One shortcoming of MUC scoring is that you don’t get credit for correct singletons Also, the metric rates as equal some responses which are worse than others
MUC Scoring Example: Truth 1 --- 2 --- 3 --- 4 --- 5 6 ---7 8 ---9 --- A --- B --- C Response 1 --- 2 --- 3 --- 4 --- 5 With MUC scorer, this gets the same score
B-CUBED Scoring B-CUBED is a mention-based scorer designed to avoid the problem of the MUC scorer [Bagga and Baldwin 1998] Precisioni = # of correct mentions in response equiv. class containing mentioni / # of mentions in response equiv class containing mentioni Recalli = # of correct mentions in response equiv class containing mentioni / # of mentions in key equiv class containing mentioni
CEAF Constrained Entity Alignment F-measure Based on a similarity metric between clusters (entities) Computes an optimal alignment between key and response using this metric Leftover clusters are not scored