Evaluation issues in anaphora resolution and beyond Ruslan Mitkov University of Wolverhampton Faro, 27 June 2002.

Slides:



Advertisements
Similar presentations
Performance Measures Criteria Criteria used to evaluate Performance Management Systems: Strategic Congruence Extent to which performance mgt systems elicits.
Advertisements

Specialized models and ranking for coreference resolution Pascal Denis ALPAGE Project Team INRIA Rocquencourt F Le Chesnay, France Jason Baldridge.
Overview of the TAC2013 Knowledge Base Population Evaluation: English Slot Filling Mihai Surdeanu with a lot help from: Hoa Dang, Joe Ellis, Heng Ji, and.
A Corpus for Cross- Document Co-Reference D. Day 1, J. Hitzeman 1, M. Wick 2, K. Crouch 1 and M. Poesio 3 1 The MITRE Corporation 2 University of Massachusetts,
Overview of the KBP 2013 Slot Filler Validation Track Hoa Trang Dang National Institute of Standards and Technology.
计算机科学与技术学院 Chinese Semantic Role Labeling with Dependency-driven Constituent Parse Tree Structure Hongling Wang, Bukang Wang Guodong Zhou NLP Lab, School.
Processing of large document collections Part 6 (Text summarization: discourse- based approaches) Helena Ahonen-Myka Spring 2006.
Chapter 18: Discourse Tianjun Fu Ling538 Presentation Nov 30th, 2006.
CS 4705 Algorithms for Reference Resolution. Anaphora resolution Finding in a text all the referring expressions that have one and the same denotation.
Detecting Anaphoricity and Antecedenthood for Coreference Resolution Olga Uryupina Institute of Linguistics, RAS.
CS 4705 Lecture 21 Algorithms for Reference Resolution.
Empirical Methods in Information Extraction - Claire Cardie 자연어처리연구실 한 경 수
Supervised models for coreference resolution Altaf Rahman and Vincent Ng Human Language Technology Research Institute University of Texas at Dallas 1.
The Use of Corpora for Automatic Evaluation of Grammar Inference Systems Andrew Roberts & Eric Atwell Corpus Linguistics ’03 – 29 th March Computer Vision.
A Framework for Named Entity Recognition in the Open Domain Richard Evans Research Group in Computational Linguistics University of Wolverhampton UK
Minnesota Manual of Accommodations for Students with Disabilities Training Guide
Miguel Martins, Senior Lecturer – Marketing Department eLearning Celebration 22 July 2007 Improving performance on a Marketing module through the use of.
SI485i : NLP Set 14 Reference Resolution. 2 Kraken, also called the Crab-fish, which is not that huge, for heads and tails counted, he is no larger than.
A Global Relaxation Labeling Approach to Coreference Resolution Coling 2010 Emili Sapena, Llu´ıs Padr´o and Jordi Turmo TALP Research Center Universitat.
A Light-weight Approach to Coreference Resolution for Named Entities in Text Marin Dimitrov Ontotext Lab, Sirma AI Kalina Bontcheva, Hamish Cunningham,
807 - TEXT ANALYTICS Massimo Poesio Lecture 7: Coreference (Anaphora resolution)
Slide Image Retrieval: A Preliminary Study Guo Min Liew and Min-Yen Kan National University of Singapore Web IR / NLP Group (WING)
Computer-Aided Language Processing Ruslan Mitkov University of Wolverhampton.
Andreea Bodnari, 1 Peter Szolovits, 1 Ozlem Uzuner 2 1 MIT, CSAIL, Cambridge, MA, USA 2 Department of Information Studies, University at Albany SUNY, Albany,
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Tree Kernels for Parsing: (Collins & Duffy, 2001) Advanced Statistical Methods in NLP Ling 572 February 28, 2012.
© Grant Thornton | | | | | Guidance on Monitoring Internal Control Systems COSO Monitoring Project Update FEI - CFIT Meeting September 25, 2008.
FAO/WHO Codex Training Package Module 2.6 FAO/WHO CODEX TRAINING PACKAGE SECTION TWO – UNDERSTANDING THE ORGANIZATION OF CODEX 2.6 How does Codex elaborate.
Automatic Detection of Tags for Political Blogs Khairun-nisa Hassanali and Vasileios Hatzivassiloglou Human Language Technology Research Institute The.
The CoNLL-2013 Shared Task on Grammatical Error Correction Hwee Tou Ng, Yuanbin Wu, and Christian Hadiwinoto 1 Siew.
Differential effects of constraints in the processing of Russian cataphora Kazanina and Phillips 2010.
Illinois-Coref: The UI System in the CoNLL-2012 Shared Task Kai-Wei Chang, Rajhans Samdani, Alla Rozovskaya, Mark Sammons, and Dan Roth Supported by ARL,
A multiple knowledge source algorithm for anaphora resolution Allaoua Refoufi Computer Science Department University of Setif, Setif 19000, Algeria .
Introduction  Information Extraction (IE)  A limited form of “complete text comprehension”  Document 로부터 entity, relationship 을 추출 
On the Issue of Combining Anaphoricity Determination and Antecedent Identification in Anaphora Resolution Ryu Iida, Kentaro Inui, Yuji Matsumoto Nara Institute.
Incorporating Extra-linguistic Information into Reference Resolution in Collaborative Task Dialogue Ryu Iida Shumpei Kobayashi Takenobu Tokunaga Tokyo.
Developing Structured Activity Tools. Aligning assessment methods and tools Often used where real work evidence not available / observable Method: Structured.
Combining terminology resources and statistical methods for entity recognition: an evaluation Angus Roberts, Robert Gaizauskas, Mark Hepple, Yikun Guo.
1 Exploiting Syntactic Patterns as Clues in Zero- Anaphora Resolution Ryu Iida, Kentaro Inui and Yuji Matsumoto Nara Institute of Science and Technology.
I2B2 Shared Task 2011 Coreference Resolution in Clinical Text David Hinote Carlos Ramirez.
Coreference Resolution
Efficiently Computed Lexical Chains As an Intermediate Representation for Automatic Text Summarization H.G. Silber and K.F. McCoy University of Delaware.
Modelling Human Thematic Fit Judgments IGK Colloquium 3/2/2005 Ulrike Padó.
A Cross-Lingual ILP Solution to Zero Anaphora Resolution Ryu Iida & Massimo Poesio (ACL-HLT 2011)
CS 6998 NLP for the Web Columbia University 04/22/2010 Analyzing Wikipedia and Gold-Standard Corpora for NER Training William Y. Wang Computer Science.
Processing of large document collections Part 6 (Text summarization: discourse- based approaches) Helena Ahonen-Myka Spring 2005.
Splitting Complex Temporal Questions for Question Answering systems ACL 2004.
Coherence and Coreference Introduction to Discourse and Dialogue CS 359 October 2, 2001.
An Entity-Mention Model for Coreference Resolution with Inductive Logic Programming Xiaofeng Yang 1 Jian Su 1 Jun Lang 2 Chew Lim Tan 3 Ting Liu 2 Sheng.
807 - TEXT ANALYTICS Massimo Poesio Lecture 7: Anaphora resolution (Coreference)
E BERHARD- K ARLS- U NIVERSITÄT T ÜBINGEN SFB 441 Coordinate Structures: On the Relationship between Parsing Preferences and Corpus Frequencies Ilona Steiner.
Operations Improvement Performance Measurement. Why Performance Measurement? “When performance is measured, performance improves. When performance is.
Fault-tolerant Control Motivation Definitions A general overview on the research area. Active Fault Tolerant Control (FTC) FTC- Analysis and Development.
Hierarchical Clustering for POS Tagging of the Indonesian Language Derry Tanti Wijaya and Stéphane Bressan.
2015/12/121 Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Proceeding of the 18th International.
Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.
Inference Protocols for Coreference Resolution Kai-Wei Chang, Rajhans Samdani, Alla Rozovskaya, Nick Rizzolo, Mark Sammons, and Dan Roth This research.
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
Measuring the Influence of Errors Induced by the Presence of Dialogs in Reference Clustering of Narrative Text Alaukik Aggarwal, Department of Computer.
Software Quality Assurance SOFTWARE DEFECT. Defect Repair Defect Repair is a process of repairing the defective part or replacing it, as needed. For example,
Arabic Syntactic Trees Zdeněk Žabokrtský Otakar Smrž Center for Computational Linguistics Faculty of Mathematics and Physics Charles University in Prague.
A Patent Document Retrieval System Addressing Both Semantic and Syntactic Properties Liang Chen*,Naoyuki Tokuda+, Hisahiro Adachi+ *University of Northern.
An evolutionary approach for improving the quality of automatic summaries Constantin Orasan Research Group in Computational Linguistics School of Humanities,
Utilizing vector models for automatic text lemmatization Ladislav Gallay Supervisor: Ing. Marián Šimko, PhD. Slovak University of Technology Faculty of.
Fault-tolerant Control Motivation Definitions A general overview on the research area. Active Fault Tolerant Control (FTC) FTC- Analysis and Development.
Public Participation in Fiscal Policy Principles & Mechanisms Juan Pablo Guerrero #FiscalTransparency Stewards General.
Simone Paolo Ponzetto University of Heidelberg Massimo Poesio
NYU Coreference CSCI-GA.2591 Ralph Grishman.
Clustering Algorithms for Noun Phrase Coreference Resolution
Algorithms for Reference Resolution
Presentation transcript:

Evaluation issues in anaphora resolution and beyond Ruslan Mitkov University of Wolverhampton Faro, 27 June 2002

Evaluation Evaluation is a driving force for every NLP task/approach/application Evaluation is indicative of the performance of a specific approach/application but not less importantly, reports where it stands as compared to other approaches/applications Growing research in evaluation inspired by the availability of annotated corpora

Major impediments to fulfilling evaluation’s mission Different approaches evaluated on different data Different approaches evaluated in different modes Results not independently confirmed As a result, no comparison or objective evaluation possible

Anaphora resolution vs. coreference resolution Anaphora resolution has to do with tracking down an antecedent of an anaphor Coreference resolution seeks to identify all coreference classes (chains)

Anaphora resolution For nominal anaphora which involves coreference it would be logical to regard each of the preceding noun phrases which are coreferential with the anaphor(s) as a legitimate antecedent Computational Linguists from many different countries attended PorTAL. The participants enjoyed the presentations; they also took an active part in the discussions.

Evaluation in anaphora resolution Two perspectives: Evaluation of anaphora resolution algorithms Evaluation of anaphora resolution systems

Recall and Precision MUC introduced the measures recall and precision for coreference resolution. These measures, as defined, are not satisfactory in terms of clarity and coverage (Mitkov 2001).measures

Evaluation package for anaphora resolution algorithms (Mitkov 1998; 2000) Evaluation package for anaphora resolution algorithms (i) performance measures (ii) comparative evaluation tasks and (iii) component measures.

Performance measures Success rate Critical success rate Critical success rate applies only to those ‘tough’ anaphors which still have more than one candidate for antecedent after gender and number filter

Example Evaluation data: 100 anaphors Number of anaphors correctly resolved: 80 Number of anaphors correctly resolved after gender and number constraints: 30 Success rate: 80/100 = 80%, Critical success rate 50/70 = 71.4%

Comparative evaluation tasks Evaluation against baseline models Comparison to similar approaches Comparison with well-established approaches Approaches frequently used for comparison: Hobbs (1978), Brenan et al. (1987), Lappin and Leass (1994), Kennedy and Boguraev (1996), Baldwin (1997), Mitkov (1996; 1998)

Component measures Relative importance Decision power (Mitkov 2001)

Evaluation measures for anaphora resolution systems Success rate Critical success rate Resolution etiquette (Mitkov et al. 2002) Resolution etiquette

Reliability of evaluation results Evaluation results can be regarded as reliable if evaluation covers/employs (i) All naturally occurring texts (ii) Sampling procedures

Relative vs. absolute results Results may be relative with regard to a specific evaluation set or other approach More “absolute” figures may be obtained if there existed a measure which quantified for the complexity of anaphors to be resolved

Measures quantifying complexity in anaphora resolution Measures for complexity (Mitkov 2001): Knowledge required for resolution Distance between anaphor and antecedent (in NPs, clauses, sentences) Number of competing candidates

Fair evaluation Algorithms should be evaluated on the basis of the same Evaluation data Pre-processing tools

Evaluation workbench Evaluation workbench for anaphora resolution (Mitkov 2000; Barbu and Mitkov 2001) Allows the comparison of approaches sharing common principles or similar pre-processing Enables the ‘plugging in’ and testing of different anaphora resolution algorithms All algorithms implemented operate in a fully automatic mode

The need for annotated corpora Annotated corpora are vital for training and evaluation Annotation should cover anaphoric or coreferential chains and not only anaphor- antecedent pairs only

Scarce commodity  Lancaster Anaphoric Treebank ( words)  MUC coreference task annotated data (65 000)  Part of the Penn Treebank (90 000)

Additional issues  Annotation scheme  Annotating tools  Annotation strategy Interannotators ’ (dis)agreement is a major issue!

The Wolverhampton coreference annotation project A word corpus annotated for anaphoric and coreferential links (identity- of-sense direct nominal anaphora) Less ambitious in terms of coverage, but much more consistent

Watch out for the traps! Are all annotated data reliable? Are all annotated data reliable Are all original documents reliable? Are all original documents reliable Are all results reported “honest”? Are all results reported “honest”

Morale and motivation important! If I may offer you my advice.... Do not despair if your first evaluation results are not as high as you wanted them to be Be prepared to provide considerable input in exchange of minor performance improvement Work hard Be transparent... and you´ll get there!

Anaphora resolution projects Ruslan Mitkov’s home page Research Group in Computational Linguistics