Clustering Algorithms for Noun Phrase Coreference Resolution

Slides:

Advertisements

Similar presentations

Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.

Advertisements

A Unified Framework for Context Assisted Face Clustering

Albert Gatt Corpora and Statistical Methods Lecture 13.

Specialized models and ranking for coreference resolution Pascal Denis ALPAGE Project Team INRIA Rocquencourt F Le Chesnay, France Jason Baldridge.

A Corpus for Cross- Document Co-Reference D. Day 1, J. Hitzeman 1, M. Wick 2, K. Crouch 1 and M. Poesio 3 1 The MITRE Corporation 2 University of Massachusetts,

Sequence Clustering and Labeling for Unsupervised Query Intent Discovery Speaker: Po-Hsien Shih Advisor: Jia-Ling Koh Source: WSDM’12 Date: 1 November,

Easy-First Coreference Resolution Veselin Stoyanov and Jason Eisner Johns Hopkins University.

Jean-Eudes Ranvier 17/05/2015Planet Data - Madrid Trustworthiness assessment (on web pages) Task 3.3.

Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.

Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.

Using Information Extraction for Question Answering Done by Rani Qumsiyeh.

Supervised models for coreference resolution Altaf Rahman and Vincent Ng Human Language Technology Research Institute University of Texas at Dallas 1.

Designing clustering methods for ontology building: The Mo’K workbench Authors: Gilles Bisson, Claire Nédellec and Dolores Cañamero Presenter: Ovidiu Fortu.

Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.

An Automatic Segmentation Method Combined with Length Descending and String Frequency Statistics for Chinese Shaohua Jiang, Yanzhong Dang Institute of.

Mining and Summarizing Customer Reviews

A Global Relaxation Labeling Approach to Coreference Resolution Coling 2010 Emili Sapena, Llu´ıs Padr´o and Jordi Turmo TALP Research Center Universitat.

Disambiguation of References to Individuals Levon Lloyd (State University of New York) Varun Bhagwan, Daniel Gruhl (IBM Research Center) Varun Bhagwan,

A Light-weight Approach to Coreference Resolution for Named Entities in Text Marin Dimitrov Ontotext Lab, Sirma AI Kalina Bontcheva, Hamish Cunningham,

Andreea Bodnari, 1 Peter Szolovits, 1 Ozlem Uzuner 2 1 MIT, CSAIL, Cambridge, MA, USA 2 Department of Information Studies, University at Albany SUNY, Albany,

The Problem Finding information about people in huge text collections or on-line repositories on the Web is a common activity Person names, however, are.

Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.

An Effective Fuzzy Clustering Algorithm for Web Document Classification: A Case Study in Cultural Content Mining Nils Murrugarra.

Learning Object Metadata Mining Masoud Makrehchi Supervisor: Prof. Mohamed Kamel.

A Comparative Study of Search Result Diversification Methods Wei Zheng and Hui Fang University of Delaware, Newark DE 19716, USA

Presented by Tienwei Tsai July, 2005

Distributional Part-of-Speech Tagging Hinrich Schütze CSLI, Ventura Hall Stanford, CA , USA NLP Applications.

Illinois-Coref: The UI System in the CoNLL-2012 Shared Task Kai-Wei Chang, Rajhans Samdani, Alla Rozovskaya, Mark Sammons, and Dan Roth Supported by ARL,

Ontology-Driven Automatic Entity Disambiguation in Unstructured Text Jed Hassell.

Automatic Detection of Tags for Political Blogs Khairun-nisa Hassanali Vasileios Hatzivassiloglou The University.

Incorporating Extra-linguistic Information into Reference Resolution in Collaborative Task Dialogue Ryu Iida Shumpei Kobayashi Takenobu Tokunaga Tokyo.

Incident Threading for News Passages (CIKM 09) Speaker: Yi-lin,Hsu Advisor: Dr. Koh, Jia-ling. Date:2010/06/14.

Annotating Words using WordNet Semantic Glosses Julian Szymański Department of Computer Systems Architecture, Faculty of Electronics, Telecommunications.

JHU/CLSP/WS07/ELERFED Scoring Metrics for IDC, CDC and EDC David Day ELERFED JHU Workshop July 18, 2007.

A Language Independent Method for Question Classification COLING 2004.

Efficiently Computed Lexical Chains As an Intermediate Representation for Automatic Text Summarization H.G. Silber and K.F. McCoy University of Delaware.

1 Learning Sub-structures of Document Semantic Graphs for Document Summarization 1 Jure Leskovec, 1 Marko Grobelnik, 2 Natasa Milic-Frayling 1 Jozef Stefan.

Ground Truth Free Evaluation of Segment Based Maps Rolf Lakaemper Temple University, Philadelphia,PA,USA.

Event-Centric Summary Generation Lucy Vanderwende, Michele Banko and Arul Menezes One Microsoft Way, WA, USA DUC 2004.

Union-find Algorithm Presented by Michael Cassarino.

DATA MINING WITH CLUSTERING AND CLASSIFICATION Spring 2007, SJSU Benjamin Lam.

2015/12/121 Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Proceeding of the 18th International.

Named Entity Disambiguation on an Ontology Enriched by Wikipedia Hien Thanh Nguyen 1, Tru Hoang Cao 2 1 Ton Duc Thang University, Vietnam 2 Ho Chi Minh.

Bo Lin Kevin Dela Rosa Rushin Shah.  As part of our research, we are working on a cross- document co-reference resolution system  Co-reference Resolution:

Evaluation issues in anaphora resolution and beyond Ruslan Mitkov University of Wolverhampton Faro, 27 June 2002.

1 Evaluating High Accuracy Retrieval Techniques Chirag Shah,W. Bruce Croft Center for Intelligent Information Retrieval Department of Computer Science.

Finding document topics for improving topic segmentation Source: ACL2007 Authors: Olivier Ferret (18 route du Panorama, BP6) Reporter:Yong-Xiang Chen.

Divided Pretreatment to Targets and Intentions for Query Recommendation Reporter: Yangyang Kang /23.

Event-Based Extractive Summarization E. Filatova and V. Hatzivassiloglou Department of Computer Science Columbia University (ACL 2004)

Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.

Multilingual Information Retrieval using GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of Kaohsiung.

Refined Online Citation Matching and Adaptive Canonical Metadata Construction CSE 598B Course Project Report Huajing Li.

Department of Computer Science The University of Texas at Austin USA Joint Entity and Relation Extraction using Card-Pyramid Parsing Rohit J. Kate Raymond.

Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,

Data Mining and Text Mining. The Standard Data Mining process.

Language Identification and Part-of-Speech Tagging

CACTUS-Clustering Categorical Data Using Summaries

Using lexical chains for keyword extraction

Simone Paolo Ponzetto University of Heidelberg Massimo Poesio

Data Mining K-means Algorithm

Entity- & Topic-Based Information Ordering

NYU Coreference CSCI-GA.2591 Ralph Grishman.

Social Knowledge Mining

Lecture 9: Entity Resolution

iSRD Spam Review Detection with Imbalanced Data Distributions

[jws13] Evaluation of instance matching tools: The experience of OAEI

An Interactive Approach to Collectively Resolving URI Coreference

Text Categorization Berlin Chen 2003 Reference:

Hierarchical, Perceptron-like Learning for OBIE

Presentation transcript:

Clustering Algorithms for Noun Phrase Coreference Resolution Roxana Angheluta, Patrick Jeuniaux, Rudradeb Mitra, Marie-Francine Moens JADT 2004 : 7es Journ´ees internationales d’Analyse statistique des Donn´ees Textuelles 2018/11/20

OUTLINE Introduction Methods The Clustering Methods Feature Selection Distance Metric The Clustering Methods Corpora and Evaluation 2018/11/20

Introduction 2018/11/20

Introduction Most of the natural language processing applications that deal with meaning of discourse(imply the completion of some reference resolution activity). Noun phrase coreference resolution: To relate each noun phrase in a text to its referent in the real world. 2018/11/20

Introduction Our coreference resolution focuses on detecting “identity” relationships (i.e. not on is-a or whole/part links for example). It is natural to view coreferencing as a partitioning or clustering of the set of entities. The clustering is accomplished in two steps: detection of the entities and extraction of a specific set of their features; clustering of the entities. 2018/11/20

Introduction Implemented four novel algorithms: hard clustering algorithm fuzzy clustering algorithm progressive fuzzy clustering algorithm and its hard variant Our goal is to test the quality of the coreference resolution that is achieved by these four algorithms. 2018/11/20

Methods 2018/11/20

Methods--Feature Selection 2018/11/20

Methods-- Distance Metric The following metric for computing the distance between two entities NPi and NPj : F: the entity feature set Wf: weight of a feature A weight of ∞ has priority over −∞: if two entities mismatch on a feature which has a weight of ∞, then they have a distance equal to ∞ 2018/11/20

The Clustering Methods 2018/11/20

The Clustering Methods Hard Clustering Cardie et al. (HC-C) Fuzzy Clustering Bergler et al. (FC-B) Progressive Fuzzy Clustering (FC-P) The Hard Variant (HC-V) 2018/11/20

Clustering Method(一): Hard Clustering Cardie et al. 2018/11/20

Clustering Method(一): Hard Clustering Cardie et al. The algorithm is very simple and fast, however it has also some weak points. The highly greedy character of this algorithm (as it considers the first match and not the best match) introduces errors which are further propagated as the algorithm advances. EX:“Robert Smith lives with his wife ... Smith loves her”. 2018/11/20

Clustering Method(一): Hard Clustering Cardie et al. It’s dependent on the threshold distance value. The single pass algorithm is very dependent on the order when comparing clusters. Often there are different possibilities for merging clusters . For a high threshold, the algorithm has the tendency to group all entities with semantic class 0 or semantic class 2 in one cluster. 2018/11/20

Clustering Method(二): Fuzzy Clustering Bergler et al. Another promising approach considers noun phrase coreference resolution as a fuzzy clustering task because of the ambiguity typically found in natural language and the difficulty of solving the coreferents with absolute certainty. 2018/11/20

Clustering Method(二): Fuzzy Clustering Bergler et al. In hard clustering, data is divided into distinct clusters, where each data element belongs to exactly one cluster. In fuzzy clustering, data elements can belong to more than one cluster, and associated with each element is a set of membership levels. 2018/11/20

Clustering Method(二): Fuzzy Clustering Bergler et al. Initially, each entity forms its own cluster (whose medoid it is). Each other entity is assigned to all of the initial clusters by computing the distance between it and the medoid of the cluster. As a result each entity has a fuzzy membership with each cluster,forming a fuzzy coreference chain (a fuzzy set). 2018/11/20

Clustering Method(二): Fuzzy Clustering Bergler et al. The medoid entity that originally formed the singleton cluster has a complete membership with itself or a distance of zero with itself. Then the chains are iteratively merged when their fuzzy set intersection is no larger than an a priori defined distance 2018/11/20

Clustering Method(二): Fuzzy Clustering Bergler et al. Beside the fuzzy representation, there are two main differences with hard clustering: the chaining effect is larger because two clusters can be merged even without checking any pairwise incompatibilities of cluster objects; the algorithm is independent of the order in which the clusters are merged. 2018/11/20

Clustering Method(三): Progressive Fuzzy Clustering 2018/11/20

Clustering Method(三): Progressive Fuzzy Clustering To improve the performance two special cases are included in the algorithm. Appositive merging:Appositives have much higher preferences than the other features. Restriction on pronoun coreferencing:This restriction however prohibits cataphoric references, but they appear quite rarely in texts. 2018/11/20

Clustering Method(三): Progressive Fuzzy Clustering The main resemblances and differences with the foregoing algorithms are: Progressive nature : The fuzzy algorithm progressively updates the fuzzy membership after each merging of clusters. However it updates it differently, i.e., not by taking the minimum fuzziness of an entity in the merged clusters, but by recomputing the fuzzy membership of an entity in the new cluster. Merging of clusters : restricting the merging of chains that have a non-pronoun phrase as medoid and by considering the similarity of the current fuzzy sets of the clusters. Search for the best match Corpus-independent It does not merge clusters when members of the new cluster would be incompatible 2018/11/20

Clustering Method(四): The Hard Variant 2018/11/20

Corpora and Evaluation 2018/11/20

Corpora and Evaluation Document Understanding Conference(DUC) 2002 Message Understanding Conference 6(MUC-6) DUC selected from the category “biographies” chose randomly ten documents from this set, parsed them in order to extract the entities and annotated them manually for coreference. 2018/11/20

Corpora and Evaluation MUC-6 : They are all annotated with coreference information. A training set of 30 documents and a test set of 30 documents. The features are extracted slightly differently for the two corpora, because of the different nature of the entities The MUC-6 corpora contains few pronouns. The DUC subcorpus is useful for the evaluation ,especially for the pronoun resolution. 2018/11/20

Corpora and Evaluation We computed automatically the precision and recall and combined them into the F-measure. Two algorithms were initially implemented to perform the evaluation: the one of Vilain et al. and the B-CUBED algorithm. In Vilain’s algorithm, the recall is computed as follows: 2018/11/20

Corpora and Evaluation In the BCUBED algorithm, the recall is computed as follows: The recall for entity i is defined as: F-measure combines equally the precision with the recall: 2018/11/20

Corpora and Evaluation We separately evaluated pronoun coreference, by selecting as entities only pronouns and their immediate antecedents in the manual files. 2018/11/20

Results and Discussion 2018/11/20

Results and Discussion Two baselines every entity in a singleton cluster (BL1) and all entities in one cluster (BL2) For the hard clustering we used four different threshold values, determined experimentally: 8, 11.5, 16 and 20 For the fuzzy clustering, we used a threshold value of 0.2 and 0.5 2018/11/20

Results and Discussion--All entities 2018/11/20

Results and Discussion--Experiment 2018/11/20

Results and Discussion--Pronouns 2018/11/20

Conclusion and Future Work 2018/11/20

Conclusion and Future Work In this paper we compared four clustering methods for coreference resolution. We evaluated them on two kinds of corpora, a standard one used in the coreference resolution task and another one containing more pronominal entities. The algorithms do not rely on a threshold distance value for cluster membership. In the future we plan to perform more experiments with different types of texts and to enlarge the feature set based on current linguistic theories and integrate the noun phrase coreference tool in our text summarization system. 2018/11/20

COMMENT Feature Selection variant Difference between Pronoun and noun phrase. 2018/11/20