ENHANCING CLUSTER LABELING USING WIKIPEDIA David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab SIGIR’09.

Slides:



Advertisements
Similar presentations
© 2009 IBM Corporation Extracting User Profiles from Large Scale Data Joint work with Michal Shmueli-Scheuer, Haggai Roitman, David Carmel and Yosi Mass.
Advertisements

Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
WWW 2014 Seoul, April 8 th SNOW 2014 Data Challenge Two-level message clustering for topic detection in Twitter Georgios Petkos, Symeon Papadopoulos, Yiannis.
Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.
Learning to Cluster Web Search Results SIGIR 04. ABSTRACT Organizing Web search results into clusters facilitates users quick browsing through search.
Probabilistic Semantic Similarity Measurements for Noisy Short Texts Using Wikipedia Entities Masumi Shirakawa 1, Kotaro Nakayama 2, Takahiro Hara 1, Shojiro.
A Music Search Engine Built upon Audio-based and Web-based Similarity Measures P. Knees, T., Pohle, M. Schedl, G. Widmer SIGIR 2007.
Contextual Advertising by Combining Relevance with Click Feedback D. Chakrabarti D. Agarwal V. Josifovski.
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
1 Entity Ranking Using Wikipedia as a Pivot (CIKM 10’) Rianne Kaptein, Pavel Serdyukov, Arjen de Vries, Jaap Kamps 2010/12/14 Yu-wen,Hsu.
Information Retrieval in Practice
1 Ranked Queries over sources with Boolean Query Interfaces without Ranking Support Vagelis Hristidis, Florida International University Yuheng Hu, Arizona.
Affinity Rank Yi Liu, Benyu Zhang, Zheng Chen MSRA.
Scalable Text Mining with Sparse Generative Models
Overview of Search Engines
EVENT IDENTIFICATION IN SOCIAL MEDIA Hila Becker, Luis Gravano Mor Naaman Columbia University Rutgers University.
Finding Advertising Keywords on Web Pages Scott Wen-tau YihJoshua Goodman Microsoft Research Vitor R. Carvalho Carnegie Mellon University.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
2008/06/06 Y.H.Chang Towards Effective Browsing of Large Scale Social Annotations1 Towards Effective Browsing of Large Scale Social Annotations WWW 2007.
MediaEval Workshop 2011 Pisa, Italy 1-2 September 2011.
Tag Clouds Revisited Date : 2011/12/12 Source : CIKM’11 Speaker : I- Chih Chiu Advisor : Dr. Koh. Jia-ling 1.
Exploiting Wikipedia as External Knowledge for Document Clustering Sakyasingha Dasgupta, Pradeep Ghosh Data Mining and Exploration-Presentation School.
Leveraging Conceptual Lexicon : Query Disambiguation using Proximity Information for Patent Retrieval Date : 2013/10/30 Author : Parvaz Mahdabi, Shima.
Reyyan Yeniterzi Weakly-Supervised Discovery of Named Entities Using Web Search Queries Marius Pasca Google CIKM 2007.
Concept Unification of Terms in Different Languages for IR Qing Li, Sung-Hyon Myaeng (1), Yun Jin (2),Bo-yeong Kang (3) (1) Information & Communications.
1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer.
AnswerBus Question Answering System Zhiping Zheng School of Information, University of Michigan HLT 2002.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
A Probabilistic Graphical Model for Joint Answer Ranking in Question Answering Jeongwoo Ko, Luo Si, Eric Nyberg (SIGIR ’ 07) Speaker: Cho, Chin Wei Advisor:
11 A Hybrid Phish Detection Approach by Identity Discovery and Keywords Retrieval Reporter: 林佳宜 /10/17.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Video Google: A Text Retrieval Approach to Object Matching in Videos Josef Sivic and Andrew Zisserman.
INTERESTING NUGGETS AND THEIR IMPACT ON DEFINITIONAL QUESTION ANSWERING Kian-Wei Kor, Tat-Seng Chua Department of Computer Science School of Computing.
Distributed Information Retrieval Server Ranking for Distributed Text Retrieval Systems on the Internet B. Yuwono and D. Lee Siemens TREC-4 Report: Further.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
IR Homework #2 By J. H. Wang Mar. 31, Programming Exercise #2: Query Processing and Searching Goal: to search relevant documents for a given query.
Improving Web Search Results Using Affinity Graph Benyu Zhang, Hua Li, Yi Liu, Lei Ji, Wensi Xi, Weiguo Fan, Zheng Chen, Wei-Ying Ma Microsoft Research.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Page 1 March 2011 Local and Global Algorithms for Disambiguation to Wikipedia Lev Ratinov 1, Dan Roth 1, Doug Downey 2, Mike Anderson 3 1 University of.
Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,
Personalizing Web Search using Long Term Browsing History Nicolaas Matthijs, Cambridge Filip Radlinski, Microsoft In Proceedings of WSDM
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Chapter 23: Probabilistic Language Models April 13, 2004.
Semantic v.s. Positions: Utilizing Balanced Proximity in Language Model Smoothing for Information Retrieval Rui Yan†, ♮, Han Jiang†, ♮, Mirella Lapata‡,
Language Model in Turkish IR Melih Kandemir F. Melih Özbekoğlu Can Şardan Ömer S. Uğurlu.
Information Retrieval and Organisation Chapter 16 Flat Clustering Dell Zhang Birkbeck, University of London.
August 17, 2005Question Answering Passage Retrieval Using Dependency Parsing 1/28 Question Answering Passage Retrieval Using Dependency Parsing Hang Cui.
Date: 2013/10/23 Author: Salvatore Oriando, Francesco Pizzolon, Gabriele Tolomei Source: WWW’13 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang SEED:A Framework.
Information Retrieval using Word Senses: Root Sense Tagging Approach Sang-Bum Kim, Hee-Cheol Seo and Hae-Chang Rim Natural Language Processing Lab., Department.
AN EFFECTIVE STATISTICAL APPROACH TO BLOG POST OPINION RETRIEVAL Ben He Craig Macdonald Iadh Ounis University of Glasgow Jiyin He University of Amsterdam.
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
1 Evaluating High Accuracy Retrieval Techniques Chirag Shah,W. Bruce Croft Center for Intelligent Information Retrieval Department of Computer Science.
Date: 2013/4/1 Author: Jaime I. Lopez-Veyna, Victor J. Sosa-Sosa, Ivan Lopez-Arevalo Source: KEYS’12 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang KESOSD.
CONTEXTUAL SEARCH AND NAME DISAMBIGUATION IN USING GRAPHS EINAT MINKOV, WILLIAM W. COHEN, ANDREW Y. NG SIGIR’06 Date: 2008/7/17 Advisor: Dr. Koh,
Refined Online Citation Matching and Adaptive Canonical Metadata Construction CSE 598B Course Project Report Huajing Li.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Predicting User Interests from Contextual Information R. W. White, P. Bailey, L. Chen Microsoft (SIGIR 2009) Presenter : Jae-won Lee.
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
Contextual Search and Name Disambiguation in Using Graphs Einat Minkov, William W. Cohen, Andrew Y. Ng Carnegie Mellon University and Stanford University.
Integrating linguistic knowledge in passage retrieval for question answering J¨org Tiedemann Alfa Informatica, University of Groningen HLT/EMNLP 2005.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Enhancing Text Clustering by Leveraging Wikipedia Semantics.
Question Answering Passage Retrieval Using Dependency Relations (SIGIR 2005) (National University of Singapore) Hang Cui, Renxu Sun, Keya Li, Min-Yen Kan,
IR 6 Scoring, term weighting and the vector space model.
2016/9/301 Exploiting Wikipedia as External Knowledge for Document Clustering Xiaohua Hu, Xiaodan Zhang, Caimei Lu, E. K. Park, and Xiaohua Zhou Proceeding.
Linguistic Graph Similarity for News Sentence Searching
Web News Sentence Searching Using Linguistic Graph Similarity
Enriching Taxonomies With Functional Domain Knowledge
Retrieval Utilities Relevance feedback Clustering
Presentation transcript:

ENHANCING CLUSTER LABELING USING WIKIPEDIA David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab SIGIR’09

Document Clustering  A method of aggregating a set of documents such that :  Documents within cluster are as similar as possible.  Documents from different clusters should be dissimilar. Cluster 1 Cluster 2 Cluster 3

Cluster Labeling  To assign each cluster a human readable label that can best represent the cluster.  Traditional method is to pick the label from the important terms within the cluster.  The statistically significant terms may not be a good label.  A good label may not occur directly in the text. Cluster 1 Cluster 2 Cluster 3 Bowling Ice Hockey Electronics

Approach  Utilizing the external resource to help the cluster labeling.  Besides the important terms extracted from the cluster, the metadata of Wikipedia such as title and category is used to serve as candidate label.

A General Framework i i i i i i

Step1: Indexing  Documents are parsed and tokenized.  Term weight are determined by tf-idf.  Use Lucene to generate a search index such that the tf and idf value of term t can be quickly accessed.

Step2: Clustering  Given the document collection D, return a set of document clusters C={C 1,C 2,…,C n }.  A cluster is represented by its centroid of the cluster's documents.  The term weight of the cluster's centroid is slightly modified: 

Step3: Important Terms Extraction  Given a cluster, find a list of important terms ordered by their estimated importance.  This can be achieved by  Selecting the top weighted terms from the cluster centroid.  Use the Jensen-Shannon Divergence(JSD) to measure the distance between the cluster and the collection.

Step4: Label Extraction  One way is to use the top k important terms directly.  The other way is to use the top k important terms to query Wikipedia. The title and the set of categories of the returned Wiki documents serve as candidate labels.

Step5: Output the Recommended Labels from Candidate Labels  MI(Mutual Information) Judge  Score each candidate label by its pointwise mutual information with the cluster's important terms.  SP(Score Propagation) Judge  Propagate the document score to the candidate label. Document score can be the original score of the IR system or the rank(d) -1  Socore Aggregation  Use linear combination to combine the above two judges.  The recommend labels are the top ranked labels.

Data Collection  20 News Groups  20 (clusters) X 1000 (documents/ clusters)  Open Directory Project(ODP)  100 (clusters) X 100 (documents/ clusters)  The Ground Truth  The correct label itself.  The correct label's inflection.  The correct label's Wordnet synonym.

Evaluation Metrics   Ex:  Mean Reciprocal  Ex: c1c2 label1 label2 label3 label4 label1 label2 label3 label4 Correct  =1/2 =0.5 c1c2 label1 label2 label3 label4 label1 label2 label3 label4 Correct  =((1/2)+(1/3))/2 =0.416…

Parameters 1. The important term selection method(JSD, ctf-cdf- idf, MI, chi-square). 2. The number of important terms for querying Wikipedia. 3. The number of Wikipedia results to be used for label extraction. 4. The judges used for candidate evaluation.

Evaluation 1  The effectiveness of using Wikipedia to enhance cluster labeling.

Evaluation 2  Candidate label extraction

Evaluation 3  Judge effectiveness

Evaluation 4.1  The Effect of Clusters' Coherency on Label Quality   Testing on a "noisy cluster":  For a noise level p(in [0,1]) of clusters, each document in one cluster have probability p to swap with document in other cluster.

Evaluation 4.2  The Effect of Clusters' Coherency on Label Quality

Conclusion  Proposed a general framework for solving cluster labeling problem.  The metadata of Wikipedia can boost the performance of cluster labeling.  The proposed method has good resiliency to noisy clusters.