Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.

Slides:

Advertisements

Similar presentations

Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.

Advertisements

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki

Psychological Advertising: Exploring User Psychology for Click Prediction in Sponsored Search Date: 2014/03/25 Author: Taifeng Wang, Jiang Bian, Shusen.

Term Level Search Result Diversification DATE : 2013/09/11 SOURCE : SIGIR’13 AUTHORS : VAN DANG, W. BRUCE CROFT ADVISOR : DR.JIA-LING, KOH SPEAKER : SHUN-CHEN,

Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.

Diversity Maximization Under Matroid Constraints Date : 2013/11/06 Source : KDD’13 Authors : Zeinab Abbassi, Vahab S. Mirrokni, Mayur Thakur Advisor :

Date : 2013/09/17 Source : SIGIR’13 Authors : Zhu, Xingwei

Learning to Cluster Web Search Results SIGIR 04. ABSTRACT Organizing Web search results into clusters facilitates users quick browsing through search.

Sequence Clustering and Labeling for Unsupervised Query Intent Discovery Speaker: Po-Hsien Shih Advisor: Jia-Ling Koh Source: WSDM’12 Date: 1 November,

DOMAIN DEPENDENT QUERY REFORMULATION FOR WEB SEARCH Date : 2013/06/17 Author : Van Dang, Giridhar Kumaran, Adam Troy Source : CIKM’12 Advisor : Dr. Jia-Ling.

Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.

Searchable Web sites Recommendation Date : 2012/2/20 Source : WSDM’11 Speaker : I- Chih Chiu Advisor : Dr. Koh Jia-ling 1.

1 Entity Ranking Using Wikipedia as a Pivot (CIKM 10’) Rianne Kaptein, Pavel Serdyukov, Arjen de Vries, Jaap Kamps 2010/12/14 Yu-wen,Hsu.

Explorations in Tag Suggestion and Query Expansion Jian Wang and Brian D. Davison Lehigh University, USA SSM 2008 (Workshop on Search in Social Media)

Mobile Web Search Personalization Kapil Goenka. Outline Introduction & Background Methodology Evaluation Future Work Conclusion.

1 Ranked Queries over sources with Boolean Query Interfaces without Ranking Support Vagelis Hristidis, Florida International University Yuheng Hu, Arizona.

Presented by Zeehasham Rasheed

Topic-Sensitive PageRank Taher H. Haveliwala. PageRank Importance is propagated A global ranking vector is pre-computed.

1/16 Final project: Web Page Classification By: Xiaodong Wang Yanhua Wang Haitang Wang University of Cincinnati.

Multi-Style Language Model for Web Scale Information Retrieval Kuansan Wang, Xiaolong Li and Jianfeng Gao SIGIR 2010 Min-Hsuan Lai Department of Computer.

Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification on Reviews Peter D. Turney Institute for Information Technology National.

Tag Clouds Revisited Date : 2011/12/12 Source : CIKM’11 Speaker : I- Chih Chiu Advisor : Dr. Koh. Jia-ling 1.

1 Context-Aware Search Personalization with Concept Preference CIKM’11 Advisor ： Jia Ling, Koh Speaker ： SHENG HONG, CHUNG.

Exploiting Wikipedia as External Knowledge for Document Clustering Sakyasingha Dasgupta, Pradeep Ghosh Data Mining and Exploration-Presentation School.

1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar.

A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.

1 A Unified Relevance Model for Opinion Retrieval (CIKM 09’) Xuanjing Huang, W. Bruce Croft Date: 2010/02/08 Speaker: Yu-Wen, Hsu.

When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.

CIKM’09 Date:2010/8/24 Advisor: Dr. Koh, Jia-Ling Speaker: Lin, Yi-Jhen 1.

Estimating Topical Context by Diverging from External Resources SIGIR’13, July 28–August 1, 2013, Dublin, Ireland. Presenter: SHIH, KAI WUN Romain Deveaud.

A Probabilistic Graphical Model for Joint Answer Ranking in Question Answering Jeongwoo Ko, Luo Si, Eric Nyberg (SIGIR ’ 07) Speaker: Cho, Chin Wei Advisor:

Retrieval Models for Question and Answer Archives Xiaobing Xue, Jiwoon Jeon, W. Bruce Croft Computer Science Department University of Massachusetts, Google,

Giorgos Giannopoulos (IMIS/”Athena” R.C and NTU Athens, Greece) Theodore Dalamagas (IMIS/”Athena” R.C., Greece) Timos Sellis (IMIS/”Athena” R.C and NTU.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Extracting meaningful labels for WEBSOM text archives Advisor.

Introduction to Digital Libraries hussein suleman uct cs honours 2003.

Web Image Retrieval Re-Ranking with Relevance Model Wei-Hao Lin, Rong Jin, Alexander Hauptmann Language Technologies Institute School of Computer Science.

Binxing Jiao et. al (SIGIR ’10) Presenter : Lin, Yi-Jhen Advisor: Dr. Koh. Jia-ling Date: 2011/4/25 VISUAL SUMMARIZATION OF WEB PAGES.

Date : 2012/10/25 Author : Yosi Mass, Yehoshua Sagiv Source : WSDM’12 Speaker : Er-Gang Liu Advisor : Dr. Jia-ling Koh 1.

Probabilistic Models of Novel Document Rankings for Faceted Topic Retrieval Ben Cartrette and Praveen Chandar Dept. of Computer and Information Science.

Improving Web Search Results Using Affinity Graph Benyu Zhang, Hua Li, Yi Liu, Lei Ji, Wensi Xi, Weiguo Fan, Zheng Chen, Wei-Ying Ma Microsoft Research.

Query Suggestion Naama Kraus Slides are based on the papers: Baeza-Yates, Hurtado, Mendoza, Improving search engines by query clustering Boldi, Bonchi,

BioSnowball: Automated Population of Wikis (KDD ‘10) Advisor: Dr. Koh, Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/11/30 1.

Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,

LOGO Summarizing Conversations with Clue Words Giuseppe Carenini, Raymond T. Ng, Xiaodong Zhou (WWW ’07) Advisor ： Dr. Koh Jia-Ling Speaker ： Tu.

A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.

21/11/20151Gianluca Demartini Ranking Clusters for Web Search Gianluca Demartini Paul–Alexandru Chirita Ingo Brunkhorst Wolfgang Nejdl L3S Info Lunch Hannover,

Chapter 23: Probabilistic Language Models April 13, 2004.

Date : 2013/03/18 Author : Jeffrey Pound, Alexander K. Hudek, Ihab F. Ilyas, Grant Weddell Source : CIKM’12 Speaker : Er-Gang Liu Advisor : Prof. Jia-Ling.

Probabilistic Latent Query Analysis for Combining Multiple Retrieval Sources Rong Yan Alexander G. Hauptmann School of Computer Science Carnegie Mellon.

A Word Clustering Approach for Language Model-based Sentence Retrieval in Question Answering Systems Saeedeh Momtazi, Dietrich Klakow University of Saarland,Germany.

A Knowledge-Based Search Engine Powered by Wikipedia David Milne, Ian H. Witten, David M. Nichols (CIKM 2007)

A Classification-based Approach to Question Answering in Discussion Boards Liangjie Hong, Brian D. Davison Lehigh University (SIGIR ’ 09) Speaker: Cho,

Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -

1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:

Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.

1 13/05/07 1/20 LIST – DTSI – Interfaces, Cognitics and Virtual Reality Unit The INFILE project: a crosslingual filtering systems evaluation campaign Romaric.

26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.

1 What Makes a Query Difficult? David Carmel, Elad YomTov, Adam Darlow, Dan Pelleg IBM Haifa Research Labs SIGIR 2006.

Divided Pretreatment to Targets and Intentions for Query Recommendation Reporter: Yangyang Kang /23.

CONTEXTUAL SEARCH AND NAME DISAMBIGUATION IN USING GRAPHS EINAT MINKOV, WILLIAM W. COHEN, ANDREW Y. NG SIGIR’06 Date: 2008/7/17 Advisor: Dr. Koh,

PERSONALIZED DIVERSIFICATION OF SEARCH RESULTS Date: 2013/04/15 Author: David Vallet, Pablo Castells Source: SIGIR’12 Advisor: Dr.Jia-ling, Koh Speaker:

ENHANCING CLUSTER LABELING USING WIKIPEDIA David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab SIGIR’09.

Finding similar items by leveraging social tag clouds Speaker: Po-Hsien Shih Advisor: Jia-Ling Koh Source: SAC 2012’ Date: October 4, 2012.

Short Text Similarity with Word Embedding Date: 2016/03/28 Author: Tom Kenter, Maarten de Rijke Source: CIKM’15 Advisor: Jia-Ling Koh Speaker: Chih-Hsuan.

LEARNING IN A PAIRWISE TERM-TERM PROXIMITY FRAMEWORK FOR INFORMATION RETRIEVAL Ronan Cummins, Colm O’Riordan (SIGIR’09) Speaker : Yi-Ling Tai Date : 2010/03/15.

Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance Hello everyone,

Personalized, Interactive Question Answering on the Web

Enriching Taxonomies With Functional Domain Knowledge

Relevance and Reinforcement in Interactive Browsing

Connecting the Dots Between News Article

Presentation transcript:

Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin Wei Advisor: Dr. Koh, Jia-Ling

Outline Introduction General Framework ▫Indexing ▫Clustering ▫Important Terms Extraction ▫Label Extraction ▫Candidate Label Evaluation Experiment Conclusion

Introduction Massive amounts of textual data have brought about the need for efficient techniques that can organize the data in manageable forms. One of the popular approaches for organizing textual data is the use of clustering algorithms. In many applications of clustering, particularly in user interface based applications, human users interact directly with the created clusters. We must label the clusters so that users can understand what the cluster is about.

Introduction A popular approach for cluster labeling is to apply statistical techniques for feature selection. This is done by identifying “important” terms in the text that best represent the cluster topic.

Introduction However, a list of significant keywords, or even phrases, will many times fail to provide a meaningful readable label for a set of documents. The suggested terms, even when related to each other, tend to represent different aspects of the topic underlying the cluster. A good label may not occur directly in the text.

Introduction This result implies that human labels are not necessarily significant from a statistical perspective. This further motivated us to seek out a cluster labeling method using external resources.

General Framework

Indexing Documents are first parsed and tokenized, and then represented as term vectors in the vector space over the system’s vocabulary. We use the Lucene open source search system.

Clustering The clustering algorithms’ goal is to create coherent clusters for which documents within a cluster share the same topics, and the labels provided by the system are expected to express the mutual topic of documents within the cluster. Input ▫the collection of documents D Return ▫a set of document clusters C = {C1,C2,...,Cn} that cover the whole collection. ▫Each cluster Ci is a subset of documents of D. ▫a document may belong to more than one cluster.

Important Terms Extraction We look for a set of terms that maximizes the Jensen- Shannon Divergence (JSD) distance between the cluster C and the entire collection. Each term is scored according to its contribution to the JSD distance between the cluster and the collection. The top scored terms are then selected as the cluster important terms. Input ▫a cluster C Return ▫a list of terms T (C) = (t1, t2,..., tk) ▫Ordered by their estimated importance, to represent the content of the cluster’s documents.

Label Extraction Input ▫the list of important terms T (C) Return ▫Extract candidate labels for cluster C (denoted L(C)) from two type of source: Document content  Consider the set of important terms themselves as potential labels for the cluster. Wikipedia  Execute a query q against the Wikipedia, the result is a list of documents D(q) sorted by their similarity score to q.  For each document d in D(q), we then consider both the document’s title and the set of categories associated with the document as potential candidate cluster labels.

Candidate Label Evaluation Candidate labels are evaluated by several judges. Input the set of candidate labels L(C). the set of the cluster’s important terms, T (C). Return Each judge evaluates the candidates according to its evaluation policy. The scores of all judges are then aggregated and the labels with the highest aggregated scores are returned.

MI Judge The Mutual Information (MI) judge scores each candidate by the average pointwise mutual information (PMI) of the label with the set of the cluster’s important terms, with respect to a given external textual corpus. Reflects the “semantic distance” of the label from the cluster content. Labels that are “closer” to the cluster content are preferred.

MI Judge Input ▫the set of candidate labels L(C) ▫the set of the cluster’s important terms T (C) ▫A corpus where the PMI will be measured  Wikipedia collection  Google n-grams collection Given a candidate label l in L(C), the following score is assigned to l: where w(t) denotes the relative importance of important term t in T (C)

MI Judge The PMI between two terms is measured by: The probability of a term is approximated by the maximum likelihood estimation

SP Judge Score Propagation (SP) judge, scores each candidate label with respect to the scores of the documents in the result set associated with that label. This judge propagates documents’ scores to candidates that are not directly associated with those documents, but share common keywords with other related labels.

SP Judge Given a candidate label l in L(C), by summing over the set of all documents in D(q) associated with label l, we obtain an aggregated weight for l, which represents the score propagation from D(q) to l: Score(d) = Lucene search engine score Score(d) = 1 / Rank (d) ranking position

SP Judge After scoring the labels, we score the label keywords as follows: Finally, the score assigned to each candidate label l is set by the average score propagated back from its keywords. Formally: where n(l) denotes the number of l’s unique keywords

Score Aggregation The final stage in candidate evaluation is to aggregate the scores from the different judges for each label. Given a list of judges, (J1,...Jm), each candidate label is scored using a linear combination of the judge scores:

Experiment Data Collections ▫20 News Groups (20NG) data collection  Newsgroup documents that were manually classified into 20 different categories  Each category includes 1,000 documents ▫Open Directory Project(ODP)  We randomly selected 100 different categories from the ODP hierarchy  Each category we then randomly selected up to 100 documents  Collection size of about 10,000 documents

Experiment We evaluated the system’s performance using two measures: The relative number of clusters for which at least one of the top-k labels is correct. ▫Mean Reciprocal Rank Given an ordered list of k proposed labels for a cluster, the reciprocal rank is the inverse of the rank of the first correct label, or zero if no label in the list is correct. is the average of the reciprocal ranks of all clusters. QueryResultsCorrect responseRankReciprocal rank catcatten, cati, catsCats31/3 torustorii, tori, torusesTori21/2 virusviruses, virii, viriViruses11

The Effectiveness of Using Wikipedia to Enhance Cluster Labeling 20NG ODP

Candidate Labels Extraction We identify two significant parameters that can affect the quality of Wikipedia’s labels: ▫the number of important terms that are used to query Wikipedia ▫the number of top scored results from which candidate labels are extracted

Candidate Labels Extraction

Evaluating Judge Effectiveness 20NG ODP

Conclusion Cluster labeling can be enhanced by utilizing the Wikipedia knowledge-base. We described a general framework for cluster labeling that extracts candidate labels from the text and from Wikipedia. Then retrieves the top scored candidates according to the evaluation of several independent judges. Developing such a collection specific decision making as part of the labeling framework is left for further research.