Learning to Cluster Web Search Results SIGIR 04. ABSTRACT Organizing Web search results into clusters facilitates users quick browsing through search.

Slides:



Advertisements
Similar presentations
CWS: A Comparative Web Search System Jian-Tao Sun, Xuanhui Wang, § Dou Shen Hua-Jun Zeng, Zheng Chen Microsoft Research Asia University of Illinois at.
Advertisements

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Chapter 5: Introduction to Information Retrieval
INFO624 - Week 2 Models of Information Retrieval Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University.
Ao-Jan Su † Y. Charlie Hu ‡ Aleksandar Kuzmanovic † Cheng-Kok Koh ‡ † Northwestern University ‡ Purdue University How to Improve Your Google Ranking: Myths.
A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng City University of Hong Kong WWW 2007 Session: Similarity Search April.
Context-aware Query Suggestion by Mining Click-through and Session Data Authors: H. Cao et.al KDD 08 Presented by Shize Su 1.
Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
Creating Concept Hierarchies in a Customer Self-Help System Bob Wall CS /29/05.
Recommender systems Ram Akella February 23, 2011 Lecture 6b, i290 & 280I University of California at Berkeley Silicon Valley Center/SC.
J. Chen, O. R. Zaiane and R. Goebel An Unsupervised Approach to Cluster Web Search Results based on Word Sense Communities.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Scalable Text Mining with Sparse Generative Models
EVENT IDENTIFICATION IN SOCIAL MEDIA Hila Becker, Luis Gravano Mor Naaman Columbia University Rutgers University.
Finding Advertising Keywords on Web Pages Scott Wen-tau YihJoshua Goodman Microsoft Research Vitor R. Carvalho Carnegie Mellon University.
1/16 Final project: Web Page Classification By: Xiaodong Wang Yanhua Wang Haitang Wang University of Cincinnati.
CS344: Introduction to Artificial Intelligence Vishal Vachhani M.Tech, CSE Lecture 34-35: CLIR and Ranking in IR.
 Clustering of Web Documents Jinfeng Chen. Zhong Su, Qiang Yang, HongHiang Zhang, Xiaowei Xu and Yuhen Hu, Correlation- based Document Clustering using.
Citation Recommendation 1 Web Technology Laboratory Ferdowsi University of Mashhad.
 An important problem in sponsored search advertising is keyword generation, which bridges the gap between the keywords bidded by advertisers and queried.
Reyyan Yeniterzi Weakly-Supervised Discovery of Named Entities Using Web Search Queries Marius Pasca Google CIKM 2007.
1 Cross-Lingual Query Suggestion Using Query Logs of Different Languages SIGIR 07.
Finding Similar Questions in Large Question and Answer Archives Jiwoon Jeon, W. Bruce Croft and Joon Ho Lee Retrieval Models for Question and Answer Archives.
1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Detecting Semantic Cloaking on the Web Baoning Wu and Brian D. Davison Lehigh University, USA WWW 2006.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
Presented by: Apeksha Khabia Guided by: Dr. M. B. Chandak
A search-based Chinese Word Segmentation Method ——WWW 2007 Xin-Jing Wang: IBM China Wen Liu: Huazhong Univ. China Yong Qin: IBM China.
A Probabilistic Graphical Model for Joint Answer Ranking in Question Answering Jeongwoo Ko, Luo Si, Eric Nyberg (SIGIR ’ 07) Speaker: Cho, Chin Wei Advisor:
Web Document Clustering: A Feasibility Demonstration Oren Zamir and Oren Etzioni, SIGIR, 1998.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
A New Suffix Tree Similarity Measure for Document Clustering
Improving Suffix Tree Clustering Base cluster ranking s(B) = |B| * f(|P|) |B| is the number of documents in base cluster B |P| is the number of words in.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
From Social Bookmarking to Social Summarization: An Experiment in Community-Based Summary Generation Oisin Boydell, Barry Smyth Adaptive Information Cluster,
Binxing Jiao et. al (SIGIR ’10) Presenter : Lin, Yi-Jhen Advisor: Dr. Koh. Jia-ling Date: 2011/4/25 VISUAL SUMMARIZATION OF WEB PAGES.
Contextual Ranking of Keywords Using Click Data Utku Irmak, Vadim von Brzeski, Reiner Kraft Yahoo! Inc ICDE 09’ Datamining session Summarized.
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
Indirect Supervision Protocols for Learning in Natural Language Processing II. Learning by Inventing Binary Labels This work is supported by DARPA funding.
1 Opinion Retrieval from Blogs Wei Zhang, Clement Yu, and Weiyi Meng (2007 CIKM)
Understanding User’s Query Intent with Wikipedia G 여 승 후.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
21/11/20151Gianluca Demartini Ranking Clusters for Web Search Gianluca Demartini Paul–Alexandru Chirita Ingo Brunkhorst Wolfgang Nejdl L3S Info Lunch Hannover,
Date : 2013/03/18 Author : Jeffrey Pound, Alexander K. Hudek, Ihab F. Ilyas, Grant Weddell Source : CIKM’12 Speaker : Er-Gang Liu Advisor : Prof. Jia-Ling.
Probabilistic Latent Query Analysis for Combining Multiple Retrieval Sources Rong Yan Alexander G. Hauptmann School of Computer Science Carnegie Mellon.
Ranking Definitions with Supervised Learning Methods J.Xu, Y.Cao, H.Li and M.Zhao WWW 2005 Presenter: Baoning Wu.
Exploring in the Weblog Space by Detecting Informative and Affective Articles Xiaochuan Ni, Gui-Rong Xue, Xiao Ling, Yong Yu Shanghai Jiao-Tong University.
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.
NTU & MSRA Ming-Feng Tsai
Citation-Based Retrieval for Scholarly Publications 指導教授:郭建明 學生:蘇文正 M
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
The Development of a search engine & Comparison according to algorithms Sung-soo Kim The final report.
Predicting User Interests from Contextual Information R. W. White, P. Bailey, L. Chen Microsoft (SIGIR 2009) Presenter : Jae-won Lee.
Predicting Short-Term Interests Using Activity-Based Search Context CIKM’10 Advisor: Jia Ling, Koh Speaker: Yu Cheng, Hsieh.
ENHANCING CLUSTER LABELING USING WIKIPEDIA David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab SIGIR’09.
Bringing Order to the Web : Automatically Categorizing Search Results Advisor : Dr. Hsu Graduate : Keng-Wei Chang Author : Hao Chen Susan Dumais.
Clustering (Search Engine Results) CSE 454. © Etzioni & Weld To Do Lecture is short Add k-means Details of ST construction.
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
A Self-organizing Semantic Map for Information Retrieval Xia Lin, Dagobert Soergel, Gary Marchionini presented by Yi-Ting.
IR 6 Scoring, term weighting and the vector space model.
Information Organization: Overview
Learning to Rank Shubhra kanti karmaker (Santu)
Data Mining Chapter 6 Search Engines
Information Organization: Overview
Presentation transcript:

Learning to Cluster Web Search Results SIGIR 04

ABSTRACT Organizing Web search results into clusters facilitates users quick browsing through search results. Traditional clustering techniques  They don ’ t generate clusters with highly readable names.  Need pre-defined categories as in classification method. Based on a regression model learned from human labeled training data, convert an unsupervised clustering problem to a supervised learning problem.

INTRODUCTION User submits query “ jaguar ” into Google  Results related to “ big cat ”, user should go to the 10th,11th,32nd and 71st results. A possible solution to this problem is to online cluster search result into different groups. Ranking salient phrases as cluster names.  Re-formalize the clustering problem as a salient phrases ranking problem.

INTRODUCTION Salient phrases Titles and snippets *Real demonstration of this technique

INTRODUCTION Leouski A. V. and Croft W. B. An Evaluation of Techniques for Clustering Search Results. Technical Report IR-76, Department of Computer Science, Zamir O., Etzioni O. Web Document Clustering (SIGIR'98), Zamir O., Etzioni O. Grouper: A Dynamic Clustering Interface to Web Search Results. (WWW8),1999. Leuski A. and Allan J. Improving Interactive Retrieval by Combining Ranked List and Clustering. Proceedings of RIAO, Liu B., Chin C. W., and Ng, H. T. Mining Topic-Specific Concepts and Definitions on the Web. (WWW'03), 2003

Problem Formalization And Algorithm Problem Formalization:  Ranked list of search result : q : current query, d i : document r : some (unknown) function calculate the probability  To find a set of topic-coherent clusters on query q (Traditional):  To find a ranked list of clusters C’,with each cluster associated with a cluster name as well as a new ranked list of documents: Algorithm:four steps  Search result fetching,  Document parsing and phrase property calculation  Salient phrase ranking,and  Post-processing

Salient Phrases Extraction 1/3 Five properties:  1.Phrase frequency / Inverted document frequency (TFIDF) w: current phrase, D(w) : the set of documents that contains w.  2.Intra-Cluster Similarity (ICS) Documents into vector space model: d i =(x i1,x i2, … ).  Each component of the vectors is weighted by TFIDF For each candidate cluster calculates its centroid as: ICS is calculate as:

Salient Phrases Extraction 2/3  3.Phrase Length (Len) Example: Len(big)=1, Len(big cats)=2.  4.Cluster Entropy (CE) For given phrase w, the corresponding document set D(w) might overlaps with other D(w i ) where w i != w. One extreme :  Too general phrase to be a good salient phase. Other extreme :  D(w) seldom overlap with D(w i ), w may have some distance meaning. Examples: Take query “ jaguar ” as an example, “ big cat ” seldom co-occur with other salient keywords such as “ car ”, “ mac os ”,etc.

Salient Phrases Extraction 3/3  5.Phrase Independence ** (IND) A phrase is independent when the entropy of its context is high. ** Chien L. F. PAT-Tree-Based Adaptive Keyphrase Extraction for Intelligent Chinese Information Retrieval. (SIGIR'97),1997.

Learning to Rank Salient Phrases 1/3 Regression is a classic statistical problem which tries to determine the relationship between two random variables x=(x 1,x 2, …,x p ) and y.  X=(TFIDF,LEN,ICS,CE,IND)  Y can be any real-valued score. Linear Regression : Residual e is a random variable. The coefficients are determined by the condition that the sum of the square residuals is as small as possible.

Learning to Rank Salient Phrases 2/3 Logistic Regression:  When the dependent variable Y is dichotomy, logistic regression is more suitable.  Because we want to predict is not a precise numerical value of a dependent variable, but rather the probability.  Whereas q can only range from 1 to 0  Logit(q) ranges from negative infinity to positive infinity.

Learning to Rank Salient Phrases 3/3 Support vector Regression :  Input x is first mapped on a high dimensional feature space using some nonlinear mapping.   -insensitive loss function: SV regression tries to minimize ||  || 2 ***Joachims T., Making large-Scale SVM Learning Practical. Advances in Kernel Methods - Support Vector Learning. Schölkopf B. and Burges C. and Smola A. (ed.), MIT-Press, 1999.

Experiments Default result numbers from search engines are set to 200. Evaluation Measure:  Traditional clustering algorithm is difficult to be evaluated.  In this approach, evaluation is relatively easy because the problem is defined to be a ranking problem.  Using classical evaluation method in Information Retrieval. : precision at top N result R : set of top N salient keywords. C : set of manually tagged correct salient keywords.

Experiments Training Data Collection:  3 human evaluators to label ground truth data for 30 queries.  Selected from one day ’ s query log from MSN.

Experiments - Training Data Collection:  For each query extract all the n-gram(n<=3) from the search results as candidate phrases.  3 evaluators selected the candidates: 10 “ good phrases ” ( assign score 100) 10 “ medium phrases ” (assign score 50) Other phrases are zero score.  Finally,three score add together and assign 1 to the y values of phrases with score greater than 100, and assign 0 to the y values of others.

Experimental Results Property Comparison:

Experimental Results Learning methods comparison:  Three-fold cross validation to evaluate 3 regression method

Experimental Results

CONCLUSION AND FUTURE WORKS Several properties, as well as several regression models, are proposed to calculate salience score for salient phrase. Clusters with short names hopefully is more readable,could improve user ’ s browsing efficiency through search result. In the future works:  To extract syntactic features for keywords and phrases to assist the salient phrase ranking.  Hierarchical structure of search results is necessary for more efficient browsing.  Some external taxonomies such as Web directories contains much knowledge, thus a combination of classification and clustering might be helpful in this application.