SIGIR’09 Boston 1 Entropy-biased Models for Query Representation on the Click Graph Hongbo Deng, Irwin King and Michael R. Lyu Department of Computer Science.

Slides:



Advertisements
Similar presentations
A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science.
Advertisements

Evaluating Novelty and Diversity Charles Clarke School of Computer Science University of Waterloo two talks in one!
DQR : A Probabilistic Approach to Diversified Query recommendation Date: 2013/05/20 Author: Ruirui Li, Ben Kao, Bin Bi, Reynold Cheng, Eric Lo Source:
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Chapter 5: Introduction to Information Retrieval
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
WSCD INTRODUCTION  Query suggestion has often been described as the process of making a user query resemble more closely the documents it is expected.
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
Ranking models in IR Key idea: We wish to return in order the documents most likely to be useful to the searcher To do this, we want to know which documents.
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.
Context-aware Query Suggestion by Mining Click-through and Session Data Authors: H. Cao et.al KDD 08 Presented by Shize Su 1.
22 May 2006 Wu, Goel and Davison Models of Trust for the Web (MTW) WWW2006 Workshop L EHIGH U NIVERSITY.
Chen Cheng1, Haiqin Yang1, Irwin King1,2 and Michael R. Lyu1
Ph.D. Thesis Defense 1 Web Mining Techniques for Query Log Analysis and Expertise Retrieval Hongbo Deng Department of Computer Science and Engineering.
Time-dependent Similarity Measure of Queries Using Historical Click- through Data Qiankun Zhao*, Steven C. H. Hoi*, Tie-Yan Liu, et al. Presented by: Tie-Yan.
Presented by Li-Tal Mashiach Learning to Rank: A Machine Learning Approach to Static Ranking Algorithms for Large Data Sets Student Symposium.
Context-Aware Query Classification Huanhuan Cao 1, Derek Hao Hu 2, Dou Shen 3, Daxin Jiang 4, Jian-Tao Sun 4, Enhong Chen 1 and Qiang Yang 2 1 University.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 12: Language Models for IR.
Semantic text features from small world graphs Jure Leskovec, IJS + CMU John Shawe-Taylor, Southampton.
Ch 4: Information Retrieval and Text Mining
Relevance Feedback based on Parameter Estimation of Target Distribution K. C. Sia and Irwin King Department of Computer Science & Engineering The Chinese.
1 Integrating User Feedback Log into Relevance Feedback by Coupled SVM for Content-Based Image Retrieval 9-April, 2005 Steven C. H. Hoi *, Michael R. Lyu.
Sigir’99 Inside Internet Search Engines: Search Jan Pedersen and William Chang.
1 PageSim: A Link-based Similarity Measure for the World Wide Web Zhenjiang Lin, Irwin King, and Michael, R., Lyu Computer Science & Engineering, The Chinese.
Information Retrieval
Personalized Ontologies for Web Search and Caching Susan Gauch Information and Telecommunications Technology Center Electrical Engineering and Computer.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Multi-Style Language Model for Web Scale Information Retrieval Kuansan Wang, Xiaolong Li and Jianfeng Gao SIGIR 2010 Min-Hsuan Lai Department of Computer.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 An Efficient Concept-Based Mining Model for Enhancing.
CC P ROCESAMIENTO M ASIVO DE D ATOS O TOÑO 2015 Lecture 8: Information Retrieval II Aidan Hogan
Gradual Adaption Model for Estimation of User Information Access Behavior J. Chen, R.Y. Shtykh and Q. Jin Graduate School of Human Sciences, Waseda University,
1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer.
Improving Web Search Ranking by Incorporating User Behavior Information Eugene Agichtein Eric Brill Susan Dumais Microsoft Research.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
1 CS 430: Information Discovery Lecture 9 Term Weighting and Ranking.
Query Suggestions Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Term Frequency. Term frequency Two factors: – A term that appears just once in a document is probably not as significant as a term that appears a number.
Chapter 6: Information Retrieval and Web Search
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Query Suggestion Naama Kraus Slides are based on the papers: Baeza-Yates, Hurtado, Mendoza, Improving search engines by query clustering Boldi, Bonchi,
Personalization with user’s local data Personalizing Search via Automated Analysis of Interests and Activities 1 Sungjick Lee Department of Electrical.
Algorithmic Detection of Semantic Similarity WWW 2005.
Jiafeng Guo(ICT) Xueqi Cheng(ICT) Hua-Wei Shen(ICT) Gu Xu (MSRA) Speaker: Rui-Rui Li Supervisor: Prof. Ben Kao.
A Classification-based Approach to Question Answering in Discussion Boards Liangjie Hong, Brian D. Davison Lehigh University (SIGIR ’ 09) Speaker: Cho,
CC P ROCESAMIENTO M ASIVO DE D ATOS O TOÑO 2014 Aidan Hogan Lecture IX: 2014/05/05.
CIS 530 Lecture 2 From frequency to meaning: vector space models of semantics.
Hongbo Deng, Michael R. Lyu and Irwin King
Post-Ranking query suggestion by diversifying search Chao Wang.
Information Retrieval Techniques MS(CS) Lecture 7 AIR UNIVERSITY MULTAN CAMPUS Most of the slides adapted from IIR book.
CpSc 881: Information Retrieval. 2 Using language models (LMs) for IR ❶ LM = language model ❷ We view the document as a generative model that generates.
Divided Pretreatment to Targets and Intentions for Query Recommendation Reporter: Yangyang Kang /23.
Learning in a Pairwise Term-Term Proximity Framework for Information Retrieval Ronan Cummins, Colm O’Riordan Digital Enterprise Research Institute SIGIR.
11 A Classification-based Approach to Question Routing in Community Question Answering Tom Chao Zhou 1, Michael R. Lyu 1, Irwin King 1,2 1 The Chinese.
KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.
1 CS 430: Information Discovery Lecture 5 Ranking.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Analyzing and Predicting Question Quality in Community Question Answering Services Baichuan Li, Tan Jin, Michael R. Lyu, Irwin King, and Barley Mak CQA2012,
1 Random Walks on the Click Graph Nick Craswell and Martin Szummer Microsoft Research Cambridge SIGIR 2007.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
MMM2005The Chinese University of Hong Kong MMM2005 The Chinese University of Hong Kong 1 Video Summarization Using Mutual Reinforcement Principle and Shot.
To Personalize or Not to Personalize: Modeling Queries with Variation in User Intent Presented by Jaime Teevan, Susan T. Dumais, Daniel J. Liebling Microsoft.
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 14: Language Models for IR.
Aidan Hogan CC Procesamiento Masivo de Datos Otoño 2017 Lecture 7: Information Retrieval II Aidan Hogan
Aidan Hogan CC Procesamiento Masivo de Datos Otoño 2018 Lecture 7 Information Retrieval: Ranking Aidan Hogan
Compact Query Term Selection Using Topically Related Text
Representation of documents and queries
From frequency to meaning: vector space models of semantics
Information Retrieval and Web Design
Presentation transcript:

SIGIR’09 Boston 1 Entropy-biased Models for Query Representation on the Click Graph Hongbo Deng, Irwin King and Michael R. Lyu Department of Computer Science and Engineering The Chinese University of Hong Kong July 21st, 2009

Hongbo Deng, Irwin King, and Michael R. Lyu Department of Computer Science and Engineering The Chinese University of Hong KongSIGIR’09 Boston 2 Introduction Query suggestion Query classification Targeted advertising Ranking  Query log analysis – improve search engine’s capabilities

Hongbo Deng, Irwin King, and Michael R. Lyu Department of Computer Science and Engineering The Chinese University of Hong KongSIGIR’09 Boston 3 Introduction  Click graph – an important technique A bipartite graph between queries and URLs Edges connect a query with the URLs Capture some semantic relations, e.g., “map” and “travel” How to utilize and model the click graph to represent queries? Robustness: Some queries with skewed click count may exclusively influence the click graph Spam: Raw CF can be easily manipulated Traditional model based on the raw click frequency (CF) Propose an entropy-biased framework

Hongbo Deng, Irwin King, and Michael R. Lyu Department of Computer Science and Engineering The Chinese University of Hong KongSIGIR’09 Boston 4 Motivation  Basic idea Various query-URL pairs should be treated differently  Intuition Common clicks on less frequent but more specific URLs are of greater value than common clicks on frequent and general URLs Is a single click on different URLs equally important? General URL Specific URL

Hongbo Deng, Irwin King, and Michael R. Lyu Department of Computer Science and Engineering The Chinese University of Hong KongSIGIR’09 Boston 5 Outline  Introduction  Related Work  Methodology Preliminaries Click Frequency Model Entropy-biased Model  Experiments  Conclusion

Hongbo Deng, Irwin King, and Michael R. Lyu Department of Computer Science and Engineering The Chinese University of Hong KongSIGIR’09 Boston 6 Modeling queries and URLs Click entropy & result entropy Related Work  Using click graph Query clustering (Befferman and Berger, KDD’00, Wen et al., WWW’ 01) Random walks for relevance rank in image search (Craswell and Szummer, SIGIR’05) Query suggestion by computing the hitting time on a click graph (Mei et al., CIKM’08) Query classification from regularized click graph (Li et al., SIGIR’08) Using click graph These methods are proposed based on the click graph, while our objective is to investigate a better model to utilize and represent the click graph. Using click graph

Hongbo Deng, Irwin King, and Michael R. Lyu Department of Computer Science and Engineering The Chinese University of Hong KongSIGIR’09 Boston 7 Modeling queries and URLs Click entropy & result entropy Related Work  Using click graph Query clustering (Befferman and Berger, KDD’00, Wen et al., WWW’ 01) Random walks for relevance rank in image search (Craswell and Szummer, SIGIR’05) Query suggestion by computing the hitting time on a click graph (Mei et al., CIKM’08) Query classification from regularized click graph (Li et al., SIGIR’08) Using click graph These existing methods do not distinguish the variation on different query-URL pairs Using click graph  Modeling the representation Use the content of clicked Web pages to define a term-weight vector model for a query (Baeza-Yates et al., 2004) Represent query as a vector of documents (URLs) without considering the content information (Baeza-Yates and Tiberi, KDD’07) Propose the query-set document model to represent documents by mining frequent query patterns rather than the content information of the documents (Poblete et al., WWW’08) Modeling queries and URLs

Hongbo Deng, Irwin King, and Michael R. Lyu Department of Computer Science and Engineering The Chinese University of Hong KongSIGIR’09 Boston 8  Using click graph Query clustering (Befferman and Berger, KDD’00, Wen et al., WWW’ 01) Random walks for relevance rank in image search (Craswell and Szummer, SIGIR’05) Query suggestion by computing the hitting time on a click graph (Mei et al., CIKM’08) Query classification from regularized click graph (Li et al., SIGIR’08) Modeling queries and URLs Click entropy & result entropy Related Work Using click graph These methods are focused on personalization for different queries, while our entropy- biased models are focused on the weighting scheme of various query-URL pairs Using click graph  Modeling the representation Use the content of clicked Web pages to define a term-weight vector model for a query (Baeza-Yates et al., 2004) Represent query as a vector of documents (URLs) without considering the content information (Baeza-Yates and Tiberi, KDD’07) Propose the query-set document model to represent documents by mining frequency query patterns rather than the content information of the documents (Qin et al., WWW’08) Modeling queries and URLs  For personalization Explore click entropy to measure the variability in click results (Dou et al., WWW’ 07) Propose result entropy to capture how often results change (Teevan et al., SIGIR’08) Click entropy & result entropy

Hongbo Deng, Irwin King, and Michael R. Lyu Department of Computer Science and Engineering The Chinese University of Hong KongSIGIR’09 Boston 9 Outline  Introduction  Related Work  Methodology Preliminaries Click Frequency Model Entropy-biased Model  Experiments  Conclusion

Hongbo Deng, Irwin King, and Michael R. Lyu Department of Computer Science and Engineering The Chinese University of Hong KongSIGIR’09 Boston 10 Preliminaries Query: URL: User: Query instance:

Hongbo Deng, Irwin King, and Michael R. Lyu Department of Computer Science and Engineering The Chinese University of Hong KongSIGIR’09 Boston 11 Traditional Click Frequency Model  Edges of click graph: Weighted by the raw click frequency (CF)  Transition probability Normalize CF From query to URL:From URL to query: Based on the transition probabilities, the query and document can be represented by the vector of transition probabilities respectively.

Hongbo Deng, Irwin King, and Michael R. Lyu Department of Computer Science and Engineering The Chinese University of Hong KongSIGIR’09 Boston 12 Traditional Click Frequency Model  Measure the similarity between queries The most similar query  q2 (“map”)  q1 (“Yahoo”) More reasonable  q2 (“map”)  q3 (“travel”) Cosine similarity: The CF model only considers the raw click frequency, and treats different query-URL pairs equally, even if some URLs are heavily clicked.

Hongbo Deng, Irwin King, and Michael R. Lyu Department of Computer Science and Engineering The Chinese University of Hong KongSIGIR’09 Boston 13 Methodology M Traditional click frequency model Entropy-biased models

Hongbo Deng, Irwin King, and Michael R. Lyu Department of Computer Science and Engineering The Chinese University of Hong KongSIGIR’09 Boston 14 Entropy-biased Model  The more general and highly ranked URL Connect with more queries Increase the ambiguity and uncertainty  The entropy of a URL: Suppose Tend to be proportional to the n(d j ) It would be more reasonable to weight these two edges differently because of the variation of the connected URLs.

Hongbo Deng, Irwin King, and Michael R. Lyu Department of Computer Science and Engineering The Chinese University of Hong KongSIGIR’09 Boston 15 Entropy  Discriminative Ability  Entropy increase, discriminative ability decrease Be inversely proportional to each other A URL with a high query frequency is less discriminative overall  Inverse query frequency Measure the discriminative ability of the URL Benefits  Constrain the influence of some heavily-clicked URLs  Balance the inherent bias of clicks for those highly ranked  Incorporate with other factors to tune the model

Hongbo Deng, Irwin King, and Michael R. Lyu Department of Computer Science and Engineering The Chinese University of Hong KongSIGIR’09 Boston 16 CF-IQF Model  Incorporate the IQF with the click frequency  A high click frequency  A low query frequency  “A” is weighted higher than “B”

Hongbo Deng, Irwin King, and Michael R. Lyu Department of Computer Science and Engineering The Chinese University of Hong KongSIGIR’09 Boston 17 CF-IQF Model  Transition probability The most similar query q2 (“map”)  q3 (“travel”) The most similar query q2 (“map”)  q1 (“Yahoo”)

Hongbo Deng, Irwin King, and Michael R. Lyu Department of Computer Science and Engineering The Chinese University of Hong KongSIGIR’09 Boston 18 UF Model and UF-IQF Model  Drawback of CF model Prone to spam by some malicious clicks (if a single user clicks on a certain URL thousands of times)  UF model Weight by user frequency instead of click frequency Improve the resistance against malicious click  UF-IQF model

Hongbo Deng, Irwin King, and Michael R. Lyu Department of Computer Science and Engineering The Chinese University of Hong KongSIGIR’09 Boston 19 Connection with TF-IDF  TF-IDF has been extensively and successfully used in the vector space model for text retrieval  Several researchers have tried to interpret IDF based on binary independence retrieval (BIR), Possion, information entropy and LM  TF-IDF has never been explored to bipartite graphs, and the IQF is new. The CF-IQF is a simplified version of the entropy-biased model  The entropy-biased model is employed to identify the edge weighting of the click graph, which can be applied to other bipartite graphs

Hongbo Deng, Irwin King, and Michael R. Lyu Department of Computer Science and Engineering The Chinese University of Hong KongSIGIR’09 Boston 20 Mining Query Log on Click Graph Query clustering Query-to-query similarity Query suggestion Models Query-to-query similarity Query suggestion

Hongbo Deng, Irwin King, and Michael R. Lyu Department of Computer Science and Engineering The Chinese University of Hong KongSIGIR’09 Boston 21 Similarity Measurement  Cosine similarity  Jaccard coefficient  The similarity results are reported and analyzed

Hongbo Deng, Irwin King, and Michael R. Lyu Department of Computer Science and Engineering The Chinese University of Hong KongSIGIR’09 Boston 22 Graph-based Random Walk  Query-to-query graph The transition probability from q i to q j  The personalized PageRank

Hongbo Deng, Irwin King, and Michael R. Lyu Department of Computer Science and Engineering The Chinese University of Hong KongSIGIR’09 Boston 23 Outline  Introduction  Related Work  Methodology Preliminaries Click Frequency Model Entropy-biased Model  Experiments  Conclusion

Hongbo Deng, Irwin King, and Michael R. Lyu Department of Computer Science and Engineering The Chinese University of Hong KongSIGIR’09 Boston 24 Experimental Evaluation  Data collection AOL query log data  Cleaning the data Removing the queries that appear less than 2 times Combining the near-duplicated queries 883,913 queries and 967,174 URLs 4,900,387 edges

Hongbo Deng, Irwin King, and Michael R. Lyu Department of Computer Science and Engineering The Chinese University of Hong KongSIGIR’09 Boston 25 Distributions

Hongbo Deng, Irwin King, and Michael R. Lyu Department of Computer Science and Engineering The Chinese University of Hong KongSIGIR’09 Boston 26 Evaluation: ODP Similarity  A simple measure of similarity among queries using ODP categories (query  category) Definition: Example:  Q1: “United States”  “Regional > North America > United States”  Q2: “National Parks”  “Regional > North America > United States > Travel and Tourism > National Parks and Monuments”  Precision at rank n  300 distinct queries 3/5

Hongbo Deng, Irwin King, and Michael R. Lyu Department of Computer Science and Engineering The Chinese University of Hong KongSIGIR’09 Boston 27 Experimental Results  Query similarity analysis 1. CF-IQF is better than CF UF-IQF > UF The results support our intuition of the entropy-biased framework about treating various query-URL pairs differently 2. UF is better than CF UF-IQF > CF-IQF The results indicates the user frequency associated with the query-URL pair is more robust than the click frequency for modeling the click graph. Results:

Hongbo Deng, Irwin King, and Michael R. Lyu Department of Computer Science and Engineering The Chinese University of Hong KongSIGIR’09 Boston 28 Experimental Results  Query similarity analysis 3. TF-IDF is better than TF The improvements of CF-IQF over CF and UF-IQF over UF models are consistent with the improvement of TF-IDF over TF model. The reason: they share the same key point to identify and tune the importance of a term or a query- URL edge.

Hongbo Deng, Irwin King, and Michael R. Lyu Department of Computer Science and Engineering The Chinese University of Hong KongSIGIR’09 Boston 29 Experimental Results  Query similarity analysis 4. Jaccard coefficient The improvements are consistent with the Cosine similarity

Hongbo Deng, Irwin King, and Michael R. Lyu Department of Computer Science and Engineering The Chinese University of Hong KongSIGIR’09 Boston 30 Experimental Results  Query similarity analysis 5. UF-IQF achieves best performance in most cases. It is very essential and promising to consider the entropy-biased models for the click graph. 6. CF and UF models > TF CF-IQF, UF-IQF > TF-IDF The click graph catches more semantic relations between queries than the query terms

Hongbo Deng, Irwin King, and Michael R. Lyu Department of Computer Science and Engineering The Chinese University of Hong KongSIGIR’09 Boston 31 Experimental Results  Random Walk Evaluation Results: 1. With the increase of n, both models improve their performance. 2. CF-IQF model always performs better than the CF mode.

Hongbo Deng, Irwin King, and Michael R. Lyu Department of Computer Science and Engineering The Chinese University of Hong KongSIGIR’09 Boston 32 Experimental Results  Random Walk Evaluation In general, the results generated by the CF and the CF-IQF models are similar, and mostly semantically relative to the original query, such as “American airline”. Another important observation is that the CF- IQF model can boost more relevant queries as suggestion and reduce some irrelevant queries.

Hongbo Deng, Irwin King, and Michael R. Lyu Department of Computer Science and Engineering The Chinese University of Hong KongSIGIR’09 Boston 33 Conclusions  Introduce the inverse query frequency (IQF) to measure the discriminative ability of a URL  Identify a new source, user frequency, for diminishing the manipulation of the malicious clicks  Propose the entropy-biased models to combine the IQF with the CF as well as UF for click graphs  Experimental results show that the improvements of our proposed models are consistent and promising

Hongbo Deng, Irwin King, and Michael R. Lyu Department of Computer Science and Engineering The Chinese University of Hong KongSIGIR’09 Boston 34 Q&A Thanks!