Mining User-logs for Query and Term Clustering - An exploration on Encarta encyclopedia Jian-Yun Nie, RALI, DIRO, Univ. Montreal Ji-Rong Wen, Hong-Jiang.

Slides:

Advertisements

Similar presentations

A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science.

Advertisements

Query Classification Using Asymmetrical Learning Zheng Zhu Birkbeck College, University of London.

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki

Date: 2013/1/17 Author: Yang Liu, Ruihua Song, Yu Chen, Jian-Yun Nie and Ji-Rong Wen Source: SIGIR12 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang Adaptive.

Chapter 5: Introduction to Information Retrieval

Introduction to Information Retrieval

Optimizing search engines using clickthrough data

Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.

Basic IR: Queries Query is statement of user’s information need. Index is designed to map queries to likely to be relevant documents. Query type, content,

1 CS 430 / INFO 430 Information Retrieval Lecture 8 Query Refinement: Relevance Feedback Information Filtering.

Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.

Time-dependent Similarity Measure of Queries Using Historical Click- through Data Qiankun Zhao*, Steven C. H. Hoi*, Tie-Yan Liu, et al. Presented by: Tie-Yan.

Context-Aware Query Classification Huanhuan Cao 1, Derek Hao Hu 2, Dou Shen 3, Daxin Jiang 4, Jian-Tao Sun 4, Enhong Chen 1 and Qiang Yang 2 1 University.

Chapter 5: Query Operations Baeza-Yates, 1999 Modern Information Retrieval.

LinkSelector: A Web Mining Approach to Hyperlink Selection for Web Portals Xiao Fang University of Arizona 10/18/2002.

Recall: Query Reformulation Approaches 1. Relevance feedback based vector model (Rocchio …) probabilistic model (Robertson & Sparck Jones, Croft…) 2. Cluster.

Affinity Rank Yi Liu, Benyu Zhang, Zheng Chen MSRA.

Query Operations: Automatic Global Analysis. Motivation Methods of local analysis extract information from local set of documents retrieved to expand.

Chapter 5: Information Retrieval and Web Search

Overview of Search Engines

Data Mining Techniques

 Clustering of Web Documents Jinfeng Chen. Zhong Su, Qiang Yang, HongHiang Zhang, Xiaowei Xu and Yuhen Hu, Correlation- based Document Clustering using.

Search Engines and Information Retrieval Chapter 1.

1 Context-Aware Search Personalization with Concept Preference CIKM’11 Advisor ： Jia Ling, Koh Speaker ： SHENG HONG, CHUNG.

 An important problem in sponsored search advertising is keyword generation, which bridges the gap between the keywords bidded by advertisers and queried.

A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.

Clustering User Queries of a Search Engine Ji-Rong Wen, Jian-YunNie & Hon-Jian Zhang.

UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.

CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.

When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.

Topical Crawlers for Building Digital Library Collections Presenter: Qiaozhu Mei.

Query Expansion By: Sean McGettrick. What is Query Expansion? Query Expansion is the term given when a search engine adding search terms to a user’s weighted.

Query Operations J. H. Wang Mar. 26, The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text.

1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)

Probabilistic Query Expansion Using Query Logs Hang Cui Tianjin University, China Ji-Rong Wen Microsoft Research Asia, China Jian-Yun Nie University of.

Clustering Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling 1.

Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.

Chapter 6: Information Retrieval and Web Search

CS 533 Information Retrieval Systems.  Introduction  Connectivity Analysis  Kleinberg’s Algorithm  Problems Encountered  Improved Connectivity Analysis.

1 Automatic Classification of Bookmarked Web Pages Chris Staff Second Talk February 2007.

Introduction to Digital Libraries hussein suleman uct cs honours 2003.

BEHAVIORAL TARGETING IN ON-LINE ADVERTISING: AN EMPIRICAL STUDY AUTHORS: JOANNA JAWORSKA MARCIN SYDOW IN DEFENSE: XILING SUN & ARINDAM PAUL.

1 Web-Page Summarization Using Clickthrough Data* JianTao Sun, Yuchang Lu Dept. of Computer Science TsingHua University Beijing , China Dou Shen,

LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.

1 FollowMyLink Individual APT Presentation Third Talk February 2006.

Next Generation Search Engines Ehsun Daroodi 1 Feb, 2003.

Query Expansion By: Sean McGettrick. What is Query Expansion? Query Expansion is the term given when a search engine adding search terms to a user’s weighted.

© 2004 Chris Staff CSAW’04 University of Malta of 15 Expanding Query Terms in Context Chris Staff and Robert Muscat Department of.

Query Suggestion. n A variety of automatic or semi-automatic query suggestion techniques have been developed  Goal is to improve effectiveness by matching.

Information Retrieval

A Novel Visualization Model for Web Search Results Nguyen T, and Zhang J IEEE Transactions on Visualization and Computer Graphics PAWS Meeting Presented.

Web Information Retrieval Prof. Alessandro Agostini 1 Context in Web Search Steve Lawrence Speaker: Antonella Delmestri IEEE Data Engineering Bulletin.

Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.

Generating Query Substitutions Alicia Wood. What is the problem to be solved?

Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:

1 Random Walks on the Click Graph Nick Craswell and Martin Szummer Microsoft Research Cambridge SIGIR 2007.

Text Information Management ChengXiang Zhai, Tao Tao, Xuehua Shen, Hui Fang, Azadeh Shakery, Jing Jiang.

Jianping Fan Department of Computer Science University of North Carolina at Charlotte Charlotte, NC Relevance Feedback for Image Retrieval.

1 CS 430 / INFO 430 Information Retrieval Lecture 12 Query Refinement and Relevance Feedback.

To Personalize or Not to Personalize: Modeling Queries with Variation in User Intent Presented by Jaime Teevan, Susan T. Dumais, Daniel J. Liebling Microsoft.

An Efficient Algorithm for Incremental Update of Concept space

Neighborhood - based Tag Prediction

Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance Hello everyone,

Lecture 12: Relevance Feedback & Query Expansion - II

Information Retrieval and Web Search

Information Retrieval and Web Search

Multimedia Information Retrieval

Disambiguation Algorithm for People Search on the Web

Introduction to Information Retrieval

Search Engine Architecture

Presentation transcript:

Mining User-logs for Query and Term Clustering - An exploration on Encarta encyclopedia Jian-Yun Nie, RALI, DIRO, Univ. Montreal Ji-Rong Wen, Hong-Jiang Zhang, MSRCN

Context of this work Summer 2000 at MSR in China (Beijing) Needs of Encarta group at Microsoft –Human editors work on improvements by adding both new articles and new index –They wanted to identify the “hot topics” –Only manual analysis of user queries was being used Build a clustering tool that automatically groups similar queries.

Context Influence of AskJeeves –Provide more accurate answers (answer questions) –Hypothesis: many users are interested in a limited number of topics (hot topics). –If these queries (questions) can be answered correctly, most users can be satisfied.

FAQ-based Question-Answering A set of FAQs that have previously been answered, or whose answers have been manually checked. A user’s query is compared with the FAQs. Similar FAQs are proposed to the user. The user selects a FAQ » Answers of higher quality

How to cluster queries? Query similarity based on queries: –keywords –question form (templates, question words, etc.) Observations: –Queries are short –Forms and words vary –Difficulty to cluster truly similar queries

Using document links for ranking/similarity Citation links have been explored in IR in 1970s, however, links were limited. PageRank in Google R(p) =  /n + (1-  )  (q  p) R(q)/outdegree(q) HITS (Kleinberg) AuthorityHub

Query analysisDocument analysis Analysis of user interactions Query space Document space

Related work on cross- connections Relevance feedback in IR –Relevance judgement by user –Query reformulation e.g. Rocchio: Q’ =  *Q +  *R -  *NR –Document clustering All the documents relevant to a query are put into the same cluster

Related work (cont’d) Citeseer, AskJeeves, Amazon, Lycos, etc. –Users who read this page (paper) also read other pages (papers). Su, Yang, et al. (2000) –URL co-occurrences within time windows in user-logs Beeferman & Berger (2000) –Exploring user-logs for document and query clustering

Our hypotheses Hypothesis 1: Queries of similar form or words are similar. Hypothesis 2: Queries leading to common document clicks are similar. Combining the two criteria

Expected advantages Benefit from user’s own decision of click- throughs –implicit relevance judgements Bridge the gap between document space and query space –Different wording –Construct a “live thesaurus” linking user’s words with document’s words

Possible problems A clicked document is not necessarily relevant to the query. Even in a relevant document, many words do not have semantic relations with the query. May end up with noisy clusters.

Why may this idea work? Pseudo-relevance feedback in IR –IR system finds a set of documents –The top ranked documents are assumed to be “relevant” (many of them are not!) –Using these documents to reformulate the query –Improvements in effectiveness of 10-15% Non strict feedback can help User click-thoughs are better data: top-documents + some user judgment Large number of data

Encarta encyclopedia Document hierarchy Example: Country (sub_cat) –people (or population) –culture –geography –etc. 9 categories 93 sub-categories several 10K articles

Overview of query clustering. Preprocessing Clustering Algorithm Sessions = Queries + Clicks Clusters. Similarity functions. Similarity threshold 1-Eps. Minimal density MinPts. Stemming. Stop words removal. Phrase recognition. Synonym labeling Internal Representation

Query session Encarta interface: –The user submits a query (question) –Encarta search engine proposes a list of articles –The user selects (clicks on) some of the documents User-logs:... address, time, query, clicks,... Extract Sessions: query -> {... document-click,...} about sessions/day from MSN

Choice of clustering algorithm A number of algorithms available –Hierarchy agglomerative clustering –k_means, etc. Our criteria –complexity (large amount of data) –incremental (daily change) –not too many manual settings (e.g. number) –No limit on max. cluster size

BDScan complexity O(n*log(n)) incremental (incremental BDScan) clusters of different shapes and sizes requires –Eps = distance threshold –MinPts = minimal number in a cluster Principle: –scan elements only once –if an element is a core element (#Eps-neighbor  MinPts) then expand the cluster by adding the neighbors.

Similarity based on queries Selection of keywords –stop-words –stemming # common_kw(q1,q2) Sim(q1, q2) = Max(#kw(q1), #kw(q2)  ki  q1,q2 (w q1 (k i ) + w q2 (k i )) /2 Sim(q1,q2) = Max(  ki  q1 (k i ),  ki  q2 (k i ))

Using query form Edit distance(q1, q2) Sim(q1, q2) = 1 - edit_dist(q1, q2) Examples: –Where does silk come? –Where does lead come from? –Where does dew come from? => Where does X come from?

Using document clicks D_C(q i ) = { d_c i1, d_c i2, …, d_c ip }  D(q i ) D_C(q j ) = { d_c j1, d_c j2, …, d_c jq }  D(q j ) # common_click(q1,q2) Sim(q1, q2) = Max(#click(q1), #click(q2) Example: atomic bomb Hiroshima Nagasaki nuclear fission ID: Nuclear bombs Japan surrenderTitle: Atomic bomb Manhattan Project ……

Using Encarta hierarchy Hypothesis: two documents within the same category (sub-category) are more similar. S(d1, d2) = [Level(common_parent(d1,d2)) - 1] / 3  i Max j s(d i,d j )  j Max i s(d i,d j ) Sim(q1,q2) = 1/2*( + ) #doc_click(q1)#doc_click(q2) where rd(q) = #clicks for q

Example Query 1: image processing ID: Title: Computer Graphics Query 2: image rendering ID: Title: Computer Animation Common parent: “computer science” common keyword: “image” => similar

Combinations of criteria Linear combinations Sim(q1, q2) =  *Sim kw (q1,q2)+  * Sim click (q1,q2) How to set  and  –using training data (not available in our case) –set by the user (Encarta editors)

Preliminary experiments Impact of similarity (1-Eps) threshold (MinPts=3) Observations –at Sim=1.0, 42% (kw) and 64% (click) queries are included into 1291 and 1756 clusters –flat S-Simi (doc_click) curve (due to few clicks/query) –Seems to support FAQ and FAD

Top queries and document clicks Among queries: Sim_kw  0.5 world war (4 113),new X (249),X system (160), G.Washington (156),tree X (104),nuclear, plant (73), Egypt (66),Charles X (51),battle (50) Sim_click  0.5 African American history (625)World war (424)Canada (321), Agriculture (162), Birds (132),Civil war, American (90)California (88)Anatomy (81), Mexico (80), Arizona, pollution (73)American literature, Poetry (69) England (57),Argentina (53), Atomic bomb (53), Fish (51),Shakespeare (50)

Experiments (cont’d) Impact of  and  Different behaviors between 0.5 and 0.9

Ongoing experiments clustering quality –How good is a cluster? (precision) –What is the coverage of the clusters? (recall) requires desired results Rough evaluation: –combinations of keywords and clicks seem to be better than keywords only.

Example Keywords alone Cluster 1:thermodynamics Cluster 2:law of thermodynamicspolygamy law family law ohm’s law watt’s law Combining keywords and clicks Cluster 1: thermodynamics law of thermodynamics Cluster 2:polygamy law family law Cluster 3:ohm’s law watt’s law

The status of the tool Encarta editors are using it to discover FAQs potential FAQs have been located among several millions eulogistic feedback from them

Extensions Construct a “live thesaurus” query q_term d_term document ox*  *  Similar to translation problem (statistical model?) - authors and users speak different “languages” Using “live thesaurus” for query expansion  Finding answers according to previous user interactions

Extensions (cont’d) More generally, connections –among documents (inter-doc. connections) –among queries (inter-query connections) –between documents and queries (cross d&q) document (query) clustering using query (document) connections through d-q connections Derive new query-document connections (find new answers): usage-based retrieval

Conclusions User-logs (user interactions) provide useful information that are complementary to keyword and link-based similarity data available to be further exploited Ongoing projects –evaluation of clustering quality –live thesaurus from larger data set

Thanks