SIGMOD 2006 Context-sensitive ranking Rakesh AgrawalMicrosoft Search Labs Ralf RantzauIBM Silicon Valley Lab Evimaria TerziUniversity of Helsinki & Microsoft.

Slides:

Advertisements

Similar presentations

Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.

Advertisements

Google News Personalization: Scalable Online Collaborative Filtering

Introduction Distance-based Adaptable Similarity Search

Web Information Retrieval

Mining Compressed Frequent- Pattern Sets Dong Xin, Jiawei Han, Xifeng Yan, Hong Cheng Department of Computer Science University of Illinois at Urbana-Champaign.

Diversity Maximization Under Matroid Constraints Date : 2013/11/06 Source : KDD’13 Authors : Zeinab Abbassi, Vahab S. Mirrokni, Mayur Thakur Advisor :

Correlation Search in Graph Databases Yiping Ke James Cheng Wilfred Ng Presented By Phani Yarlagadda.

Fast Algorithms For Hierarchical Range Histogram Constructions

PREFER: A System for the Efficient Execution of Multi-parametric Ranked Queries Vagelis Hristidis University of California, San Diego Nick Koudas AT&T.

Support Vector Machines

Preference Elicitation Partial-revelation VCG mechanism for Combinatorial Auctions and Eliciting Non-price Preferences in Combinatorial Auctions.

Case Study: BibFinder BibFinder: A popular CS bibliographic mediator –Integrating 8 online sources: DBLP, ACM DL, ACM Guide, IEEE Xplore, ScienceDirect,

Online Social Networks and Media. Graph partitioning The general problem – Input: a graph G=(V,E) edge (u,v) denotes similarity between u and v weighted.

Trust and Profit Sensitive Ranking for Web Databases and On-line Advertisements Raju Balakrishnan (Arizona State University)

Efficient Query Evaluation on Probabilistic Databases

A Probabilistic Framework for Semi-Supervised Clustering

More on Rankings. Query-independent LAR Have an a-priori ordering of the web pages Q: Set of pages that contain the keywords in the query q Present the.

DATA MINING LECTURE 12 Link Analysis Ranking Random walks.

Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.

Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol

Neighborhood Formation and Anomaly Detection in Bipartite Graphs Jimeng Sun Huiming Qu Deepayan Chakrabarti Christos Faloutsos Speaker: Jimeng Sun.

Mutual Information Mathematical Biology Seminar

1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 7 April 20, 2005

Implicit Hitting Set Problems Richard M. Karp Harvard University August 29, 2011.

Semantic text features from small world graphs Jure Leskovec, IJS + CMU John Shawe-Taylor, Southampton.

SubSea: An Efficient Heuristic Algorithm for Subgraph Isomorphism Vladimir Lipets Ben-Gurion University of the Negev Joint work with Prof. Ehud Gudes.

2-Layer Crossing Minimisation Johan van Rooij. Overview Problem definitions NP-Hardness proof Heuristics & Performance Practical Computation One layer:

Computing Sketches of Matrices Efficiently & (Privacy Preserving) Data Mining Petros Drineas Rensselaer Polytechnic Institute (joint.

Preference Analysis Joachim Giesen and Eva Schuberth May 24, 2006.

Topic-Sensitive PageRank Taher H. Haveliwala. PageRank Importance is propagated A global ranking vector is pre-computed.

Answering Imprecise Queries over Autonomous Web Databases Ullas Nambiar Dept. of Computer Science University of California, Davis Subbarao Kambhampati.

Evaluating Top-k Queries over Web-Accessible Databases Nicolas Bruno Luis Gravano Amélie Marian Columbia University.

A Framework For Community Identification in Dynamic Social Networks Chayant Tantipathananandh Tanya Berger-Wolf David Kempe Presented by Victor Lee.

Approximation Algorithms

Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.

Genetic Algorithm.

Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.

Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach Wenjie Zhang, Xuemin Lin The University of New South Wales & NICTA Ming Hua,

UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.

1 Evaluating top-k Queries over Web-Accessible Databases Paper By: Amelie Marian, Nicolas Bruno, Luis Gravano Presented By Bhushan Chaudhari University.

Keyword Searching and Browsing in Databases using BANKS Seoyoung Ahn Mar 3, 2005 The University of Texas at Arlington.

© ETH Zürich Eric Lo ETH Zurich a joint work with Carsten Binnig (U of Heidelberg), Donald Kossmann (ETH Zurich), Tamer Ozsu (U of Waterloo) and Peter.

CS 533 Information Retrieval Systems.  Introduction  Connectivity Analysis  Kleinberg’s Algorithm  Problems Encountered  Improved Connectivity Analysis.

On information theory and association rule interestingness Loo Kin Kong 5 th July, 2002.

Simulation is the process of studying the behavior of a real system by using a model that replicates the behavior of the system under different scenarios.

Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.

Information Networks Rank Aggregation Lecture 10.

Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.

Graph-based Text Classification: Learn from Your Neighbors Ralitsa Angelova ， Gerhard Weikum : Max Planck Institute for Informatics Stuhlsatzenhausweg.

Andreas Papadopoulos - [DEXA 2015] Clustering Attributed Multi-graphs with Information Ranking 26th International.

1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.

Answering Top-k Queries Using Views Gautam Das (Univ. of Texas), Dimitrios Gunopulos (Univ. of California Riverside), Nick Koudas (Univ. of Toronto), Dimitris.

Searching Specification Documents R. Agrawal, R. Srikant. WWW-2002.

Graph-based Deformable Matching of 3D Line Segments with Application in Protein Fitting 12 1 HANG DOU 1, MATTHEW L BAKER 2, TAO JU Washington University.

De novo discovery of mutated driver pathways in cancer Discussion leader: Matthew Bernstein Scribe: Kun-Chieh Wang Computational Network Biology BMI 826/Computer.

Robust Estimation With Sampling and Approximate Pre-Aggregation Author: Christopher Jermaine Presented by: Bill Eberle.

Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.

Top-k Query Processing Optimal aggregation algorithms for middleware Ronald Fagin, Amnon Lotem, and Moni Naor + Sushruth P. + Arjun Dasgupta.

1 CS 430: Information Discovery Lecture 5 Ranking.

03/02/20061 Evaluating Top-k Queries Over Web-Accessible Databases Amelie Marian Nicolas Bruno Luis Gravano Presented By: Archana and Muhammed.

Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:

1 Random Walks on the Click Graph Nick Craswell and Martin Szummer Microsoft Research Cambridge SIGIR 2007.

Multiple Sequence Alignment Vasileios Hatzivassiloglou University of Texas at Dallas.

Learning to Rank: From Pairwise Approach to Listwise Approach Authors: Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li Presenter: Davidson Date:

Algebraic Techniques for Analysis of Large Discrete-Valued Datasets 

Network Partition –Finding modules of the network. Graph Clustering –Partition graphs according to the connectivity. –Nodes within a cluster is highly.

Algorithms for Large Data Sets

Unsupervised Learning

Rank Aggregation.

Consensus Partition Liang Zheng 5.21.

Presentation transcript:

SIGMOD 2006 Context-sensitive ranking Rakesh AgrawalMicrosoft Search Labs Ralf RantzauIBM Silicon Valley Lab Evimaria TerziUniversity of Helsinki & Microsoft Search Labs Work done largely while the authors were in IBM Almaden

SIGMOD 2006 The curse of abundance: Too many data and too many answers Query shopping.com for a digital camera: Query Froogle for a tennis racquet:

SIGMOD 2006 Ranking query results Algorithms for ranking web pages have been quite successful ([BP’98,Kleinberg98]) –Key idea: Exploit the graph of hyperlinks between web pages Can we take similar approach for ranking database query results? –Need for a graph structure that accurately describes the relationships between tuples in the database -Past attempts: schema and key constraints or queries [BHP’04, BHNCS’02, GMT’04] But are these graphs natural or do they reflect design optimization decisions?

SIGMOD 2006 Using preferences to induce a graph of tuples Genre (G)Actor (A)Title (T)Language t1t1 DramaKidmanBirthEnglish t2t2 DramaCruzVanilla SkyEnglish t3t3 Sci-FiReevesMatrixEnglish t4t4 ComedyCruzSin noticias de DiosSpanish t5t5 ComedyAnistonRumor has it…English Drama > Sci-Fi Kidman > Reeves Matrix > Birth t 1 >t 3 and t 2 >t 3 t 1 >t 3 t 3 >t 1 t1t1 t2t2 t3t3 [ Preferences are predicates of the form “X=x 1 > X=x 2 ” ]

SIGMOD 2006 Augment preferences with context Genre (G)Actor (A)Title (T)Language (L) t1t1 DramaKidmanBirthEnglish t2t2 DramaCruzVanilla SkyEnglish t3t3 Sci-FiReevesMatrixEnglish t4t4 ComedyCruzSin noticias de DiosSpanish t5t5 ComedyAnistonRumor has itEnglish –in general (*) English > Spanish | * –but in the context of Comedies Spanish > English| Comedies [ Contexts are predicates of the form “Y=a” ]

SIGMOD 2006 Preferences in the past Preferences expressed via a numeric score [AW’00,KI’04,KI’05] –Nicole Kidman : 0.9 –Penelope Cruz : 0.4 –Dramas : 0.8 –Comedies : 0.3 Pairwise preferences in ML literature [CSS’97] Preferences as partial orders [Kieβling’02] Preferences as first-order formulas [Chomiki’03]

SIGMOD 2006 Contextual preferences Genre (G)Actor (A)Title (T)Language (L) t1t1 DramaKidmanBirthEnglish t2t2 DramaCruzVanilla SkyEnglish t3t3 Sci-FiReevesMatrixEnglish t4t4 ComedyCruzSin noticias de DiosSpanish t5t5 ComedyAnistonRumor has it…English P 1 ={G=Drama > G=Sci-Fi | L=English} P 2 ={A=Kidman > A=Reeves | L = English} P 3 ={T=matrix > T=Birth | L=English } t 1 >t 3 |En and t 2 >t 3 |En t 1 >t 3 |En t 3 >t 1 |En Genre (G)Actor (A)Title (T)Language (L) t1t1 DramaKidmanBirthEnglish t2t2 DramaCruzVanilla SkyEnglish t3t3 Sci-FiReevesMatrixEnglish t4t4 ComedyCruzSin noticias de DiosSpanish t5t5 ComedyAnistonRumor has it…English t1t1 t2t2 t3t3 2/3 1/3 1 1/2 t1t1 t2t2 t3t3

SIGMOD 2006 Obtaining preferences Users provide preferences voluntarily – in the same way users rate products and services Preferences can be automatically collected via browser plug-ins or taskbars (with user permission) Preferences can be learned from past data Preferences can also be learned from the data (e.g., using association-rule mining) Preferences are obtained from various sources and can contain cycles and contradictions, which are resolved democratically

SIGMOD 2006 Overview Question: How to incorporate users preferences when ranking query results? Approach: Accumulate contextual preferences of the form i 1 >i 2 |X Order the answer tuples such that the preferences are maximally respected, giving higher weight to those preferences whose contexts have closer match to the query

SIGMOD 2006 Issues How to define similarity between a query and a context ? –See paper for the distance function. Can we create orders in an offline step and use their information at query time ? Should we save all orders? How to combine the saved orders while answering queries ?

SIGMOD 2006 Problem decomposition [Problem 1]: For every context X build an order τ X (Ordering) [Problem 2]: Given a set of orders T m = {τ 1,…, τ m } find ℓ representative orders T ℓ (ClusterOrders) Assign each of the input orders to one of the representatives (the closest) Associate with each representative σ a set of contexts Y σ [Problem 3]: Provide top-k results for the query Q –respecting the representative orders and –weight respect according to the similarity between query and contexts (Querying)

SIGMOD 2006 Problem 1: The Ordering problem For a given context X and a set of preferences P X over the tuples D={t 1,…,t n } find an ordering τ of D such that t1t1 t2t2 t3t3 1/2 2/3 1/3 1 t1t1 t2t2 t3t3 t2t2 t1t1 t3t3 Agree = 1 +1/2 = 2/3 = 13/6

SIGMOD 2006 Problem 2: The ClusterOrders problem Given m orders T m ={τ 1,…,τ m }, each corresponding to a single concept X i, find ℓ representative orders T ℓ such that cost(T ℓ ) is minimized where and We use the standard Spearman footrule and Kendall tau distances for comparing orderings

SIGMOD 2006 The ClusterOrders problem: Example a b c d e f a b c d e f a b c d e f f e d c b a f e d c b a a b c d e f f e d c b a Cost(τ 1 ) = Cost(τ 2 ) = 1Cost(τ 1, τ 2 ) = 2+1=3

SIGMOD 2006 Problem 3: The Querying problem Provide top-k results for query Q respecting the representative orders and weighting respect using the corresponding set of contexts

SIGMOD 2006 Problem decomposition [Problem 1]: For every context X build an order τ X (Ordering) [Problem 2]: Given a set of orders T m = {τ 1,…, τ m } find ℓ representative orders T ℓ (ClusterOrders) Assign each of the input orders to one of the representatives (the closest) Associate with each representative σ a set of contexts Y σ [Problem 3]: Provide top-k results for the query Q –respecting the representative orders and –weight respect according to the similarity between query and contexts (Querying)

SIGMOD 2006 Constructing orders from preferences [Problem1] Problem is NP-hard; need for heuristics PickPerm algorithm : pick a random permutation, inverse it and pick the best of the two t1t1 t2t2 t3t3 1/2 2/3 1/3 1 t1t1 t2t2 t3t3 t2t2 t3t3 t1t1 A = 11/6 t1t1 t3t3 t2t2 A = 5/6 t2t2 t3t3 t1t1 [ Inspired by the 2-approximation algorithm for finding the maximum acyclic subgraph of a given graph ]

SIGMOD 2006 Greedy algorithm [CSS’97] At the i-th iteration pick the i-th element of the output permutation At each iteration pick the tuple t with the highest s_val(t) = OutDegree(t)-InDegree(t) in the remaining preference graph t1t1 t2t2 t3t3 1/2 2/3 1/3 1 t1t1 t2t2 t3t3 2/3 1/3 t1t1 t3t3 1 -4/3 t2t2 1/3 -1/3 t2t2 t1t1 t2t2 t1t1 t3t3

SIGMOD 2006 MC -algorithm Reverse the directions of the edges on the preference graph Run a random walk (with random restarts) on the reversed graph Rank according to the stationary distribution

SIGMOD 2006 Performance Data generation –Fix an order on the tuples –Generate preferences that respect this order –Pc: the probability that a preference is generated between a pair of tuples Observations –For small p c values more orders are compatible, all algorithms are good –For large p c values MC and Greedy find the optimal order

SIGMOD 2006 Problem decomposition [Problem 1]: For every context X build an order τ X (Ordering) [Problem 2]: Given a set of orders T m = {τ 1,…, τ m } find ℓ representative orders T ℓ (ClusterOrders) Assign each of the input orders to one of the representatives (the closest) Associate with each representative σ a set of contexts Y σ [Problem 3]: Provide top-k results for the query Q –respecting the representative orders and –weight respect according to the similarity between query and contexts (Querying)

SIGMOD 2006 Reducing the number of orders [Problem 2] Finding ℓ representative orders is NP-hard Finding ℓ orders from the input ones (good approximation, but still hard) Need for heuristics Greedy algorithm –Always pick the order (from the input) that introduces the minimum cost Furthest algorithm –Start by picking a random order τ and add it in the output set of orders T ℓ –For ℓ-1 iterations pick the order that is furthest away from the orders already in T ℓ

SIGMOD 2006 Refine the representative orders Given the set of representative orders T ℓ, assign each input order τЄT m to its closest representative in T ℓ. (partition T m into ℓ partitions)* –Discrete refinement: For each partition pick the best representative of the partition –Continuous refinement: ( [DKNS’01] ) For each partition find the best representative of the partition *Notice the resemblance between this problem and Catalog Segmentation problem by [KPR’04]

SIGMOD 2006 Performance Data generation –Fix ℓ underlying orders T –Generate other orders from T by picking an order in T and adding noise (swaps) –Compute the cost of the solution wrt to the ground truth Observations –Without refinements: Greedy performs steadily better than Furthest –With refinements: Both algorithms are equally good –The groupings are equivalent

SIGMOD 2006 Problem decomposition [Problem 1]: For every context X build an order τ X (Ordering) [Problem 2]: Given a set of orders T m = {τ 1,…, τ m } find ℓ representative orders T ℓ (ClusterOrders) Assign each of the input orders to one of the representatives (the closest) Associate with each representative σ a set of contexts Y σ [Problem 3]: Provide top-k results for the query Q –respecting the representative orders and –weight respect according to the similarity between query and contexts (Querying)

SIGMOD 2006 Problem 3: The Querying problem Use variation of the TA algorithms [FLN’02, FKS’03] –Assume k = 2 and query Q such that: sim(Q,Y 1 ) = 0.5, sim(Q,Y 2 ) = 0.3, sim(Q,Y 3 )=0.1 Y 1,T 1 t1t1 5 t2t2 4 t3t3 3 t4t4 2 T5T5 1 Y 2,T 2 t2t2 5 t3t3 4 t1t1 3 t4t4 2 t5t5 1 Y 3,T 3 t4t4 5 t3t3 4 t1t1 3 t5t5 2 t2t

SIGMOD 2006 Problem 3: The Querying problem 1.At each sequential access a.Set the threshold TH to be the aggregate of the scores seen in this access TH =0.5*5+0.3*5+0.1*5=4.5 Y 1,T 1 t1t1 5 t2t2 4 t3t3 3 t4t4 2 T5T5 1 Y 2,T 2 t2t2 5 t3t3 4 t1t1 3 t4t4 2 t5t5 1 Y 3,T 3 t4t4 5 t3t3 4 t1t1 3 t5t5 2 t2t

SIGMOD 2006 Problem 3: The Querying problem 1.At each sequential access b.Do random accesses and compute the score of the objects seen TH =0.5*5+0.3*5+0.1*5=4.5 Y 1,T 1 t1t1 5 t2t2 4 t3t3 3 t4t4 2 T5T5 1 Y 2,T 2 t2t2 5 t3t3 4 t1t1 3 t4t4 2 t5t5 1 Y 3,T 3 t4t4 5 t3t3 4 t1t1 3 t5t5 2 t2t2 1 t1t1 3.7 t2t2 3.6 t4t

SIGMOD 2006 Problem 3: The Querying problem 1.At each sequential access b.Do random accesses and compute the score of the objects seen TH =0.5*5+0.3*5+0.1*5=4.5 Y 1,T 1 t1t1 5 t2t2 4 t3t3 3 t4t4 2 T5T5 1 Y 2,T 2 t2t2 5 t3t3 4 t1t1 3 t4t4 2 t5t5 1 Y 3,T 3 t4t4 5 t3t3 4 t1t1 3 t5t5 2 t2t2 1 t1t1 3.7 t2t

SIGMOD 2006 Problem 3: The Querying problem 1.At each sequential access c.Maintain a list of the top-k objects seen so far TH =0.5*5+0.3*5+0.1*5=4.5 Y 1,T 1 t1t1 5 t2t2 4 t3t3 3 t4t4 2 T5T5 1 Y 2,T 2 t2t2 5 t3t3 4 t1t1 3 t4t4 2 t5t5 1 Y 3,T 3 t4t4 5 t3t3 4 t1t1 3 t5t5 2 t2t2 1 t1t1 3.7 t2t

SIGMOD 2006 Problem 3: The Querying problem 1.At each sequential access d.When the scores of the top-k are greater or equal to the threshold, stop TH =0.5*4+0.3*4+0.1*4=3.6 Y 1,T 1 t1t1 5 t2t2 4 t3t3 3 t4t4 2 T5T5 1 Y 2,T 2 t2t2 5 t3t3 4 t1t1 3 t4t4 2 t5t5 1 Y 3,T 3 t4t4 5 t3t3 4 t1t1 3 t5t5 2 t2t2 1 t1t1 3.7 t2t

SIGMOD 2006 Accuracy of top-k results IMDB dataset –Automatically generate preferences via association- rule mining: ‘A1=a’ > ‘A1=b’ |X if conf(X  a)>conf(X  b) –Sol k : top-k results obtained after clustering –G k : top-k results without clustering

SIGMOD 2006 Accuracy of top-k results

SIGMOD 2006 Recap Notion of contextual preferences Use of contextual preferences to order database results Use of association rules to obtain contextual preferences Experimental validation of the effectiveness of the proposed techniques using both synthetic and real data

SIGMOD 2006 Conclusions and future work The framework of contextual preferences is both intuitive and practical The framework is easily extended to accommodate for top-k lists and bucket orders Scalability of the algorithms needs further investigation

SIGMOD 2006 Questions?