Top- K Query Evaluation with Probabilistic Guarantees Martin Theobald, Gerhard Weikum, Ralf Schenkel Presenter: Avinandan Sengupta.

Slides:



Advertisements
Similar presentations
Google News Personalization: Scalable Online Collaborative Filtering
Advertisements

Efficient Top-k Search across Heterogeneous XML Data Sources Jianxin Li 1 Chengfei Liu 1 Jeffrey Xu Yu 2 Rui Zhou 1 1 Swinburne University of Technology.
Group Recommendation: Semantics and Efficiency
Web Information Retrieval
Overcoming Limitations of Sampling for Agrregation Queries Surajit ChaudhuriMicrosoft Research Gautam DasMicrosoft Research Mayur DatarStanford University.
Efficient Processing of Top- k Queries in Uncertain Databases Ke Yi, AT&T Labs Feifei Li, Boston University Divesh Srivastava, AT&T Labs George Kollios,
Probabilistic Skyline Operator over Sliding Windows Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Xuemin Lin, Ying Zhang, Wei.
1 A FAIR ASSIGNMENT FOR MULTIPLE PREFERENCE QUERIES Leong Hou U, Nikos Mamoulis, Kyriakos Mouratidis Gruppo 10: Paolo Barboni, Tommaso Campanella, Simone.
Supporting top-k join queries in relational databases Ihab F. Ilyas, Walid G. Aref, Ahmed K. Elmagarmid Presented by Rebecca M. Atchley Thursday, April.
Best-Effort Top-k Query Processing Under Budgetary Constraints
PAPER BY : CHRISTOPHER R’E NILESH DALVI DAN SUCIU International Conference on Data Engineering (ICDE), 2007 PRESENTED BY : JITENDRA GUPTA.
Top-k Query Evaluation with Probabilistic Guarantees By Martin Theobald, Gerald Weikum, Ralf Schenkel.
Rank Aggregation. Rank Aggregation: Settings Multiple items – Web-pages, cars, apartments,…. Multiple scores for each item – By different reviewers, users,
Retrieval Evaluation. Brief Review Evaluation of implementations in computer science often is in terms of time and space complexity. With large document.
1 Ranked Queries over sources with Boolean Query Interfaces without Ranking Support Vagelis Hristidis, Florida International University Yuheng Hu, Arizona.
Aggregation Algorithms and Instance Optimality
Combining Fuzzy Information: an Overview Ronald Fagin Abdullah Mueen -- Slides by Abdullah Mueen.
Evaluating Top-k Queries over Web-Accessible Databases Nicolas Bruno Luis Gravano Amélie Marian Columbia University.
“ The Initiative's focus is to dramatically advance the means to collect,store,and organize information in digital forms,and make it available for searching,retrieval,and.
Genetic Algorithm.
Link Recommendation In P2P Social Networks Yusuf Aytaş, Hakan Ferhatosmanoğlu, Özgür Ulusoy Bilkent University, Ankara, Turkey.
VLDB ´04 Top-k Query Evaluation with Probabilistic Guarantees Martin Theobald Gerhard Weikum Ralf Schenkel Max-Planck Institute for Computer Science SaarbrückenGermany.
MPI Informatik 1/17 Oberseminar AG5 Result merging in a Peer-to-Peer Web Search Engine Supervisors: Speaker : Sergey Chernov Prof. Gerhard Weikum Christian.
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
1 A Bayesian Method for Guessing the Extreme Values in a Data Set Mingxi Wu, Chris Jermaine University of Florida September 2007.
Improved search for Socially Annotated Data Authors: Nikos Sarkas, Gautam Das, Nick Koudas Presented by: Amanda Cohen Mostafavi.
Finding Frequent Items in Data Streams [Charikar-Chen-Farach-Colton] Paper report By MH, 2004/12/17.
1 Efficient Computation of Frequent and Top-k Elements in Data Streams.
1 Applying Collaborative Filtering Techniques to Movie Search for Better Ranking and Browsing Seung-Taek Park and David M. Pennock (ACM SIGKDD 2007)
Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach Wenjie Zhang, Xuemin Lin The University of New South Wales & NICTA Ming Hua,
Query Routing in Peer-to-Peer Web Search Engine Speaker: Pavel Serdyukov Supervisors: Gerhard Weikum Christian Zimmer Matthias Bender International Max.
1 Evaluating top-k Queries over Web-Accessible Databases Paper By: Amelie Marian, Nicolas Bruno, Luis Gravano Presented By Bhushan Chaudhari University.
« Pruning Policies for Two-Tiered Inverted Index with Correctness Guarantee » Proceedings of the 30th annual international ACM SIGIR, Amsterdam 2007) A.
Querying Structured Text in an XML Database By Xuemei Luo.
Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.
Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.
The many facets of approximate similarity search Marco Patella and Paolo Ciaccia DEIS, University of Bologna - Italy.
The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd Presented by Anca Leuca, Antonis Makropoulos.
Supporting Top-k join Queries in Relational Databases Ihab F. Ilyas, Walid G. Aref, Ahmed K. Elmagarmid Presented by: Richa Varshney.
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
1University of Texas at Arlington.  Introduction  Motivation  Requirements  Paper’s Contribution.  Related Work  Overview of Ripple Join  Rank.
A more efficient Collaborative Filtering method Tam Ming Wai Dr. Nikos Mamoulis.
All right reserved by Xuehua Shen 1 Optimal Aggregation Algorithms for Middleware Ronald Fagin, Amnon Lotem, Moni Naor (PODS01)
IO-Top-k: Index-access Optimized Top-k Query Processing Debapriyo Majumdar Max-Planck-Institut für Informatik Saarbrücken, Germany Joint work with Holger.
Supporting Top-k join Queries in Relational Databases Ihab F. Ilyas, Walid G. Aref, Ahmed K. Elmagarmid Presented by: Z. Joseph, CSE-UT Arlington.
Combining Fuzzy Information: An Overview Ronald Fagin.
Presented by Suresh Barukula 2011csz  Top-k query processing means finding k- objects, that have highest overall grades.  A query in multimedia.
Answering Top-k Queries Using Views Gautam Das (Univ. of Texas), Dimitrios Gunopulos (Univ. of California Riverside), Nick Koudas (Univ. of Toronto), Dimitris.
Searching Specification Documents R. Agrawal, R. Srikant. WWW-2002.
NRA Top k query processing using Non Random Access Only sequential access Only sequential accessAlgorithm 1) 1) scan index lists in parallel; 2) 2) consider.
CSE 6392 – Data Exploration and Analysis in Relational Databases April 20, 2006.
Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.
Optimal Aggregation Algorithms for Middleware By Ronald Fagin, Amnon Lotem, and Moni Naor.
Top-k Query Processing Optimal aggregation algorithms for middleware Ronald Fagin, Amnon Lotem, and Moni Naor + Sushruth P. + Arjun Dasgupta.
Database Searching and Information Retrieval Presented by: Tushar Kumar.J Ritesh Bagga.
03/02/20061 Evaluating Top-k Queries Over Web-Accessible Databases Amelie Marian Nicolas Bruno Luis Gravano Presented By: Archana and Muhammed.
A Unified Approach to Ranking in Probabilistic Databases Jian Li, Barna Saha, Amol Deshpande University of Maryland, College Park, USA VLDB
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
1 VLDB, Background What is important for the user.
Your friend has a hobby of generating random bit strings, and finding patterns in them. One day she come to you, excited and says: I found the strangest.
Efficient and Self-tuning Incremental Query Expansions for Top-k Query Processing Martin Theobald Ralf Schenkel Gerhard Weikum Max-Planck Institute for.
An Efficient Algorithm for Incremental Update of Concept space
SIMILARITY SEARCH The Metric Space Approach
Top-k Query Processing
Rank Aggregation.
Probabilistic Latent Preference Analysis
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Probabilistic Ranking of Database Query Results
Presentation transcript:

Top- K Query Evaluation with Probabilistic Guarantees Martin Theobald, Gerhard Weikum, Ralf Schenkel Presenter: Avinandan Sengupta

Introduction to Top-k query processing The threshold algorithm and its variants Are we solving the right problem? A probabilistic algorithm Implementation Details Results Conclusion Presentation Outline 2

Introduction to Top-k query processing The threshold algorithm and its variants Are we solving the right problem? A probabilistic algorithm Implementation Details Results Conclusion 3

Data and a Query Scrip IDEarnings Per Share P/E ratio β...Average Market Cap (B$) SNPS IBM ……… INFY MSFT GOOG Top 10 midcap stocks with low β Hypothetical DB of NASDAQ traded stocks. Data collated from Google Finance Attributes Objects 4

P/E Ratio (norm) INFY: 1 GOOG: 0.99 SNPS: 0.90 IBM: MSFT: 0.47 β -1 (norm) SNPS: 1 IBM: 0.96 MSFT: 0.67 GOOG: INFY: 0.59 Average Market Cap (B$) SNPS: 1 INFY : GOOG: 0.05 IBM: 0.07 MSFT: 0.08 PE j /Highest PE(β -1 j /max(β -1 j ))Grades based on how close the market cap is to the midcap median; normalized Midcap median ≅ 4.5B Hypothetical Graded Lists (made fit for consumption by Top-k processors) f = 0.5*P/E + 1.0*β *MCap weights Aggregate function normalization 5

Top-k List SNPS, X INFY, Y... GOOG, Z Top-k results P/E Ratio (norm) INFY: 1 GOOG: 0.99 SNPS: 0.90 IBM: MSFT: 0.47 β -1 (norm) SNPS: 1 IBM: 0.96 MSFT: 0.67 GOOG: INFY: 0.59 Average Market Cap (B$) SNPS: 1 INFY : GOOG: 0.05 IBM: 0.07 MSFT: 0.08 Top-k Processor 6

Presentation Outline Introduction to Top-k query processing The threshold algorithm and its variants Are we solving the right problem? A probabilistic algorithm Implementation Details Results Conclusion 7

Fagin’s Threshold Algorithm (TA) Access the n lists in parallel. As an object o i is seen, perform a random access to the other lists to find the complete score for o i. Do the same for all objects in the current row. Now compute the threshold τ as the sum of scores in the current row. The algorithm stops after k objects have been found with a score above τ. 8

TA with No Random Access (TA-NRA) Access the n lists in parallel. For an item a, compute its (B)est score: B a = f { f {score j | j ∈ seen-attributes(a)}, f {high k | k ∉ seen-attributes(a)}} high k = last seen score for the k th attribute and its (W)orst score W a = f { f {score j | j ∈ seen-attributes(a)}, f {0 | k ∉ seen-attributes(a)}} Halt when k distinct objects have been seen and there is no object m outside the Top-k list whose B m ≥ W k – this means that we also maintain a table of all seen objects with their W/B scores Top-k List SNPS, W1, B1 INFY, W2, B2... GOOG, W k, B k Running Top-k list; contains the k objects with largest W values; ties broken with B values 9

Issues with TA and TA-NRA High space-time costs Overly conservative 10

Presentation Outline Introduction to Top-k query processing The threshold algorithm and its variants Are we solving the right problem? A probabilistic algorithm Implementation Details Results Conclusion 11

Are we solving the right problem? Is random access possible in most common scenarios? – Web content – XML data, hierarchical data sets Does the user need an exact top-k query result? – Or is she satisfied with an approximation? 12

How about an approximate solution? Can we remove candidates (objects that we think can make it to the top-k list) from consideration early on in the process? – Quickly reach solution 13

Pictorially... Source: (author’s webpage) 14

Introduction to Top-k query processing The threshold algorithm and its variants Are we solving the right problem? A probabilistic algorithm Implementation Details Results Conclusion 15

Probabilistic TA-NRA - 1 Predict the total score of a item for which a partial score is known Avoid the overly conservative best- score/worst-score bounds of the original TA- NRA – Instead, calculate the probability that the total score of the item exceeds a threshold (making the item interesting for the top-k result) 16

Probabilistic TA-NRA - 2 If this probability is sufficiently low (below a threshold), drop the item from the candidate list. The probabilistic prediction involves computing the convolution of the score distributions of different index lists. 17

Score Distribution of Lists - How? 18 β -1 (norm) SNPS: 1 IBM: 0.96 MSFT: 0.67 GOOG: INFY: 0.59 score Median 0.65 Parameter fitting curve fitting pdf

What it is and What it is not Probabilistic guarantees are not about query run-times but about query result quality Probabilistic guarantees refers to the approximation of the top-k ranks in a completely scored and exactly ranked result set 19

The Math Source: (author’s webpage) 20 Set of seen attributes for an object

More Math... Source: (author’s webpage) 21

Introduction to Top-k query processing The threshold algorithm and its variants Are we solving the right problem? A probabilistic algorithm Implementation Details Results Conclusion 22

What distributions to consider? Uniform distribution – simplest assumptions – convolutions based on moment-generating functions with generalized Chernoff-Hoeffding bounds Poisson estimations – efficiently evaluated, provides a reasonable fit for tf*idf based score distributions for Web corpora Histograms – when above methods fail – Involves non-trivial computation (done offline per list) 23

Solving Convolutions? Difficult When the PDF is a uniform distribution, its solution becomes difficult – Use alternate techniques other than convolution – Off-load computation to available probabilistic engines – OpenMaple, etc 24

Queue Management Source: (author’s webpage) 25

Introduction to Top-k query processing The threshold algorithm and its variants Are we solving the right problem? A probabilistic algorithm Implementation Details Results Conclusion 26

Results Source: (author’s webpage) 27

Performance as a function of ε Source: Paper 28

Precision of probabilistic predictors for tf*idf, Uniform-, and Zipf-distributed scores Source: Paper 29

Introduction to Top-k query processing The threshold algorithm and its variants Are we solving the right problem? A probabilistic algorithm Implementation Details Results Conclusion 30

New algorithms were developed based on probabilistic score predictions – Trade-off a small amount of top-k result quality for a drastic reduction of sorted accesses Intelligent management of priority queues for efficient implementation was presented Assumptions were made regarding the aggregation function to be summation Future work to be based on ranked retrieval of XML data and integrating into XXL search engine 31 Conclusion

32 Thanks!