Minimal Probing: Supporting Expensive Predicates for Top-k Queries Kevin C. Chang Seung-won Hwang Univ. of Illinois at Urbana-Champaign.

Slides:



Advertisements
Similar presentations
Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.
Advertisements

Topic 3 Top-K and Skyline Algorithms. 2 What is top-k processing? Find k items that best answer a users query –As a set, as a sorted list, or as a sorted.
Choosing an Order for Joins
Web Information Retrieval
13/04/20151 SPARK: Top- k Keyword Query in Relational Database Wei Wang University of New South Wales Australia.
SkewReduce YongChul Kwon Magdalena Balazinska, Bill Howe, Jerome Rolia* University of Washington, *HP Labs Skew-Resistant Parallel Processing of Feature-Extracting.
RankSQL: Supporting Ranking Queries in RDBMS Chengkai Li (UIUC) Mohamed A. Soliman (Univ. of Waterloo) Kevin Chen-Chuan Chang (UIUC) Ihab F. Ilyas (Univ.
1 RankSQL: Query Algebra and Optimization for Relational Top-k Queries Chengkai Li (UIUC) joint work with Kevin Chen-Chuan Chang (UIUC) Ihab F. Ilyas (U.
Requirements on the Execution of Kahn Process Networks Marc Geilen and Twan Basten 11 April 2003 /e.
Supporting Ad-Hoc Ranking Aggregates Chengkai Li (UIUC) joint work with Kevin Chang (UIUC) Ihab Ilyas (Waterloo)
1 Oct 30, 2006 LogicSQL-based Enterprise Archive and Search System How to organize the information and make it accessible and useful ? Li-Yan Yuan.
Advanced Topics in Algorithms and Data Structures Lecture 6.1 – pg 1 An overview of lecture 6 A parallel search algorithm A parallel merging algorithm.
Parallel Computation in Biological Sequence Analysis Xue Wu CMSC 838 Presentation.
Introduction to Database Systems 1 Join Algorithms Query Processing: Lecture 1.
Saehoon Kim§, Yuxiong He. , Seung-won Hwang§, Sameh Elnikety
Federated Search of Text Search Engines in Uncooperative Environments Luo Si Language Technology Institute School of Computer Science Carnegie Mellon University.
CS246 Ranked Queries. Junghoo "John" Cho (UCLA Computer Science)2 Traditional Database Query (Dept = “CS”) & (GPA > 3.5) Boolean semantics Clear boundary.
Hashed Samples Selectivity Estimators for Set Similarity Selection Queries.
Overview of Implementing Relational Operators and Query Evaluation
The Database and Info. Systems Lab. University of Illinois at Urbana-Champaign Light-weight Domain-based Form Assistant: Querying Web Databases On the.
Context Tailoring the DBMS –To support particular applications Beyond alphanumerical data Beyond retrieve + process –To support particular hardware New.
MPI Informatik 1/17 Oberseminar AG5 Result merging in a Peer-to-Peer Web Search Engine Supervisors: Speaker : Sergey Chernov Prof. Gerhard Weikum Christian.
MIDDLEWARE SYSTEMS RESEARCH GROUP Denial of Service in Content-based Publish/Subscribe Systems M.A.Sc. Candidate: Alex Wun Thesis Supervisor: Hans-Arno.
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Query Evaluation Chapter 12: Overview.
1 Evaluating top-k Queries over Web-Accessible Databases Paper By: Amelie Marian, Nicolas Bruno, Luis Gravano Presented By Bhushan Chaudhari University.
Ashwani Roy Understanding Graphical Execution Plans Level 200.
Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.
Physical Database Design I, Ch. Eick 1 Physical Database Design I About 25% of Chapter 20 Simple queries:= no joins, no complex aggregate functions Focus.
Online aggregation Joseph M. Hellerstein University of California, Berkley Peter J. Haas IBM Research Division Helen J. Wang University of California,
Supporting Top-k join Queries in Relational Databases Ihab F. Ilyas, Walid G. Aref, Ahmed K. Elmagarmid Presented by: Richa Varshney.
“Artificial Intelligence” in my research Seung-won Hwang Department of CSE POSTECH.
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
Efficient Instant-Fuzzy Search with Proximity Ranking Authors: Inci Centidil, Jamshid Esmaelnezhad, Taewoo Kim, and Chen Li IDCE Conference 2014 Presented.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Overview of Implementing Relational Operators and Query Evaluation Chapter 12.
Efficient Processing of Top-k Spatial Preference Queries
1University of Texas at Arlington.  Introduction  Motivation  Requirements  Paper’s Contribution.  Related Work  Overview of Ripple Join  Rank.
Ranking objects based on relationships Computing Top-K over Aggregation Sigmod 2006 Kaushik Chakrabarti et al.
All right reserved by Xuehua Shen 1 Optimal Aggregation Algorithms for Middleware Ronald Fagin, Amnon Lotem, Moni Naor (PODS01)
Joseph M. Hellerstein Peter J. Haas Helen J. Wang Presented by: Calvin R Noronha ( ) Deepak Anand ( ) By:
Supporting Top-k join Queries in Relational Databases Ihab F. Ilyas, Walid G. Aref, Ahmed K. Elmagarmid Presented by: Z. Joseph, CSE-UT Arlington.
Buffer-pool aware Query Optimization Ravishankar Ramamurthy David DeWitt University of Wisconsin, Madison.
Answering Top-k Queries Using Views Gautam Das (Univ. of Texas), Dimitrios Gunopulos (Univ. of California Riverside), Nick Koudas (Univ. of Toronto), Dimitris.
Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.
Introduction to Query Optimization, R. Ramakrishnan and J. Gehrke 1 Introduction to Query Optimization Chapter 13.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Introduction to Query Optimization Chapter 13.
Information Technology Selecting Representative Objects Considering Coverage and Diversity Shenlu Wang 1, Muhammad Aamir Cheema 2, Ying Zhang 3, Xuemin.
Supporting Ranking and Clustering as Generalized Order-By and Group-By Chengkai Li (UIUC) joint work with Min Wang Lipyeow Lim Haixun Wang (IBM) Kevin.
1 Adaptive Parallelism for Web Search Myeongjae Jeon Rice University In collaboration with Yuxiong He (MSR), Sameh Elnikety (MSR), Alan L. Cox (Rice),
Finding skyline on the fly HKU CS DB Seminar 21 July 2004 Speaker: Eric Lo.
Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.
Optimal Aggregation Algorithms for Middleware By Ronald Fagin, Amnon Lotem, and Moni Naor.
Database Searching and Information Retrieval Presented by: Tushar Kumar.J Ritesh Bagga.
03/02/20061 Evaluating Top-k Queries Over Web-Accessible Databases Amelie Marian Nicolas Bruno Luis Gravano Presented By: Archana and Muhammed.
1 Efficient Computation of Diverse Query Results Erik Vee joint work with Utkarsh Srivastava, Jayavel Shanmugasundaram, Prashant Bhat, Sihem Amer Yahia.
Boolean + Ranking: Querying a Database by K-Constrained Optimization Joint work with: Seung-won Hwang, Kevin C. Chang, Min Wang, Christian A. Lang, Yuan-chi.
The Database and Info. Systems Lab. University of Illinois at Urbana-Champaign RankFP : A Framework for Rank Formulation and Processing Hwanjo Yu, Seung-won.
Execution Plans Detail From Zero to Hero İsmail Adar.
Supporting Ranking and Clustering as Generalized Order-By and Group-By
Boolean + Ranking: Querying a Database by K-Constrained Optimization
RankSQL: Query Algebra and Optimization for Relational Top-k Queries
Seung-won Hwang, Kevin Chen-Chuan Chang
Supporting Ad-Hoc Ranking Aggregates
RankSQL: Query Algebra and Optimization for Relational Top-k Queries
Chapter 12: Query Processing
Introduction to Query Optimization
Spatio-temporal Pattern Queries
Selected Topics: External Sorting, Join Algorithms, …
Implementation of Relational Operations
Evaluation of Relational Operations: Other Techniques
Presentation transcript:

Minimal Probing: Supporting Expensive Predicates for Top-k Queries Kevin C. Chang Seung-won Hwang Univ. of Illinois at Urbana-Champaign

Ranked queries return top-k results, unlike Boolean Crucial for retrieving data by “soft” conditions – relevance: e.g., text search engines – similarity: e.g., multimedia databases – preference: e.g., e-commerce product search Example scenario: preference query for finding house: – select h.id from house h where new(age), cheap(price, size), large(size) order by min(new,cheap,large) stop after 5  Observation: Crucial to support expensive predicates Context: Top-k Queries predicate k: retrieval sizescoring function

Problem: Expensive Predicates Expensive predicates – no pre-computed indexes for zero-time sorted-access – need a probe to evaluate each object (similar to sequential scan) Unified abstraction for: – user-defined functions: functional extensibility query conditions can be arbitrary, user-specific e.g., cheap(price,size) – external predicates: data extensibility source interface may require one probe per object e.g., safe(zip) access crime rate from apbnews.com – fuzzy joins associations of relations can be arbitrary e.g., close(house.zip, park.zip)

Require sorted access of search predicates. To “simulate” sorted access, require complete probing – are these probes necessary? Goal: Minimize probe cost Current Limitations: “Sort-Merge” Framework d:0.90, a:0.85, b:0.78, c:0.75, e:0.70 b:0.90, d:0.90, e:0.80, a:0.75, c:0.20 a:0.90, b:0.80, c:0.70, d:0.60, e:0.50 b:0.78 Merge Algorithm F = min(new,cheap,large) k = 1 Sort stepMerge stepTop-k output new (search predicate) cheap (expensive predicate) large (expensive predicate)  

Motivation: Solution Space Assume sequential probing: Algorithm skeleton: do: schedule next obj o, pred p probe pr(o,p) until (top-k identified) predicates p1p2p3 object a b c

Our framework: Separate, Global Predicate Scheduling Two important decisions on framework: Separate predicate scheduling – scheduling as separate “optimization” phase before probing – avoid run-time scheduling overhead Global predicate scheduling – scheduling based on global info (predicate selectivities) – lack of per-object information to justify per-object scheduling – avoid per-object scheduling overhead  Simple framework and algorithm – and efficient! – allow essentially A* framework, for given predicate schedule – enable formal analysis: optimality, scalability

Separate, global predicate scheduling Simple Framework Algorithm skeleton: find global schedule H do: schedule next obj o probe pr(o, next(o,H)) until (top-k identified) predicates H=(p1,p2,p3) p1p2p3 object a b c

Challenges for Minimizing Probing Predicate scheduling before probing – how to identify the best H? Object scheduling during probing – how to find next object to probe, for achieving “minimal probing” with respect to H? Algorithm skeleton: find global schedule H do: schedule next obj o probe pr(o, next(o,H)) until (top-k identified) ? ?

Challenge 1 : Object Scheduling Goal: Perform only necessary probes Necessary probes: – A probe is necessary if top-k answers cannot be determined by any algorithm without it, regardless of the outcomes of other probes.  Question 1: Given a probe pr(o, next(o,H)), how to determine if it is necessary? Probe-optimal algorithm – An algorithm is probe-optimal if it performs only the necessary probes.  Question 2 : How to identify necessary probes in order to design such an algorithm?

k=1, F=min(x,p1,p2); suppose H=(p1,p2) Question 1: Is this Probe Necessary? OIDxp1p2 F=min(x,p1,p2) a0.9 b0.8 c0.7 d0.6 e 0.5 ? ? ? ? ? top 1 Maybe Not!  0.8

k=1, F=min(x,p1,p2); suppose H=(p1,p2)  Theorem: Probe pr(o,p) is absolutely necessary, if o is among the current top-k in terms of ceiling scores. Question 1: Is this Probe Necessary? OIDxp1p2 F=min(x,p1,p2) a0.9 b0.8 c0.7 d0.6 e 0.5 ?  top 1? Necessary! 110.8

Question 2: Probe-optimal object scheduling Objects in current top-k must be further probed Probe-optimal object scheduling: Algorithm MPro – use a priority queue with ceiling scores as priorities a:0.9 b:0.8 c:0.7 d:0.6 e:0.5 a:0.85 b:0.8 c:0.7 d:0.6 e:0.5 b:0.8 a:0.75 c:0.7 d:0.6 e:0.5 a:0.75 c:0.7 d:0.6 e:0.5 b:0.78 a:0.75 c:0.7 d:0.6 e:0.5 b:0.78 pr(a,p1) =0.85 pr(a,p2) =0.75 pr(b,p1) =0.78 pr(b,p2) =0.90 top 1

Challenge 2: Predicate Scheduling Scheduling problem – find minimal cost schedule from permutations Challenges – selectivity estimation: dynamic predicates aggregate selectivities (context-dependent) – scheduling computation: NP-hard Our approach: – on-line sampling to estimate selectivities – greedy selection to schedule predicates 0.1% sampling achieves almost the best schedule

Experiment Results Practical performance of MPro – proportional cost to the retrieval size k – significant speedup for small k Impact of performance factors – database size: sublinear cost scalability – score distribution and scoring function: see paper 6 hour 2 min

Demo : House Search Data: All houses on sale in Illinois (N=20990) – from – objects: house(id, price, size, bed, bath, zip, city) Query: F = Average(n, c, r) – n nearcity: close to Chicago – c cheap: “reasonable” price for its size – r roomy: prefer 4-6 rooms

Summary of Contributions (more in the paper) Abstraction: – for user-defined, external, and fuzzy join predicates Framework and algorithm: – sampling-based global scheduling – probe-optimal algorithm MPro – extensions of MPro: fuzzy joins, parallel MPro, approximation Principles/Theorems: – necessary-probe principle – probe-optimality of MPro – analytical scalability of MPro Extensive experiments

Thank You!

Parallel MPro: Overview Probe-parallel MPro – Probe k necessary probes concurrently – Up to k-fold speedup Data-parallel MPro – Partition data into s chunks – Up to s-time speedup top-k MPro Merge

Scalability k=100 N=1000 k=1000 N=10000 k=10000 N= N=1000 N=10000 N=100000

Comparison TTT OOO