Depth Estimation for Ranking Query Optimization Karl Schnaitter, UC Santa Cruz Joshua Spiegel, BEA Systems, Inc. Neoklis Polyzotis, UC Santa Cruz.

Slides:

Advertisements

Similar presentations

Web Information Retrieval

Advertisements

Overcoming Limitations of Sampling for Agrregation Queries Surajit ChaudhuriMicrosoft Research Gautam DasMicrosoft Research Mayur DatarStanford University.

Efficient Processing of Top- k Queries in Uncertain Databases Ke Yi, AT&T Labs Feifei Li, Boston University Divesh Srivastava, AT&T Labs George Kollios,

Jianxin Li, Chengfei Liu, Rui Zhou Swinburne University of Technology, Australia Wei Wang University of New South Wales, Australia Top-k Keyword Search.

Kaushik Chakrabarti(Univ Of Illinois) Minos Garofalakis(Bell Labs) Rajeev Rastogi(Bell Labs) Kyuseok Shim(KAIST and AITrc) Presented at 26 th VLDB Conference,

Supporting Top-k join Queries in Relational Databases By:Ihab F. Ilyas, Walid G. Aref, Ahmed K. Elmagarmid Presented by: Calvin R Noronha ( )

Supporting top-k join queries in relational databases Ihab F. Ilyas, Walid G. Aref, Ahmed K. Elmagarmid Presented by Rebecca M. Atchley Thursday, April.

1 Autocompletion for Mashups Ohad Greenshpan, Tova Milo, Neoklis Polyzotis Tel-Aviv University UCSC.

Fast Algorithms For Hierarchical Range Histogram Constructions

STHoles: A Multidimensional Workload-Aware Histogram Nicolas Bruno* Columbia University Luis Gravano* Columbia University Surajit Chaudhuri Microsoft Research.

Linked Bernoulli Synopses Sampling Along Foreign Keys Rainer Gemulla, Philipp Rösch, Wolfgang Lehner Technische Universität Dresden Faculty of Computer.

A Scalable, Predictable Join Operator for Highly Concurrent Data Warehouses George Candea (EPFL & Aster Data) Neoklis Polyzotis (UC Santa Cruz) Radek Vingralek.

Trust and Profit Sensitive Ranking for Web Databases and On-line Advertisements Raju Balakrishnan (Arizona State University)

Efficient Query Evaluation on Probabilistic Databases

Probabilistic Threshold Range Aggregate Query Processing over Uncertain Data Wenjie Zhang University of New South Wales & NICTA, Australia Joint work:

SUPPORTING TOP-K QUERIES IN RELATIONAL DATABASES. PROCEEDINGS OF THE 29TH INTERNATIONAL CONFERENCE ON VERY LARGE DATABASES, MARCH 2004 Sowmya Muniraju.

Selectivity-Based Partitioning Alkis Polyzotis UC Santa Cruz.

A Generic Framework for Handling Uncertain Data with Local Correlations Xiang Lian and Lei Chen Department of Computer Science and Engineering The Hong.

Quantile-Based KNN over Multi- Valued Objects Wenjie Zhang Xuemin Lin, Muhammad Aamir Cheema, Ying Zhang, Wei Wang The University of New South Wales, Australia.

Graph-Based Synopses for Relational Selectivity Estimation Joshua Spiegel and Neoklis Polyzotis University of California, Santa Cruz.

1 Ranked Queries over sources with Boolean Query Interfaces without Ranking Support Vagelis Hristidis, Florida International University Yuheng Hu, Arizona.

CMSC724: Database Management Systems Instructor: Amol Deshpande

An Adaptive Multi-Objective Scheduling Selection Framework For Continuous Query Processing Timothy M. Sutherland Bradford Pielech Yali Zhu Luping Ding.

1 Wavelet synopses with Error Guarantees Minos Garofalakis Phillip B. Gibbons Information Sciences Research Center Bell Labs, Lucent Technologies Murray.

Evaluating Top-k Queries over Web-Accessible Databases Nicolas Bruno Luis Gravano Amélie Marian Columbia University.

Top- K Query Evaluation with Probabilistic Guarantees Martin Theobald, Gerhard Weikum, Ralf Schenkel Presenter: Avinandan Sengupta.

Approximate XML Query Answers Alkis Polyzotis (UC Santa Cruz) Minos Garofalakis (Bell Labs) Yannis Ioannidis (U. of Athens, Hellas)

Minimal Probing: Supporting Expensive Predicates for Top-k Queries Kevin C. Chang Seung-won Hwang Univ. of Illinois at Urbana-Champaign.

Scalable Approximate Query Processing through Scalable Error Estimation Kai Zeng UCLA Advisor: Carlo Zaniolo 1.

MPI Informatik 1/17 Oberseminar AG5 Result merging in a Peer-to-Peer Web Search Engine Supervisors: Speaker : Sergey Chernov Prof. Gerhard Weikum Christian.

1 A Bayesian Method for Guessing the Extreme Values in a Data Set Mingxi Wu, Chris Jermaine University of Florida September 2007.

Finding Frequent Items in Data Streams [Charikar-Chen-Farach-Colton] Paper report By MH, 2004/12/17.

Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach Wenjie Zhang, Xuemin Lin The University of New South Wales & NICTA Ming Hua,

Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.

Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.

Top-k Similarity Join over Multi- valued Objects Wenjie Zhang Jing Xu, Xin Liang, Ying Zhang, Xuemin Lin The University of New South Wales, Australia.

Supporting Top-k join Queries in Relational Databases Ihab F. Ilyas, Walid G. Aref, Ahmed K. Elmagarmid Presented by: Richa Varshney.

“Artificial Intelligence” in my research Seung-won Hwang Department of CSE POSTECH.

Index Interactions in Physical Design Tuning Modeling, Analysis, and Applications Karl Schnaitter, UC Santa Cruz Neoklis Polyzotis, UC Santa Cruz Lise.

The Sweet Spot between Inverted Indices and Metric-Space Indexing for Top-K–List Similarity Search Evica Milchevski , Avishek Anand ★ and Sebastian Michel.

PrefJoin: An Efficient Preference- Aware Join Operator Mohamed E. Khalefa Mohamed F. Mokbel Justin Levandoski.

Facilitating Document Annotation using Content and Querying Value.

Efficient Processing of Top-k Spatial Preference Queries

1University of Texas at Arlington.  Introduction  Motivation  Requirements  Paper’s Contribution.  Related Work  Overview of Ripple Join  Rank.

GSLPI: a Cost-based Query Progress Indicator

Supporting Top-k join Queries in Relational Databases Ihab F. Ilyas, Walid G. Aref, Ahmed K. Elmagarmid Presented by: Z. Joseph, CSE-UT Arlington.

Answering Top-k Queries Using Views Gautam Das (Univ. of Texas), Dimitrios Gunopulos (Univ. of California Riverside), Nick Koudas (Univ. of Toronto), Dimitris.

Introduction to Query Optimization, R. Ramakrishnan and J. Gehrke 1 Introduction to Query Optimization Chapter 13.

Space-Efficient Online Computation of Quantile Summaries SIGMOD 01 Michael Greenwald & Sanjeev Khanna Presented by ellery.

Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.

HASE: A Hybrid Approach to Selectivity Estimation for Conjunctive Queries Xiaohui Yu University of Toronto Joint work with Nick Koudas.

XCluster Synopses for Structured XML Content Alkis Polyzotis (UC Santa Cruz) Minos Garofalakis (Intel Research, Berkeley)

A Unified Approach to Ranking in Probabilistic Databases Jian Li, Barna Saha, Amol Deshpande University of Maryland, College Park, USA VLDB

The Effect of Database Size Distribution on Resource Selection Algorithms Luo Si and Jamie Callan School of Computer Science Carnegie Mellon University.

Chapter 13 Query Optimization Yonsei University 1 st Semester, 2015 Sanghyun Park.

1 Overview of Query Evaluation Chapter Outline  Query Optimization Overview  Algorithm for Relational Operations.

Relevant Document Distribution Estimation Method for Resource Selection Luo Si and Jamie Callan School of Computer Science Carnegie Mellon University

1 VLDB, Background What is important for the user.

1 Chengkai Li Kevin-Chen-Chuan Chang Ihab Ilyas Sumin Song Presented by: Mariam John CSE /20/2006 RankSQL: Query Algebra and Optimization for Relational.

AQAX: Approximate Query Answering for XML Josh Spiegel, M. Pontikakis, S. Budalakoti, N. Polyzotis Univ. of California Santa Cruz.

RankSQL: Query Algebra and Optimization for Relational Top-k Queries

Information Retrieval in Practice

Seung-won Hwang, Kevin Chen-Chuan Chang

A paper on Join Synopses for Approximate Query Answering

Evaluation of Relational Operations: Other Operations

Communication and Memory Efficient Parallel Decision Tree Construction

A Unifying View on Instance Selection

Overview of Query Evaluation

A Framework for Testing Query Transformation Rules

Efficient Processing of Top-k Spatial Preference Queries

Presentation transcript:

Depth Estimation for Ranking Query Optimization Karl Schnaitter, UC Santa Cruz Joshua Spiegel, BEA Systems, Inc. Neoklis Polyzotis, UC Santa Cruz

2 Relational Ranking Queries A base score for each table in [0,1] Combined with a scoring function S S(b H, b R, b E ) = 0.3*b H + 0.5*b R + 0.2*b E Return top k results based on S In this case, k = 10 RANK BY 0.3/h.price + 0.5*r.rating + 0.2*isMusic(e) LIMIT 10 Hotels: b H (h) = 1/h.price Restaurants: b R (r) = r.rating Events: b E (e) = isMusic(e) SELECT h.hid, r.rid, e.eid FROM Hotels h, Restaurants r, Events e WHERE h.city = r.city AND r.city = e.city

3 Ranking Query Execution SELECT h.hid, r.rid, e.eid FROM Hotels h, Restaurants r, Events e WHERE h.city = r.city AND r.city = e.city RANK BY 0.3/h.price + 0.5*r.rating + 0.2*isMusic(e) LIMIT 10 HR E Fetch 10 results HR E Sort on S conventional plan rank-aware plan Fetch 10 results Ordered by score Rank join Join

4 Depth Estimation Depth: number of accessed tuples –Indicates execution cost –Linked to memory consumption The problem: Estimate depths for each operator in a rank-aware plan HR left depth right depth Rank join

5 Depth Estimation Methods Ilyas et al. (SIGMOD 2004) –Uses probabilistic model of data –Assumes relations of equal size and a scoring function that sums scores –Limited applicability Li et al. (SIGMOD 2005) –Samples a subset of rows from each table –Independent samples give a poor model of join results

6 Our Solution: DEEP DEpth Estimation for Physical plans Strengths of DEEP –A principled methodology Uses statistical model of data distribution Formally computes depth over statistics –Efficient estimation algorithms –Widely applicable Works with state-of-the-art physical plans Realizable with common data synopses

7 Outline Preliminaries DEEP Framework Experimental Results

8 Outline Preliminaries DEEP Framework Experimental Results

9 Monotonic Functions x f(x) A function f(x 1,...,x n ) is monotonic if  i(x i ≤y i )  f(x 1,...,x n ) ≤ f(y 1,...,y n )

10 Monotonic Functions A function f(x 1,...,x n ) is monotonic if  i(x i ≤y i )  f(x 1,...,x n ) ≤ f(y 1,...,y n ) Most scoring functions are monotonic –E.g. sum, product, avg, max, min Monotonicity enables bound on score –In example query, score was 0.3/h.price + 0.5*r.rating + 0.2*isMusic(e) –Given a restaurant r, upper bound is 0.3* *r.rating + 0.2*1

11 Hash Rank Join [IAE04] The Hash Rank Join algorithm –Joins inputs sorted by score –Returns results with highest score Main ideas –Alternate between inputs based on pull strategy –Score bounds allow early termination LabLbL x1.0 y0.8 · · · RabRbR y1.0 z0.9 w0.7 · · · Query:Top result from L R with scoring function S(b L, b R ) = b L + b R Result:y Score:1.8 Bound: 1.8Bound: 1.7

12 HRJN* [IAE04] The HRJN* pull strategy: a)Pull from the input with highest bound b)If (a) is a tie, pull from input with the smaller number of pulls so far c)If (b) is a tie, pull from the left LabLbL RabRbR Bound: x 1.0 y 0.8 y1.0 z0.9 w0.7 Result:y Score:1.8 Query:Top result from L R with scoring function S(b L, b R ) = b L + b R ?

13 Outline Preliminaries DEEP Framework Experimental Results

14 Supported Operators Evidence in favor of HRJN* –Pull strategy has strong properties Within constant factor of optimal cost Optimal for a significant class of inputs More details in the paper –Efficient in experiments [IAE04]  DEEP explicitly supports HRJN* –Easily extended to other join operators –Selection operators too

15 DEEP: Conceptual View Depth Computation Statistical Data Model defined in terms of Formalization Estimation Algorithms Statistics Interface defined in terms of Implementation Data Synopsis

16 Statistics Model Statistics yield the distribution of scores for base tables and joins bLbL bRbR F L R (b L,b R ) bLbL FL(bL)FL(bL) bRbR FR(bR)FR(bR) FRFR FLFL F L R

17 Statistics Interface DEEP accesses statistics with two methods –getFreq(b): Return frequency of b –nextScore(b,i): Return next lowest score on dimension i getFreq(b) = 3 nextScore(b,1)=0.9 nextScore(b,2)=0.5 bLbL bRbR F L R (b L,b R ) b The interface allows for efficient algorithms –Abstracts the physical statistics format –Allows statistics to be generated on-the-fly

18 Statistics Implementation Interface can be implemented over common types of data synopses Can use a histogram if a)Base score function is invertible, or b)Base score measures distance Assume uniformity & independence if a)Base score function is too complex, or b)Sufficient statistics are not available

19 Depth Estimation Overview Estimates madeValue Top-k query plan A B C l2l2 r2r2 r1r1 l1l1 s2s2 s1s1 1.Score of the k th best tuple out of s1s1 2.Depths of needed to output score of s 1 l 1 and r 1 3.Score of the l 1 th best tuple out of s2s2 4.Depths of needed to output score of s 2 l 2 and r 2

20 Idea –Sort by total score –Sum frequencies Estimating Terminal Score Suppose we want the 10 th best score bL + bRbL + bR F L R (b L,b R )sum bLbL bRbR F L R (b L,b R ) S term = 1.6

21 Estimation Algorithm Idea: Only process necessary statistics bLbL bRbR F L R (b L,b R ) Algorithm relies solely on getFreq and nextScore –Avoids materializing complete table Worst-case complexity equivalent to sorting table –More efficient in practice S term = 1.6

22 Depth Estimation Overview Estimates madeValue Top-k query plan A B C l2l2 r2r2 r1r1 l1l1 s2s2 s1s1 1.Score of the k th best tuple out of s1s1 2.Depths of needed to output score of s 1 l 1 and r 1 3.Score of the l 1 th best tuple out of s2s2 4.Depths of needed to output score of s 2 l 2 and r 2

23 Estimating Depth for HRJN* Example: S term = ≤ depth ≤ 15 bLbL FL(bL)FL(bL) > S term = S term < S term Input Scores Theorem: i  depth of HRJN*  j i j Estimation algorithm –Access via getFreq and nextScore –Similar to estimation of S term bL + 1bL + 1FL(bL)FL(bL)

24 Outline Preliminaries DEEP Framework Experimental Results

25 Experimental Setting TPC-H data set –Total size of 1 GB –Varying amount of skew Workloads of 250 queries –Top-10, top-100, top-1000 queries –One or two joins per query Error metric: absolute relative error

26 Depth Estimation Techniques DEEP –Uses 150 KB TuG synopsis [SP06] Probabilistic [IAE04] –Uses same TuG synopsis –Modified to handle single-join queries with varying table sizes Sampling [LCIS05] –5% sample = 4.6 MB

27 Error for Varying Skew Zipfian Skew Parameter Percentage Error

28 Error at Each Input Input of Two-Join Query Percentage Error

29 Conclusions Depth estimation is necessary to optimize relational ranking queries DEEP is a principled and practical solution –Takes data distribution into account –Applies to many common scenarios –Integrates with data summarization techniques New theoretical results for HRJN* Next steps –Accuracy guarantees –Data synopses for complex base scores (especially text predicates)

30 Thank You

31 Related Work Selectivity estimation is a similar idea It is the inverse problem L R L R depth? selectivity = k depth = |L| selectivity? depth = |R| Selectivity Estimation Depth Estimation

32 Other Features DEEP can be extended to NLRJ and selection operators DEEP can be extended to other pulling strategies –Block-based HRJN* –Block-based alternation

33 Analysis of HRJN* Within the class of all HRJN variants: –HRJN* is optimal for many cases With no ties of score bound between inputs With no ties of score bound within one input –HRJN* is instance optimal