Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.

Slides:

Advertisements

Similar presentations

Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.

Advertisements

Raghavendra Madala. Introduction Icicles Icicle Maintenance Icicle-Based Estimators Quality Guarantee Performance Evaluation Conclusion 2 ICICLES: Self-tuning.

Lindsey Bleimes Charlie Garrod Adam Meyerson

Dynamic Sample Selection for Approximate Query Processing Brian Babcock Stanford University Surajit Chaudhuri Microsoft Research Gautam Das Microsoft Research.

Overcoming Limitations of Sampling for Agrregation Queries Surajit ChaudhuriMicrosoft Research Gautam DasMicrosoft Research Mayur DatarStanford University.

Kaushik Chakrabarti(Univ Of Illinois) Minos Garofalakis(Bell Labs) Rajeev Rastogi(Bell Labs) Kyuseok Shim(KAIST and AITrc) Presented at 26 th VLDB Conference,

CS4432: Database Systems II

A Paper on RANDOM SAMPLING OVER JOINS by SURAJIT CHAUDHARI RAJEEV MOTWANI VIVEK NARASAYYA PRESENTED BY, JEEVAN KUMAR GOGINENI SARANYA GOTTIPATI.

Correlation Search in Graph Databases Yiping Ke James Cheng Wilfred Ng Presented By Phani Yarlagadda.

Fast Algorithms For Hierarchical Range Histogram Constructions

Linked Bernoulli Synopses Sampling Along Foreign Keys Rainer Gemulla, Philipp Rösch, Wolfgang Lehner Technische Universität Dresden Faculty of Computer.

Brian Babcock Surajit Chaudhuri Gautam Das at the 2003 ACM SIGMOD International Conference By Shashank Kamble Gnanoba.

Harikrishnan Karunakaran Sulabha Balan CSE  Introduction  Icicles  Icicle Maintenance  Icicle-Based Estimators  Quality & Performance  Conclusion.

February 14, 2006CS DB Exploration 1 Congressional Samples for Approximate Answering of Group-By Queries Swarup Acharya Phillip B. Gibbons Viswanath.

A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries By : Surajid Chaudhuri Gautam Das Vivek Narasayya Presented by :Sayed.

New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007.

Deterministic Wavelet Thresholding for Maximum-Error Metrics Minos Garofalakis Bell Laboratories Lucent Technologies 600 Mountain Avenue Murray Hill, NJ.

Graph-Based Synopses for Relational Selectivity Estimation Joshua Spiegel and Neoklis Polyzotis University of California, Santa Cruz.

Evaluating Window Joins Over Unbounded Streams By Nishant Mehta and Abhishek Kumar.

Chapter 6: Database Evolution Title: AutoAdmin “What-if” Index Analysis Utility Authors: Surajit Chaudhuri, Vivek Narasayya ACM SIGMOD 1998.

Evaluating Hypotheses

An Incremental Refining Spatial Join Algorithm for Estimating Query Results in GIS Wan D. Bae, Shayma Alkobaisi, Scott T. Leutenegger Department of Computer.

Privacy Preserving Data Mining Yehuda Lindell & Benny Pinkas.

16.5 Introduction to Cost- based plan selection Amith KC Student Id: 109.

1 Wavelet synopses with Error Guarantees Minos Garofalakis Phillip B. Gibbons Information Sciences Research Center Bell Labs, Lucent Technologies Murray.

Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Evaluating Hypotheses.

A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets Rainer Gemulla (University of Technology Dresden) Wolfgang Lehner (University.

Hashed Samples Selectivity Estimators for Set Similarity Selection Queries.

Concepts of Database Management, Fifth Edition

Database Systems: Design, Implementation, and Management Eighth Edition Chapter 10 Database Performance Tuning and Query Optimization.

Surface Simplification Using Quadric Error Metrics Michael Garland Paul S. Heckbert.

CONGRESSIONAL SAMPLES FOR APPROXIMATE ANSWERING OF GROUP-BY QUERIES Swarup Acharya Phillip Gibbons Viswanath Poosala ( Information Sciences Research Center,

Context Tailoring the DBMS –To support particular applications Beyond alphanumerical data Beyond retrieve + process –To support particular hardware New.

Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.

Database Management Systems, R. Ramakrishnan and J. Gehrke1 Query Evaluation Chapter 12: Overview.

Database Management 9. course. Execution of queries.

Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach Wenjie Zhang, Xuemin Lin The University of New South Wales & NICTA Ming Hua,

Join Synopses for Approximate Query Answering Swarup Acharya Phillip B. Gibbons Viswanath Poosala Sridhar Ramaswamy.

OLAP : Blitzkreig Introduction 3 characteristics of OLAP cubes: Large data sets ~ Gb, Tb Expected Query : Aggregation Infrequent updates Star Schema :

Swarup Acharya Phillip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented By Vinay Hoskere.

1 Single Table Queries. 2 Objectives  SELECT, WHERE  AND / OR / NOT conditions  Computed columns  LIKE, IN, BETWEEN operators  ORDER BY, GROUP BY,

February 14, 2006CS DB Exploration 1 Congressional Samples for Approximate Answering of Group-By Queries Swarup Acharya Phillip B. Gibbons Viswanath.

Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.

Histograms for Selectivity Estimation

Join Synopses for Approximate Query Answering Swarup Acharya, Philip B. Gibbons, Viswanath Poosala, Sridhar Ramaswamy By Vladimir Gamaley.

New Sampling-Based Summary Statistics for Improving Approximate Query Answers Yinghui Wang

AL-MAAREFA COLLEGE FOR SCIENCE AND TECHNOLOGY INFO 232: DATABASE SYSTEMS CHAPTER 7 (Part II) INTRODUCTION TO STRUCTURED QUERY LANGUAGE (SQL) Instructor.

Database Management Systems, R. Ramakrishnan and J. Gehrke1 Introduction to Query Optimization Chapter 13.

Presented By Anirban Maiti Chandrashekar Vijayarenu

Robust Estimation With Sampling and Approximate Pre-Aggregation Author: Christopher Jermaine Presented by: Bill Eberle.

Chapter 8 Physical Database Design. Outline Overview of Physical Database Design Inputs of Physical Database Design File Structures Query Optimization.

CONGRESSIONAL SAMPLES FOR APPROXIMATE ANSWERING OF GROUP BY QUERIES Swaroop Acharya,Philip B Gibbons, VishwanathPoosala By Agasthya Padisala Anusha Reddy.

By: Gang Zhou Computer Science Department University of Virginia 1 Medians and Beyond: New Aggregation Techniques for Sensor Networks CS851 Seminar Presentation.

Evaluating Window Joins over Unbounded Streams Jaewoo Kang Jeffrey F. Naughton Stratis D. Viglas {jaewoo, naughton, Univ. of Wisconsin-Madison.

Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.

APEX: An Adaptive Path Index for XML data Chin-Wan Chung, Jun-Ki Min, Kyuseok Shim SIGMOD 2002 Presentation: M.S.3 HyunSuk Jung Data Warehousing Lab. In.

Chapter 9: Web Services and Databases Title: NiagaraCQ: A Scalable Continuous Query System for Internet Databases Authors: Jianjun Chen, David J. DeWitt,

University of Texas at Arlington Presented By Srikanth Vadada Fall CSE rd Sep 2010 Dynamic Sample Selection for Approximate Query Processing.

Written By: Presented By: Swarup Acharya,Amr Elkhatib Phillip B. Gibbons, Viswanath Poosala, Sridhar Ramaswamy Join Synopses for Approximate Query Answering.

ICICLES: Self-tuning Samples for Approximate Query Answering By Venkatesh Ganti, Mong Li Lee, and Raghu Ramakrishnan Shruti P. Gopinath CSE 6339.

Updating SF-Tree Speaker: Ho Wai Shing.

Parallel Databases.

A paper on Join Synopses for Approximate Query Answering

Anthony Okorodudu CSE ICICLES: Self-tuning Samples for Approximate Query Answering By Venkatesh Ganti, Mong Li Lee, and Raghu Ramakrishnan.

Bolin Ding Silu Huang* Surajit Chaudhuri Kaushik Chakrabarti Chi Wang

ICICLES: Self-tuning Samples for Approximate Query Answering

Chapter 15 QUERY EXECUTION.

Spatial Online Sampling and Aggregation

AQUA: Approximate Query Answering

Presentation transcript:

Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande

Contents 1. Introduction 2. Need for approximate answers 3. Problem with joins 4. Join synopses 5. Allocation 6. Maintenance of join synopses 7. Experimental Evaluation

Introduction This paper  demonstrates difficulty of providing good approximate answers to join queries  proposes join synopses as the efficient solution for this problem  presents strategy for allocating available space for join synopses  provides efficient algorithm for maintaining join synopses in presence of updates to the base relations

Why Approximate answers ? Reduce overhead for large DBs and improve response time Reduce access to the base relation Example of Approximate answers  Initial queries in the data mining which are used to determine what the interesting queries are  Queries requesting numerical answers and full precision of exact answer not needed e.g. total, average The research in this paper was conducted while developing efficient approximate query answering system, Aqua.

Aqua System improve response time by avoiding frequent access to original data maintains smaller sized statistical summaries, called synopses, on warehouse. sits on the top of the DBMS. There key components  Statistic Collection  Query Rewriting  Maintenance Collects all synopses, uses it to answer queries posed by user parses sql input, rewrite queries for scaling certain operators to fit for synopses Keep synopses up to date during updating of original data

Problem with Joins Natural set of synopses for an approximate query includes uniform random samples of each base relations Non-uniform result samples - For the join to be uniform random sample, probabilities of tuples in join samples must be equal Small join result Join of relations R & S on attribute X aabbaabb abab R.XS.X a2 a1 b1 Probabilities of tuples a1 and a2 being selected should be same as prob. of tuples a1 and b1 selected in join uniform random sampling prob. (a1,a2)=(1/r)*(1/r)*(1/r)=1/r 3 prob. (a1,b1)=(1/r)*(1/r)*(1/r) *(1/r)=1/r 4 To get uniform join samples is very difficult

Join Synopses Naïve way - execute all possible join queries and collect samples Join synopses - samples are taken from small set of distinguished joins Can obtain random samples of all possible joins in the schema This is scheme is for foreign key joins Modeled database schema as a graph  vertex - base relation  directed edge (u to v) – if u has at least one attribute which is foreign key in v

Join Synopses Key result proved - There is 1-1 correspondence between a tuple in relation ‘r’ & a tuple in the output of any foreign key join involving ‘r’ & any of its descendents in the graph. A sample Sr of a relation ‘r’ can be used to produce another relation J(Sr) called a join synopsis of ‘r’. ( provides random samples). Join synopses of R is simply a sample of R where as for C it is the join of N, R and sample of C.

Join Synopses For each node u in database schema G, corresponding to a relation r1, define J(u) to be the output of the maximum foreign key join r1xr2x..xrk with source r1. Let Su be a uniform random sample of r1. The join synopsis J(Su) is the output of Suxr2xr3…..xrk. J(Su) is a uniform random sample of J(u) with |Su| tuples. Thus we can extract from our synopsis a uniform random sample of the output of any k-way foreign key join. From 1 join synopsis for a node whose foreign key join has k relations, we can extract a URS of the output of between k-1 & pow(2,k-1)-1 distinct foreign key joins.

Allocation Allocate space among various join synopses when certain properties of query workload are known. Identify heuristics for the common case when such properties are not known. Let ‘S’ be a set of queries with selects, aggregates, group by’s & foreign key joins. For each relation Ri, find fraction Fi of queries in S for which Ri is the source relation in a foreign key join. It is known that the error bounds are inversely proportional to sqrt(n).(n- number of tuples in join sample). Select join synopsis sizes so as to minimize the average relative error.

Allocation The average relative error bound over the queries is proportional to sum(f i /sqrt(n i )) n i is selected so as to minimize the above equation for the total memory allocated for join synopses For each relation R i if s i = size of single join synopses tuple then join synopses size is chosen so as sum (n i s i ) <= Total memory allocated In the absence of query work load information heuristic strategies can be used.  EqJoin  CubeJoin  PropJoin divides space equally amongst the relations divides space proportional to cube root of their join synopses tuple sizes divides space proportional to their join synopses tuple sizes

Maintenance of Join Synopses Need to maintain the join synopses when base relation is updated (insert or delete) does not require frequent access to base relation If a new tuple is inserted  Let P u be the probability of newly arrived tuple for relation u in random sample S u  Let uxr2xr3x….xrk be the max foreign key join with source u.  We add ‘T’ (new tuple) to S u with probability P u.  If ‘T’ is added to S u, we add to J(S u ) the tuple Txr2xr3x….rk

Maintenance of Join Synopses  If T is added to S u and S u exceeds its target size, then select uniformly at random a tuple T’ to evict from S u and remove the tuple in J(S u ) corresponding to T’. On delete of a tuple T from u  T is in S u delete the tuple from S u and remove the tuple from J(S u ) corresponding to T  If sample becomes too small due to many deletions repopulate by scanning relation u. This algorithm performs lookups with the base relation with small probability P u

Experimental Evaluation Two classes of experiments  Accuracy experiments  Maintenance experiments Accuracy Experiments  Compares accuracy of techniques based on join synopses and based on base samples  parameters varied - query selectivity and total space allocated to precomputed summaries (summary size/join synopses size) Maintenance Experiments  Study cost of keeping join synopses up to date in presence of insertions/deletions to the underlying data.

Experimental Evaluation Ran results on TPC-D decision support benchmark Query used is an aggregate that is computed on join of Lineitem, Customer, Order, Supplier, Nation and Region. The query used is region parameter is set to ‘ASIA’ and selection predicate is on o_orderdate column to the range [1/1/94, 1/1/95] Query selectivity is varied using these parameters

Experimental Evaluation Accuracy Experiments

Experimental Evaluation Maintenance Experiments tuples inserted in lineitem table

Conclusion Provides uniform random sampling for joins in the database having foreign key joins. Focus on computing approximate answers to aggregates computed on multi-way joins. Join synopses can be maintained effectively during updates.