Join Synopses for Approximate Query Answering Swarup Acharya Phillip B. Gibbons Viswanath Poosala Sridhar Ramaswamy.

Slides:

Advertisements

Similar presentations

Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.

Advertisements

Raghavendra Madala. Introduction Icicles Icicle Maintenance Icicle-Based Estimators Quality Guarantee Performance Evaluation Conclusion 2 ICICLES: Self-tuning.

Dynamic Sample Selection for Approximate Query Processing Brian Babcock Stanford University Surajit Chaudhuri Microsoft Research Gautam Das Microsoft Research.

Design of the fast-pick area Based on Bartholdi & Hackman, Chpt. 7.

Overcoming Limitations of Sampling for Agrregation Queries Surajit ChaudhuriMicrosoft Research Gautam DasMicrosoft Research Mayur DatarStanford University.

Kaushik Chakrabarti(Univ Of Illinois) Minos Garofalakis(Bell Labs) Rajeev Rastogi(Bell Labs) Kyuseok Shim(KAIST and AITrc) Presented at 26 th VLDB Conference,

Replication Strategies in Unstructured Peer-to-Peer Networks Edith Cohen Scott Shenker This is a modified version of the original presentation by the authors.

CS4432: Database Systems II

Fast Algorithms For Hierarchical Range Histogram Constructions

Building a Distributed Full-Text Index for the Web S. Melnik, S. Raghavan, B.Yang, H. Garcia-Molina.

Linked Bernoulli Synopses Sampling Along Foreign Keys Rainer Gemulla, Philipp Rösch, Wolfgang Lehner Technische Universität Dresden Faculty of Computer.

Harikrishnan Karunakaran Sulabha Balan CSE  Introduction  Icicles  Icicle Maintenance  Icicle-Based Estimators  Quality & Performance  Conclusion.

February 14, 2006CS DB Exploration 1 Congressional Samples for Approximate Answering of Group-By Queries Swarup Acharya Phillip B. Gibbons Viswanath.

Incremental Maintenance for Non-Distributive Aggregate Functions work done at IBM Almaden Research Center Themis Palpanas (U of Toronto) Richard Sidle.

A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries By : Surajid Chaudhuri Gautam Das Vivek Narasayya Presented by :Sayed.

New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007.

A Generic Framework for Handling Uncertain Data with Local Correlations Xiang Lian and Lei Chen Department of Computer Science and Engineering The Hong.

Graph-Based Synopses for Relational Selectivity Estimation Joshua Spiegel and Neoklis Polyzotis University of California, Santa Cruz.

Data Broadcast in Asymmetric Wireless Environments Nitin H. Vaidya Sohail Hameed.

Evaluating Window Joins Over Unbounded Streams By Nishant Mehta and Abhishek Kumar.

Chapter 6: Database Evolution Title: AutoAdmin “What-if” Index Analysis Utility Authors: Surajit Chaudhuri, Vivek Narasayya ACM SIGMOD 1998.

Circumventing Data Quality Problems Using Multiple Join Paths Yannis Kotidis, Athens University of Economics and Business Amélie Marian, Rutgers University.

Evaluating Hypotheses

Scaling Personalized Web Search Glen Jeh, Jennfier Widom Stanford University Presented by Li-Tal Mashiach Search Engine Technology course (236620) Technion.

An Incremental Refining Spatial Join Algorithm for Estimating Query Results in GIS Wan D. Bae, Shayma Alkobaisi, Scott T. Leutenegger Department of Computer.

An Adaptive Multi-Objective Scheduling Selection Framework For Continuous Query Processing Timothy M. Sutherland Bradford Pielech Yali Zhu Luping Ding.

1 Wavelet synopses with Error Guarantees Minos Garofalakis Phillip B. Gibbons Information Sciences Research Center Bell Labs, Lucent Technologies Murray.

1 External Sorting for Query Processing Yanlei Diao UMass Amherst Feb 27, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.

Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Evaluating Hypotheses.

CS 591 A11 Algorithms for Data Streams Dhiman Barman CS 591 A1 Algorithms for the New Age 2 nd Dec, 2002.

Package Transportation Scheduling Albert Lee Robert Z. Lee.

Hashed Samples Selectivity Estimators for Set Similarity Selection Queries.

Concepts of Database Management, Fifth Edition

Minimal Spanning Trees What is a minimal spanning tree (MST) and how to find one.

CONGRESSIONAL SAMPLES FOR APPROXIMATE ANSWERING OF GROUP-BY QUERIES Swarup Acharya Phillip Gibbons Viswanath Poosala ( Information Sciences Research Center,

Context Tailoring the DBMS –To support particular applications Beyond alphanumerical data Beyond retrieve + process –To support particular hardware New.

Database Management 9. course. Execution of queries.

Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.

Lesley Charles November 23, 2009.

OLAP : Blitzkreig Introduction 3 characteristics of OLAP cubes: Large data sets ~ Gb, Tb Expected Query : Aggregation Infrequent updates Star Schema :

Swarup Acharya Phillip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented By Vinay Hoskere.

February 14, 2006CS DB Exploration 1 Congressional Samples for Approximate Answering of Group-By Queries Swarup Acharya Phillip B. Gibbons Viswanath.

A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries Surajit Chaudhuri Gautam Das Vivek Narasayya Presented by Sushanth.

Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.

Join Synopses for Approximate Query Answering Swarup Acharya, Philip B. Gibbons, Viswanath Poosala, Sridhar Ramaswamy By Vladimir Gamaley.

New Sampling-Based Summary Statistics for Improving Approximate Query Answers Yinghui Wang

Building a Distributed Full-Text Index for the Web by Sergey Melnik, Sriram Raghavan, Beverly Yang and Hector Garcia-Molina from Stanford University Presented.

BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data ACM EuroSys 2013 (Best Paper Award)

Searching Specification Documents R. Agrawal, R. Srikant. WWW-2002.

Presented By Anirban Maiti Chandrashekar Vijayarenu

Robust Estimation With Sampling and Approximate Pre-Aggregation Author: Christopher Jermaine Presented by: Bill Eberle.

CONGRESSIONAL SAMPLES FOR APPROXIMATE ANSWERING OF GROUP BY QUERIES Swaroop Acharya,Philip B Gibbons, VishwanathPoosala By Agasthya Padisala Anusha Reddy.

By: Gang Zhou Computer Science Department University of Virginia 1 Medians and Beyond: New Aggregation Techniques for Sensor Networks CS851 Seminar Presentation.

Evaluating Window Joins over Unbounded Streams Jaewoo Kang Jeffrey F. Naughton Stratis D. Viglas {jaewoo, naughton, Univ. of Wisconsin-Madison.

Adaptive Ordering of Pipelined Stream Filters Babu, Motwani, Munagala, Nishizawa, and Widom SIGMOD 2004 Jun 13-18, 2004 presented by Joshua Lee Mingzhu.

University of Texas at Arlington Presented By Srikanth Vadada Fall CSE rd Sep 2010 Dynamic Sample Selection for Approximate Query Processing.

Written By: Presented By: Swarup Acharya,Amr Elkhatib Phillip B. Gibbons, Viswanath Poosala, Sridhar Ramaswamy Join Synopses for Approximate Query Answering.

ICICLES: Self-tuning Samples for Approximate Query Answering By Venkatesh Ganti, Mong Li Lee, and Raghu Ramakrishnan Shruti P. Gopinath CSE 6339.

Confidence Intervals and Sample Size

Stochastic Streams: Sample Complexity vs. Space Complexity

Approximating the MST Weight in Sublinear Time

A paper on Join Synopses for Approximate Query Answering

Streaming & sampling.

Anthony Okorodudu CSE ICICLES: Self-tuning Samples for Approximate Query Answering By Venkatesh Ganti, Mong Li Lee, and Raghu Ramakrishnan.

Query Sampling in DB2.

ICICLES: Self-tuning Samples for Approximate Query Answering

Spatial Online Sampling and Aggregation

AQUA: Approximate Query Answering

Random Sampling over Joins Revisited

Query Sampling in DB2.

Presentation transcript:

Join Synopses for Approximate Query Answering Swarup Acharya Phillip B. Gibbons Viswanath Poosala Sridhar Ramaswamy

Purpose Provide good approximate answers for join queries. Make use of join synopsis. Allocate space for join synopses. Maintain join synopses.

Why????? Reduce overhead for large DB’s. Approximate answer is best. Reduce access to base relation. Fast response time.

Proposed Make use of pre-computed samples of DJ’s. (Join Synopses). For queries with foreign key joins, provide good approximate join aggregates using small number of join synposes. Allocation of space among the sets. If workload characteristics is available. Provide confidence bounds.

Aqua System Goal- improve response time for queries by reducing access to base relation. Maintains smaller sized statistical summaries – “SYNOPSES” Provides confidence bounds. It has 3 components 1:Statistics Collection 2:Query Rewriting 3:Maintenance. It sits on top of a DBMS.

Problem with Joins Natural set of synopses would be random samples from each of the base relation. Non-Uniform Result Sample.- for the join of base samples to be a uniform random sample of original relation the probability must equal. Small join result sizes. Using samples on base relations is not feasible.

Join Synopses Pre-compute samples of join results. Compute samples of the results of a small set of distinguishing joins. Can obtain random samples for all possible joins in the schema. This scheme is for foreign key joins. Model the DB as a graph.

Join Synopses There is a 1-1 correspondence between a tuple in a relation ‘r’ & a tuple in the output of any foreign key join involving ‘r’ & any of its descendants in the graph. A sample Sr of a relation ‘r’ can be used to produce another relation J(Sr) called a join synopsis of ‘r’. ( provides random samples). The subgraph of G on the ‘k’ nodes in any k-way foreign key join must be a connected subgraph with a single root node.

Join Synopses For each node u in G, corresponding to a relation r1, define J(u) to be the output of the maximum foreign key join r1xr2x..xrk with source r1. Let Su be a uniform random sample of r1. The join synopsis J(Su) is the output of Suxr2xr3…..xrk. J(Su) is a uniform random sample of J(u) with |Su| tuples. Thus we can extract from our synopsis a uniform random sample of the output of any k-way foreign key join. From 1 join synopsis for a node whose foreign key join has k relations, we can extract a URS of the output of between k-1 & pow(2,k-1)-1 distinct foreign ey joins.

Allocation Allocate space among various join synopses when certain properties of query workload are known. Identify heuristics for the common case when such properties are not known. Let ‘S’ be a set of queries with selects, aggregates, group by’s & foreign key joins. For each relation Ri, find fraction Fi of queries in S for which Ri is the source relation in a foreign key join. Select join synopsis sizes so as to minimize the average relative error. It is known that the error bounds are inversely proportional to sqrt(n).(n- number of tuples in join sample).

Allocation The average relative error bound over the queries is proportional to sum(Fi/sqrt(ni)). In the absence of query work load information heuristic strategies can be used. There are 3- EqJoin CubeJoin PropJoin

Maintenance of Join Synopses Need to maintain synchronization when base relation is updated. If a tuple is added do this. –Let Pu be current probability for including a newly arrived tuple for relation u in the random sample Su. –Let uxr2xr3x….xrk be the max foreign key join with source u. –We add Tp (new tuple) to Su with probability Pu. –If Tp is added to Su, we add to J(Su) the tuple Tpxr2xr3x….rk.

Maintenance of Join Synopses On delete of a tuple Tp from u, do this Find if Tp is in Su. If there, then delete it from Su & remove the tuple in J(Su) corresponding to Tp. If sample becomes too small, repopulate the sample by rescanning relation u. This algorithm performs look ups to base relation with probability Pu. We never update join synopses for any ancestors of u.

Experimental Evaluation Test bed – TPC-D decision support benchmark. DB of around 300 MB. Machine – 296MHz UltraSPARC-II, 256 MB of memory, Solaris 5.6. Query used is based on Q5 & an aggregate computed on join of Lineitem, Customer, Order, Supplier, Nation, Region. The query used is

Join Synopsis Accuracy

Join Synopsis Maintenance

Conclusion Focus on computing approximate answers to aggregates computed on multi-way joins. For DB’s with schema’s that involve only foreign key joins, join synopses has been proposed as s solution. Join synopses can be maintained effectively during updates.