AQUA: Approximate Query Answering

Slides:



Advertisements
Similar presentations
Optimal Approximations of the Frequency Moments of Data Streams Piotr Indyk David Woodruff.
Advertisements

Raghavendra Madala. Introduction Icicles Icicle Maintenance Icicle-Based Estimators Quality Guarantee Performance Evaluation Conclusion 2 ICICLES: Self-tuning.
Overcoming Limitations of Sampling for Agrregation Queries Surajit ChaudhuriMicrosoft Research Gautam DasMicrosoft Research Mayur DatarStanford University.
Kaushik Chakrabarti(Univ Of Illinois) Minos Garofalakis(Bell Labs) Rajeev Rastogi(Bell Labs) Kyuseok Shim(KAIST and AITrc) Presented at 26 th VLDB Conference,
Fast Algorithms For Hierarchical Range Histogram Constructions
3/13/2012Data Streams: Lecture 161 CS 410/510 Data Streams Lecture 16: Data-Stream Sampling: Basic Techniques and Results Kristin Tufte, David Maier.
Fast Incremental Maintenance of Approximate histograms : Phillip B. Gibbons (Intel Research Pittsburgh) Yossi Matias (Tel Aviv University) Viswanath Poosala.
Harikrishnan Karunakaran Sulabha Balan CSE  Introduction  Icicles  Icicle Maintenance  Icicle-Based Estimators  Quality & Performance  Conclusion.
February 14, 2006CS DB Exploration 1 Congressional Samples for Approximate Answering of Group-By Queries Swarup Acharya Phillip B. Gibbons Viswanath.
1 Maintaining Bernoulli Samples Over Evolving Multisets Rainer Gemulla Wolfgang Lehner Technische Universität Dresden Peter J. Haas IBM Almaden Research.
New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007.
Algorithms for massive data sets Lecture 2 (Mar 14, 2004) Yossi Matias & Ely Porat (partially based on various presentations & notes)
A Quick Introduction to Approximate Query Processing CS286, Spring’2007 Minos Garofalakis.
1 Wavelet synopses with Error Guarantees Minos Garofalakis Phillip B. Gibbons Information Sciences Research Center Bell Labs, Lucent Technologies Murray.
A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets Rainer Gemulla (University of Technology Dresden) Wolfgang Lehner (University.
Scalable Approximate Query Processing through Scalable Error Estimation Kai Zeng UCLA Advisor: Carlo Zaniolo 1.
Hashed Samples Selectivity Estimators for Set Similarity Selection Queries.
CONGRESSIONAL SAMPLES FOR APPROXIMATE ANSWERING OF GROUP-BY QUERIES Swarup Acharya Phillip Gibbons Viswanath Poosala ( Information Sciences Research Center,
STA Lecture 161 STA 291 Lecture 16 Normal distributions: ( mean and SD ) use table or web page. The sampling distribution of and are both (approximately)
Random Sampling, Point Estimation and Maximum Likelihood.
Handling Numeric Attributes in Hoeffding Trees Bernhard Pfahringer, Geoff Holmes and Richard Kirkby.
Ripple Joins for Online Aggregation by Peter J. Haas and Joseph M. Hellerstein published in June 1999 presented by Ronda Hilton.
Query Optimization Arash Izadpanah. Introduction: What is Query Optimization? Query optimization is the process of selecting the most efficient query-evaluation.
Department of Computer Science City University of Hong Kong Department of Computer Science City University of Hong Kong 1 A Statistics-Based Sensor Selection.
Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.
Join Synopses for Approximate Query Answering Swarup Acharya Phillip B. Gibbons Viswanath Poosala Sridhar Ramaswamy.
Swarup Acharya Phillip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented By Vinay Hoskere.
Online aggregation Joseph M. Hellerstein University of California, Berkley Peter J. Haas IBM Research Division Helen J. Wang University of California,
DAQ: A New Paradigm for Approximate Query Processing Navneet Potti Jignesh Patel VLDB 2015.
Data Streams Part 3: Approximate Query Evaluation Reynold Cheng 23 rd July, 2002.
Efficient RDF Storage and Retrieval in Jena2 Written by: Kevin Wilkinson, Craig Sayers, Harumi Kuno, Dave Reynolds Presented by: Umer Fareed 파리드.
Join Synopses for Approximate Query Answering Swarup Acharya, Philip B. Gibbons, Viswanath Poosala, Sridhar Ramaswamy By Vladimir Gamaley.
Load Shedding Techniques for Data Stream Systems Brian Babcock Mayur Datar Rajeev Motwani Stanford University.
New Sampling-Based Summary Statistics for Improving Approximate Query Answers Yinghui Wang
Joseph M. Hellerstein Peter J. Haas Helen J. Wang Presented by: Calvin R Noronha ( ) Deepak Anand ( ) By:
Presented By Anirban Maiti Chandrashekar Vijayarenu
Robust Estimation With Sampling and Approximate Pre-Aggregation Author: Christopher Jermaine Presented by: Bill Eberle.
ApproxHadoop Bringing Approximations to MapReduce Frameworks
CONGRESSIONAL SAMPLES FOR APPROXIMATE ANSWERING OF GROUP BY QUERIES Swaroop Acharya,Philip B Gibbons, VishwanathPoosala By Agasthya Padisala Anusha Reddy.
By: Gang Zhou Computer Science Department University of Virginia 1 Medians and Beyond: New Aggregation Techniques for Sensor Networks CS851 Seminar Presentation.
Query Execution. Where are we? File organizations: sorted, hashed, heaps. Indexes: hash index, B+-tree Indexes can be clustered or not. Data can be stored.
University of Texas at Arlington Presented By Srikanth Vadada Fall CSE rd Sep 2010 Dynamic Sample Selection for Approximate Query Processing.
Written By: Presented By: Swarup Acharya,Amr Elkhatib Phillip B. Gibbons, Viswanath Poosala, Sridhar Ramaswamy Join Synopses for Approximate Query Answering.
BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data Authored by Sameer Agarwal, et. al. Presented by Atul Sandur.
ICICLES: Self-tuning Samples for Approximate Query Answering By Venkatesh Ganti, Mong Li Lee, and Raghu Ramakrishnan Shruti P. Gopinath CSE 6339.
Dense-Region Based Compact Data Cube
Lecture 6- Query Optimization (continued)
Confidence Intervals Cont.
Data Warehousing CIS 4301 Lecture Notes 4/20/2006.
Approximating the MST Weight in Sublinear Time
A paper on Join Synopses for Approximate Query Answering
Database Applications (15-415) DBMS Internals- Part VII Lecture 16, October 25, 2016 Mohammad Hammoud.
Ripple Joins for Online Aggregation
Bolin Ding Silu Huang* Surajit Chaudhuri Kaushik Chakrabarti Chi Wang
Query Sampling in DB2.
ICICLES: Self-tuning Samples for Approximate Query Answering
Spatial Online Sampling and Aggregation
StreamApprox Approximate Stream Analytics in Apache Flink
Load Shedding Techniques for Data Stream Systems
StreamApprox Approximate Stream Analytics in Apache Spark
StreamApprox Approximate Computing for Stream Analytics
Communication and Memory Efficient Parallel Decision Tree Construction
Random Sampling over Joins Revisited
Query Sampling in DB2.
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Feifei Li, Ching Chang, George Kollios, Azer Bestavros
Range-Efficient Computation of F0 over Massive Data Streams
Implementation of Relational Operations
DATABASE HISTOGRAMS E0 261 Jayant Haritsa
Lu Tang , Qun Huang, Patrick P. C. Lee
Presentation transcript:

AQUA: Approximate Query Answering Gibbons et al. Information Sciences Research Center Bell Laboratories

Motivation Long Response Times! SQL Query Exact Answer Decision Support Systems (DSS) SQL Query Exact Answer Long Response Times! Exact answers NOT always required DSS applications usually exploratory: early feedback to help identify “interesting” regions Aggregate queries: precision to “last decimal” not needed e.g., “What percentage of the US sales are in NJ?” (display as bar graph) Preview answers while waiting. Trial queries Base data can be remote or unavailable

Fast Approximate Answers Primarily for Aggregate queries Goal is to quickly report the approximate estimated answer leading digits of answers In seconds instead of minutes or hours Most useful if can provide error guarantees E.g., Average salary $59,000 +/- $500 (with 95% confidence) in 10 seconds vs. $59,152.25 in 10 minutes Achieved by answering the query based on samples of the data or online (in phases)

What is an approximate answer? For aggregate queries like sum, avg: < estimated value, accuracy measure> accuracy measure= confidence interval, confidence probability $59,000 +/- $500 (with 95% confidence) can be guaranteed/heuristic based For Set Value queries <representative tuples, metadata about the complete data> Representative tuples Certain or possible Randomly selected Biased selected Arbitrary selected

Metrics Coverage: Range of queries Response time Accuracy Update time Footprint: storage requirements for sample

Online Computation Online: + Continuous refinement of answers (online aggregation) + User control: what to refine, when to stop + Seeing the query is very helpful for fast approximate results + No maintenance overheads

Pre-Computation Construct & store synopses prior to query time At query time, use synopses to answer the query Often faster: better access patterns, small synopses can reside in memory or cache Middleware: Can use with any DBMS, no special index striding Also effective for remote or streaming data Need to maintain synopses up-to-date

Aqua Architecture Network Picture without Aqua: User poses a query Q SQL Query Q Q Data Warehouse (e.g., Oracle) Network Result HTML XML Browser Excel Warehouse Data Updates Picture without Aqua: User poses a query Q Data Warehouse executes Q and returns result Warehouse is periodically updated with new data

Aqua Architecture Picture with Aqua: Network Aqua is middleware, between the user and the warehouse Aqua Synopses are stored in the warehouse Aqua intercepts the user query and rewrites it to be a query Q’ on the synopses. Data warehouse returns approximate answer SQL Query Q Rewriter Data Warehouse (e.g., Oracle) Q’ Network Result (w/ error bounds) HTML XML Browser Excel Warehouse Data Updates AQUA Synopses AQUA Tracker

Aqua: Key Features Maintains a number of synopses on the data Updates these synopses primarily by observing the new data Provides discrete reporting Ensures guaranteed accuracy measures Have a footprint orders of magnitude smaller than the data warehouse

Synopses For each relation store the counts of tuples Smaller relations – all tuples are stored Other relations- random tuples based on biased sampling. For each attribute in aggregate query- MAX and MIN For each tuple , only interested attributes are stored Incremental maintenance Batch mode Online maintenance – too complex. Counts, Max and Min updated as new tuples are added/deleted

Est. count = 5*2 = 10, Exact count = 10 Sampling: Basics Idea: A small random sample S of the data often well-represents all the data For a fast approx answer, apply the query to S & “scale” the result E.g., R.a is {0,1}, S is a 20% sample select count(*) from R where R.a = 0 select 5 * count(*) from S where S.a = 0 R.a 1 1 0 1 1 1 1 1 0 0 0 0 1 1 1 1 1 0 1 1 1 0 1 0 1 1 0 1 1 0 Red = in S Est. count = 5*2 = 10, Exact count = 10 Unbiased: For expressions involving count, sum, avg: the estimator is unbiased, i.e., the expected value of the answer is the actual answer, even for (most) queries with predicates! Leverage extensive literature on confidence intervals for sampling Actual answer is within the interval [a,b] with a given probability E.g., 54,000 ± 600 with prob  90%

One-Pass Uniform Sampling Reservoir Sampling : Maintains a sample S of a fixed-size M creates a "reservoir" sample of size M and populates it with the first M items of R Add each new item to S with probability M/N, where N is the current number of data items If add an item, evict a random item from S To handle deletions, permit |S| to drop to L < M, e.g., L = M/2 remove from S if deleted item is in S, else ignore If |S| = M/2, get a new S using another pass (happens only if delete roughly half the items & cost is fully amortized)

Problems with joins Uniform random samples provide: Non uniform result samples Small join results sizes

Join(Samples) != Sample(Join) R.X a b a1 a2 b1 S.X b2 Join result = {a1, a2, b1, b2} Probability for a base tuple to be selected = 1/r Prob[select a1 and a2] = 1/r^3 Prob[select a1 and b1] = 1/r^4

Small Results for Join(samples) Foreign key join of R and S (RS) Join result size = |R| 1% sample from both R and S  0.01% sample from the join result!! Each tuple from sample(R) joins with a single tuple from S Probability that tuple is kept is only 1% !

Join Samples Assumptions Foreign Key Joins Acyclic data warehouse Two way join r1 r2 is a foreign key join if the join attribute : is a foreign key in r1 a key in r2 For k > 2, a k-way foreign join there is an ordering r1,r2..rk and for j = 1,2,.. K, If si-1 is a relation obtained joining r1, r2, … ri-1 then si-1 ri is a 2-way foreign join where

Join Sample TPC-D scheme L PS S N R C O P

Join sample (Cont..) L O PS P C S N R Lemma 1: The subgraph of G on the k nodes in any k - way foreign key join must be a connected graph with a single root node

Join sample (Cont..) L O PS P C S N R Lemma 2: There is a 1-1 correspondence between tuples in a relation r1 and tuple in any k-way foreign key join.

Join Samples: Key Observations … Rk “Source relation” One-to-one correspondence between tuples in source relation and those in result of chain of FK-joins Sample(R1) joined with R2, …, Rk = sample(FK-join chain) To get a sample of a sub-chain of FK-joins “rooted” at source, just project away irrelevant attributes! Join synopses = set of such sample joins for every source and maximal FK-join-chain in the schema! Can be used to answer ANY FK-join query over the given schema!

Join Samples: Maintenance On adding a new tuple t to a base relation u, we do the following: Determine if t is added to s(u) If yes, we add to J(S(u)) the join of t,r2,r3,…rk If size of S(u) increases we evict one tuple from S(u) and its corresponding join from J(S(u). On deleting any tuple from S(u) we delete it corresponding entry in the join sample. If the size becomes too , resample. To reduce space, renormalize the join sample. Maximum tuples = k|S(u)|

Sampling: Confidence Intervals Guarantees? 90% Confidence Interval (±) Method as (S)  (R) 3.16 * (S) / sqrt(|S|) Chebychev (est. (R)) always 3.16 * (R) / sqrt(|S|) Chebychev (known (R)) 1.22 * (MAX-MIN) / sqrt(|S|) Hoeffding as |S|   1.65 * (S) / sqrt(|S|) Central Limit Theorem Aqua uses Hoeffding based upper bounds: Guranteed bounds unlike heuristic bounds based on central limit theorem Count number of tuples in a Relation For average/sum aggregate attributes, MAX and MIN stored.

Experiments Results Computing aggregates on base relations

Experiments Results Computing aggregates on Joins

Experiment Results Execution time on Joins

Experiments Results Maintenance – Result accuracy

Experiments Results Maintenance Cost

Related Work Approximate Query processing systems Statistical techniques Data reduction techniques Sample based estimation algorithms Maintenance algorithms

Conclusion Approximate answering is becoming extremely important in new application of data warehouses Aqua first system to provide fast approximate answers to a broad class of queries Does not access base data Schemas based on join synopses provide better answer than those, based only on the basic relations samples. Guaranteed analytical bounds Incremental maintenance technique