Download presentation
Presentation is loading. Please wait.
1
AQUA: Approximate Query Answering
Gibbons et al. Information Sciences Research Center Bell Laboratories
2
Motivation Long Response Times! SQL Query Exact Answer
Decision Support Systems (DSS) SQL Query Exact Answer Long Response Times! Exact answers NOT always required DSS applications usually exploratory: early feedback to help identify “interesting” regions Aggregate queries: precision to “last decimal” not needed e.g., “What percentage of the US sales are in NJ?” (display as bar graph) Preview answers while waiting. Trial queries Base data can be remote or unavailable
3
Fast Approximate Answers
Primarily for Aggregate queries Goal is to quickly report the approximate estimated answer leading digits of answers In seconds instead of minutes or hours Most useful if can provide error guarantees E.g., Average salary $59,000 +/- $500 (with 95% confidence) in 10 seconds vs. $59, in 10 minutes Achieved by answering the query based on samples of the data or online (in phases)
4
What is an approximate answer?
For aggregate queries like sum, avg: < estimated value, accuracy measure> accuracy measure= confidence interval, confidence probability $59,000 +/- $500 (with 95% confidence) can be guaranteed/heuristic based For Set Value queries <representative tuples, metadata about the complete data> Representative tuples Certain or possible Randomly selected Biased selected Arbitrary selected
5
Metrics Coverage: Range of queries Response time Accuracy Update time
Footprint: storage requirements for sample
6
Online Computation Online:
+ Continuous refinement of answers (online aggregation) + User control: what to refine, when to stop + Seeing the query is very helpful for fast approximate results + No maintenance overheads
7
Pre-Computation Construct & store synopses prior to query time
At query time, use synopses to answer the query Often faster: better access patterns, small synopses can reside in memory or cache Middleware: Can use with any DBMS, no special index striding Also effective for remote or streaming data Need to maintain synopses up-to-date
8
Aqua Architecture Network Picture without Aqua: User poses a query Q
SQL Query Q Q Data Warehouse (e.g., Oracle) Network Result HTML XML Browser Excel Warehouse Data Updates Picture without Aqua: User poses a query Q Data Warehouse executes Q and returns result Warehouse is periodically updated with new data
9
Aqua Architecture Picture with Aqua: Network
Aqua is middleware, between the user and the warehouse Aqua Synopses are stored in the warehouse Aqua intercepts the user query and rewrites it to be a query Q’ on the synopses. Data warehouse returns approximate answer SQL Query Q Rewriter Data Warehouse (e.g., Oracle) Q’ Network Result (w/ error bounds) HTML XML Browser Excel Warehouse Data Updates AQUA Synopses AQUA Tracker
10
Aqua: Key Features Maintains a number of synopses on the data
Updates these synopses primarily by observing the new data Provides discrete reporting Ensures guaranteed accuracy measures Have a footprint orders of magnitude smaller than the data warehouse
11
Synopses For each relation store the counts of tuples
Smaller relations – all tuples are stored Other relations- random tuples based on biased sampling. For each attribute in aggregate query- MAX and MIN For each tuple , only interested attributes are stored Incremental maintenance Batch mode Online maintenance – too complex. Counts, Max and Min updated as new tuples are added/deleted
12
Est. count = 5*2 = 10, Exact count = 10
Sampling: Basics Idea: A small random sample S of the data often well-represents all the data For a fast approx answer, apply the query to S & “scale” the result E.g., R.a is {0,1}, S is a 20% sample select count(*) from R where R.a = 0 select 5 * count(*) from S where S.a = 0 R.a Red = in S Est. count = 5*2 = 10, Exact count = 10 Unbiased: For expressions involving count, sum, avg: the estimator is unbiased, i.e., the expected value of the answer is the actual answer, even for (most) queries with predicates! Leverage extensive literature on confidence intervals for sampling Actual answer is within the interval [a,b] with a given probability E.g., 54,000 ± 600 with prob 90%
13
One-Pass Uniform Sampling
Reservoir Sampling : Maintains a sample S of a fixed-size M creates a "reservoir" sample of size M and populates it with the first M items of R Add each new item to S with probability M/N, where N is the current number of data items If add an item, evict a random item from S To handle deletions, permit |S| to drop to L < M, e.g., L = M/2 remove from S if deleted item is in S, else ignore If |S| = M/2, get a new S using another pass (happens only if delete roughly half the items & cost is fully amortized)
14
Problems with joins Uniform random samples provide:
Non uniform result samples Small join results sizes
15
Join(Samples) != Sample(Join)
R.X a b a1 a2 b1 S.X b2 Join result = {a1, a2, b1, b2} Probability for a base tuple to be selected = 1/r Prob[select a1 and a2] = 1/r^3 Prob[select a1 and b1] = 1/r^4
16
Small Results for Join(samples)
Foreign key join of R and S (RS) Join result size = |R| 1% sample from both R and S 0.01% sample from the join result!! Each tuple from sample(R) joins with a single tuple from S Probability that tuple is kept is only 1% !
17
Join Samples Assumptions Foreign Key Joins Acyclic data warehouse
Two way join r1 r2 is a foreign key join if the join attribute : is a foreign key in r1 a key in r2 For k > 2, a k-way foreign join there is an ordering r1,r2..rk and for j = 1,2,.. K, If si-1 is a relation obtained joining r1, r2, … ri-1 then si-1 ri is a 2-way foreign join where
18
Join Sample TPC-D scheme L PS S N R C O P
19
Join sample (Cont..) L O PS P C S N R Lemma 1:
The subgraph of G on the k nodes in any k - way foreign key join must be a connected graph with a single root node
20
Join sample (Cont..) L O PS P C S N R Lemma 2:
There is a 1-1 correspondence between tuples in a relation r1 and tuple in any k-way foreign key join.
21
Join Samples: Key Observations
… Rk “Source relation” One-to-one correspondence between tuples in source relation and those in result of chain of FK-joins Sample(R1) joined with R2, …, Rk = sample(FK-join chain) To get a sample of a sub-chain of FK-joins “rooted” at source, just project away irrelevant attributes! Join synopses = set of such sample joins for every source and maximal FK-join-chain in the schema! Can be used to answer ANY FK-join query over the given schema!
22
Join Samples: Maintenance
On adding a new tuple t to a base relation u, we do the following: Determine if t is added to s(u) If yes, we add to J(S(u)) the join of t,r2,r3,…rk If size of S(u) increases we evict one tuple from S(u) and its corresponding join from J(S(u). On deleting any tuple from S(u) we delete it corresponding entry in the join sample. If the size becomes too , resample. To reduce space, renormalize the join sample. Maximum tuples = k|S(u)|
23
Sampling: Confidence Intervals
Guarantees? 90% Confidence Interval (±) Method as (S) (R) 3.16 * (S) / sqrt(|S|) Chebychev (est. (R)) always 3.16 * (R) / sqrt(|S|) Chebychev (known (R)) 1.22 * (MAX-MIN) / sqrt(|S|) Hoeffding as |S| 1.65 * (S) / sqrt(|S|) Central Limit Theorem Aqua uses Hoeffding based upper bounds: Guranteed bounds unlike heuristic bounds based on central limit theorem Count number of tuples in a Relation For average/sum aggregate attributes, MAX and MIN stored.
24
Experiments Results Computing aggregates on base relations
25
Experiments Results Computing aggregates on Joins
26
Experiment Results Execution time on Joins
27
Experiments Results Maintenance – Result accuracy
28
Experiments Results Maintenance Cost
29
Related Work Approximate Query processing systems
Statistical techniques Data reduction techniques Sample based estimation algorithms Maintenance algorithms
30
Conclusion Approximate answering is becoming extremely important in new application of data warehouses Aqua first system to provide fast approximate answers to a broad class of queries Does not access base data Schemas based on join synopses provide better answer than those, based only on the basic relations samples. Guaranteed analytical bounds Incremental maintenance technique
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.