Download presentation
Presentation is loading. Please wait.
Published byRosalyn Waters Modified over 9 years ago
1
A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries By : Surajid Chaudhuri Gautam Das Vivek Narasayya Presented by :Sayed Muchallil September 21 st, 2010 Presented by :Sayed Muchallil September 21 st, 2010
2
CONTENTS 1.INTRODUCTION 2.ARCHITECTURE FOR APPROXIMATE QUERY PROCESSING 3.FIXED WORKLOAD 4.STRATIFIED SAMPLING 5.SOLUTION 6.SUMMARY
3
Pre-computed samples Can give approximate answer very efficiently. Workload are used to make sure that errors are acceptable.
4
Previous Studies Solution is difficult to evaluate theoretically. Do not formally deal with uncertainty in the expected workload. Ignoring the variance in the data distribution.
5
Sample Product IDRevenue 110 2 3 41000 Only 50% of R records can be used as sample Query : “SELECT SUM(Revenue) FROM R” The answer for is 1030 Table R
6
Sample (cont.) Product ID Revenue 110 41000 The answer for the query for table S 1 is 40. The answer for the query for table S 2 is 2020. How to get these answer? Sample Table S 1 Sample Table S 2
7
Sample (cont.) large variance in the aggregate column can lead to large relative errors. Relative error = |y - y’| / y Relative error for S 1 = |1030 – 40| / 1030 Relative error for S 2 = |1030 – 2020| / 1030
8
What’s New ? The goal is to pick sample that minimize error. If actual workload is identical to the given workload (fixed), error will be smaller. Can work for identical and similar query to the given workload.
9
Sampling Two ways for selecting samples – Randomized – Deterministic A Workload W is a set of pairs of queries and their weight. – W = {,,… } – Σ i w i = 1.
10
Architecture for Approximate Query Processing
11
Architecture (cont.) Offline Component Selects sample or records from relation R Online Component Rewrites an incoming query to use the sample. What is “rewrites” means? Reports answer with an estimate error
12
Architecture (cont.) New method for automatically lifting a given workload. It is unrealistic to assume that the incoming queries will be identical to the given workload. The key : the ability to compute a probability distribution P w.
13
Error Metrics Relative Error : |y - y’| / y Squared Error : SE(Q) = (|y - y’| / y)² Squared Error for GROUP BY query SE(Q) = (1/g) Σ i ((y i – y i ’)/ y i )² a probability distribution of queries p w Mean squared error for the distribution: MSE(p w ) = Σ Q p w (Q)*SE(Q) Root mean squared error : RMSE(p w ) = √ MSE(p w )
14
Fixed Workload Special case ? A given workload are “identical” to the incoming queries. Problem: FIXEDSAMP Input: R, W, k Output: A sample of k records (with appropriate additional columns) such that MSE(W) is minimized.
15
Fundamental Regions Relation R contains 9 records W consists of 2 queries Q1 = select records with C values between 10 -50 Q2 = select records with C values between 40 -70 These queries divide Relation R into 4 fundamental regions.
16
Fundamental Regions (cont.)
17
partitioning the records in R into a minimum number of regions R 1, R 2, …, R r such that for any region R j, each query in W selects either all records in R j or none. Total number fundamental regions =? Min(2 |W|, n)
18
FIXEDSAMP Solution Step 1. Identify Fundamental Regions in R r <= k r > k Step 2 Pick Sample Records Step 3 Assign values to additional columns
19
LIFTING WORKLOAD TO QUERY DISTRIBUTION Query Q’ is not identical, Pw(Q’) is high if Q’ is similar to queries in the workload, and Low if not. Q’ and Q are similar if selected records have significant overlap.
20
LIFTED WORKLOAD P {Q} (R’) is the probability of occurrence of any query that selects exactly the set of records R’. For any given record inside (resp. outside) R Q, the parameter δ (resp. γ) represents the probability that an incoming query will select this record
21
LIFTED WORKLOAD (Cont.)
22
δ → 1 and γ → 0: implies that incoming queries are identical to workload queries. δ → 1 and γ → ½: implies that incoming queries are supersets of workload queries. δ → ½ and γ → 0: implies that incoming queries are subsets of workload queries. δ → ½ and γ → ½: implies that incoming queries are unrestricted.
23
RATIONALE FOR STRATIFIED SAMPLING A population is partitioned into multiple strata, and samples are selected uniformly from each stratum.
24
STRATIFIED SAMPLING a stratified sampling scheme partitions R into r strata containing n1,., nr records (where Σnj = n), with k1, …, kr records uniformly sampled from each stratum (where Σkj = k). Q1 = SELECT COUNT(*) FROM R WHERE ProductID IN(3,4); POPQ is population of query Q POPQ1 = {0,0,1,1} = non-zero variance Divided into two strata {0,0} and {1,1} Product IDRevenue 110 2 3 41000
25
SOLUTION FOR SINGLE-TABLE SELECTION QUERIES WITH AGGREGATION Stratification How many strata How many records for each stratum Allocation Determines how to divide k Sampling Forms the final sample of k record
26
SOLUTION FOR COUNT AGGREGATE Stratification (lemma 1) r is not known, divide R into fundamental regions and treat them as strata. Allocation (lemma 2) MSE(p W ) = Σ i w i MSE(p{Q}) MSE(p W ) can be expressed as a weighted sum of the MSE of each query in the workload
27
SOLUTION FOR COUNT AGGREGATE (Cont.) For any Q ε W, we express MSE(p {Q} ) as a function of the k j ’s Lemma 3 : ApproxMSE(p {Q} ) = Then,
28
SOLUTION FOR COUNT AGGREGATE (Cont.) Since we have an (approximate) formula for MSE(p {Q} ), we can express MSE(p w ) as a function of the k j ’s variables. Corollary 1 : MSE(p w ) = Σ j (α j / k j ), where each α j is a function of n 1,…,n r, δ, and γ. α j captures the “importance” of a region; it is positively correlated with n j as well as the frequency of queries in the workload that access R j. Now we can minimize MSE(p w ).
29
SOLUTION FOR COUNT AGGREGATE (Cont.) Lemma 4: Σ j (α j / k j ) is minimized subject to Σ j k j = k if k j = k * ( sqrt(α j ) / Σ i sqrt(α i ) ) This provides a closed-form and computationally inexpensive solution to the allocation problem since α j depends only on δ, γ and the number of tuples in each fundamental region
30
SOLUTION FOR SUM AGGREGATE Stratification Bucketing technique Divide fundamental regions with large variance into a set of finer regions. Treat each region as strata Allocation Y j is average (sum) of the aggregate column values of all records in region R j
31
SOLUTION FOR SUM AGGREGATE (Cont.) Each value in the region can be approximated as y j An approximate formula for MSE(P{Q}) for SUM query Q in W
32
Pragmatic Issues Identifying Fundamental Regions Handling Large Number of Fundamental Regions Obtaining Integer Solution Obtaining unbiased error
33
STRAT ALGORITHM
34
IMPLEMENTATION AND EXPERIMENTAL RESULT This experiment compares the STRAT method to other methods. USAMP – uniform random sampling WSAMP – weighted sampling OTLIDX – outlier indexing combined with weighted sampling CONG – Congressional sampling
35
COUNT AGGREGATE
36
SUM AGGREGATE
37
COUNT AGGREGATE
38
THANK YOU
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.