Presentation is loading. Please wait.

Presentation is loading. Please wait.

Improving the Accuracy of Continuous Aggregates & Mining Queries Under Load Shedding Yan-Nei Law* and Carlo Zaniolo Computer Science Dept. UCLA * Bioinformatics.

Similar presentations


Presentation on theme: "Improving the Accuracy of Continuous Aggregates & Mining Queries Under Load Shedding Yan-Nei Law* and Carlo Zaniolo Computer Science Dept. UCLA * Bioinformatics."— Presentation transcript:

1 Improving the Accuracy of Continuous Aggregates & Mining Queries Under Load Shedding Yan-Nei Law* and Carlo Zaniolo Computer Science Dept. UCLA * Bioinformatics Institute Singapore Samples in more complex query graphs

2 Continuous Aggregate Queries on Data Streams: Sampling & Load Shedding Only random samples are available for computing aggregate queries because of  Limitations of remote sensors, or transmission lines  Load Shedding policies implemented when overloads occur  When overloads occur (e.g., due to a burst of arrivals} we can 1. drop queries all together, or 2. sample the input---much preferable Key objective: Achieve answer accuracy with sparse samples for complex aggregates on windows Can we improve answer accuracy with minimal overhead?

3 General Architecture Basic Idea:  Optimize sampling rates of load shedders for accurate answers. Previous Work [BDM04]:  Find an error bound for each aggregate query.  Determine sampling rates that minimize query inaccuracy within the limits imposed by resource constraints.  Only work for SUM and COUNT  No error model provided …... S 1 S n Query Network ∑ ∑ ∑ Aggregate Query Operator Load Shedder Data Stream S i ∑

4 A New Approach Correlation between answers at different points in time  Example: sensor data [VAA04,AVA04] Objective: The current answer can be adjusted by the past answers in the way that:  Low sampling rate  current answer less accurate  more dependent on history.  High sampling rate  current answer more accurate  less dependent on history. We propose a Bayesian quality enhancement module which can achieve this objective automatically and reduce the uncertainty of the approximate answers. A larger class of queries will be considered:  SUM, COUNT, AVG, quantiles.

5 Our Model The observed answer à is computed from random samples of the complete stream with sampling rate P. We propose a bayesian method to obtain the improved answer by combining  the observed answer  the error model  history of the answer Quality Enhancement Module Improved answer …... ∑∑∑ S 1 S n Query Network History P à Aggregate Query Operator Load Shedder Data Stream S i ∑ …...

6 Error Model of the aggregate answers à – approximate answer obtained by a random sample with sampling rate P. Key result: Error model for sum count, avg, quantiles:  SUM:  COUNT:  AVG:  p-th Quantiles [B86]: F is the cumulative distribution, f = F’ is the density function

7 Use of Error Model Derive accuracy estimate for larger class of queries for optimizing the load shedding policy  Idea: minimize the variance of each query Enhance the quality of the query answer, on the basis of statistically information derived from the past.

8 Learning Prior Distribution from the Past Statistical information on the answers:  Spatial – e.g. reading from the neighbors.  Temporal – e.g the past answers {x i }. Model the distribution of the answer by:-  Normal distribution: By MLE, pdf ~ N(  s,  s 2 ) where  s =  x i /n,  s 2 =  (x i -  s ) 2 /n; only need to store  s,  s. Only require a minimal amount of computation time. Assuming that there is no concept change.

9 Observations Reduced uncertainty. (small  s   t <<  ) Compromise between prior and observed answer:  large   less accurate à  more dependent on  s  small   more accurate à  less dependent on  s  uncertain prior (i.e. large  s ) will not have much effect on the improved answer.

10 Generalizing to Mining functions: K-means is a Generalized AVG Query Relative Error for the first mean Relative Error for the second mean

11 Quantiles: (dataset with concept drifts) Average Rel. Err. for every quantile p-th Quantiles ApproximatePosterior 20%3.302.67 40%1.591.56 60%0.300.20 80%0.210.13

12 Changing Distribution Corrections effective for distributions besides normal ones Changing distributions (a.k.a. concept changes) can be easily detected—we used a two sample test Then old prior is dropped and new prior is constructed

13 Minimal Overheads Computation costs introduced by:  Calculating posterior distribution  Detecting changes QueryApproximatePosterior SUM200 Quantile12301240 K-means960990 Time (in ms) for each query

14 Summary Proposed a Bayesian quality enhancement method for approximate aggregates in the presence of sampling. Our method:  Works for ordered statistics and data mining functions as well as with traditional aggregates, and also  handles concept changes in the data streams

15 Sampling rate = 20% p-th QuantilesApproximatePosterior 20%0.6826740.680351 40%0.5647660.408557 60%0.0837050.080721 80%0.0594690.061146


Download ppt "Improving the Accuracy of Continuous Aggregates & Mining Queries Under Load Shedding Yan-Nei Law* and Carlo Zaniolo Computer Science Dept. UCLA * Bioinformatics."

Similar presentations


Ads by Google