Improving the Accuracy of Continuous Aggregates & Mining Queries Under Load Shedding Yan-Nei Law* and Carlo Zaniolo Computer Science Dept. UCLA * Bioinformatics.

Improving the Accuracy of Continuous Aggregates & Mining Queries Under Load Shedding Yan-Nei Law* and Carlo Zaniolo Computer Science Dept. UCLA * Bioinformatics Institute Singapore Samples in more complex query graphs

Continuous Aggregate Queries on Data Streams: Sampling & Load Shedding Only random samples are available for computing aggregate queries because of  Limitations of remote sensors, or transmission lines  Load Shedding policies implemented when overloads occur  When overloads occur (e.g., due to a burst of arrivals} we can 1. drop queries all together, or 2. sample the input---much preferable Key objective: Achieve answer accuracy with sparse samples for complex aggregates on windows Can we improve answer accuracy with minimal overhead?

General Architecture Basic Idea:  Optimize sampling rates of load shedders for accurate answers. Previous Work [BDM04]:  Find an error bound for each aggregate query.  Determine sampling rates that minimize query inaccuracy within the limits imposed by resource constraints.  Only work for SUM and COUNT  No error model provided …... S 1 S n Query Network ∑ ∑ ∑ Aggregate Query Operator Load Shedder Data Stream S i ∑

A New Approach Correlation between answers at different points in time  Example: sensor data [VAA04,AVA04] Objective: The current answer can be adjusted by the past answers in the way that:  Low sampling rate  current answer less accurate  more dependent on history.  High sampling rate  current answer more accurate  less dependent on history. We propose a Bayesian quality enhancement module which can achieve this objective automatically and reduce the uncertainty of the approximate answers. A larger class of queries will be considered:  SUM, COUNT, AVG, quantiles.

Our Model The observed answer Ã is computed from random samples of the complete stream with sampling rate P. We propose a bayesian method to obtain the improved answer by combining  the observed answer  the error model  history of the answer Quality Enhancement Module Improved answer …... ∑∑∑ S 1 S n Query Network History P Ã Aggregate Query Operator Load Shedder Data Stream S i ∑ …...

Error Model of the aggregate answers Ã – approximate answer obtained by a random sample with sampling rate P. Key result: Error model for sum count, avg, quantiles:  SUM:  COUNT:  AVG:  p-th Quantiles [B86]: F is the cumulative distribution, f = F’ is the density function

Use of Error Model Derive accuracy estimate for larger class of queries for optimizing the load shedding policy  Idea: minimize the variance of each query Enhance the quality of the query answer, on the basis of statistically information derived from the past.

Learning Prior Distribution from the Past Statistical information on the answers:  Spatial – e.g. reading from the neighbors.  Temporal – e.g the past answers {x i }. Model the distribution of the answer by:-  Normal distribution: By MLE, pdf ~ N(  s,  s 2 ) where  s =  x i /n,  s 2 =  (x i -  s ) 2 /n; only need to store  s,  s. Only require a minimal amount of computation time. Assuming that there is no concept change.

Observations Reduced uncertainty. (small  s   t <<  ) Compromise between prior and observed answer:  large   less accurate Ã  more dependent on  s  small   more accurate Ã  less dependent on  s  uncertain prior (i.e. large  s ) will not have much effect on the improved answer.

Generalizing to Mining functions: K-means is a Generalized AVG Query Relative Error for the first mean Relative Error for the second mean

Quantiles: (dataset with concept drifts) Average Rel. Err. for every quantile p-th Quantiles ApproximatePosterior 20%3.302.67 40%1.591.56 60%0.300.20 80%0.210.13

Changing Distribution Corrections effective for distributions besides normal ones Changing distributions (a.k.a. concept changes) can be easily detected—we used a two sample test Then old prior is dropped and new prior is constructed

Minimal Overheads Computation costs introduced by:  Calculating posterior distribution  Detecting changes QueryApproximatePosterior SUM200 Quantile12301240 K-means960990 Time (in ms) for each query

Summary Proposed a Bayesian quality enhancement method for approximate aggregates in the presence of sampling. Our method:  Works for ordered statistics and data mining functions as well as with traditional aggregates, and also  handles concept changes in the data streams

Sampling rate = 20% p-th QuantilesApproximatePosterior 20%0.6826740.680351 40%0.5647660.408557 60%0.0837050.080721 80%0.0594690.061146

Improving the Accuracy of Continuous Aggregates & Mining Queries Under Load Shedding Yan-Nei Law* and Carlo Zaniolo Computer Science Dept. UCLA * Bioinformatics.

Similar presentations

Presentation on theme: "Improving the Accuracy of Continuous Aggregates & Mining Queries Under Load Shedding Yan-Nei Law* and Carlo Zaniolo Computer Science Dept. UCLA * Bioinformatics."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Improving the Accuracy of Continuous Aggregates & Mining Queries Under Load Shedding Yan-Nei Law* and Carlo Zaniolo Computer Science Dept. UCLA * Bioinformatics.

Similar presentations

Presentation on theme: "Improving the Accuracy of Continuous Aggregates & Mining Queries Under Load Shedding Yan-Nei Law* and Carlo Zaniolo Computer Science Dept. UCLA * Bioinformatics."— Presentation transcript:

Similar presentations

About project

Feedback