Improving the Accuracy of Continuous Aggregates & Mining Queries Under Load Shedding Yan-Nei Law* and Carlo Zaniolo Computer Science Dept. UCLA * Bioinformatics.

Slides:



Advertisements
Similar presentations
1 ECE 776 Project Information-theoretic Approaches for Sensor Selection and Placement in Sensor Networks for Target Localization and Tracking Renita Machado.
Advertisements

Overcoming Limitations of Sampling for Agrregation Queries Surajit ChaudhuriMicrosoft Research Gautam DasMicrosoft Research Mayur DatarStanford University.
Active Learning for Streaming Networked Data Zhilin Yang, Jie Tang, Yutao Zhang Computer Science Department, Tsinghua University.
Introduction to Histograms Presented By: Laukik Chitnis
From Counting Sketches to Equi-Depth Histograms CS240B Notes from a EDBT11 paper entitled: A Fast and Space-Efficient Computation of Equi-Depth Histograms.
Adaptive Monitoring of Bursty Data Streams Brian Babcock, Shivnath Babu, Mayur Datar, and Rajeev Motwani.
The adjustment of the observations
1 Load Shedding CS240B notes. 22 Load Shedding in a DSMS zDSMS: online response on boundless and bursty data streams—How? zBy using approximations and.
Probabilistic Aggregation in Distributed Networks Ling Huang, Ben Zhao, Anthony Joseph and John Kubiatowicz {hling, ravenben, adj,
Dynamic Tuning of the IEEE Protocol to Achieve a Theoretical Throughput Limit Frederico Calì, Marco Conti, and Enrico Gregori IEEE/ACM TRANSACTIONS.
Self-Correlating Predictive Information Tracking for Large-Scale Production Systems Zhao, Tan, Gong, Gu, Wambolt Presented by: Andrew Hahn.
Statistics, data, and deterministic models NRCSE.
Probabilistic Data Aggregation Ling Huang, Ben Zhao, Anthony Joseph Sahara Retreat January, 2004.
Approximate Bayesian Methods in Genetic Data Analysis Mark A. Beaumont, University of Reading,
1 University of Freiburg Computer Networks and Telematics Prof. Christian Schindelhauer Wireless Sensor Networks 13th Lecture Christian Schindelhauer.
Model-Driven Data Acquisition in Sensor Networks - Amol Deshpande et al., VLDB ‘04 Jisu Oh March 20, 2006 CS 580S Paper Presentation.
A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets Rainer Gemulla (University of Technology Dresden) Wolfgang Lehner (University.
QUIZ CHAPTER Seven Psy302 Quantitative Methods. 1. A distribution of all sample means or sample variances that could be obtained in samples of a given.
Particle Filtering. Sensors and Uncertainty Real world sensors are noisy and suffer from missing data (e.g., occlusions, GPS blackouts) Use sensor models.
Standard error of estimate & Confidence interval.
Scalable Approximate Query Processing through Scalable Error Estimation Kai Zeng UCLA Advisor: Carlo Zaniolo 1.
1 Bayesian methods for parameter estimation and data assimilation with crop models Part 2: Likelihood function and prior distribution David Makowski and.
1 Reading Report 9 Yin Chen 29 Mar 2004 Reference: Multivariate Resource Performance Forecasting in the Network Weather Service, Martin Swany and Rich.
Particle Filtering in Network Tomography
Department of Computer Science Provenance-based Trustworthiness Assessment in Sensor Networks Elisa Bertino CERIAS and Department of Computer Science,
Orderless Tracking through Model-Averaged Posterior Estimation Seunghoon Hong* Suha Kwak Bohyung Han Computer Vision Lab. Dept. of Computer Science and.
Monte Carlo Simulation CWR 6536 Stochastic Subsurface Hydrology.
Department of Computer Science City University of Hong Kong Department of Computer Science City University of Hong Kong 1 A Statistics-Based Sensor Selection.
Bruno Ribeiro CS69000-DM1 Topics in Data Mining. Bruno Ribeiro  Reviews of next week’s papers due Friday 5pm (Sunday 11:59pm submission closes) ◦ Assignment.
Probabilistic Reasoning for Robust Plan Execution Steve Schaffer, Brad Clement, Steve Chien Artificial Intelligence.
Computer Science, Software Engineering & Robotics Workshop, FGCU, April 27-28, 2012 Fault Prediction with Particle Filters by David Hatfield mentors: Dr.
Mixture Models, Monte Carlo, Bayesian Updating and Dynamic Models Mike West Computing Science and Statistics, Vol. 24, pp , 1993.
July, 2000Guang Jin Statistics in Applied Science and Technology Chapter 7 - Sampling Distribution of Means.
Distributions of the Sample Mean
Load Shedding Techniques for Data Stream Systems Brian Babcock Mayur Datar Rajeev Motwani Stanford University.
1 3. M ODELING U NCERTAINTY IN C ONSTRUCTION Objective: To develop an understanding of the impact of uncertainty on the performance of a project, and to.
Two Main Uses of Statistics: 1)Descriptive : To describe or summarize a collection of data points The data set in hand = the population of interest 2)Inferential.
Secure In-Network Aggregation for Wireless Sensor Networks
BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data ACM EuroSys 2013 (Best Paper Award)
Lecture 2: Statistical learning primer for biologists
- 1 - Overall procedure of validation Calibration Validation Figure 12.4 Validation, calibration, and prediction (Oberkampf and Barone, 2004 ). Model accuracy.
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
Robust Estimation With Sampling and Approximate Pre-Aggregation Author: Christopher Jermaine Presented by: Bill Eberle.
Machine Learning 5. Parametric Methods.
Tutorial I: Missing Value Analysis
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Probabilistic Robotics Introduction Probabilities Bayes rule Bayes filters.
G. Cowan Lectures on Statistical Data Analysis Lecture 9 page 1 Statistical Data Analysis: Lecture 9 1Probability, Bayes’ theorem 2Random variables and.
Parameter Estimation. Statistics Probability specified inferred Steam engine pump “prediction” “estimation”
Non-parametric Methods for Clustering Continuous and Categorical Data Steven X. Wang Dept. of Math. and Stat. York University May 13, 2010.
This represents the most probable value of the measured variable. The more readings you take, the more accurate result you will get.
Density Estimation in R Ha Le and Nikolaos Sarafianos COSC 7362 – Advanced Machine Learning Professor: Dr. Christoph F. Eick 1.
BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data Authored by Sameer Agarwal, et. al. Presented by Atul Sandur.
Continuous Monitoring of Distributed Data Streams over a Time-based Sliding Window MADALGO – Center for Massive Data Algorithmics, a Center of the Danish.
A Study on Speaker Adaptation of Continuous Density HMM Parameters By Chin-Hui Lee, Chih-Heng Lin, and Biing-Hwang Juang Presented by: 陳亮宇 1990 ICASSP/IEEE.
Hierarchical Models. Conceptual: What are we talking about? – What makes a statistical model hierarchical? – How does that fit into population analysis?
Dense-Region Based Compact Data Cube
Confidence Intervals Cont.
Load Shedding CS240B notes.
A paper on Join Synopses for Approximate Query Answering
Optimum Passive Beamforming in Relation to Active-Passive Data Fusion
i) Two way ANOVA without replication
Value of Information Analysis in Spatial Models
Load Shedding Techniques for Data Stream Systems
Statistical Methods For Engineers
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
DATABASE HISTOGRAMS E0 261 Jayant Haritsa
Parametric Methods Berlin Chen, 2005 References:
Load Shedding CS240B notes.
Presentation transcript:

Improving the Accuracy of Continuous Aggregates & Mining Queries Under Load Shedding Yan-Nei Law* and Carlo Zaniolo Computer Science Dept. UCLA * Bioinformatics Institute Singapore Samples in more complex query graphs

Continuous Aggregate Queries on Data Streams: Sampling & Load Shedding Only random samples are available for computing aggregate queries because of  Limitations of remote sensors, or transmission lines  Load Shedding policies implemented when overloads occur  When overloads occur (e.g., due to a burst of arrivals} we can 1. drop queries all together, or 2. sample the input---much preferable Key objective: Achieve answer accuracy with sparse samples for complex aggregates on windows Can we improve answer accuracy with minimal overhead?

General Architecture Basic Idea:  Optimize sampling rates of load shedders for accurate answers. Previous Work [BDM04]:  Find an error bound for each aggregate query.  Determine sampling rates that minimize query inaccuracy within the limits imposed by resource constraints.  Only work for SUM and COUNT  No error model provided …... S 1 S n Query Network ∑ ∑ ∑ Aggregate Query Operator Load Shedder Data Stream S i ∑

A New Approach Correlation between answers at different points in time  Example: sensor data [VAA04,AVA04] Objective: The current answer can be adjusted by the past answers in the way that:  Low sampling rate  current answer less accurate  more dependent on history.  High sampling rate  current answer more accurate  less dependent on history. We propose a Bayesian quality enhancement module which can achieve this objective automatically and reduce the uncertainty of the approximate answers. A larger class of queries will be considered:  SUM, COUNT, AVG, quantiles.

Our Model The observed answer à is computed from random samples of the complete stream with sampling rate P. We propose a bayesian method to obtain the improved answer by combining  the observed answer  the error model  history of the answer Quality Enhancement Module Improved answer …... ∑∑∑ S 1 S n Query Network History P à Aggregate Query Operator Load Shedder Data Stream S i ∑ …...

Error Model of the aggregate answers à – approximate answer obtained by a random sample with sampling rate P. Key result: Error model for sum count, avg, quantiles:  SUM:  COUNT:  AVG:  p-th Quantiles [B86]: F is the cumulative distribution, f = F’ is the density function

Use of Error Model Derive accuracy estimate for larger class of queries for optimizing the load shedding policy  Idea: minimize the variance of each query Enhance the quality of the query answer, on the basis of statistically information derived from the past.

Learning Prior Distribution from the Past Statistical information on the answers:  Spatial – e.g. reading from the neighbors.  Temporal – e.g the past answers {x i }. Model the distribution of the answer by:-  Normal distribution: By MLE, pdf ~ N(  s,  s 2 ) where  s =  x i /n,  s 2 =  (x i -  s ) 2 /n; only need to store  s,  s. Only require a minimal amount of computation time. Assuming that there is no concept change.

Observations Reduced uncertainty. (small  s   t <<  ) Compromise between prior and observed answer:  large   less accurate à  more dependent on  s  small   more accurate à  less dependent on  s  uncertain prior (i.e. large  s ) will not have much effect on the improved answer.

Generalizing to Mining functions: K-means is a Generalized AVG Query Relative Error for the first mean Relative Error for the second mean

Quantiles: (dataset with concept drifts) Average Rel. Err. for every quantile p-th Quantiles ApproximatePosterior 20% % % %

Changing Distribution Corrections effective for distributions besides normal ones Changing distributions (a.k.a. concept changes) can be easily detected—we used a two sample test Then old prior is dropped and new prior is constructed

Minimal Overheads Computation costs introduced by:  Calculating posterior distribution  Detecting changes QueryApproximatePosterior SUM200 Quantile K-means Time (in ms) for each query

Summary Proposed a Bayesian quality enhancement method for approximate aggregates in the presence of sampling. Our method:  Works for ordered statistics and data mining functions as well as with traditional aggregates, and also  handles concept changes in the data streams

Sampling rate = 20% p-th QuantilesApproximatePosterior 20% % % %