Load Shedding Techniques for Data Stream Systems Brian Babcock Mayur Datar Rajeev Motwani Stanford University.

Slides:



Advertisements
Similar presentations
Modeling of Data. Basic Bayes theorem Bayes theorem relates the conditional probabilities of two events A, and B: A might be a hypothesis and B might.
Advertisements

Sampling From a Moving Window Over Streaming Data Brian Babcock * Mayur Datar Rajeev Motwani * Speaker Stanford University.
Raghavendra Madala. Introduction Icicles Icicle Maintenance Icicle-Based Estimators Quality Guarantee Performance Evaluation Conclusion 2 ICICLES: Self-tuning.
3/13/2012Data Streams: Lecture 161 CS 410/510 Data Streams Lecture 16: Data-Stream Sampling: Basic Techniques and Results Kristin Tufte, David Maier.
Introduction to Histograms Presented By: Laukik Chitnis
Maintaining Variance and k-Medians over Data Stream Windows Brian Babcock, Mayur Datar, Rajeev Motwani, Liadan O’Callaghan Stanford University.
1 A Deterministic Algorithm for Summarizing Asynchronous Streams over a Sliding Window Costas Busch Rensselaer Polytechnic Institute Srikanta Tirthapura.
Maintaining Variance over Data Stream Windows Brian Babcock, Mayur Datar, Rajeev Motwani, Liadan O ’ Callaghan, Stanford University ACM Symp. on Principles.
Mining Data Streams.
Stratification (Blocking) Grouping similar experimental units together and assigning different treatments within such groups of experimental units A technique.
Sampling: Final and Initial Sample Size Determination
February 14, 2006CS DB Exploration 1 Congressional Samples for Approximate Answering of Group-By Queries Swarup Acharya Phillip B. Gibbons Viswanath.
Adaptive Monitoring of Bursty Data Streams Brian Babcock, Shivnath Babu, Mayur Datar, and Rajeev Motwani.
1 Load Shedding CS240B notes. 22 Load Shedding in a DSMS zDSMS: online response on boundless and bursty data streams—How? zBy using approximations and.
Static Optimization of Conjunctive Queries with Sliding Windows over Infinite Streams Presented by: Andy Mason and Sheng Zhong Ahmed M.Ayad and Jeffrey.
A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries By : Surajid Chaudhuri Gautam Das Vivek Narasayya Presented by :Sayed.
Understanding the Accuracy of Assembly Variation Analysis Methods ADCATS 2000 Robert Cvetko June 2000.
Point and Confidence Interval Estimation of a Population Proportion, p
Rules for means Rule 1: If X is a random variable and a and b are fixed numbers, then Rule 2: If X and Y are random variables, then.
MAE 552 Heuristic Optimization
Quality-Of-Service (QoS) Panel Mitch Cherniack Brandeis David Maier OGI Rajeev Motwani Stanford Johannes GehrkeCornell Hari BalakrishnanMIT SWiM, Stanford.
An Incremental Refining Spatial Join Algorithm for Estimating Query Results in GIS Wan D. Bae, Shayma Alkobaisi, Scott T. Leutenegger Department of Computer.
Time-Decaying Sketches for Sensor Data Aggregation Graham Cormode AT&T Labs, Research Srikanta Tirthapura Dept. of Electrical and Computer Engineering.
On Random Sampling over Joins Surajit Chaudhuri Rajeeve Motwani Vivek Narasayya Microsoft Research Stanford University Microsoft Research.
Chain: Operator Scheduling for Memory Minimization in Data Stream Systems Authors: Brian Babcock, Shivnath Babu, Mayur Datar, and Rajeev Motwani (Dept.
Improving the Accuracy of Continuous Aggregates & Mining Queries Under Load Shedding Yan-Nei Law* and Carlo Zaniolo Computer Science Dept. UCLA * Bioinformatics.
Approximating Power Indices Yoram Bachrach(Hebew University) Evangelos Markakis(CWI) Ariel D. Procaccia (Hebrew University) Jeffrey S. Rosenschein (Hebrew.
1 Wavelet synopses with Error Guarantees Minos Garofalakis Phillip B. Gibbons Information Sciences Research Center Bell Labs, Lucent Technologies Murray.
1 Inference About a Population Variance Sometimes we are interested in making inference about the variability of processes. Examples: –Investors use variance.
SWIM 1/9/20031 QoS in Data Stream Systems Rajeev Motwani Stanford University.
Computer Science Characterizing and Exploiting Reference Locality in Data Stream Applications Feifei Li, Ching Chang, George Kollios, Azer Bestavros Computer.
Independent Sample T-test Classical design used in psychology/medicine N subjects are randomly assigned to two groups (Control * Treatment). After treatment,
Determining the Size of a Sample
Standard error of estimate & Confidence interval.
Geo479/579: Geostatistics Ch12. Ordinary Kriging (1)
Data Stream Systems Reynold Cheng 12 th July, 2002 Based on slides by B. Babcock et.al, “Models and Issues in Data Stream Systems”, PODS’02.
Ripple Joins for Online Aggregation by Peter J. Haas and Joseph M. Hellerstein published in June 1999 presented by Ronda Hilton.
Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
1 MARKETING RESEARCH Week 5 Session A IBMS Term 2,
A Process Control Screen for Multiple Stream Processes An Operator Friendly Approach Richard E. Clark Process & Product Analysis.
Data Streams Part 3: Approximate Query Evaluation Reynold Cheng 23 rd July, 2002.
February 14, 2006CS DB Exploration 1 Congressional Samples for Approximate Answering of Group-By Queries Swarup Acharya Phillip B. Gibbons Viswanath.
Keyword Search in Databases using PageRank By Michael Sirivianos April 11, 2003.
LECTURE 25 THURSDAY, 19 NOVEMBER STA291 Fall
How Errors Propagate Error in a Series Errors in a Sum Error in Redundant Measurement.
CpSc 881: Machine Learning Evaluating Hypotheses.
Joseph M. Hellerstein Peter J. Haas Helen J. Wang Presented by: Calvin R Noronha ( ) Deepak Anand ( ) By:
BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data ACM EuroSys 2013 (Best Paper Award)
Memory Requirements of Data Streams Reynold Cheng 19 th July, 2002.
Presented By Anirban Maiti Chandrashekar Vijayarenu
Extensive Investigation of Calibrated Accelerated Life Testing (CALT) in Comparison with Classical Accelerated Life Testing (ALT) Burak Sal (Presenter),
Robust Estimation With Sampling and Approximate Pre-Aggregation Author: Christopher Jermaine Presented by: Bill Eberle.
CONGRESSIONAL SAMPLES FOR APPROXIMATE ANSWERING OF GROUP BY QUERIES Swaroop Acharya,Philip B Gibbons, VishwanathPoosala By Agasthya Padisala Anusha Reddy.
1 A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries Surajit Chaudhuri Gautam Das Vivek Narasayya Proceedings of the.
University of Texas at Arlington Presented By Srikanth Vadada Fall CSE rd Sep 2010 Dynamic Sample Selection for Approximate Query Processing.
Written By: Presented By: Swarup Acharya,Amr Elkhatib Phillip B. Gibbons, Viswanath Poosala, Sridhar Ramaswamy Join Synopses for Approximate Query Answering.
BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data Authored by Sameer Agarwal, et. al. Presented by Atul Sandur.
Mining Data Streams (Part 1)
Join Size The join size is the space required to join two relations.
Chapter 7. Classification and Prediction
Load Shedding CS240B notes.
Approximate Inference Methods
Spatial Online Sampling and Aggregation
StreamApprox Approximate Stream Analytics in Apache Flink
Load Shedding Techniques for Data Stream Systems
AQUA: Approximate Query Answering
Brian Babcock, Shivnath Babu, Mayur Datar, and Rajeev Motwani
Feifei Li, Ching Chang, George Kollios, Azer Bestavros
Load Shedding CS240B notes.
Presentation transcript:

Load Shedding Techniques for Data Stream Systems Brian Babcock Mayur Datar Rajeev Motwani Stanford University

Differences from Previous Talk Our focus: Aggregation queries No quality of service specifications – Instead, focus on accuracy of query answers Compensate for dropped data by scaling answers Random drops only (no semantic drops)

Problem Setting   ΣΣ   Σ S1S1 S2S2 R Sliding Window Aggregate Queries (SUM and COUNT) Operator Sharing Filters, UDFs, and Joins w/ Relations Q1Q1 Q2Q2 Q3Q3

Inputs to the Problem   ΣΣ   Σ S1S1 S2S2 R Std Dev σ Mean μ Q1Q1 Q2Q2 Q3Q3 Stream Rate r Processing Time t Selectivity s

Load Shedding via Random Drops Stream Rate r 11 22 Σ3Σ3 S (t 1, s 1 ) (t 2, s 2 ) Load = rt 1 + rs 1 t 2 + rs 1 s 2 t 3 (t 3, s 3 ) Sampling Rate p (time, selectivity) Load = rt 1 + p(rs 1 t 2 + rs 1 s 2 t 3 ) Scale answer by 1/p Need Load ≤ 1

Problem Statement Relative error is metric of choice: |Estimate - Actual| Actual Goal: Minimize the maximum relative error across queries, subject to Load ≤ 1 – Want low error with high probability

Relating Load Shedding and Error Relative error for query i Sampling rate for query i Query-dependent constant Equation derived from Hoeffding bounds Constant C i depends on: – Variance of aggregated attribute – Sliding window size

Calculate Ratio of Sampling Rates Minimize maximum relative error → Equal relative error across queries Express all sampling rates in terms of common variable λ

Placing Load Shedders   Σ Σ Target.8λ Target.6λ Sampling Rate.8λ Sampling Rate.75 =.6λ /.8λ

Conclusion Load shedding helps cope with bursts Minimizing relative error is natural objective for aggregate queries Algorithm for load shedding: – Relate target sampling rates for all queries – Place random drop operators based on target sampling rates – Adjust sampling rates to achieve desired load

Thanks for listening! Questions?

Choosing Target Sampling Rates Sampling rate for query Variance of aggregated attribute Sliding window size Relative Error

Measuring Inaccuracy 11 22 Σ3Σ3 Sampling Rate p 1 Sampling Rate p 2 Scale answer by 1/(p 1 p 2 ) Tuple w/ value x: x / (p 1 p 2 ) 0 with pr. p 1 p 2 with pr. 1-p 1 p 2 Key point: Product of sampling rates determines quality of approximate answer