Load Shedding Techniques for Data Stream Systems

Slides:



Advertisements
Similar presentations
Modeling of Data. Basic Bayes theorem Bayes theorem relates the conditional probabilities of two events A, and B: A might be a hypothesis and B might.
Advertisements

Sampling From a Moving Window Over Streaming Data Brian Babcock * Mayur Datar Rajeev Motwani * Speaker Stanford University.
Raghavendra Madala. Introduction Icicles Icicle Maintenance Icicle-Based Estimators Quality Guarantee Performance Evaluation Conclusion 2 ICICLES: Self-tuning.
CS 4100 Artificial Intelligence Prof. C. Hafner Class Notes March 27, 2012.
Maintaining Variance and k-Medians over Data Stream Windows Brian Babcock, Mayur Datar, Rajeev Motwani, Liadan O’Callaghan Stanford University.
Maintaining Variance over Data Stream Windows Brian Babcock, Mayur Datar, Rajeev Motwani, Liadan O ’ Callaghan, Stanford University ACM Symp. on Principles.
Mining Data Streams.
Stratification (Blocking) Grouping similar experimental units together and assigning different treatments within such groups of experimental units A technique.
Sampling: Final and Initial Sample Size Determination
February 14, 2006CS DB Exploration 1 Congressional Samples for Approximate Answering of Group-By Queries Swarup Acharya Phillip B. Gibbons Viswanath.
Adaptive Monitoring of Bursty Data Streams Brian Babcock, Shivnath Babu, Mayur Datar, and Rajeev Motwani.
1 CS 361 Lecture 5 Approximate Quantiles and Histograms 9 Oct 2002 Gurmeet Singh Manku
Estimation of Sample Size
1 Load Shedding CS240B notes. 22 Load Shedding in a DSMS zDSMS: online response on boundless and bursty data streams—How? zBy using approximations and.
Static Optimization of Conjunctive Queries with Sliding Windows over Infinite Streams Presented by: Andy Mason and Sheng Zhong Ahmed M.Ayad and Jeffrey.
A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries By : Surajid Chaudhuri Gautam Das Vivek Narasayya Presented by :Sayed.
Understanding the Accuracy of Assembly Variation Analysis Methods ADCATS 2000 Robert Cvetko June 2000.
Rules for means Rule 1: If X is a random variable and a and b are fixed numbers, then Rule 2: If X and Y are random variables, then.
Quality-Of-Service (QoS) Panel Mitch Cherniack Brandeis David Maier OGI Rajeev Motwani Stanford Johannes GehrkeCornell Hari BalakrishnanMIT SWiM, Stanford.
Time-Decaying Sketches for Sensor Data Aggregation Graham Cormode AT&T Labs, Research Srikanta Tirthapura Dept. of Electrical and Computer Engineering.
On Random Sampling over Joins Surajit Chaudhuri Rajeeve Motwani Vivek Narasayya Microsoft Research Stanford University Microsoft Research.
Chain: Operator Scheduling for Memory Minimization in Data Stream Systems Authors: Brian Babcock, Shivnath Babu, Mayur Datar, and Rajeev Motwani (Dept.
Improving the Accuracy of Continuous Aggregates & Mining Queries Under Load Shedding Yan-Nei Law* and Carlo Zaniolo Computer Science Dept. UCLA * Bioinformatics.
Approximating Power Indices Yoram Bachrach(Hebew University) Evangelos Markakis(CWI) Ariel D. Procaccia (Hebrew University) Jeffrey S. Rosenschein (Hebrew.
1 Load Shedding in a Data Stream Manager Slides edited from the original slides of Kevin Hoeschele Anurag Shakti Maskey.
SWIM 1/9/20031 QoS in Data Stream Systems Rajeev Motwani Stanford University.
Computer Science Characterizing and Exploiting Reference Locality in Data Stream Applications Feifei Li, Ching Chang, George Kollios, Azer Bestavros Computer.
Independent Sample T-test Classical design used in psychology/medicine N subjects are randomly assigned to two groups (Control * Treatment). After treatment,
Standard error of estimate & Confidence interval.
CONGRESSIONAL SAMPLES FOR APPROXIMATE ANSWERING OF GROUP-BY QUERIES Swarup Acharya Phillip Gibbons Viswanath Poosala ( Information Sciences Research Center,
Geo479/579: Geostatistics Ch12. Ordinary Kriging (1)
Data Stream Systems Reynold Cheng 12 th July, 2002 Based on slides by B. Babcock et.al, “Models and Issues in Data Stream Systems”, PODS’02.
Ripple Joins for Online Aggregation by Peter J. Haas and Joseph M. Hellerstein published in June 1999 presented by Ronda Hilton.
Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
A Process Control Screen for Multiple Stream Processes An Operator Friendly Approach Richard E. Clark Process & Product Analysis.
Data Streams Part 3: Approximate Query Evaluation Reynold Cheng 23 rd July, 2002.
February 14, 2006CS DB Exploration 1 Congressional Samples for Approximate Answering of Group-By Queries Swarup Acharya Phillip B. Gibbons Viswanath.
TYPES There are several TYPES of variables that reflect characteristics of the data Ratio Interval Ordinal Nominal.
Choosing Sample Size Section Starter A coin is weighted so that it comes up heads 80% of the time. You bet $1 that you can make it come.
LECTURE 25 THURSDAY, 19 NOVEMBER STA291 Fall
Load Shedding Techniques for Data Stream Systems Brian Babcock Mayur Datar Rajeev Motwani Stanford University.
How Errors Propagate Error in a Series Errors in a Sum Error in Redundant Measurement.
CpSc 881: Machine Learning Evaluating Hypotheses.
Joseph M. Hellerstein Peter J. Haas Helen J. Wang Presented by: Calvin R Noronha ( ) Deepak Anand ( ) By:
Memory Requirements of Data Streams Reynold Cheng 19 th July, 2002.
Presented By Anirban Maiti Chandrashekar Vijayarenu
Robust Estimation With Sampling and Approximate Pre-Aggregation Author: Christopher Jermaine Presented by: Bill Eberle.
CONGRESSIONAL SAMPLES FOR APPROXIMATE ANSWERING OF GROUP BY QUERIES Swaroop Acharya,Philip B Gibbons, VishwanathPoosala By Agasthya Padisala Anusha Reddy.
Written By: Presented By: Swarup Acharya,Amr Elkhatib Phillip B. Gibbons, Viswanath Poosala, Sridhar Ramaswamy Join Synopses for Approximate Query Answering.
BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data Authored by Sameer Agarwal, et. al. Presented by Atul Sandur.
ICICLES: Self-tuning Samples for Approximate Query Answering By Venkatesh Ganti, Mong Li Lee, and Raghu Ramakrishnan Shruti P. Gopinath CSE 6339.
Mining Data Streams (Part 1)
Join Size The join size is the space required to join two relations.
Chapter 7. Classification and Prediction
Chapter 6 Inferences Based on a Single Sample: Estimation with Confidence Intervals Slides for Optional Sections Section 7.5 Finite Population Correction.
Load Shedding CS240B notes.
A paper on Join Synopses for Approximate Query Answering
Anthony Okorodudu CSE ICICLES: Self-tuning Samples for Approximate Query Answering By Venkatesh Ganti, Mong Li Lee, and Raghu Ramakrishnan.
Approximate Inference Methods
ICICLES: Self-tuning Samples for Approximate Query Answering
Spatial Online Sampling and Aggregation
Physics 114: Exam 2 Review Material from Weeks 7-11
AQUA: Approximate Query Answering
Error rate due to noise In this section, an expression for the probability of error will be derived The analysis technique, will be demonstrated on a binary.
PSY 626: Bayesian Statistics for Psychological Science
Brian Babcock, Shivnath Babu, Mayur Datar, and Rajeev Motwani
Feifei Li, Ching Chang, George Kollios, Azer Bestavros
Load Shedding CS240B notes.
CSE 6392 – Data Exploration and Analysis in Relational Databases
Presentation transcript:

Load Shedding Techniques for Data Stream Systems Brian Babcock Mayur Datar Rajeev Motwani Stanford University

Differences from Previous Talk Our focus: Aggregation queries No quality of service specifications Instead, focus on accuracy of query answers Compensate for dropped data by scaling answers Random drops only (no semantic drops)

Problem Setting Σ Σ Σ     Q1 Q2 Q3 R S1 S2 Sliding Window Aggregate Queries (SUM and COUNT) Σ Σ Σ   Filters, UDFs, and Joins w/ Relations Operator Sharing   R S1 S2

Inputs to the Problem Σ Σ Σ     Q1 Q2 Q3 R S1 S2 Std Dev σ Mean μ Processing Time t Selectivity s     R S1 S2 Stream Rate r

Load Shedding via Random Drops (time, selectivity) 1 2 Σ3 S Scale answer by 1/p (t3, s3) Load = rt1 + rs1t2 + rs1s2t3 (t2, s2) Load = rt1 + p(rs1t2 + rs1s2t3) Sampling Rate p (t1, s1) Need Load ≤ 1 Stream Rate r

Problem Statement Relative error is metric of choice: |Estimate - Actual| Actual Goal: Minimize the maximum relative error across queries, subject to Load ≤ 1 Want low error with high probability

Relating Load Shedding and Error Query-dependent constant Relative error for query i Sampling rate for query i Equation derived from Hoeffding bounds Constant Ci depends on: Variance of aggregated attribute Sliding window size

Calculate Ratio of Sampling Rates Minimize maximum relative error → Equal relative error across queries Express all sampling rates in terms of common variable λ

Placing Load Shedders Σ Σ    Target .8λ Target .6λ Sampling Rate .75 = .6λ /.8λ  Sampling Rate .8λ

Conclusion Load shedding helps cope with bursts Minimizing relative error is natural objective for aggregate queries Algorithm for load shedding: Relate target sampling rates for all queries Place random drop operators based on target sampling rates Adjust sampling rates to achieve desired load

Thanks for listening! Questions?

Choosing Target Sampling Rates Relative Error Sampling rate for query Variance of aggregated attribute Sliding window size

Measuring Inaccuracy Σ3 2 1 Tuple w/ value x: x / (p1p2) Scale answer by 1/(p1p2) Tuple w/ value x: x / (p1p2) with pr. p1p2 with pr. 1-p1p2 Σ3 Sampling Rate p2 2 Key point: Product of sampling rates determines quality of approximate answer Sampling Rate p1 1