Load Shedding CS240B notes.

Slides:



Advertisements
Similar presentations
Overcoming Limitations of Sampling for Agrregation Queries Surajit ChaudhuriMicrosoft Research Gautam DasMicrosoft Research Mayur DatarStanford University.
Advertisements

Probabilistic Skyline Operator over Sliding Windows Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Xuemin Lin, Ying Zhang, Wei.
Active Learning for Streaming Networked Data Zhilin Yang, Jie Tang, Yutao Zhang Computer Science Department, Tsinghua University.
The Design of the Borealis Stream Processing Engine Daniel J. Abadi1, Yanif Ahmad2, Magdalena Balazinska1, Ug ̆ur C ̧ etintemel2, Mitch Cherniack3, Jeong-Hyon.
Adaptive Monitoring of Bursty Data Streams Brian Babcock, Shivnath Babu, Mayur Datar, and Rajeev Motwani.
1 Load Shedding CS240B notes. 22 Load Shedding in a DSMS zDSMS: online response on boundless and bursty data streams—How? zBy using approximations and.
A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries By : Surajid Chaudhuri Gautam Das Vivek Narasayya Presented by :Sayed.
Probabilistic Aggregation in Distributed Networks Ling Huang, Ben Zhao, Anthony Joseph and John Kubiatowicz {hling, ravenben, adj,
Adaptive Sampling for Sensor Networks Ankur Jain ٭ and Edward Y. Chang University of California, Santa Barbara DMSN 2004.
Aurora Proponent Team Wei, Mingrui Liu, Mo Rebuttal Team Joshua M Lee Raghavan, Venkatesh.
Quality-Of-Service (QoS) Panel Mitch Cherniack Brandeis David Maier OGI Rajeev Motwani Stanford Johannes GehrkeCornell Hari BalakrishnanMIT SWiM, Stanford.
Probabilistic Data Aggregation Ling Huang, Ben Zhao, Anthony Joseph Sahara Retreat January, 2004.
Cumulative Violation For any window size  t  Communication-Efficient Tracking for Distributed Cumulative Triggers Ling Huang* Minos Garofalakis.
Chain: Operator Scheduling for Memory Minimization in Data Stream Systems Authors: Brian Babcock, Shivnath Babu, Mayur Datar, and Rajeev Motwani (Dept.
Improving the Accuracy of Continuous Aggregates & Mining Queries Under Load Shedding Yan-Nei Law* and Carlo Zaniolo Computer Science Dept. UCLA * Bioinformatics.
1 Approximation and Load Shedding for QoS in DSMS* CS240B Notes By Carlo Zaniolo CSD--UCLA ________________________________________ * Notes based on a.
SWIM 1/9/20031 QoS in Data Stream Systems Rajeev Motwani Stanford University.
Energy-efficient Self-adapting Online Linear Forecasting for Wireless Sensor Network Applications Jai-Jin Lim and Kang G. Shin Real-Time Computing Laboratory,
A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets Rainer Gemulla (University of Technology Dresden) Wolfgang Lehner (University.
Scalable Approximate Query Processing through Scalable Error Estimation Kai Zeng UCLA Advisor: Carlo Zaniolo 1.
Mirek Riedewald Department of Computer Science Cornell University Efficient Processing of Massive Data Streams for Mining and Monitoring.
CONGRESSIONAL SAMPLES FOR APPROXIMATE ANSWERING OF GROUP-BY QUERIES Swarup Acharya Phillip Gibbons Viswanath Poosala ( Information Sciences Research Center,
Providing Resiliency to Load Variations in Distributed Stream Processing Ying Xing, Jeong-Hyon Hwang, Ugur Cetintemel, Stan Zdonik Brown University.
Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002.
Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.
A new model and architecture for data stream management.
Topic (vi): New and Emerging Methods Topic organizer: Maria Garcia (USA) UNECE Work Session on Statistical Data Editing Oslo, Norway, September 2012.
Adaptive Query Processing in Data Stream Systems Paper written by Shivnath Babu Kamesh Munagala, Rajeev Motwani, Jennifer Widom stanfordstreamdatamanager.
Load Shedding Techniques for Data Stream Systems Brian Babcock Mayur Datar Rajeev Motwani Stanford University.
Load Shedding in Stream Databases – A Control-Based Approach Yicheng Tu, Song Liu, Sunil Prabhakar, and Bin Yao Department of Computer Science, Purdue.
1 Supporting Dynamic Migration in Tightly Coupled Grid Applications Liang Chen Qian Zhu Gagan Agrawal Computer Science & Engineering The Ohio State University.
Accommodating Bursts in Distributed Stream Processing Systems Yannis Drougas, ESRI Vana Kalogeraki, AUEB
A new model and architecture for data stream management.
Robust Estimation With Sampling and Approximate Pre-Aggregation Author: Christopher Jermaine Presented by: Bill Eberle.
CONGRESSIONAL SAMPLES FOR APPROXIMATE ANSWERING OF GROUP BY QUERIES Swaroop Acharya,Philip B Gibbons, VishwanathPoosala By Agasthya Padisala Anusha Reddy.
Aurora: a new model and architecture for data stream management Daniel J. Abadi 1, Don Carney 2, Ugur Cetintemel 2, Mitch Cherniack 1, Christian Convey.
Evaluating Window Joins over Unbounded Streams Jaewoo Kang Jeffrey F. Naughton Stratis D. Viglas {jaewoo, naughton, Univ. of Wisconsin-Madison.
Control-Based Load Shedding in Data Stream Management Systems Yicheng Tu and Sunil Prabhakar Department of Computer Sciences, Purdue University April 3,
Control-Based Load Shedding in Data Stream Management Systems Yicheng Tu and Sunil Prabhakar Department of Computer Sciences, Purdue University April 3,
Written By: Presented By: Swarup Acharya,Amr Elkhatib Phillip B. Gibbons, Viswanath Poosala, Sridhar Ramaswamy Join Synopses for Approximate Query Answering.
BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data Authored by Sameer Agarwal, et. al. Presented by Atul Sandur.
Continuous Monitoring of Distributed Data Streams over a Time-based Sliding Window MADALGO – Center for Massive Data Algorithmics, a Center of the Danish.
SketchVisor: Robust Network Measurement for Software Packet Processing
Mining Data Streams (Part 1)
CS 9633 Machine Learning Support Vector Machines
S. Sudarshan CS632 Course, Mar 2004 IIT Bombay
Chapter 7. Classification and Prediction
Topics discussed in this section:
CSCI5570 Large Scale Data Processing Systems
International Conference on Data Engineering (ICDE 2016)
Basics of Intrusion Detection
A paper on Join Synopses for Approximate Query Answering
Quality-aware Aggregation & Predictive Analytics at the Edge
An overview of Data Streaming
Data Stream Management System (DSMS)
StreamApprox Approximate Stream Analytics in Apache Flink
Load Shedding Techniques for Data Stream Systems
AQUA: Approximate Query Answering
StreamApprox Approximate Stream Analytics in Apache Spark
StreamApprox Approximate Computing for Stream Analytics
Load Shedding in Stream Databases – A Control-Based Approach
Q4 : How does Netflix recommend movies?
Streaming Sensor Data Fjord / Sensor Proxy Multiquery Eddy
Brian Babcock, Shivnath Babu, Mayur Datar, and Rajeev Motwani
Offset-Time-Based QoS Scheme
Approximate Frequency Counts over Data Streams
Word Embedding Word2Vec.
Load Shedding CS240B notes.
August 8, 2006 Danny Budik, Itamar Elhanany Machine Intelligence Lab
ACHIEVEMENT DESCRIPTION
Presentation transcript:

Load Shedding CS240B notes

Load Shedding in a DSMS DSMS: online response on boundless and bursty data streams—How? By using approximations and synopses and even Shedding load when arrival rates become impossible Approximations and Synopses are often used with normal load Shedding is used for bursty streams and overload situations. 2 2

QoS and Load Schedding When input stream rate exceeds system capacity a stream manager can shed load (tuples) Load shedding affects queries and their answers: drop the tasks and the tuples that will cause least loss Introducing load shedding in a data stream manager is a challenging problem Random load shedding or semantic load shedding

Problems to Address When to shed load Where to shed load Overload should be detected quickly Where to shed load Avoid wasted work Upstream Drop Vs. Downstream Drop How much to shed The magnitude of the drop Which tuples to shed

Loss-tolerance QoS function Loss function is not linear:

Value-based QoS Value-based QoS Show which values of the output tuple space are most important. In a medical application that monitors patient heartbeats Extreme values are certainly more interesting than normal ones Corresponding higher utility

Load Shedding in Aurora QoS for each application as a function relating output to its utility – Delay based, drop based, value based Techniques for introducing load shedding operators in a plan such that QoS isdisrupted the least – Determining when, where and how much load to shed

Which Query to drop First? Models and algorithms proposed include Greedy algorithms or Fractional Knapsack Problem Other OR techniques Must deal with nonlinearities

Load Shedding in STREAM Formulate load shedding as an optimization problem for multiple sliding window aggregate queries – Minimize inaccuracy in answers subject to output rate matching or exceeding arrival rate Consider placement of load shedding operators in query plan – Each operator sheds load uniformly with probability pi

Window-Oriented Load Shedding Input stream divided into windows of size w Use fewer Slides per windows to compute aggregates—tumbles is the extreme case. Window-based Sampling Reservoir sampling for incoming tuples Expiring tuples pose a more difficult problem.

Load Shedding by Sampling for Continuous Aggregate Queries on Data Streams: Only random samples are available for computing aggregate queries because of Limitations of remote sensors, or transmission lines Load Shedding policies implemented when overloads occur When overloads occur (e.g., due to a burst of arrivals} we can drop queries all together, or sample the input---much preferable Key objective: Achieve answer accuracy with sparse samples for complex aggregates on windows Can we improve answer accuracy with minimal overhead?

Load Shedding To cope with bursty arrivals of high-volume data DSMS has to shed load while minimizing the degradation of the Quality of Service (QoS) The goal then becomes determining: when, where and how much load to shed An intelligent scheme, can improve the quality of our mining results under bursty arrivals

A first Architecture Basic Idea: [BDM04] …... S1 Sn Query Network ∑ Basic Idea: [BDM04] Optimize sampling rates of load shedders for accurate answers. Find an error bound for each aggregate query. Determine sampling rates that minimize query inaccuracy within the limits imposed by resource constraints. This approach works for SUM and COUNT Generalization to other functions? Si Data Stream Load Shedder Query Operator ∑ Aggregate

Query Network: arbitrary placement of aggregates and shedder after any aggregate Sn L1 L4 Q1 Q4 L2 L5 Q2 Q3 Q5 Data Stream Load Shedder Aggregate Operator

Generalized Load Shedding in Stream Mill A general framework that achieves optimal load shedding policies, while accommodating: Different requirements for different users, different query sensitivities, and different penalties. Applicability to a wide spectrum of aggregate functions: We have formally characterized using a new notion, called reciprocal-error queries. Proposing an extensible architecture that allows UDAs to benefit from the system provided load shedding functions. Significant improvements (in absolute error, false positives, and false negatives) compared to the common uniform approach. We propose an efficient (linear-time) algorithm to handle severe overloads without losing optimality.

Goals to Achieve Light-weight overhead handling React to overload immediately Minimizing QoS degradation Delivering subset results Only omitting tuples from the correct answer Never produce incorrect answers

Prediction & Improvements A larger class of queries was considered in [LZ08] SUM, COUNT, AVG, Quantiles. Temporal Correlation between answers can be used to improve answer Example: sensor data Current answer can be adjusted by the past answers so that: Low sampling rate  current answer less accurate  more dependent on history. High sampling rate  current answer more accurate  less dependent on history. A Bayesian quality enhancement module which can achieve this objective automatically and reduce the uncertainty of the approximate answers.

Improved Model Using History …... The observed answer à is computed from random samples of the complete stream with sampling rate P. A bayesian method to obtain the improved answer by combining the observed answer the error model history of the answer Query Network Sn ∑ ∑ …... ∑ à …... History P Quality Enhancement Module Query Operator Load Shedder Data Stream Si ∑ Improved answer Aggregate

Summary An error model Works for ordered statistics and data mining functions as well as with traditional aggregates, computationally very efficient Bayesian quality enhancement method for approximate aggregates in the presence of sampling. No correction when concept changes are suspected—a two-sample test used to detect suspected changes.

References—Sampling and load shedding [Tabul03] Nesime Tatbul, Ugur Cetintemel, Stanley B. Zdonik, Mitch Cherniack, Michael Stonebraker: Load Shedding in a Data Stream Manager.VLDB2003, pp.309--320. [BDM04] Brian Babcock, Mayur Datar, Rajeev Motwani: Load Shedding for Aggregation Queries over Data Streams. ICDE 2004: 350-361. [Tabul07] Nesime Tatbul, Ugur Cetintemel, Stanley B. Zdonik: Staying FIT: Efficient Load Shedding Techniques for Distributed Stream Processing. VLDB 2007: 159-170. [LZ08] Yan-Nei Law and Carlo Zaniolo: Improving the Accuracy of Continuous Aggregates and Mining Queries on Data Streams under Load Shedding. International Journal of Business Intelligence and Data Mining, 2008. [ICDE 2010] Barzan Mozafari and Carlo Zaniolo, Optimal Load Shedding with Aggregates and Mining Queries. In Proceedings of the 26th International Conference on Data Engineering (ICDE 2010), Long Beach, California, USA, March 1-6, 2010.