Adaptivity in continuous query systems Luis A. Sotomayor & Zhiguo Xu Professor Carlo Zaniolo CS240B - Spring 2003.

Slides:

Advertisements

Similar presentations

QoS-based Management of Multiple Shared Resources in Dynamic Real-Time Systems Klaus Ecker, Frank Drews School of EECS, Ohio University, Athens, OH {ecker,

Advertisements

Semantics and Evaluation Techniques for Window Aggregates in Data Streams Jin Li, David Maier, Kristin Tufte, Vassilis Papadimos, Peter A. Tucker SIGMOD.

Analysis of : Operator Scheduling in a Data Stream Manager CS561 – Advanced Database Systems By Eric Bloom.

Hadi Goudarzi and Massoud Pedram

Shi Bai, Weiyi Zhang, Guoliang Xue, Jian Tang, and Chonggang Wang University of Minnesota, AT&T Lab, Arizona State University, Syracuse University, NEC.

Resource Management §A resource can be a logical, such as a shared file, or physical, such as a CPU (a node of the distributed system). One of the functions.

1 CNPA B Nasser S. Abouzakhar Queuing Disciplines Week 8 – Lecture 2 16 th November, 2009.

REAL-TIME COMMUNICATION ANALYSIS FOR NOCS WITH WORMHOLE SWITCHING Presented by Sina Gholamian, 1 09/11/2011.

A Data Stream Management System for Network Traffic Management Shivnath Babu Stanford University Lakshminarayanan Subramanian Univ. California, Berkeley.

Adaptive Monitoring of Bursty Data Streams Brian Babcock, Shivnath Babu, Mayur Datar, and Rajeev Motwani.

Static Optimization of Conjunctive Queries with Sliding Windows over Infinite Streams Presented by: Andy Mason and Sheng Zhong Ahmed M.Ayad and Jeffrey.

Towards Feasibility Region Calculus: An End-to-end Schedulability Analysis of Real- Time Multistage Execution William Hawkins and Tarek Abdelzaher Presented.

Copyright 2004 David J. Lilja1 Errors in Experimental Measurements Sources of errors Accuracy, precision, resolution A mathematical model of errors Confidence.

Source-Adaptive Multilayered Multicast Algorithms for Real- Time Video Distribution Brett J. Vickers, Celio Albuquerque, and Tatsuya Suda IEEE/ACM Transactions.

Beneficial Caching in Mobile Ad Hoc Networks Bin Tang, Samir Das, Himanshu Gupta Computer Science Department Stony Brook University.

Adaptive Ordering of Pipelined Stream Filters S. Babu, R. Motwani, K. Munagala, I. Nishizawa, and J. Widom In Proc. of SIGMOD 2004, June 2004.

Adaptive Sampling in Distributed Streaming Environment Ankur Jain 2/4/03.

Chain: Operator Scheduling for Memory Minimization in Data Stream Systems Authors: Brian Babcock, Shivnath Babu, Mayur Datar, and Rajeev Motwani (Dept.

Extending Network Lifetime for Precision-Constrained Data Aggregation in Wireless Sensor Networks Xueyan Tang School of Computer Engineering Nanyang Technological.

Improving the Accuracy of Continuous Aggregates & Mining Queries Under Load Shedding Yan-Nei Law* and Carlo Zaniolo Computer Science Dept. UCLA * Bioinformatics.

Parallel Computation in Biological Sequence Analysis Xue Wu CMSC 838 Presentation.

An Adaptive Multi-Objective Scheduling Selection Framework For Continuous Query Processing Timothy M. Sutherland Bradford Pielech Yali Zhu Luping Ding.

1 Minimizing Latency and Memory in DSMS CS240B Notes By Carlo Zaniolo CSD--UCLA.

SWIM 1/9/20031 QoS in Data Stream Systems Rajeev Motwani Stanford University.

Models and Issues in Data Streaming Presented By :- Ankur Jain Department of Computer Science 6/23/03 A list of relevant papers is available at

Distributed Constraint Optimization * some slides courtesy of P. Modi

PROMISE: Peer-to-Peer Media Streaming Using CollectCast Presented by: Randeep Singh Gakhal CMPT 886, July 2004.

Performance and Power Efficient On-Chip Communication Using Adaptive Virtual Point-to-Point Connections M. Modarressi, H. Sarbazi-Azad, and A. Tavakkol.

Not All Microseconds are Equal: Fine-Grained Per-Flow Measurements with Reference Latency Interpolation Myungjin Lee †, Nick Duffield‡, Ramana Rao Kompella†

Detecting Distance-Based Outliers in Streams of Data Fabrizio Angiulli and Fabio Fassetti DEIS, Universit `a della Calabria CIKM 07.

Cloud and Big Data Summer School, Stockholm, Aug Jeffrey D. Ullman.

Efficient Scheduling of Heterogeneous Continuous Queries Mohamed A. Sharaf Panos K. Chrysanthis Alexandros Labrinidis Kirk Pruhs Advanced Data Management.

Freshness-Aware Scheduling of Continuous Queries in the Dynamic Web Mohamed A. Sharaf Alexandros Labrinidis Panos K. Chrysanthis Kirk Pruhs Advanced Data.

Network Aware Resource Allocation in Distributed Clouds.

An Integration Framework for Sensor Networks and Data Stream Management Systems.

Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach Wenjie Zhang, Xuemin Lin The University of New South Wales & NICTA Ming Hua,

Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.

Maximum Network Lifetime in Wireless Sensor Networks with Adjustable Sensing Ranges Cardei, M.; Jie Wu; Mingming Lu; Pervaiz, M.O.; Wireless And Mobile.

Smita Vijayakumar Qian Zhu Gagan Agrawal 1.  Background  Data Streams  Virtualization  Dynamic Resource Allocation  Accuracy Adaptation  Research.

Integrating Scale Out and Fault Tolerance in Stream Processing using Operator State Management Author: Raul Castro Fernandez, Matteo Migliavacca, et al.

Chair for Computer Science 6 (Data Management) Friedrich-Alexander-University of Erlangen-Nuremberg Michael Daum, Frank Lauterwald, Philipp Baumgärtel,

Energy-Efficient Monitoring of Extreme Values in Sensor Networks Loo, Kin Kong 10 May, 2007.

Adaptive Query Processing in Data Stream Systems Paper written by Shivnath Babu Kamesh Munagala, Rajeev Motwani, Jennifer Widom stanfordstreamdatamanager.

CS6321 Query Optimization Over Web Services Utkarsh Kamesh Jennifer Rajeev Shrivastava Munagala Wisdom Motwani Presented By Ajay Kumar Sarda.

High-Speed Policy-Based Packet Forwarding Using Efficient Multi-dimensional Range Matching Lakshman and Stiliadis ACM SIGCOMM 98.

Static Process Scheduling

Adaptive Ordering of Pipelined Stream Filters Babu, Motwani, Munagala, Nishizawa, and Widom SIGMOD 2004 Jun 13-18, 2004 presented by Joshua Lee Mingzhu.

03/02/20061 Evaluating Top-k Queries Over Web-Accessible Databases Amelie Marian Nicolas Bruno Luis Gravano Presented By: Archana and Muhammed.

Sporadic model building for efficiency enhancement of the hierarchical BOA Genetic Programming and Evolvable Machines (2008) 9: Martin Pelikan, Kumara.

Onlinedeeneislam.blogspot.com1 Design and Analysis of Algorithms Slide # 1 Download From

Kalman Filter and Data Streaming Presented By :- Ankur Jain Department of Computer Science 7/21/03.

Continuous Monitoring of Distributed Data Streams over a Time-based Sliding Window MADALGO – Center for Massive Data Algorithmics, a Center of the Danish.

Instructor Materials Chapter 6: Quality of Service

Load Shedding CS240B notes.

A paper on Join Synopses for Approximate Query Answering

Chapter 12: Query Processing

A Framework for Automatic Resource and Accuracy Management in A Cloud Environment Smita Vijayakumar.

Supporting Fault-Tolerance in Streaming Grid Applications

© 2008 Cisco Systems, Inc. All rights reserved.Cisco ConfidentialPresentation_ID 1 Chapter 6: Quality of Service Connecting Networks.

CPU Scheduling G.Anuradha

Streaming Sensor Data Fjord / Sensor Proxy Multiquery Eddy

Smita Vijayakumar Qian Zhu Gagan Agrawal

Brian Babcock, Shivnath Babu, Mayur Datar, and Rajeev Motwani

Networked Real-Time Systems: Routing and Scheduling

Feifei Li, Ching Chang, George Kollios, Azer Bestavros

Capabilities of Threshold Neurons

Query Optimization Minimizing Memory and Latency in DSMS

Process Scheduling B.Ramamurthy 4/11/2019.

Process Scheduling B.Ramamurthy 4/7/2019.

Load Shedding CS240B notes.

Presentation transcript:

Adaptivity in continuous query systems Luis A. Sotomayor & Zhiguo Xu Professor Carlo Zaniolo CS240B - Spring 2003

Sotomayor - Xu2 Outline Introduction Adapting to the “burstiness” of data streams by using a smart operator scheduling strategy Adapting to high volumes of data streamed by multiple data sources through the use of “adaptive filters” Conclusion

Sotomayor - Xu3 Introduction Two distinguishing characteristics of data streams: Volume of data is extremely high Decisions are made in close to real time Traditional solutions are impractical Data cannot be stored in static databases for offline querying Importance of data streams is due to variety of applications

Sotomayor - Xu4 Applications of data streams Network monitoring Intrusion detection systems Fraud detection Financial monitoring E-commerce Sensor networks

Sotomayor - Xu5 Research efforts Large number of applications has led to many efforts seeking to construct full-fledged DSMS Efforts have concentrated on issues of System architectures Query languages Algorithm efficiency Issues such as efficient resource allocation, and communication overhead have received less attention

Sotomayor - Xu6 Importance of adaptivity DSMS deal with multiple long-running continuous queries Data streams do not usually arrive at a regular rate Considerable “burstiness” and variation over time Environment conditions in which queries are executed are frequently different from the conditions for which the query plans were generated DSMS may face an increasing number of data sources and therefore an increased volume of traffic

The “Chain” operator scheduling strategy

Sotomayor - Xu8 The classic solution Buffer the backlog of unprocessed tuples Work through them during periods of light load Problem: Heavy load could exceed physical memory (causing page switches) The memory used for these backlogs has to be minimized

Sotomayor - Xu9 Finding a better solution Claim: the operator scheduling strategy can have a significant impact on run- time resource consumption Use an operator scheduling strategy that will minimize the amount of memory used during query execution I.e. reduce the size of the backlogs

Sotomayor - Xu10 Chain scheduling A near optimal operator scheduling strategy Outperforms competing operator scheduling strategies Strategy concentrates on Single stream queries involving Selection Projection Foreign-key joins with stored relations Sliding window queries over multiple streams

Sotomayor - Xu11 The model Query execution is conceptualized as a data flow diagram (a directed acyclic graph) Nodes correspond to pipelined operators Edges represent compositions of operators An edge from A to B indicates the output of operator A is the input to operator B Another interpretation: an edge represents an input queue that buffers the output from A before it is input to B

Sotomayor - Xu12 An example Suppose the query is SELECT Name FROM EmployeeStream WHERE ID = ‘12345’; Operators are Projection (SELECT …) Selection (WHERE …) Input stream SelectProject Output stream Operator path

Sotomayor - Xu13 Main ideas Operators are thought of as filters Operate on a set of tuples Produce s tuples in return s  selectivity of an operator If s = 0.2 we can interpret the value in two ways Out of every 10 tuples, the operator outputs 2 tuples If the input requires 1 unit of memory, the output will require 0.2 units of memory

Sotomayor - Xu14 Example Consider an operator path with two operators O 1 and O 2 Assume that O 1 takes one unit of time to process a tuple and that its selectivity is 0.2 Assume that O 2 takes one unit of time to process 0.2 tuples and that its selectivity is 0 I.e. O 2 outputs tuples out of the system

Sotomayor - Xu15 Example (cont) Now consider two strategies FIFO A tuple is passed through both operators in two consecutive time units No other tuples are processed during that time Greedy strategy If there is a tuple buffered before O 1 then it is operated on using one time unit Otherwise if there are tuples buffered before O 2, 0.2 tuples are processed using 1 time unit

Sotomayor - Xu16 Example (cont) Time Greedy schedulingFIFO scheduling Memory usage Need to consider the growth or reduction of data as it travels along the operator path

Sotomayor - Xu17 Progress charts Behavior of data is captured by progress charts Points represent an operator The ith operator takes (t i – t i-1 ) units of time to process a tuple of size s i-1 Result is a tuple of size s i

Sotomayor - Xu18 Progress charts (cont) We can define selectivity as the drop in tuple size from operator i to operator i+1. In other words selectivity is equal to s i /s i-1  selectivity

Sotomayor - Xu19 The lower envelope Consider some point (s, t) on the progress chart Imagine there is a line from this point to every operator point (t i, s i ) to its right The operator that corresponds to the line with the steepest slope is called the “steepest descent operator point”

Sotomayor - Xu20 The lower envelope (cont) By starting at the first point (t 0, s 0 ) and repeatedly calculating the steepest descent operator point we find the lower envelope P’ for a progress chart P Notice that the slopes of the segments are non-increasing

Sotomayor - Xu21 The lower envelope (cont) So what is it? A way to find which segments of the operator path yield the biggest drops in tuple size It allows us to consider changes in selectivity across groups of operators We call these groups “chains”

Sotomayor - Xu22 “Chain” scheduling Chain assigns priorities to operators equaling the slope of the lower envelope segment to which the operator belongs At any time Out of all the operators with tuples in their input queues the one with the highest priority is chosen When there are “ties,” the operator with the oldest tuples is chosen (based on arrival time)

Sotomayor - Xu23 The Chain strategy along the progress chart Tuples don’t actually move along lower envelope They instead move along the operator path When the Chain strategy moves along the actual progress chart P, the memory requirements are not that much greater than before

Sotomayor - Xu24 Multiple stream queries Queries that have at least one tuple- based sliding window join between two streams

Sotomayor - Xu25 Multiple stream query execution Query is first broken up into parallel operator paths R   S R   S  Shared

Sotomayor - Xu26 Experimental results Compared the performance of Chain, FIFO, Greedy, and Round-Robin 2 data sets (network data) Synthetic data set Real data set Queries used IP addresses and packet sizes in selection and projection predicates

Sotomayor - Xu27 Experiment: single stream queries (4 operators) Query: 4 operators Third operator is very selective In between two less selective operators

Sotomayor - Xu28 Experiment results

Sotomayor - Xu29 Multiple stream experiment Three simultaneous queries A sliding window join Two single stream queries with selectivities less than one Results show Chain outperforms other strategies by a large margin

Sotomayor - Xu30 Multiple stream experiment results

Sotomayor - Xu31 Summary Proved that the choice of operator scheduling strategy has a significant impact on resource consumption Proved that the Chain scheduling strategy outperforms competing strategies Future work Latency and starvation issues Consider query plans that change over time Consider the sharing of computation and memory in query plans

Sotomayor - Xu32 “Adaptive filters” for continuous queries over distributed data streams

Sotomayor - Xu33 What’s the problem? Distributed data sources continuously stream updates to a centralized processor where continuous queries are evaluated Because of the high volume of data updates, the communication overhead jeopardizes system performance E.g. path latency computed by monitoring queuing latency at routers: the volume of monitoring traffic from routers may exceed that of normal traffic Can we reduce the communication overhead to make continuous queries based on multiple data streams feasible and efficient?

Sotomayor - Xu34 Important observations Exact precision for continuous queries is not always needed E.g. path latency application: <= 5 ms of accuracy Approximate answers of sufficient precision can usually be computed from a small fraction of the input stream. E.g. average network traffic volume received by all hosts within the organization The precision constraint for queries may change over time. E.g. more precise traffic volume needed in face of attack

Sotomayor - Xu35 Overview of Approach Reduce communication overhead at the cost of query precision. Quantitative precision constraints specified with the continuous queries Bounded approximate answer [L, H] Precision constraint δ. 0 ≤ H – L ≤ δ Filters installed at the remote data sources by the stream processor Filter at data object O’s source: [Lo, Ho] of width Wo centered around most recent numeric update V.

Sotomayor - Xu36 Naive filtering policy Uniform allocation E.g a single CQ: AVG(O 1, O 2, …, O n ) Precision constraint δ  Filters with a bound of width δ The wider a bound, the more restrictive a filter and consequently the more imprecise the query answers. Cons Multiple CQs are issued on one object. If the smallest bound width is chosen for the filter, the higher update stream rate may be wasted on a few CQs. Data updates rate and magnitudes not counted.

Sotomayor - Xu37 System structure Data source Filters Stream coordinator Precision manager Bound cache CQ evaluator

Sotomayor - Xu38 System structure

Sotomayor - Xu39 Adaptive filter setting algorithm Goal: set bound widths for steam filters adaptively to reduce communication costs while guaranteeing the precision constraints of CQs AVG queries analyzed only Q 1, Q 2, …, Q m with sets S 1, S 2, …, S m. S j is a subset of a set of n data objects O 1, O 2, …, O n Query result Q j : Precision constraint: Basic idea: Implicit bound width shrinking Explicit bound width growing

Sotomayor - Xu40 Bound shrinking Filtering bound width Wi for object Oi Maintained both at the central stream coordinator and at the source filter W i  W i · (1 – S) for every Γ time units Γ: adjustment period S: shrink percentage

Sotomayor - Xu41 Bound growing Burden score: the degree to which an object is contributing to the overall communication cost due to streamed updates where C i is communication cost for O i, W i is the current bound width, and Burden target: the lowest overall burden required of the objects in the query in order to meet the precision constraint at all times. Where N i is the number of updates of O i received by the stream coordinator in the last Γ time units

Sotomayor - Xu42 Bound growing (Cont) Burden deviation: the degree to which an object is “over- burdened” with respect to the burden targets of the queries that access it. Queried objects are considered in order of decreasing deviation, and it is assigned the maximum possible bound growth when it is considered.

Sotomayor - Xu43 Bound growing (Summary) Each object is assigned a burden score Each query is assigned a burden target by either averaging burden scores or invoking an iterative linear solver Each object is assigned a deviation value based on the difference between its burden score and the burden targets of the queries that access it The objects are considered in order of decreasing deviation, and each object is assigned the maximum possible bound growth when it is considered

Sotomayor - Xu44 Burden Target Computation Single AVG query Q k over every object O 1, …, O n. B 1 = B 2 = … = B n = T k Or Intuitive explanation behind this formula Objects having higher than average burden scores will be given a higher priority for bound width growth to lower their burden scores; Objects having lower than average burden scores will shrink by default, thereby raising their burden scores.

Sotomayor - Xu45 Burden Target Computation (Cont) Multiple queries over different set of objects θ i,j : the portion of object O i ’s burden score corresponding to query Q j and Goal for adjusting burden scores in presence of overlapping queries is to have the burden score B i of each object O i equal the sum of the burden targets of the queries over O i. Burden target:

Sotomayor - Xu46 Validation against optimized strategy The adaptive bound width setting algorithm converges on bounds that are on par with those selected by an optimizer.

Sotomayor - Xu47 Implementation and experimental validation Single query

Sotomayor - Xu48 Implementation and experimental validation Multiple queries

Sotomayor - Xu49 Summary Trade the precision of query results for lower communication costs. The specification of precision for continuous queries Adaptive filters Future work How imprecision propagates through more complex query plans Develop appropriate optimization techniques for adapting remote filter predicates in more complex environments

Sotomayor - Xu50 Conclusion The problem DSMS must consider the high volume as well as the “burstiness” of data streams Effectiveness of systems depends on being able to gracefully adapt to environmental conditions (I.e. resource availability) Two different approaches for adaptivity Minimizing the amount of memory at all times Controlling the amount of data sent from multiple data sources

Sotomayor - Xu51 Conclusion (cont) Chain operator scheduling minimizes the amount of memory used during execution making the system more adaptable to variation in arrival rates Adaptive filters reduce the volume of data so that a system can perform efficiently while providing a certain level of precision Overall, the need for adaptivity in DSMS is necessary due to the unpredictability of data streams

Sotomayor - Xu52 References J. M. Hellerstein et al. Adaptive Query Processing: Technology in Evolution. IEEE 2000 B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom. Models and Issues in Data Stream Systems. ACM SIGMOD/PODS 2002 Conference. B. Babcock, S. Babu, M. Datar, R. Motwani. Chain: Operator Scheduling for Memory Minimization in Data Stream Systems. SIGMOD 2003 Chris Olston, Jing Jiang, Jennifer Widom. Adaptive Filters for Continuous Queries Over Distributed Data Streams. SIGMOD 2003.