Adaptivity in continuous query systems Luis A. Sotomayor & Zhiguo Xu Professor Carlo Zaniolo CS240B - Spring 2003.

Adaptivity in continuous query systems Luis A. Sotomayor & Zhiguo Xu Professor Carlo Zaniolo CS240B - Spring 2003

Sotomayor - Xu2 Outline Introduction Adapting to the “burstiness” of data streams by using a smart operator scheduling strategy Adapting to high volumes of data streamed by multiple data sources through the use of “adaptive filters” Conclusion

Sotomayor - Xu3 Introduction Two distinguishing characteristics of data streams: Volume of data is extremely high Decisions are made in close to real time Traditional solutions are impractical Data cannot be stored in static databases for offline querying Importance of data streams is due to variety of applications

Sotomayor - Xu4 Applications of data streams Network monitoring Intrusion detection systems Fraud detection Financial monitoring E-commerce Sensor networks

Sotomayor - Xu5 Research efforts Large number of applications has led to many efforts seeking to construct full-fledged DSMS Efforts have concentrated on issues of System architectures Query languages Algorithm efficiency Issues such as efficient resource allocation, and communication overhead have received less attention

Sotomayor - Xu6 Importance of adaptivity DSMS deal with multiple long-running continuous queries Data streams do not usually arrive at a regular rate Considerable “burstiness” and variation over time Environment conditions in which queries are executed are frequently different from the conditions for which the query plans were generated DSMS may face an increasing number of data sources and therefore an increased volume of traffic

The “Chain” operator scheduling strategy

Sotomayor - Xu8 The classic solution Buffer the backlog of unprocessed tuples Work through them during periods of light load Problem: Heavy load could exceed physical memory (causing page switches) The memory used for these backlogs has to be minimized

Sotomayor - Xu9 Finding a better solution Claim: the operator scheduling strategy can have a significant impact on run- time resource consumption Use an operator scheduling strategy that will minimize the amount of memory used during query execution I.e. reduce the size of the backlogs

Sotomayor - Xu10 Chain scheduling A near optimal operator scheduling strategy Outperforms competing operator scheduling strategies Strategy concentrates on Single stream queries involving Selection Projection Foreign-key joins with stored relations Sliding window queries over multiple streams

Sotomayor - Xu11 The model Query execution is conceptualized as a data flow diagram (a directed acyclic graph) Nodes correspond to pipelined operators Edges represent compositions of operators An edge from A to B indicates the output of operator A is the input to operator B Another interpretation: an edge represents an input queue that buffers the output from A before it is input to B

Sotomayor - Xu12 An example Suppose the query is SELECT Name FROM EmployeeStream WHERE ID = ‘12345’; Operators are Projection (SELECT …) Selection (WHERE …) Input stream SelectProject Output stream Operator path

Sotomayor - Xu13 Main ideas Operators are thought of as filters Operate on a set of tuples Produce s tuples in return s  selectivity of an operator If s = 0.2 we can interpret the value in two ways Out of every 10 tuples, the operator outputs 2 tuples If the input requires 1 unit of memory, the output will require 0.2 units of memory

Sotomayor - Xu14 Example Consider an operator path with two operators O 1 and O 2 Assume that O 1 takes one unit of time to process a tuple and that its selectivity is 0.2 Assume that O 2 takes one unit of time to process 0.2 tuples and that its selectivity is 0 I.e. O 2 outputs tuples out of the system

Sotomayor - Xu15 Example (cont) Now consider two strategies FIFO A tuple is passed through both operators in two consecutive time units No other tuples are processed during that time Greedy strategy If there is a tuple buffered before O 1 then it is operated on using one time unit Otherwise if there are tuples buffered before O 2, 0.2 tuples are processed using 1 time unit

Sotomayor - Xu16 Example (cont) Time Greedy schedulingFIFO scheduling 011 11.2 21.42.0 31.62.2 41.83.0 52.03.2 62.24.0 Memory usage Need to consider the growth or reduction of data as it travels along the operator path

Sotomayor - Xu17 Progress charts Behavior of data is captured by progress charts Points represent an operator The ith operator takes (t i – t i-1 ) units of time to process a tuple of size s i-1 Result is a tuple of size s i

Sotomayor - Xu18 Progress charts (cont) We can define selectivity as the drop in tuple size from operator i to operator i+1. In other words selectivity is equal to s i /s i-1  selectivity

Sotomayor - Xu19 The lower envelope Consider some point (s, t) on the progress chart Imagine there is a line from this point to every operator point (t i, s i ) to its right The operator that corresponds to the line with the steepest slope is called the “steepest descent operator point”

Sotomayor - Xu20 The lower envelope (cont) By starting at the first point (t 0, s 0 ) and repeatedly calculating the steepest descent operator point we find the lower envelope P’ for a progress chart P Notice that the slopes of the segments are non-increasing

Sotomayor - Xu21 The lower envelope (cont) So what is it? A way to find which segments of the operator path yield the biggest drops in tuple size It allows us to consider changes in selectivity across groups of operators We call these groups “chains”

Sotomayor - Xu22 “Chain” scheduling Chain assigns priorities to operators equaling the slope of the lower envelope segment to which the operator belongs At any time Out of all the operators with tuples in their input queues the one with the highest priority is chosen When there are “ties,” the operator with the oldest tuples is chosen (based on arrival time)

Sotomayor - Xu23 The Chain strategy along the progress chart Tuples don’t actually move along lower envelope They instead move along the operator path When the Chain strategy moves along the actual progress chart P, the memory requirements are not that much greater than before

Sotomayor - Xu24 Multiple stream queries Queries that have at least one tuple- based sliding window join between two streams

Sotomayor - Xu25 Multiple stream query execution Query is first broken up into parallel operator paths R   S R   S  Shared

Sotomayor - Xu26 Experimental results Compared the performance of Chain, FIFO, Greedy, and Round-Robin 2 data sets (network data) Synthetic data set Real data set Queries used IP addresses and packet sizes in selection and projection predicates

Sotomayor - Xu27 Experiment: single stream queries (4 operators) Query: 4 operators Third operator is very selective In between two less selective operators

Sotomayor - Xu28 Experiment results

Sotomayor - Xu29 Multiple stream experiment Three simultaneous queries A sliding window join Two single stream queries with selectivities less than one Results show Chain outperforms other strategies by a large margin

Sotomayor - Xu30 Multiple stream experiment results

Sotomayor - Xu31 Summary Proved that the choice of operator scheduling strategy has a significant impact on resource consumption Proved that the Chain scheduling strategy outperforms competing strategies Future work Latency and starvation issues Consider query plans that change over time Consider the sharing of computation and memory in query plans

Sotomayor - Xu32 “Adaptive filters” for continuous queries over distributed data streams

Sotomayor - Xu33 What’s the problem? Distributed data sources continuously stream updates to a centralized processor where continuous queries are evaluated Because of the high volume of data updates, the communication overhead jeopardizes system performance E.g. path latency computed by monitoring queuing latency at routers: the volume of monitoring traffic from routers may exceed that of normal traffic Can we reduce the communication overhead to make continuous queries based on multiple data streams feasible and efficient?

Sotomayor - Xu34 Important observations Exact precision for continuous queries is not always needed E.g. path latency application: <= 5 ms of accuracy Approximate answers of sufficient precision can usually be computed from a small fraction of the input stream. E.g. average network traffic volume received by all hosts within the organization The precision constraint for queries may change over time. E.g. more precise traffic volume needed in face of attack

Sotomayor - Xu35 Overview of Approach Reduce communication overhead at the cost of query precision. Quantitative precision constraints specified with the continuous queries Bounded approximate answer [L, H] Precision constraint δ. 0 ≤ H – L ≤ δ Filters installed at the remote data sources by the stream processor Filter at data object O’s source: [Lo, Ho] of width Wo centered around most recent numeric update V.

Sotomayor - Xu36 Naive filtering policy Uniform allocation E.g a single CQ: AVG(O 1, O 2, …, O n ) Precision constraint δ  Filters with a bound of width δ The wider a bound, the more restrictive a filter and consequently the more imprecise the query answers. Cons Multiple CQs are issued on one object. If the smallest bound width is chosen for the filter, the higher update stream rate may be wasted on a few CQs. Data updates rate and magnitudes not counted.

Sotomayor - Xu37 System structure Data source Filters Stream coordinator Precision manager Bound cache CQ evaluator

Sotomayor - Xu38 System structure

Sotomayor - Xu39 Adaptive filter setting algorithm Goal: set bound widths for steam filters adaptively to reduce communication costs while guaranteeing the precision constraints of CQs AVG queries analyzed only Q 1, Q 2, …, Q m with sets S 1, S 2, …, S m. S j is a subset of a set of n data objects O 1, O 2, …, O n Query result Q j : Precision constraint: Basic idea: Implicit bound width shrinking Explicit bound width growing

Sotomayor - Xu40 Bound shrinking Filtering bound width Wi for object Oi Maintained both at the central stream coordinator and at the source filter W i  W i · (1 – S) for every Γ time units Γ: adjustment period S: shrink percentage

Sotomayor - Xu41 Bound growing Burden score: the degree to which an object is contributing to the overall communication cost due to streamed updates where C i is communication cost for O i, W i is the current bound width, and Burden target: the lowest overall burden required of the objects in the query in order to meet the precision constraint at all times. Where N i is the number of updates of O i received by the stream coordinator in the last Γ time units

Sotomayor - Xu42 Bound growing (Cont) Burden deviation: the degree to which an object is “over- burdened” with respect to the burden targets of the queries that access it. Queried objects are considered in order of decreasing deviation, and it is assigned the maximum possible bound growth when it is considered.

Sotomayor - Xu43 Bound growing (Summary) Each object is assigned a burden score Each query is assigned a burden target by either averaging burden scores or invoking an iterative linear solver Each object is assigned a deviation value based on the difference between its burden score and the burden targets of the queries that access it The objects are considered in order of decreasing deviation, and each object is assigned the maximum possible bound growth when it is considered

Sotomayor - Xu44 Burden Target Computation Single AVG query Q k over every object O 1, …, O n. B 1 = B 2 = … = B n = T k Or Intuitive explanation behind this formula Objects having higher than average burden scores will be given a higher priority for bound width growth to lower their burden scores; Objects having lower than average burden scores will shrink by default, thereby raising their burden scores.

Sotomayor - Xu45 Burden Target Computation (Cont) Multiple queries over different set of objects θ i,j : the portion of object O i ’s burden score corresponding to query Q j and Goal for adjusting burden scores in presence of overlapping queries is to have the burden score B i of each object O i equal the sum of the burden targets of the queries over O i. Burden target:

Sotomayor - Xu46 Validation against optimized strategy The adaptive bound width setting algorithm converges on bounds that are on par with those selected by an optimizer.

Sotomayor - Xu47 Implementation and experimental validation Single query

Sotomayor - Xu48 Implementation and experimental validation Multiple queries

Sotomayor - Xu49 Summary Trade the precision of query results for lower communication costs. The specification of precision for continuous queries Adaptive filters Future work How imprecision propagates through more complex query plans Develop appropriate optimization techniques for adapting remote filter predicates in more complex environments

Sotomayor - Xu50 Conclusion The problem DSMS must consider the high volume as well as the “burstiness” of data streams Effectiveness of systems depends on being able to gracefully adapt to environmental conditions (I.e. resource availability) Two different approaches for adaptivity Minimizing the amount of memory at all times Controlling the amount of data sent from multiple data sources

Sotomayor - Xu51 Conclusion (cont) Chain operator scheduling minimizes the amount of memory used during execution making the system more adaptable to variation in arrival rates Adaptive filters reduce the volume of data so that a system can perform efficiently while providing a certain level of precision Overall, the need for adaptivity in DSMS is necessary due to the unpredictability of data streams

Sotomayor - Xu52 References J. M. Hellerstein et al. Adaptive Query Processing: Technology in Evolution. IEEE 2000 B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom. Models and Issues in Data Stream Systems. ACM SIGMOD/PODS 2002 Conference. B. Babcock, S. Babu, M. Datar, R. Motwani. Chain: Operator Scheduling for Memory Minimization in Data Stream Systems. SIGMOD 2003 Chris Olston, Jing Jiang, Jennifer Widom. Adaptive Filters for Continuous Queries Over Distributed Data Streams. SIGMOD 2003.

Adaptivity in continuous query systems Luis A. Sotomayor & Zhiguo Xu Professor Carlo Zaniolo CS240B - Spring 2003.

Similar presentations

Presentation on theme: "Adaptivity in continuous query systems Luis A. Sotomayor & Zhiguo Xu Professor Carlo Zaniolo CS240B - Spring 2003."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Adaptivity in continuous query systems Luis A. Sotomayor & Zhiguo Xu Professor Carlo Zaniolo CS240B - Spring 2003.

Similar presentations

Presentation on theme: "Adaptivity in continuous query systems Luis A. Sotomayor & Zhiguo Xu Professor Carlo Zaniolo CS240B - Spring 2003."— Presentation transcript:

Similar presentations

About project

Feedback