Download presentation
Presentation is loading. Please wait.
Published byHortense Bell Modified over 8 years ago
1
1 STREAM: The Stanford Data Stream Management System STanfordstREamdatAManager http://infolab.stanford.edu/stream/ 69521001 陳盈君 69521038 吳哲維 69521040 林冠良
2
2 Outline Introduction The Continuous Query Language Query Plan and Execution Performance Issue (for query plan) Synopsis sharing Exploiting constraints Operator scheduling How to use Operator scheduling Adaptivity Approximation DSMS Interface & Future
3
3 Introduction
4
4 Data Stream Continuous, unbounded, rapid, time-varying streams of data element. Occur in variety of modern application: Network monitoring and traffic engineering Sensor network, RFID tags Telecom call records Financial analysis Manufacturing processes
5
5 STREAM system Stanford Data Stream Management Simply view: First load data, then index it, the run queries Continuous data stream Continuous queries
6
6 The Continuous Query Language CQL
7
7 CQL & SQL CQL starts with SQL Then add … Streams as new data type Continuous instead of one-time semantics Windows on stream Sampling on stream Operators between stream and relation
8
8 CQL : abstract semantics The abstract semantics is based on two data types Streams Relations Above data types are defined on discrete and ordered time domain .
9
9 Data type : stream A stream S is an unbounded bag of pairs s, T , where s is a tuple and T is the timestamp that denotes the logical arrival time of tuple s on stream S A stream is a collection of timestamped tuple. The element of stream S indicates that tuple s arrives on S at time T.
10
10 Data type : relation A relation R is a time-varying bag of tuples. The bag of tuples at time T is denoted R( T ), and we call R( T ) an instantaneous relation. For an example At time 0, R(0)= ø At time 1, R(1)={ } At time 2, R(2)={, } At time 3, R(3)={, }
11
11 Three classes of operators over streams and relations A relation-to-relation operator takes one or more relations as input and produces a relation as output. A stream-to-relation operator takes a stream as input and produces a relation as output. A relation-to-stream operator takes a relation as input and produces a stream as output.
12
12 Data types and operator classes Stream Relation relation-to-relation relation-to-stream stream-to-relation
13
13 Operator classes Stream-to-stream operators ? They are absent : they are composed from operators of the above three classed. A continuous query Q is a tree of operators belonging to above three classes. The streams and relations are input to the leaf operators, and the output of Q is the output of the root operator.
14
14 Illustration of a continuous query tree stream-to-relation stream input Relation-to-relation relation-to-stream Stream output Back to queue
15
15 Relation-to-relation operators in CQL CQL uses SQL constructs to express its relation-to-relation operators. Some relation-to-relation operators : select, project, binary-join, union, except etc …
16
16 Stream-to-relation operators in CQL The stream-to-relation operators in CQL are based on the concept pf a sliding windows over a stream. Three sliding window types: Tuple-based sliding window Time-based sliding window Partitioned sliding window
17
17 Tuple-based sliding window A tuple-based sliding window on a stream S takes an integer N > 0 as a parameter and produces a relation R. At time , R( ) contains the N tuples of S with the largest timestamps <= . It is specified by following S with “ [Rows N] ”. As a special case, “ [Rows unbounded] ” denotes the append-only window “ [Rows (infinite)] ”.
18
18 Time-based sliding window A time-based sliding window on a stream S takes a time interval as a parameter and produces a relation R. At time , R( ) contains all tuples of S with timestamp - and . It is specified by following S with “ [Range w] ”. As a special case, “ [Now] ” denotes the windows with =0.
19
19 Partitioned sliding window A partitioned sliding window on a stream S takes an integer N and a set of attributes {A 1, …, A k } of S as parameters. It is specified by following S with “ [Partition By A 1, …,A k Rows N] ”.
20
20 Illustration Partitioned window For [Partition A1,A2,A3 Rows N] New Old Stream S … N tuples timestamp A1A2A3
21
21 Relation-to-stream operators in CQL CQL has three relation-to-stream operators: Istream (for “ insert stream ” ) Dstream(for “ delete stream ” ) Rstream(for “ relation stream ” )
22
22 Istream Istream applied to a relation R contains a stream element whenever tuples s is in R( )-R( -1), i.e., whenever s is inserted into R at time . Assume R(-1)= ø for notational simplicity, we have :
23
23 Dstream Dstream applied to a relation R contains a stream element whenever tuple s in R( -1)-R( ), i.e, whenever s is deleted from R at time . Formally:
24
24 Istream example If we have the relation R At time , R( )={,,, } At time -1; R( -1)={,,, } Istream=R( )-R( -1)={ }
25
25 Dstream example If we have the relation R At time , R( )={,,, } At time -1; R( -1)={,,, } Dstream=R( -1)-R( )={ }
26
26 Rstream Rstream applied to a relation R contains a stream element whenever tuples s is in R( ). Formally:
27
27 Example CQL queries Example 1. Continuous query filters a stream S. Select Istream(*) From S[Rows unbounded] Where S.A>10 Stream S is converted into a relation by applying an unbounded window. The relation-to-relation filter “ S.A>10 ” The relation inserted to the filtered are streamed as the result. (using relation-to- stream operator : Istream(*)) Note: the query can be rewritten in the following intuitive form: Select * From S Where S.A>10 Note: the query can be rewritten in the following intuitive form: Select * From S Where S.A>10
28
28 Example 2 Window join example Select * From S1 [Rows 1000],S2 [Rang 2 Minutes] Where S1.A=S2.A And S1.A>10 The answer to this query is a relation. At any given time, the answer relation contains the join of the last 1000 tuples of S1 with the tuples of S2 that have arrived in previous 2 minutes. If we want to have a stream result? (*)=> Istream (S1.A)
29
29 Exercise !! (example 3) We have a stored table R and a stream S, then we want get a stream result that attribute A in R is the same as in S. And we just interested in attribute A in S and attribute B in R. Answer : Select Rstream(S.A,R.B) From S[Now], R Where S.A=R.A
30
30 Query Plans and Execution
31
31 Introduction When a continuous query specified in CQL is registered with the STREAM system, a query plan is complied form it. Query plans are composed of operators, which perform the actual processing, queues, which buffer tuples as they move between operators, and synopses, which store operator state.
32
32 Operators Each query plan operator reads from one or more input queues, processes the input based on its semantics, and writes its output to an output queue.
33
33 TABLE 1. Operators used in SREAM query plans
34
34 Queues A queue in query plan connects its “ producing ” plan operator O p to its “ consuming ” operator O c. see query treesee query tree A queue logically contains sequences of elements representing either streams or relations. Many of the operators in STREAM system require that elements on their input queues be read in non-decreasing timestamp order. For an example operator : stream-to-relation operator (sliding window).
35
35 Synopses Logically, a synopsis belongs to a specific plan operator, storing state that may be required for future evaluation of that operator. Each operator has different number of synopsis, for example, binary-join has 2 synopses, select has 0 synopsis. Synopses store summary of the tuples. Share Synopses
36
36 Query Plan When CQL query is registered, STREAM constructs a query plan : a tree of operators, connected by queues, with synopses attached to operators as needed. Show an example: Select * From S1 [Rows 1000], S2[Rang 2 Minutes] Where S1.A=S2.A And S1.A>10
37
37 Query Plan Example(cont.)
38
38 Other plan(same query)
39
39 Query plan execution Add an flag “ + ” or “ - ” into pair such that become or. “ + ” means insert; “ - ” means delete; How to use the flag? Show that by executing the query plan example. Select * From S1 [Rows 1000], S2[Rang 2 Minutes] Where S1.A=S2.A And S1.A>10
40
40 Query plan execution At time . s1 s2 s3 s1000 … s1 s2 s3 s1000 … t1 t2 t3 tntn … t1 t2 t3 tntn … At time +1 insert s1001 delete Output to q3 two relation
41
41 Performance Issue
42
42 Performance Issue Simply generating the straightforward query plans and executing them as described can be very inefficient. Idea : Eliminating data redundancy discarding data that will not be used selectively scheduling operators to most efficiently reduce intermediate state Above these are reducing memory overhead.
43
43 Synopsis sharing (1/2) Synopsis sharing has two different classification Synopsis sharing for single query plan and multiple query plans. Purpose: In order to eliminate data redundancy, we replace synopsis with lightweight stub and a single store to hold actual tuples. (Figure 4.2) Elements: Stubs implement the same interfaces like synopses. A single synopsis store can share different views of data by different operators. Store ability Tracking the progress of each stub Present the appropriate view to each stubs (subset of tuples )
44
44 Synopsis sharing (2/2) A tuple is inserted into the store as soon as it is inserted by any one of the stubs, and it is removed only when it has been removed from all of the stubs. To decrease state redundancy, multiple query plans involving similar intermediate relations can share synopses as well. Example : Select A, Max(B) From S1 [Rows 200] Group By A (Fig.3) idea : S1 [Rows 200] is a subset of S1 [Rows 1000]
45
45
46
46 Exploiting constraints (1/2) Streams may exhibit data or arrival patterns that can be exploited to reduce run-time synopsis sizes. Data constraints can either be specified at stream- registration time, or inferred by gathering statistics over time. Example: a continuous query that joins a stream Orders with a stream Fulfillments based on attributes orderID and itemID, perhaps to monitor average fulfillment delays.
47
47 Exploiting constraints (2/2) Idea: In general case, this query precisely requires synopses of unbounded size. However, if we know that all elements for a given orderID and itemID arrive on Orders before the corresponding elements arrive on Fulllments, then we need not maintain a join synopsis for the Fulllments operand at all. If Fulllments elements arrive clustered by orderID, then we need only save Orders tuples for a given orderID until the next orderID is seen.
48
48 Operator scheduling (1/6) An operator consumes elements from its input queues and produces elements on its output queue. The global operator scheduling policy can have a large effect on memory utilization. Two scheduling strategies: FIFO scheduling Greedy scheduling FIFO scheduling: When batches of n elements have been accumulated, they are passed through both operators in two consecutive time units, during which no other element is processed.
49
49 Operator scheduling (2/6) Greedy scheduling: it gives preference to the operator that has the greatest rate of reduction in total queue size per unit time. Example: a query plan has two operators,O 1 followed by O 2. O 1 takes one time unit to process a batch of n elements, and output 0.2n elements per input batch (i.e. its selectivity is 0.2). If O 2 takes one time unit to operate on 0.2n elements, and it sends its output out of the system. ( its selectivity is 0). Consider the following arrival pattern: n elements arrive at every time instant from t = 0 to t = 6, then no elements arrive from time t = 7 through t = 13.
50
50 Operator scheduling (3/6) Result:(t 0 -t 6 ) Issue: The greedy strategy performs better because it runs O 1 whenever it has input, reducing queue size by 0.8n elements each timestep. Input data=1 (1.0-1.0+0.2)+1=1.2 1.2-0.2+1=2.0 (2-1+0.2)+1=2.2 Input data=1 O 1 :(1-1+0.2)+1=1.2O 1 :(1.2-1+0.2)+1=1.4
51
51 Operator scheduling (4/6) Exercise: a query plan has 3 operators O 1 O 2 O 3. The execution order is O 1 to O 2 to O 3, and O 1 produces 0.9n elements per n input elements in one time unit, O 2 processes 0.9n elements in one time unit without changing the input size, and O 3 processes 0.9n elements in one time unit and sends its output out of the system. Please give a result table from t 0 -t 6.
52
52 Operator scheduling (5/6) Result:(t 0 -t 6 ) O1 :(1-1+0.9)+1O1:(1.9-1+0.9)+1 O1:(2.8-1+0.9)+1
53
53 Operator scheduling (6/6) Issue: Greedy scheduling operator O 3 must wait for O 1 and O 2 completed. In FIFO scheduling, we can view O 1 to O 3 as one block, so average reduction is 0.33n > Greedy reduction rate 0.1n by O 1. Greedy algorithm, s shortcoming: although the operator O 3 has highest priority but it is blocked by other operator with the lowest priority.
54
54 How to use Operator scheduling Chain scheduling algorithm Start by marking the first operator in the query plan as the " current" operator. find the block of consecutive operators starting at the " current" operator that maximizes the reduction in total queue size per unit time. Mark the first operator following this block as the " current" operator and repeat the previous step until all operators have been assigned to chains.
55
55 Adaptivity
56
56 Adaptivity(1/4) Because in long-running stream applications, data and arrival characteristics of streams may vary significantly over time. Therefore an adaptive approach to query processing is necessary. The STREAM system includes a monitoring and adaptive query processing infrastructure called StreaMon
57
57 Adaptivity(2/4) StreaMon has three components: 1) Executor : runs query plans to produce results 2) Profiler : collects and maintains statistics about stream and plan characteristics 3) Reoptimizer : ensures that the plans and memory structures are the most efficient for current characteristics
58
58 Adaptivity(3/4) Both are essential for adaptivity compete resource with the Executor
59
59 Adaptivity(4/4) A clear three-way tradeoff if conditions stabilize : run-time overhead, speed of adaptivity, and provable convergence StreaMon supports multiple adaptive algorithms that lie at different points along this tradeoff spectrum.
60
60 Approximation
61
61 Approximation--Environment Data stream environment that the combination of: ▪ multiple unbounded and possibly rapid incoming data streams ▪ multiple complex continuous queries with timeliness requirements ▪ finite computation and memory resources
62
62 Approximation--Goal Our goal is to build a system that, under these circumstances, degrades gracefully to approximate query answers Because there is a close relationship between resource management and approximation, our overall goal is to maximize query precision by making the best use of available resources
63
63 Approximation--Comparison We propose some static approximation (compile- time) techniques and some dynamic approximation (run-time, adaptive) techniques In comparison with other systems both the Telegraph and Niagara projects do consider resource management but not in the context of providing approximate query answers when available resources are insufficient Aurora with the introduction of “ QoS graphs ” that capture tradeoffs among precision, response time, resource usage, and usefulness to the application.
64
64 Static approximation In static approximation, queries are modified when they are submitted to the system so that they use fewer resources at execution time The two static approximation techniques we consider are window reduction and sampling rate reduction
65
65 Dynamic Approximation In dynamic approximation, queries are unchanged, but the system may not always provide precise query answers The three dynamic approximation techniques we consider are synopsis compression, sampling and load shedding For Example, the memory-limited case can use synopsis compression to reduce memory use. (i.e incorporating a window into a synopsis where no window is being used, or shrinking the existing window, will shrink the synopsis) synopsis
66
66 Dynamic Approximation — CPU-limited Another example, in CPU-Limited case can be reduced by load-shedding--dropping elements from query plans and saving the CPU time that would be required to process them to completion. We implement load-shedding by introducing sampling operators that probabilistically drop stream elements.
67
67 Dynamic Approximation — CPU-limited If we know a few basic statistics on the distribution of values in our streams, probabilistic guarantees on the accuracy of sliding-window aggregation queries for a given sampling rate can be derived mathematically Assume that for a given query Qi, we know the mean μ i and standard deviation σ i of the values we are aggregating, as well as the window size Ni. We can use the Hoeffding inequality to derive a bound on the probabilityδ that our relative error exceeds a given threshold ε max for a given sampling rate. The sampling rate P i :
68
68 Static approximation v.s Dynamic approximation Static approximation ’ s advantage: ▪ A user might even participate in the process of static approximation, guiding or approving the system ’ s query modifications ▪ Adaptive approximation techniques and continuous monitoring of system activity are not required — the query is modified once, before it begins execution Dynamic approximation ’ s advantage: ▪ The level of approximation can vary with fluctuations in data rates and distributions, query workload, and resource availability ▪ Approximation can occur at the plan operator level, and decisions can be made based on the global set of (possibly shared) query plans running in the system
69
69 The Stream System Interface(1/3) It is important for users, system administrators, and system developers to have the ability to inspect the system while it is running and to experiment with adjustments
70
70 The Stream System Interface(2/3) The visualizer allows them to: a. View the structure of query plans and their component entities (operators, queues, and synopses) b. View the detailed properties of each entity (e.q the amount of memory) c. Dynamically adjust entity properties (support real-time ) d. View monitoring graphs that display time- varying entity properties
71
71 The Stream System Interface(3/3)
72
72 Future Directions and Conclusion Distributed Stream Processing Crash Recovery 1) Transactions: one-time query v.s continuous query 2) Down-time data: fixed v.s changed Improved Approximation Relationship to Publish-Subscribe Systems
73
73 Reference 1.The STREAM Group. STREAM: The Stanford Stream Data Manager IEEE Data Engineering Bulletin, March 2003.(short overview paper) 2.Arasu, A.; Babcock, B.; Babu, S.; Cieslewicz, J.; Datar, M.; Ito, K.; Motwani, R,; Srivastava, U.; Widom, J. STREAM: The Stanford Data Stream Management System, 2004, Book chapter 3.R. Motwani, J. Widom, A. Arasu, B. Babcock, S. Babu, M.Datar, G.Manku, C. Olston, J. Rosenstein, and R. Varma. Query Processing, Resource Management, and Approximation in a Data Stream Management System. In Proc. of CIDR 2003. 4. S. Babu and J. Widom. StreaMon: an adaptive engine for stream query processing. In Proc. of the 2004 ACM SIGMOD Intl. Conf. on Management of Data,June 2004. Demonstration description 5. B. Babcock, M. Datar, and R. Motwani. Load shedding for aggregation queries over data streams. In Proc. of the 20th Intl. Conf. on Data Engineering, Mar. 2004 6. http://infolab.stanford.edu/stream/http://infolab.stanford.edu/stream/
74
74 Thank You!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.