Download presentation
Presentation is loading. Please wait.
Published byElfrieda Philippa Richard Modified over 9 years ago
1
Data Streams: Lecture 101 Window Aggregates in NiagaraST Kristin Tufte, Jin Li Thanks to the NiagaraST Group @ PSU
2
Data Streams: Lecture 10 2 Outline Review of Window Aggregate Evaluation WID Window Semantics Panes Disordered Data and Out-of-order Processing
3
Data Streams: Lecture 10 3 Window Aggregate – Buffering t4 ( a5, 47, 01:10:10)t5 ( a6, 48, 01:10:30)t6 ( a6, 46, 01:11:02) ( aid, max) ( a5, 47 ) ( a6, 48 ) ( aid, amt, ts ) SELECT aid, MAX(amt) FROM Bids [Range 5 minutes Slide 1 minute Wattr ts] GROUP BY aid t6 (a6, 46, 01:11:02) t4 (a5, 47, 01:10:10) t5 (a6, 48, 01:10:40) t4 (a5, 47, 01:10:10) t1 (a5, 40, 01:06:30) t2 (a6, 42, 01:07:45) t3 (a5, 45, 01:08:15) (aid, amt ts (hh:mm:ss)) windows: 01:06:00 – 01:11:00 01:07:00 – 01:12:00 01:08:00 – 01:13:00 window window: 01:06:00 – 01:11:00 01:07:00 – 01:12:00 01:08:00 – 01:13:00 window windows: 01:06:00 – 01:11:00 01:07:00 – 01:12:00 01:08:00 – 01:13:00 window windowMax
4
Data Streams: Lecture 10 4 Window Aggregate Evaluation in NiagaraST -WID t4 ( s5, 47, 01:10:10)t5 ( a6, 48, 01:10:30)p1 ( a6, *, 01:11:00) t4 ( a5, 47, 01:10:10, 70-74 )t5 ( a6, 48, 01:10:30, 70-74 )p1 ( a6, *, *, 70 )t6 ( a6, 46, 01:11:02, 71-75 ) (aid, window-id, max) ( a6, 70, 48 ) 70a5max: 47 … 74a5max: 47 70a6max: 48 … 74a6max: 48 71 a6max: 48 … 74 a6 max: 48 75 a6 max: 46 wid aid max … (aid, amt, ts ) (aid, amt, ts, window-id) t6 ( a6, 46, 01:11:02)t1 ( a5, 40, 01:06:30)t2 ( a6, 42, 01:07:45)t3 ( a5, 45, 01:08:15) t1 ( a5, 40, 01:06:30, 70-74 )t2 ( a6, 42, 01:07:45, 70-74 )t3 ( a5, 45, 01:08:15, 70-74 ) 70a5max: 40 … 74a5max: 40 70a6max: 42 … 74a6max: 42 70a5max: 45 … 74a5max: 45 70a6max: 48 … 74a6max: 48 Bucket Range 5 minutes Slide 1 minute Wattr ts Max groups on: aid, wid windows: 01:06:00 – 01:11:00 01:07:00 – 01:12:00 01:08:00 – 01:13:00
5
Data Streams: Lecture 10 5 What’s the Difference Window Semantics Assumptions of data arrival order vs. Window-id Data arrival (query answer and result production) vs. Punctuation Query evaluation performance Space Time Latency
6
Data Streams: Lecture 10 6 Bucket Bucket maps each tuple to windows windows: A 01:03:00 – 01:08:00 B 01:04:00 – 01:09:00 C 01:05:00 – 01:10:00 D 01:06:00 – 01:11:00 E 01:07:00 – 01:12:00 F 01:08:00 – 01:13:00 t3 ( a5, 45, 01:08:15) t4 ( s5, 47, 01:10:10) t5 ( a6, 48, 01:10:30) t6 ( a6, 46, 01:11:02) t1 ( a5, 40, 01:06:30) t2 ( a6, 42, 01:07:45) [Range 5 minutes Slide 1 minute Wattr ts] ( aid, amt, ts )
7
Data Streams: Lecture 10 7 Window Semantics Framework in NiagaraST T: the set of all tuples in the input stream S: a window specification W: a set of window-ids In this lecture, we assume time starts at 0 windows: (T, S) W Defines the set of window ids to be used, e.g., 0, 1, 2, … extent: (T, S, w) U T, where w W Specifies which tuples belong to a given window wids: (T, S, t) V W, where t T Determines the set of window-ids to which a tuple belongs Is the dual of extent
8
Data Streams: Lecture 10 8 Extent Window 74’s content windows: 71 01:03:00 – 01:08:00 72 01:04:00 – 01:09:00 73 01:05:00 – 01:10:00 74 01:06:00 – 01:11:00 75 01:07:00 – 01:12:00 76 01:08:00 – 01:13:00 t3 ( a5, 45, 01:08:15) t4 ( s5, 47, 01:10:10) t5 ( a6, 48, 01:10:30) t6 ( a6, 46, 01:11:02) t1 ( a5, 40, 01:06:30) t2 ( a6, 42, 01:07:45) [Range 5 minutes Slide 1 minute Wattr ts] ( aid, amt, ts )
9
Data Streams: Lecture 10 9 Wids t3’s window membership windows: 71 01:03:00 – 01:08:00 72 01:04:00 – 01:09:00 73 01:05:00 – 01:10:00 74 01:06:00 – 01:11:00 75 01:07:00 – 01:12:00 76 01:08:00 – 01:13:00 t3 ( a5, 45, 01:08:15) t4 ( s5, 47, 01:10:10) t5 ( a6, 48, 01:10:30) t6 ( a6, 46, 01:11:02) t1 ( a5, 40, 01:06:30) t2 ( a6, 42, 01:07:45) [Range 5 minutes Slide 1 minute Wattr ts] ( aid, amt, ts )
10
Data Streams: Lecture 10 10 Defining Window Semantics - sliding window windows (T, S [RANGE, SLIDE, WATTR]) = {0, 1, 2, …} extent (w, T, S[RANGE, SLIDE, WATTR]) = { t T | ((w+1) * SLIDE)-RANGE ) ≤ t.WATTR < (w+1) * SLIDE } wids (t, T, S [RANGE, SLIDE, WATTR]) = {w W | t.WATTR / SLIDE – 1 < w ≤ (t.WATTR + RANGE) / SLIDE) – 1 } Q1: SELECT aid, max(amt) FROM Bids [Range 5 minutes [Range 5 minutes Slide 1 minute Slide 1 minute Wattr ts] Wattr ts] GROUP-BY aid windows (T, S [5, 1, ts]) = {0, 1, 2, …} extent (w, T, S[5, 1, ts]) = { t T | ((w+1) * 1) − 5 ) ≤ t.ts < (w+1) * 1 } wids (t, T, S [5, 1, ts]) = {w W | t.ts / 1 – 1 < w ≤ (t.ts + 5) / 1) – 1 } For t4 (s5, 47, 01:10:10), wids (t4, T, S [5, 1, ts]) = {w W | t4.ts / 1 – 1 < w ≤ (t4.ts + 5) / 1) – 1 } = {w W | 69.17 < w ≤ 74.17 } = {w W | 70 ≤ w ≤ 74} where t4.ts is 01:10:10 ≈ 70.17 minute
11
Data Streams: Lecture 10 11 Partitioned Window Q2: SELECT aid, MAX(amt ) FROM Bids [Range 1000 rows Slide 100 rows Wattr row-num Pattr aid] Q1: SELECT aid, MAX(amt) FROM Bids [Range 5 minutes Slide 1 minute Wattr ts] GROUP BY aid Q1': SELECT aid, MAX(amt) FROM Bids [Range 5 minutes Slide 1 minute Wattr ts Pattr aid] Q1 == Q1'
12
Data Streams: Lecture 10 12 Defining Window Semantics - partitioned window Q2: SELECT aid, MAX(amt ) FROM Bids [Range 1000 rows Slide 100 rows Wattr row-num Pattr aid] windows (T, S [RANGE, SLIDE, row-num, PATTR]) = {(i, p) | i { 0, 1, 2, …}, p T.PATTR} extent ((i, p), T, S[RANGE, SLIDE, row-num, PATTR]) = { t T | t.PATTR = p, ((i+1) * SLIDE)-RANGE ) ≤ rank(t, row-num, PATTR, T) < (i+1) * SLIDE } T: the set of all tuples in the input stream S: a window specification W: a set of window-ids
13
Data Streams: Lecture 10 13 Rank function t3 ( a5, 45, 01:08:15) t4 ( s5, 47, 01:10:10) t5 ( a6, 48, 01:10:30) t6 ( a6, 46, 01:11:02) t1 ( a5, 40, 01:06:30) t2 ( a6, 42, 01:07:45) ( aid, amt, ts) t3 ( a5, 45, 01:08:15) t4 ( s5, 47, 01:10:10) t1 ( a5, 40, 01:06:30) a5 a6 t5 ( a6, 48, 01:10:30) t6 ( a6, 46, 01:11:02) t2 ( a6, 42, 01:07:45) 0 1 2 0 1 2 rank(t, row-num, PATTR, T): tuple t’s arrival position in partition PATTR rank
14
Data Streams: Lecture 10 14 Defining Window Semantics - partitioned window (cont.) wids (t, T, S[RANGE, row-num, PATTR]) = {(i, p) W | t.PATTR = p, r / SLIDE – 1 i (r + RANGE) / SLIDE –1} where r = rank (t, row-num, PATTR, T) Q2: SELECT aid, MAX(amt ) FROM Bids [RANGE 1000 rows SLIDE 100 rows WATTR row-num PATTR aid] T: the set of all tuples in the input stream S: a window specification W: a set of window-ids
15
Data Streams: Lecture 10 15 Defining Window Semantics - slide-by-tuple window Q3: SELECT aid, MAX(amt) FROM Bids [Range 5 minutes Rattr ts Slide 1 row Sattr row-num] GROUP BY aid windows (T, S [RANGE, RATTR, 1, row-num]) = {w | t T, w = t.RATTR} extent (t, T, S[RANGE, RATTR, 1, row-num]) = { t T | (w+1) -RANGE ) ≤ t.RATTR < (w+1) } wids (t, T, S[RANGE, RATTR, 1, row-num]) = { w W | t.RATTR ≤ w < t.RATTR + RANGE }
16
Data Streams: Lecture 10 16 Discussion: Landmark Windows “Incrementally compute the max bid price of each auction; update the results every 1 minute.” Assume time starts from 0 windows (T, S [-inf, 1, ts]) = {0, 1, 2, …}
17
Data Streams: Lecture 10 17 Implementation of wids functions -sliding window wids (t, T, S [RANGE, SLIDE, WATTR]) = {w W | t.WATTR / SLIDE – 1 < w ≤ (t.WATTR + RANGE) / SLIDE) – 1 } Q1: SELECT aid, MAX(amt) FROM Bids [Range 5 minutes Slide 1 minute Wattr ts] GROUP BY aid wids (t, T, S [5, 1, ts]) = {w W | t.ts / 1 – 1 < w ≤ (t.ts + 5) / 1) – 1 } No state, no delay to determine the window-ids for each tuple Context-free
18
Data Streams: Lecture 10 18 Implementation of wids functions -partitioned window Need to maintain the number of arrived tuples for each partition Backward Context wids (t, T, S[1000, 100, row-num, aid]) = {(i, p) W | t.aid = p, r / 100 – 1 i (r + 1000) / 100 –1} where r = rank (t, row-num, PATTR, T) Q2: SELECT aid, MAX(amt ) FROM Bids [Range 1000 rows Slide 100 rows Wattr row-num Pattr aid]
19
Data Streams: Lecture 10 19 Implementation of wids functions -slide-by-tuple window Forward context wids (t, T, S[5, ts, 1, row-num]) = { w W | t.ts ≤ w < t.ts + 5 } Q3: SELECT aid, MAX(amt) FROM Bids [Range 5 minutes Rattr ts Slide 1 row Sattr row-num] GROUP BY aid windows (T, S [5, ts, 1, row-num]) = {w | t T, w = t.ts}
20
Data Streams: Lecture 10 20 Backward Context and Forward Context Backward context state to be kept Forward context the mapping from tuples to window-ids has to be delayed
21
Data Streams: Lecture 10 21 WID vs. Buffering – execution time comparison (overview)
22
Data Streams: Lecture 10 22 WID vs. Buffering – execution time comparison (zoom-in)
23
Data Streams: Lecture 10 23 Outline Review of Window Aggregate Evaluation WID Window Semantics Panes Disordered Data and Out-of-order Processing
24
Data Streams: Lecture 10 24 Sharing Panes Windows Panes … … P1P1 P5P5 P6P6 P7P7 P8P8 P2P2 P3P3 P4P4 W3W3 W1W1 W2W2 W4W4 W5W5 Q4: SELECT aid, count(*) FROM Bids [Range 4 minutes [Range 4 minutes Slide 1 minute Slide 1 minute Wattr ts] GROUP BY aid Wattr ts] GROUP BY aid Li, J. Maier, D., Tufte, K., Papadimos, V., Tucker, P. No Pane, No Gain: Efficient Evaluation of Sliding-Window Aggregates over Data Streams. http://www.cs.pdx.edu/datalab/niagara/nopanenogain.pdf Pane size = GCD (SLIDE, RANGE)
25
Data Streams: Lecture 10 25 Query Transformation Q4: SELECT aid, count(*) FROM Bids [Range 4 minutes Slide 1 minute Wattr ts] GROUP BY aid FROM Bids [Range 4 minutes Slide 1 minute Wattr ts] GROUP BY aid Q5: SELECT aid, sum(cnt) FROM Q6 [Range 4 Slide 1 Wattr pid] GROUP BY aid FROM Q6 [Range 4 Slide 1 Wattr pid] GROUP BY aid Q6: SELECT aid, count(*) as cnt, window-id as pid FROM Bids [Range 1 minute Slide 1 minute Wattr ts] GROUP BY aid FROM Bids [Range 1 minute Slide 1 minute Wattr ts] GROUP BY aid
26
Data Streams: Lecture 10 26 Pane Implementation (aid, amt, ts ) ( a5, 47, 01:10:10 ) t1 ( a5, *, 01:11:00 ) p1 ( a6, 48, 01:10:30 ) t2 (aid, amt, pane-id ) ( a5, 47, 70-70 ) t1 ( a5, *, 70 ) p1 ( a6, 48, 70-70 ) t2 streamscan count (*) (group on pane-id, aid) bucket B1 as pane-id RANGE 1 min SLIDE 1 min WATTR ts bucket B2 as window-id RANGE 4 SLIDE 1 WATTR pane-id sum(*) (group on window-id, aid) (aid, pane-id, count) ( a5, 70, 8 ) m0 (aid, pane-id, count, window-id) ( a5, 70, 8, 70-74 ) m0 Q6 Q5 (aid, sum, window-id) ( a5, 8, 70 ) s0
27
Data Streams: Lecture 10 27 Aggregate Functions Distributive, Algebraic, Holistic Distributive f(X 1 ), f(X 2 ) f(X 1 U X 2 ) Algebraic g(X) f(X), g is a “synopsis function” g(X) can be stored in constant memory g(X 1 ), g(X 2 ) g(X 1 U X 2 ) Holistic: otherwise Differential g(Y), g(X) f(Y – X) g(Y-X), g(X) f(Y) g(X) can be stored in constant memory distributive or algebraic
28
Data Streams: Lecture 10 28 Pane Implementation - Another Tumbling Count Range 1 minute Slide 1 minute Wattr ts Window operator Range 4 rows Slide 1 row Wattr row-num Window Sum SELECT aid, count(*) FROM Bids [Range 4 minutes Slide 1 minute Wattr ts] GROUP BY aid
29
Data Streams: Lecture 10 29 When are panes better than windows? SELECT aid, max(amt) FROM Bids [Range X rows Slide Y rows Wattr row-num] GROUP BY aid 1. Panes are better when cost ratio is less than 1 2. The number of tuples per pane affects whether using panes is better
30
Data Streams: Lecture 10 30 Outline Review of Window Aggregate Evaluation WID Window Semantics Panes Handling Disorder
31
Data Streams: Lecture 10 31 Sources of Disorder Sources of disorder Merging different data sources Various network transmission delay Data prioritization Query processing algorithms, e.g., shared window joins [Hammad, et al.] Multiple possible windowing attributes, e.g., two timestamps
32
Data Streams: Lecture 10 32 Example of Disorder: Band Disorder Network flow: source IP + port destination IP + port Timestamp of 8 th packet of network flow vs. flow start time Passive Measurement and Analysis project, San Diego Supercomputer Center, http://pma.nlanr.net/PMAhttp://pma.nlanr.net/PMA
33
Data Streams: Lecture 10 33 Example of Disorder: Block-sorted Disorder Scatter plot of netflow records emitted by a router in the Abilene Network (http://abilene.internet2.edu/observatory)http://abilene.internet2.edu/observatory X-axis is position of flow record in stream, y-axis is network flow start time Network flow start time (s)
34
Data Streams: Lecture 10 34 Handling Disorder Sort (In-order processing) Slack – BSort in Aurora Heartbeat in STREAM Sort-based Merge in Gigascope Output buffering and sorting in a shared-window join Space and time cost
35
Data Streams: Lecture 10 35 Out-of-Order Processing Out-of-order processing in Join M. A. Hammad, W. G. Aref, and A. K. Elmagarmid Optimizing In-Order Execution of Continuous Queries over Streamed Sensor Data. SSDBM 2005 NiagaraST: Punctuation + Window-Id Analogue to CPU’s out-of-order processing of instructions
36
Data Streams: Lecture 10 36 Disorder Handling - WID Q1: SELECT aid, MAX(amt) FROM Bids [Range 5 minutes Slide 1 minute Wattr ts] GROUP-BY aid p1 ( a6, *, 01:11:00)t7 ( a5, 52, 01:10:15) p1 ( a6, *, *, 70 )t6 ( a6, 46, 01:11:02, 71-75)t7 ( a5, 52, 01:10:15, 70-74) (aid, window-id, max) ( a6, 70, 48 ) bucket 70s5max: 47 … 74s5max: 47 70s6max: 48 … 74s6max: 48 71 s6max: 48 … 74 s6 max: 48 75 s6 max: 46 wid aid max … (aid, amt, ts ) (aid, amt, ts, window-id) t6 ( a6, 46, 01:11:02) 70s5max: 52 … 74s5max: 52 70s6max: 48 … 74s6max: 48 Max Group on: wid, aid
37
Data Streams: Lecture 10 37 Sources of Punctuation External Punctuation Data sources, e.g., Gigascope Internal Punctuation Mechanisms that can be used to generate punctuations Slack Heartbeat
38
Data Streams: Lecture 10 38 Latency vs. Accuracy Band Disorder Compare external punctuation and two flavors of slack As slack increases, error decreases and latency increases External punctuation has better latency and accuracy than slack
39
Data Streams: Lecture 10 39 Latency vs. Accuracy Block-Sorted Disorder Latency vs. Accuracy Block-Sorted- Disorder (percentage of incorrect answers) SELECT count (*) From Bids [Range 10 minutes Slide 1 minute Wattr ts]
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.