Data Streams: Lecture 101 Window Aggregates in NiagaraST Kristin Tufte, Jin Li Thanks to the NiagaraST PSU
Data Streams: Lecture 10 2 Outline Review of Window Aggregate Evaluation WID Window Semantics Panes Disordered Data and Out-of-order Processing
Data Streams: Lecture 10 3 Window Aggregate – Buffering t4 ( a5, 47, 01:10:10)t5 ( a6, 48, 01:10:30)t6 ( a6, 46, 01:11:02) ( aid, max) ( a5, 47 ) ( a6, 48 ) ( aid, amt, ts ) SELECT aid, MAX(amt) FROM Bids [Range 5 minutes Slide 1 minute Wattr ts] GROUP BY aid t6 (a6, 46, 01:11:02) t4 (a5, 47, 01:10:10) t5 (a6, 48, 01:10:40) t4 (a5, 47, 01:10:10) t1 (a5, 40, 01:06:30) t2 (a6, 42, 01:07:45) t3 (a5, 45, 01:08:15) (aid, amt ts (hh:mm:ss)) windows: 01:06:00 – 01:11:00 01:07:00 – 01:12:00 01:08:00 – 01:13:00 window window: 01:06:00 – 01:11:00 01:07:00 – 01:12:00 01:08:00 – 01:13:00 window windows: 01:06:00 – 01:11:00 01:07:00 – 01:12:00 01:08:00 – 01:13:00 window windowMax
Data Streams: Lecture 10 4 Window Aggregate Evaluation in NiagaraST -WID t4 ( s5, 47, 01:10:10)t5 ( a6, 48, 01:10:30)p1 ( a6, *, 01:11:00) t4 ( a5, 47, 01:10:10, )t5 ( a6, 48, 01:10:30, )p1 ( a6, *, *, 70 )t6 ( a6, 46, 01:11:02, ) (aid, window-id, max) ( a6, 70, 48 ) 70a5max: 47 … 74a5max: 47 70a6max: 48 … 74a6max: a6max: 48 … 74 a6 max: a6 max: 46 wid aid max … (aid, amt, ts ) (aid, amt, ts, window-id) t6 ( a6, 46, 01:11:02)t1 ( a5, 40, 01:06:30)t2 ( a6, 42, 01:07:45)t3 ( a5, 45, 01:08:15) t1 ( a5, 40, 01:06:30, )t2 ( a6, 42, 01:07:45, )t3 ( a5, 45, 01:08:15, ) 70a5max: 40 … 74a5max: 40 70a6max: 42 … 74a6max: 42 70a5max: 45 … 74a5max: 45 70a6max: 48 … 74a6max: 48 Bucket Range 5 minutes Slide 1 minute Wattr ts Max groups on: aid, wid windows: 01:06:00 – 01:11:00 01:07:00 – 01:12:00 01:08:00 – 01:13:00
Data Streams: Lecture 10 5 What’s the Difference Window Semantics Assumptions of data arrival order vs. Window-id Data arrival (query answer and result production) vs. Punctuation Query evaluation performance Space Time Latency
Data Streams: Lecture 10 6 Bucket Bucket maps each tuple to windows windows: A 01:03:00 – 01:08:00 B 01:04:00 – 01:09:00 C 01:05:00 – 01:10:00 D 01:06:00 – 01:11:00 E 01:07:00 – 01:12:00 F 01:08:00 – 01:13:00 t3 ( a5, 45, 01:08:15) t4 ( s5, 47, 01:10:10) t5 ( a6, 48, 01:10:30) t6 ( a6, 46, 01:11:02) t1 ( a5, 40, 01:06:30) t2 ( a6, 42, 01:07:45) [Range 5 minutes Slide 1 minute Wattr ts] ( aid, amt, ts )
Data Streams: Lecture 10 7 Window Semantics Framework in NiagaraST T: the set of all tuples in the input stream S: a window specification W: a set of window-ids In this lecture, we assume time starts at 0 windows: (T, S) W Defines the set of window ids to be used, e.g., 0, 1, 2, … extent: (T, S, w) U T, where w W Specifies which tuples belong to a given window wids: (T, S, t) V W, where t T Determines the set of window-ids to which a tuple belongs Is the dual of extent
Data Streams: Lecture 10 8 Extent Window 74’s content windows: 71 01:03:00 – 01:08: :04:00 – 01:09: :05:00 – 01:10: :06:00 – 01:11: :07:00 – 01:12: :08:00 – 01:13:00 t3 ( a5, 45, 01:08:15) t4 ( s5, 47, 01:10:10) t5 ( a6, 48, 01:10:30) t6 ( a6, 46, 01:11:02) t1 ( a5, 40, 01:06:30) t2 ( a6, 42, 01:07:45) [Range 5 minutes Slide 1 minute Wattr ts] ( aid, amt, ts )
Data Streams: Lecture 10 9 Wids t3’s window membership windows: 71 01:03:00 – 01:08: :04:00 – 01:09: :05:00 – 01:10: :06:00 – 01:11: :07:00 – 01:12: :08:00 – 01:13:00 t3 ( a5, 45, 01:08:15) t4 ( s5, 47, 01:10:10) t5 ( a6, 48, 01:10:30) t6 ( a6, 46, 01:11:02) t1 ( a5, 40, 01:06:30) t2 ( a6, 42, 01:07:45) [Range 5 minutes Slide 1 minute Wattr ts] ( aid, amt, ts )
Data Streams: Lecture Defining Window Semantics - sliding window windows (T, S [RANGE, SLIDE, WATTR]) = {0, 1, 2, …} extent (w, T, S[RANGE, SLIDE, WATTR]) = { t T | ((w+1) * SLIDE)-RANGE ) ≤ t.WATTR < (w+1) * SLIDE } wids (t, T, S [RANGE, SLIDE, WATTR]) = {w W | t.WATTR / SLIDE – 1 < w ≤ (t.WATTR + RANGE) / SLIDE) – 1 } Q1: SELECT aid, max(amt) FROM Bids [Range 5 minutes [Range 5 minutes Slide 1 minute Slide 1 minute Wattr ts] Wattr ts] GROUP-BY aid windows (T, S [5, 1, ts]) = {0, 1, 2, …} extent (w, T, S[5, 1, ts]) = { t T | ((w+1) * 1) − 5 ) ≤ t.ts < (w+1) * 1 } wids (t, T, S [5, 1, ts]) = {w W | t.ts / 1 – 1 < w ≤ (t.ts + 5) / 1) – 1 } For t4 (s5, 47, 01:10:10), wids (t4, T, S [5, 1, ts]) = {w W | t4.ts / 1 – 1 < w ≤ (t4.ts + 5) / 1) – 1 } = {w W | < w ≤ } = {w W | 70 ≤ w ≤ 74} where t4.ts is 01:10:10 ≈ minute
Data Streams: Lecture Partitioned Window Q2: SELECT aid, MAX(amt ) FROM Bids [Range 1000 rows Slide 100 rows Wattr row-num Pattr aid] Q1: SELECT aid, MAX(amt) FROM Bids [Range 5 minutes Slide 1 minute Wattr ts] GROUP BY aid Q1': SELECT aid, MAX(amt) FROM Bids [Range 5 minutes Slide 1 minute Wattr ts Pattr aid] Q1 == Q1'
Data Streams: Lecture Defining Window Semantics - partitioned window Q2: SELECT aid, MAX(amt ) FROM Bids [Range 1000 rows Slide 100 rows Wattr row-num Pattr aid] windows (T, S [RANGE, SLIDE, row-num, PATTR]) = {(i, p) | i { 0, 1, 2, …}, p T.PATTR} extent ((i, p), T, S[RANGE, SLIDE, row-num, PATTR]) = { t T | t.PATTR = p, ((i+1) * SLIDE)-RANGE ) ≤ rank(t, row-num, PATTR, T) < (i+1) * SLIDE } T: the set of all tuples in the input stream S: a window specification W: a set of window-ids
Data Streams: Lecture Rank function t3 ( a5, 45, 01:08:15) t4 ( s5, 47, 01:10:10) t5 ( a6, 48, 01:10:30) t6 ( a6, 46, 01:11:02) t1 ( a5, 40, 01:06:30) t2 ( a6, 42, 01:07:45) ( aid, amt, ts) t3 ( a5, 45, 01:08:15) t4 ( s5, 47, 01:10:10) t1 ( a5, 40, 01:06:30) a5 a6 t5 ( a6, 48, 01:10:30) t6 ( a6, 46, 01:11:02) t2 ( a6, 42, 01:07:45) rank(t, row-num, PATTR, T): tuple t’s arrival position in partition PATTR rank
Data Streams: Lecture Defining Window Semantics - partitioned window (cont.) wids (t, T, S[RANGE, row-num, PATTR]) = {(i, p) W | t.PATTR = p, r / SLIDE – 1 i (r + RANGE) / SLIDE –1} where r = rank (t, row-num, PATTR, T) Q2: SELECT aid, MAX(amt ) FROM Bids [RANGE 1000 rows SLIDE 100 rows WATTR row-num PATTR aid] T: the set of all tuples in the input stream S: a window specification W: a set of window-ids
Data Streams: Lecture Defining Window Semantics - slide-by-tuple window Q3: SELECT aid, MAX(amt) FROM Bids [Range 5 minutes Rattr ts Slide 1 row Sattr row-num] GROUP BY aid windows (T, S [RANGE, RATTR, 1, row-num]) = {w | t T, w = t.RATTR} extent (t, T, S[RANGE, RATTR, 1, row-num]) = { t T | (w+1) -RANGE ) ≤ t.RATTR < (w+1) } wids (t, T, S[RANGE, RATTR, 1, row-num]) = { w W | t.RATTR ≤ w < t.RATTR + RANGE }
Data Streams: Lecture Discussion: Landmark Windows “Incrementally compute the max bid price of each auction; update the results every 1 minute.” Assume time starts from 0 windows (T, S [-inf, 1, ts]) = {0, 1, 2, …}
Data Streams: Lecture Implementation of wids functions -sliding window wids (t, T, S [RANGE, SLIDE, WATTR]) = {w W | t.WATTR / SLIDE – 1 < w ≤ (t.WATTR + RANGE) / SLIDE) – 1 } Q1: SELECT aid, MAX(amt) FROM Bids [Range 5 minutes Slide 1 minute Wattr ts] GROUP BY aid wids (t, T, S [5, 1, ts]) = {w W | t.ts / 1 – 1 < w ≤ (t.ts + 5) / 1) – 1 } No state, no delay to determine the window-ids for each tuple Context-free
Data Streams: Lecture Implementation of wids functions -partitioned window Need to maintain the number of arrived tuples for each partition Backward Context wids (t, T, S[1000, 100, row-num, aid]) = {(i, p) W | t.aid = p, r / 100 – 1 i (r ) / 100 –1} where r = rank (t, row-num, PATTR, T) Q2: SELECT aid, MAX(amt ) FROM Bids [Range 1000 rows Slide 100 rows Wattr row-num Pattr aid]
Data Streams: Lecture Implementation of wids functions -slide-by-tuple window Forward context wids (t, T, S[5, ts, 1, row-num]) = { w W | t.ts ≤ w < t.ts + 5 } Q3: SELECT aid, MAX(amt) FROM Bids [Range 5 minutes Rattr ts Slide 1 row Sattr row-num] GROUP BY aid windows (T, S [5, ts, 1, row-num]) = {w | t T, w = t.ts}
Data Streams: Lecture Backward Context and Forward Context Backward context state to be kept Forward context the mapping from tuples to window-ids has to be delayed
Data Streams: Lecture WID vs. Buffering – execution time comparison (overview)
Data Streams: Lecture WID vs. Buffering – execution time comparison (zoom-in)
Data Streams: Lecture Outline Review of Window Aggregate Evaluation WID Window Semantics Panes Disordered Data and Out-of-order Processing
Data Streams: Lecture Sharing Panes Windows Panes … … P1P1 P5P5 P6P6 P7P7 P8P8 P2P2 P3P3 P4P4 W3W3 W1W1 W2W2 W4W4 W5W5 Q4: SELECT aid, count(*) FROM Bids [Range 4 minutes [Range 4 minutes Slide 1 minute Slide 1 minute Wattr ts] GROUP BY aid Wattr ts] GROUP BY aid Li, J. Maier, D., Tufte, K., Papadimos, V., Tucker, P. No Pane, No Gain: Efficient Evaluation of Sliding-Window Aggregates over Data Streams. Pane size = GCD (SLIDE, RANGE)
Data Streams: Lecture Query Transformation Q4: SELECT aid, count(*) FROM Bids [Range 4 minutes Slide 1 minute Wattr ts] GROUP BY aid FROM Bids [Range 4 minutes Slide 1 minute Wattr ts] GROUP BY aid Q5: SELECT aid, sum(cnt) FROM Q6 [Range 4 Slide 1 Wattr pid] GROUP BY aid FROM Q6 [Range 4 Slide 1 Wattr pid] GROUP BY aid Q6: SELECT aid, count(*) as cnt, window-id as pid FROM Bids [Range 1 minute Slide 1 minute Wattr ts] GROUP BY aid FROM Bids [Range 1 minute Slide 1 minute Wattr ts] GROUP BY aid
Data Streams: Lecture Pane Implementation (aid, amt, ts ) ( a5, 47, 01:10:10 ) t1 ( a5, *, 01:11:00 ) p1 ( a6, 48, 01:10:30 ) t2 (aid, amt, pane-id ) ( a5, 47, ) t1 ( a5, *, 70 ) p1 ( a6, 48, ) t2 streamscan count (*) (group on pane-id, aid) bucket B1 as pane-id RANGE 1 min SLIDE 1 min WATTR ts bucket B2 as window-id RANGE 4 SLIDE 1 WATTR pane-id sum(*) (group on window-id, aid) (aid, pane-id, count) ( a5, 70, 8 ) m0 (aid, pane-id, count, window-id) ( a5, 70, 8, ) m0 Q6 Q5 (aid, sum, window-id) ( a5, 8, 70 ) s0
Data Streams: Lecture Aggregate Functions Distributive, Algebraic, Holistic Distributive f(X 1 ), f(X 2 ) f(X 1 U X 2 ) Algebraic g(X) f(X), g is a “synopsis function” g(X) can be stored in constant memory g(X 1 ), g(X 2 ) g(X 1 U X 2 ) Holistic: otherwise Differential g(Y), g(X) f(Y – X) g(Y-X), g(X) f(Y) g(X) can be stored in constant memory distributive or algebraic
Data Streams: Lecture Pane Implementation - Another Tumbling Count Range 1 minute Slide 1 minute Wattr ts Window operator Range 4 rows Slide 1 row Wattr row-num Window Sum SELECT aid, count(*) FROM Bids [Range 4 minutes Slide 1 minute Wattr ts] GROUP BY aid
Data Streams: Lecture When are panes better than windows? SELECT aid, max(amt) FROM Bids [Range X rows Slide Y rows Wattr row-num] GROUP BY aid 1. Panes are better when cost ratio is less than 1 2. The number of tuples per pane affects whether using panes is better
Data Streams: Lecture Outline Review of Window Aggregate Evaluation WID Window Semantics Panes Handling Disorder
Data Streams: Lecture Sources of Disorder Sources of disorder Merging different data sources Various network transmission delay Data prioritization Query processing algorithms, e.g., shared window joins [Hammad, et al.] Multiple possible windowing attributes, e.g., two timestamps
Data Streams: Lecture Example of Disorder: Band Disorder Network flow: source IP + port destination IP + port Timestamp of 8 th packet of network flow vs. flow start time Passive Measurement and Analysis project, San Diego Supercomputer Center,
Data Streams: Lecture Example of Disorder: Block-sorted Disorder Scatter plot of netflow records emitted by a router in the Abilene Network ( X-axis is position of flow record in stream, y-axis is network flow start time Network flow start time (s)
Data Streams: Lecture Handling Disorder Sort (In-order processing) Slack – BSort in Aurora Heartbeat in STREAM Sort-based Merge in Gigascope Output buffering and sorting in a shared-window join Space and time cost
Data Streams: Lecture Out-of-Order Processing Out-of-order processing in Join M. A. Hammad, W. G. Aref, and A. K. Elmagarmid Optimizing In-Order Execution of Continuous Queries over Streamed Sensor Data. SSDBM 2005 NiagaraST: Punctuation + Window-Id Analogue to CPU’s out-of-order processing of instructions
Data Streams: Lecture Disorder Handling - WID Q1: SELECT aid, MAX(amt) FROM Bids [Range 5 minutes Slide 1 minute Wattr ts] GROUP-BY aid p1 ( a6, *, 01:11:00)t7 ( a5, 52, 01:10:15) p1 ( a6, *, *, 70 )t6 ( a6, 46, 01:11:02, 71-75)t7 ( a5, 52, 01:10:15, 70-74) (aid, window-id, max) ( a6, 70, 48 ) bucket 70s5max: 47 … 74s5max: 47 70s6max: 48 … 74s6max: s6max: 48 … 74 s6 max: s6 max: 46 wid aid max … (aid, amt, ts ) (aid, amt, ts, window-id) t6 ( a6, 46, 01:11:02) 70s5max: 52 … 74s5max: 52 70s6max: 48 … 74s6max: 48 Max Group on: wid, aid
Data Streams: Lecture Sources of Punctuation External Punctuation Data sources, e.g., Gigascope Internal Punctuation Mechanisms that can be used to generate punctuations Slack Heartbeat
Data Streams: Lecture Latency vs. Accuracy Band Disorder Compare external punctuation and two flavors of slack As slack increases, error decreases and latency increases External punctuation has better latency and accuracy than slack
Data Streams: Lecture Latency vs. Accuracy Block-Sorted Disorder Latency vs. Accuracy Block-Sorted- Disorder (percentage of incorrect answers) SELECT count (*) From Bids [Range 10 minutes Slide 1 minute Wattr ts]