1 Out of Order Processing for Stream Query Evaluation Jin Li (Portland State Universtiy) Joint work with Theodore Johnson, Vladislav Shkapenyuk, David Maier, and Kristin Tufte
2 Stream (monitoring) applications Network packets, transportation data, sensor data, stock quotes … Process data online Often require (near) real-time response Often involve data from multiple sources Have no control over the physical properties of the data Challenges for database-style support of stream applications Continuous query vs. one-time query “count the number of packets in the past minute” Adapt traditional query operators to data streams Requires a time attribute Stream Query System source A source B source C (srcIP, dstIP, len, ts)
3 Stream Query – Windowed Aggregation count (*) group by tb, srcIP, destIP union ts windows union Query 1: SELECT tb, srcIP, destIP, count(*) FROM A union B union C GROUP BY ts/60 as tb, srcIP, destIP “Count the number of packets in each minutes; update result every minute” ABC
4 IOP Evaluation – Current Approach Merge is an order-preserving implementation of UnionAll How to determine end of windows? merge sort merge Query 1: SELECT tb, srcIP, destIP, count(*) FROM A union B union C GROUP BY ts/60 as tb, srcIP, destIP count (*) group by tb, srcIP, destIP ts windows ABC
5 Problems Performance penalty Burst Caused by maintaining stream order May overload the stream system Memory Overhead Sort Order-preserving merge time skew lulls Latency Maintaining stream order delays tuple processing count (*) group by tb, srcIP, destIP merge sort ABC
6 Do We Have to Maintain Stream Order?
7 Stream Query – Windowed Aggregation count (*) group by tb, srcIP, destIP union ts windows union Query 1: SELECT tb, srcIP, destIP, count(*) FROM A union B union C GROUP BY ts/60 as tb, srcIP, destIP Stream query evaluation essentially requires information on stream progress ABC
8 Outline Stream query and existing evaluation approach (IOP) The out-of-order processing (OOP) alternative The OOP Implementation in Gigascope Initial performance results
9 Disorder External Sources Merging multiple data sources Different transmission routes (e.g., sensor networks) Multiple possible windowing attributes, e.g., start time and end time of netflows Internal Sources Data prioritization [Urhan and Franklin, 2001] Query processing algorithms, e.g., shared window joins [Hammad, et al., 2003]
10 OOP Stream Query Evaluation – Leveraging Punctuation count (*) group by tb, srcIP, destIP union ( , , 64, 10:01:30am) ( , , 32, 10:02:05am)( *, *, *, 10:02:00am) ( , , 64, 10:01:45am) … srcIPdestIPtbcnt ( *, *, *, 10:02:00am) 10:02:00am ( *, *, *, 10:03:00am) 10:03:00am ( *, *, *, 10:03:00am) ( *, *, *, 10:02:00am) 10:02:00am10:03:00am (118, , , 58) Punctuation is a special tuple embedded in a data stream that indicates the progress of the stream; e.g. (*, *, *, 10:02:03am) ABC
11 Outline Stream query and existing evaluation approach (IOP) The out-of-order processing (OOP) alternative The OOP implementation in Gigascope Initial performance results
12 Gigascope Architecture Bulk of the processing performed at the RTS. Low-level queries read directly from the packet buffer. Avoid copying the packet data to multiple queries. Low-level queries are small and light-weight Selection, projection, partial aggregation. Ensure timely processing, small cache footprint. NIC q1q2q3 … Q2 Q1 App RTS Circular buffer
13 Aggregation in Gigascope (IOP) Low-level aggregation Maintains fix-sized, small hash table – output on collisions Slow flush to smooth output traffic Flush the results of window n-1 gradually as processing input tuples in window n However, can still creates bursts in order to maintain output order Query 2: SELECT tb, srcIP, destIP, count(*) FROM TCP GROUP BY ts/60 as tb, srcIP, destIP q1 Q1 select tb, srcIP, destIP, count(*) from TCP group by ts/60 as tb, srcIP, destIP SELECT tb, srcIP, destIP, sum (Cnt) FROM q1 GROUP BY tb, srcIP, destIP tbdestIPsrcIPcnt 80ba26 80ca78 80bc99 80ad64 (a, c, 128, 4870) tbdestIPsrcIPcnt 80ca78 80bc99 80ad64 tbdestIPsrcIPcnt 81ca1 80bc99 80ad64 (80, a, b, 26) (x, y, 32, 4880) tbdestIPsrcIPcnt 81ca1 80bc99 80ad64 (80, a, c, 78) tbdestIPsrcIPcnt 80ca78 80bc99 80ad64
14 Aggregation in Gigascope (OOP) Low-level aggregation Does not need to maintain stream order Allows a delay of k windows Smooth output traffic better Heartbeat carries punctuation Initially generated by the callback function of a timer in low-level queries In high-level queries, each operator propagates heartbeat/punctuation Query 2: SELECT tb, srcIP, destIP, count(*) FROM TCP GROUP BY ts/60 as tb, srcIP, destIP q1 Q1 select tb, srcIP, destIP, count(*) from TCP group by ts/60 as tb, srcIP, destIP SELECT tb, srcIP, destIP, sum (Cnt) FROM q1 GROUP BY tb, srcIP, destIP tbdestIPsrcIPcnt 82nm16 81ca78 83bc99 80ad64 (a, b, 32, 5050) tbdestIPsrcIPcnt 82nm16 81ca78 83bc99 (80, d, a, 64) tbdestIPsrcIPcnt 82nm16 81ca78 83bc99 (83, c, b, 99) tbdestIPsrcIPcnt 82nm16 81ca78 84ba1 tbdestIPsrcIPcnt 82nm16 81ca78
15 Outline Stream query and existing evaluation approach (IOP) The out-of-order processing (OOP) alternative The OOP implementation in Gigascope Initial performance results
16 Performance Study – Traffic Shaping Data skew: 90% of data goes to 10% of groups Query 2: SELECT tb, srcIP, destIP, count(*) FROM TCP GROUP BY time/60 as tb, srcIP, destIP number of groups max data rate (kilo pkts/sec)
17 Performance Study - Memory Data rate: 110,000 pkts/sec #. of groups: Query 3: SELECT tb, srcIP, destIP, count(*) FROM A union B GROUP BY time/10 as tb, srcIP, destIP Time Skew (sec) Memory Usage (MB)
18 Conclusion and Future Work Verifies the benefits of OOP with high volume data Other operators such as join More performance numbers
19 Questions?