1 Out of Order Processing for Stream Query Evaluation Jin Li (Portland State Universtiy) Joint work with Theodore Johnson, Vladislav Shkapenyuk, David.

Slides:



Advertisements
Similar presentations
Semantics and Evaluation Techniques for Window Aggregates in Data Streams Jin Li, David Maier, Kristin Tufte, Vassilis Papadimos, Peter A. Tucker SIGMOD.
Advertisements

MapReduce Online Tyson Condie UC Berkeley Slides by Kaixiang MO
Semantics and Evaluation Techniques for Window Aggregates in Data Streams Jin Li, David Maier, Kristin Tufte, Vassilis Papadimos, Peter A. Tucker SIGMOD.
Analysis of : Operator Scheduling in a Data Stream Manager CS561 – Advanced Database Systems By Eric Bloom.
Outline What is a data warehouse? A multi-dimensional data model Data warehouse architecture Data warehouse implementation Further development of data.
3/8/2012Data Streams: Lecture 151 CS 410/510 Data Streams Lecture 15: How Soccer Players Would do Stream Joins & Query-Aware Partitioning for Monitoring.
Composite Subset Measures Lei Chen, Paul Barford, Bee-Chung Chen, Vinod Yegneswaran University of Wisconsin - Madison Raghu Ramakrishnan Yahoo! Research.
FindAll: A Local Search Engine for Mobile Phones Aruna Balasubramanian University of Washington.
Windows in Niagara Jin (Jenny) Li, David Maier, Vassilis Papadimos, Peter Tucker, Kristin Tufte.
Engine Design: Stream Operators Everywhere Theodore Johnson AT&T Labs – Research Contributors: Chuck Cranor Vladislav Shkapenyuk.
How to Build a Stream Database Theodore Johnson AT&T Labs - Research.
A Heartbeat Mechanism and its Application in Gigascope Johnson, Muthukrishnan, Shkapenyuk, Spatscheck Presented by: Joseph Frate and John Russo.
Applications : Network Monitoring Theodore Johnson AT&T Labs – Research Contributors: Chuck Cranor Vladislav Shkapenyuk Oliver.
Semantics and Evaluation Techniques for Window Aggregates in Data Stream Jin Li, David Maier, Kristin Tufte, Vassillis Papadimos, Peter Tucker. Presented.
Avoiding Idle Waiting in the execution of Continuous Queries Carlo Zaniolo CSD CS240B Notes April 2008.
Hash, Don’t Cache: Fast Packet Forwarding for Enterprise Edge Routers Minlan Yu Princeton University Joint work with Jennifer.
Tracking Port Scanners on the IP Backbone Tao Ye Sprint Burlingame, CA Avinash Sridharan University of Southern California.
Management by Network Search
Master’s Thesis (30 credits) By: Morten Lindeberg Supervisors: Vera Goebel and Jarle Søberg Design, Implementation, and Evaluation of Network Monitoring.
Sharing Information across Congestion Windows CSE222A Project Presentation March 15, 2005 Apurva Sharma.
Heartbeat Mechanism and its Applications in Gigascope Vladislav Shkapenyuk (speaker), Muthu S. Muthukrishnan Rutgers University Theodore Johnson Oliver.
Multiple Aggregations Over Data Streams Rui ZhangNational Univ. of Singapore Nick KoudasUniv. of Toronto Beng Chin OoiNational Univ. of Singapore Divesh.
Querying Large Databases Rukmini Kaushik. Purpose Research for efficient algorithms and software architectures of query engines.
1 Fjording The Stream An Architecture for Queries over Streaming Sensor Data Samuel Madden, Michael Franklin UC Berkeley.
Data Streams: Lecture 101 Window Aggregates in NiagaraST Kristin Tufte, Jin Li Thanks to the NiagaraST PSU.
Integrating Scale Out and Fault Tolerance in Stream Processing using Operator State Management Author: Raul Castro Fernandez, Matteo Migliavacca, et al.
2006/3/211 Multiple Aggregations over Data Stream Rui Zhang, Nick Koudas, Beng Chin Ooi Divesh Srivastava SIGMOD 2005.
PermJoin: An Efficient Algorithm for Producing Early Results in Multi-join Query Plans Justin J. Levandoski Mohamed E. Khalefa Mohamed F. Mokbel University.
Jennifer Rexford Princeton University MW 11:00am-12:20pm Measurement COS 597E: Software Defined Networking.
BARD / April BARD: Bayesian-Assisted Resource Discovery Fred Stann (USC/ISI) Joint Work With John Heidemann (USC/ISI) April 9, 2004.
1 Semantics and Evaluation Techniques for Window Aggregates in Data Streams Jin Li, David Maier, Kristin Tufte, Vassilis Papadimos, Peter Tucker This work.
SCREAM: Sketch Resource Allocation for Software-defined Measurement Masoud Moshref, Minlan Yu, Ramesh Govindan, Amin Vahdat (CoNEXT’15)
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Evaluation of Relational Operations Chapter 14, Part A (Joins)
Safety Guarantee of Continuous Join Queries over Punctuated Data Streams Hua-Gang Li *, Songting Chen, Junichi Tatemura Divykant Agrawal, K. Selcuk Candan.
Building OC-768 Monitor using GS Tool Vladislav Shkapenyuk Theodore Johnson Oliver Spatscheck June 2009.
Streaming Data Warehouses Theodore Johnson
Partial Query-Evaluation in Internet Query Engines Jayavel Shanmugasundaram Kristin Tufte David DeWitt David Maier Jeffrey Naughton University of Wisconsin.
U Innsbruck Informatik - 1 Specification of a Network Adaptation Layer for the Grid GGF7 presentation Michael Welzl University.
The latte Stream-Archive Query Project - Exploring Stream+Archive Data in Intelligent Transportation Systems Jin Li (with Kristin Tufte, Vassilis Papadimos,
Gigascope A stream database for network monitoring
S. Sudarshan CS632 Course, Mar 2004 IIT Bombay
FlowRadar: A Better NetFlow For Data Centers
International Conference on Data Engineering (ICDE 2016)
Neha Jain Shashwat Yadav
Lecture 16: Data Storage Wednesday, November 6, 2006.
Database Applications (15-415) DBMS Internals- Part VII Lecture 16, October 25, 2016 Mohammad Hammoud.
Data stream as an unbounded table
SCTP v/s TCP – A Comparison of Transport Protocols for Web Traffic
The Design of an Acquisitional Query Processor For Sensor Networks
Srinivas Narayana MIT CSAIL October 7, 2016
Evaluation of Relational Operations
A Framework for Automatic Resource and Accuracy Management in A Cloud Environment Smita Vijayakumar.
SONATA: Query-Driven Network Telemetry
CS222: Principles of Data Management Notes #13 Set operations, Aggregation Instructor: Chen Li.
Lecture 17: Distributed Transactions
湖南大学-信息科学与工程学院-计算机与科学系
Implementation of Relational Operations (Part 2)
ATP TCP Reducing the Latency-Tail of Short-Lived Flows: Adding Forward Error Correction in Data Centers Klaus-Tycho Foerster, Demian Jaeger, David Stolz,
SCREAM: Sketch Resource Allocation for Software-defined Measurement
Streaming Sensor Data Fjord / Sensor Proxy Multiquery Eddy
Selected Topics: External Sorting, Join Algorithms, …
(Two-Pass Algorithms)
Laura Bright David Maier Portland State University
Specialized Cloud Architectures
One-Pass Algorithms for Database Operations (15.2)
Continuous Query Languages for DSMS
Heavy Hitters in Streams and Sliding Windows
Theppatorn rhujittawiwat
Lu Tang , Qun Huang, Patrick P. C. Lee
Streams and Stuff Sirish and Sam and Mike.
Presentation transcript:

1 Out of Order Processing for Stream Query Evaluation Jin Li (Portland State Universtiy) Joint work with Theodore Johnson, Vladislav Shkapenyuk, David Maier, and Kristin Tufte

2 Stream (monitoring) applications Network packets, transportation data, sensor data, stock quotes …  Process data online  Often require (near) real-time response  Often involve data from multiple sources  Have no control over the physical properties of the data Challenges for database-style support of stream applications  Continuous query vs. one-time query “count the number of packets in the past minute”  Adapt traditional query operators to data streams Requires a time attribute Stream Query System source A source B source C (srcIP, dstIP, len, ts)

3 Stream Query – Windowed Aggregation count (*) group by tb, srcIP, destIP union ts windows union Query 1: SELECT tb, srcIP, destIP, count(*) FROM A union B union C GROUP BY ts/60 as tb, srcIP, destIP “Count the number of packets in each minutes; update result every minute” ABC

4 IOP Evaluation – Current Approach Merge is an order-preserving implementation of UnionAll How to determine end of windows? merge sort merge Query 1: SELECT tb, srcIP, destIP, count(*) FROM A union B union C GROUP BY ts/60 as tb, srcIP, destIP count (*) group by tb, srcIP, destIP ts windows ABC

5 Problems Performance penalty  Burst Caused by maintaining stream order May overload the stream system  Memory Overhead Sort Order-preserving merge  time skew  lulls  Latency Maintaining stream order delays tuple processing count (*) group by tb, srcIP, destIP merge sort ABC

6 Do We Have to Maintain Stream Order?

7 Stream Query – Windowed Aggregation count (*) group by tb, srcIP, destIP union ts windows union Query 1: SELECT tb, srcIP, destIP, count(*) FROM A union B union C GROUP BY ts/60 as tb, srcIP, destIP Stream query evaluation essentially requires information on stream progress ABC

8 Outline Stream query and existing evaluation approach (IOP) The out-of-order processing (OOP) alternative The OOP Implementation in Gigascope Initial performance results

9 Disorder External Sources  Merging multiple data sources  Different transmission routes (e.g., sensor networks)  Multiple possible windowing attributes, e.g., start time and end time of netflows Internal Sources  Data prioritization [Urhan and Franklin, 2001]  Query processing algorithms, e.g., shared window joins [Hammad, et al., 2003]

10 OOP Stream Query Evaluation – Leveraging Punctuation count (*) group by tb, srcIP, destIP union ( , , 64, 10:01:30am) ( , , 32, 10:02:05am)( *, *, *, 10:02:00am) ( , , 64, 10:01:45am) … srcIPdestIPtbcnt ( *, *, *, 10:02:00am) 10:02:00am ( *, *, *, 10:03:00am) 10:03:00am ( *, *, *, 10:03:00am) ( *, *, *, 10:02:00am) 10:02:00am10:03:00am (118, , , 58)  Punctuation is a special tuple embedded in a data stream that indicates the progress of the stream; e.g. (*, *, *, 10:02:03am) ABC

11 Outline Stream query and existing evaluation approach (IOP) The out-of-order processing (OOP) alternative The OOP implementation in Gigascope Initial performance results

12 Gigascope Architecture Bulk of the processing performed at the RTS. Low-level queries read directly from the packet buffer.  Avoid copying the packet data to multiple queries. Low-level queries are small and light-weight  Selection, projection, partial aggregation.  Ensure timely processing, small cache footprint. NIC q1q2q3 … Q2 Q1 App RTS Circular buffer

13 Aggregation in Gigascope (IOP) Low-level aggregation  Maintains fix-sized, small hash table – output on collisions  Slow flush to smooth output traffic Flush the results of window n-1 gradually as processing input tuples in window n  However, can still creates bursts in order to maintain output order Query 2: SELECT tb, srcIP, destIP, count(*) FROM TCP GROUP BY ts/60 as tb, srcIP, destIP q1 Q1 select tb, srcIP, destIP, count(*) from TCP group by ts/60 as tb, srcIP, destIP SELECT tb, srcIP, destIP, sum (Cnt) FROM q1 GROUP BY tb, srcIP, destIP tbdestIPsrcIPcnt 80ba26 80ca78 80bc99 80ad64 (a, c, 128, 4870) tbdestIPsrcIPcnt 80ca78 80bc99 80ad64 tbdestIPsrcIPcnt 81ca1 80bc99 80ad64 (80, a, b, 26) (x, y, 32, 4880) tbdestIPsrcIPcnt 81ca1 80bc99 80ad64 (80, a, c, 78) tbdestIPsrcIPcnt 80ca78 80bc99 80ad64

14 Aggregation in Gigascope (OOP) Low-level aggregation  Does not need to maintain stream order  Allows a delay of k windows  Smooth output traffic better Heartbeat carries punctuation  Initially generated by the callback function of a timer in low-level queries  In high-level queries, each operator propagates heartbeat/punctuation Query 2: SELECT tb, srcIP, destIP, count(*) FROM TCP GROUP BY ts/60 as tb, srcIP, destIP q1 Q1 select tb, srcIP, destIP, count(*) from TCP group by ts/60 as tb, srcIP, destIP SELECT tb, srcIP, destIP, sum (Cnt) FROM q1 GROUP BY tb, srcIP, destIP tbdestIPsrcIPcnt 82nm16 81ca78 83bc99 80ad64 (a, b, 32, 5050) tbdestIPsrcIPcnt 82nm16 81ca78 83bc99 (80, d, a, 64) tbdestIPsrcIPcnt 82nm16 81ca78 83bc99 (83, c, b, 99) tbdestIPsrcIPcnt 82nm16 81ca78 84ba1 tbdestIPsrcIPcnt 82nm16 81ca78

15 Outline Stream query and existing evaluation approach (IOP) The out-of-order processing (OOP) alternative The OOP implementation in Gigascope Initial performance results

16 Performance Study – Traffic Shaping Data skew: 90% of data goes to 10% of groups Query 2: SELECT tb, srcIP, destIP, count(*) FROM TCP GROUP BY time/60 as tb, srcIP, destIP number of groups max data rate (kilo pkts/sec)

17 Performance Study - Memory Data rate: 110,000 pkts/sec #. of groups: Query 3: SELECT tb, srcIP, destIP, count(*) FROM A union B GROUP BY time/10 as tb, srcIP, destIP Time Skew (sec) Memory Usage (MB)

18 Conclusion and Future Work Verifies the benefits of OOP with high volume data Other operators such as join More performance numbers

19 Questions?