Continuous Stream Monitoring Technology Elke A. Rundensteiner Database Systems Research Laboratory Department of Computer Science Worcester Polytechnic Institute, USA October 2006
2 Project Topics in a Nutshell Distributed Data Sources: EVE : Data Warehousing over Distributed Data TOTAL-ETL : Distributed Extract Transform Load [NSF’96,NSF02,IBM] XML/Web Data Systems: RAINBOW : XML to Relational Databases MASS : Native XQuery Processing System [Verizon,IBM,NSF05] Databases & Visualization: Scalable Visual High-Dim. Data Exploration Data and Visual Quality Support in XMDV [NSF’97,NSF01,NSF05] Stream Monitoring System: Scalable Query Engine for Data Streams Fire Prediction and Monitoring Appl. [NSF06, NEC ]
3 Why Database Technology? Vast amount of electronic information in organisations, companies, and scientific institutes that needs to be organized, stored securily, and accessed efficiently Database management systems (DBMSs) provide: Model for logical structure of information Query languages to access and modify data Persistent data storage over long time Index technologies Efficient query processing and optimization Concurrent access for multiple users Access rights and security Scalability in query workload and data size Stored Database DBMS Select name from employee;
4 Generations of DBMSs Early DBMSs Navigational access Relational DBMSs Traditional tables and SQL queries Object-oriented DBMSs Object modeling and extensibility Object-relational DBMSs Combine declarative queries with OO modeling XML DBMSs Support web and semi-structured data types
5 Question... ? What is common among these DBMSs ? Stored Database DBMS Select name from employee;
6 Answer... Three common steps : Make schema design Load database Query static database Key Differences: Different data models Stored Database DBMS Select name from employee;
7 So what next ? Stored Database DBMS Select name from employee;
8 A Look at Modern Applications Digital radio telescopes Network traffic monitoring Environmental Monitoring Tracking using RFID Tags Sensor networks Analyses of web usage logs Financial analysis of stock exchanges Out-patient critical care ... Filter & Transform select fft(s) from radiosignal s where source(s)= “Antenna1”;
9 A Look at Modern Applications What do those applications have in common ? Filter & Transform select fft(s) from radiosignal s where source(s)= “Antenna1”;
10 Continous Queries on Data Streams Online Stream Monitoring Online Stream Monitoring
11 Databases : A Paradigm Shift ! data Query data streams of data static data Ad-hoc one-time queries Continuous standing queries
12 Data Streams and Continous Queries Data streams: Continuous on-line ordered sequences Produced by sensors, simulations, and instruments Data pushed to reactive applications Result also continuous output streams Stream queries: Continuous long-running or even infinite queries On-the-fly real-time processing as data arrives Constrained processing time and memory usage Selective stream storage (often of recent past)
13 Requirements for Data Stream Management Systems (DSMSs) Non-blocking operators in query plans Windows: Infinite streams into finite sub-streams One-pass query algorithms Approximate query answers Real-time response for unusual behavior detected Adaptation to environmental changes
14 DSMS Provides: High-level query language (declarative interface) Data independence from physical stream implementations Query optimization (for performance) Scalability in data volume and query workload Shared execution of similar queries Adaptive distributed processing
15 Real-time Stream Query Processing: Parallelism Process Queries on shared-nothing architectures (cluster or Grid ) Make use of aggregated resources (main memory, CPU) Network Clusters of Machines Query Workload Acquired NSF Equipment grant 2006 for Purchase of High-Performance Cluster For Stream Processing Applications
16 Three Types of Parallelism We Exploit Pipelined: Operators be composed into producer and consumer relationship Independent: Independent operators run simultaneously on distinct machines Partitioned: Single operator replicated and run on multiple machines Adaptation Considered Within Each Processing Paradigm
17 Project 1 : Mobile Wireless Application Streams - moving objects - dynamic range query - dynamic kNN query
18 Scuba Project : Mobile Application Streams Scalability Large number of objects Large number of queries Limited Resources Memory CPU Real-time Response Requirement The challenge is to provide fast query response in update-intensive environments - moving objects - dynamic range query - dynamic kNN query Novel Idea: Exploit the fact that objects naturally move in groups (i.e., clusters) to optimize query evaluation
19 Spatio-Temporal Continuous Tracking Monitor the traffic in the red areas Continuously return the area covered by the herd during the migration
20 Main Idea: Moving Clusters Main Idea: Abstracting individual objects into a cluster based on common attributes - Direction - Speed - Spatial Position With cluster abstractions, minimize the number of unnecessary individual object/query joins, thus optimizing query evaluation Continuously retrieve closest police car next to me Police Car Scalable Cluster-Based Algorithm for Evaluating Continuous Spatio-Temporal Queries on Moving Objects (SCUBA)
21 Advantage of Moving Cluster Abstraction When clusters don’t overlap, we avoid many joins of individual objects within those clusters m1m1 m2m2 No need to join objects/queries in m 1 with queries/objects in m 2 - Moving object- Spatio-temporal range query Scuba presented April 2006 at EDBT’06 If two abstractions do not ‘overlap' then we can discard negative candidates and avoid individual joins for spatio-temporal range queries.
22 Stream Queries for Mobile Traffic Services Monitor the traffic in the red areas Range Query Send E-coupons to all cars that I am considered as their nearest gas stations Reverse-NN Query How many cars in the highlighted area? Range Query
Raindrop : XQueries on XML Streams (or, Automaton Meets Algebra) Funded by NSF 2005; In collaboration with Prof. Mani
24 What’s Special for XML Stream Processing? Dream Catcher King S. Bt Bound 30 … Dream Catcher … Token-by-Token access manner timeline Pattern retrieval + Filtering + Restructuring FOR $b in stream(biditems.xml) //book LET $p := $b/price $t := $b/title WHERE $p < 20 Return $t Token: not a direct counterpart of a tuple 30Bt BoundS.KingDream2001 pricepublisherfirstlasttitleyear Pattern Retrieval on Token Streams
25 Automata-Based Paradigm FOR $b in stream(biditems.xml) //book LET $p := $b/price $t := $b/title WHERE $p < 20 Return $t 1 book * 2 4 title price Auxiliary structures for: 1.Buffering data 2.Filtering 3.Restructuring … //book //book/title //book/price 3
26 Observations Either paradigm has deficiencies Both paradigms complement each other Automata ParadigmAlgebra Paradigm Good for pattern retrieval on tokensDoes not support token inputs Need patches for filtering and restructuring Good for filtering and restructuring Present all details on same low levelSupport multiple descriptive levels (declarative->procedural) Little studied as query processing paradigm Well studied as query process paradigm
27 Towards One Uniform Algebraic View Token-based plan (automata plan) Tuple-based plan Tuple stream XML data stream Query answer Algebraic Stream Plan
28 Example Algebraic Plan FOR $b in stream(biditems.xml) //book LET $p := $b/price $t := $b/title WHERE $p < 30 Return $t Tuple-based plan Token-based plan (automata plan)
29 Example Uniform Algebraic Plan FOR $b in stream(biditems.xml) //book LET $p := $b/price $t := $b/title WHERE $p < 30 Return $t StructuralJoin $b ExtractNest $b, $p ExtractNest $b, $t Navigate $b, /price->$p Navigate $b, /title->$t Navigate $S1, //book ->$b Tuple-based plan
30 Example Uniform Algebraic Plan FOR $b in stream(biditems.xml) //book LET $p := $b/price $t := $b/title WHERE $p < 30 Return $t StructuralJoin $b ExtractNest $b, $p ExtractNest $b, $t Navigate $b, /price->$p Navigate $b, /title->$t Navigate $S1, //book ->$b Select $p<30 Tagger “Inexpensive”, $t->$r
31 Plan Rewriting : In or Out? Token-based plan (automata plan) Tuple-based Plan Tuple stream XML data stream Query answer Pattern retrieval in Semantics- focused plan Apply “push into automata”
32 Raindrop Plan Alternatives Nav $b, /price->$p ExtractNest $b, $p ExtractNest $b, $t SJoin //book Select price < 30 Tagger Nav $b, /title->$t Nav $S1, //book->$b ExtractNest $S1, $b Navigate /price Select price<30 Navigate book/title Tagger Nav $S1, //book->$b NavUnnest $S1, //book ->$b NavNest $b, /price ->$p NavNest $b, /title ->$t Select $p<30 Tagger “Inexpensive”, $t->$r Out In Statistics Collection and On-line Plan Migration
33 Raindrop : Research Contributions and Issues Costing/query optimization of plans On-the-fly migration into/out of automaton Physical implementation strategies of operators Exploit XML schema constraints for query optimization Load-shedding from an automaton Early memory release optimization Published in CIKM’03, ER’03, DKE’06 Journal, VLDB’05, VLDB’06.
34 FireEngine Project : Sensors in Buildings
35 Fire Monitoring Queries Ambient Queries: What are typical temperature and humidity in given rooms based on environment ? Detection Queries: Unusual behaviors or patterns detected ? Tracking Queries: Track smoke and heat clouds (moving clusters) in terms of their sizes and speeds. Analysis Queries : Is there an outlier (prank), or an actual fire ? Reliabity Assessment: Any sensors faulty, and thus should be ignored? Prediction Queries: Match sensors readings of fire with a fire stream simulation to determine similarity ? FireStream Demo to be presented at ICDE’07
36 Project : RFID Event Stream Monitoring Given potentially infinite, heterogeneous, high-speed event streams Goal: detect interesting patterns among events Supply chain management, e.g., ( “ insufficient inventory ” → “ no- backup ” ) or “ inventory overflow ” Business service optimization, e.g., “ search ticket ” →“timeout” Anomaly detection, e.g., “pick item”→“no checkout”→“exit” And more … Complex query patterns to be answered in real-time Supported by NEC Cupertino and NSF Princeton
37 Event Processing Example Event stream pick(1), pick(2), pick(3), checkout(3), pick(4), exit(2), … Event Pattern Query EVENT SEQ(PICK p, !(CHECKOUT c), EXIT e) WHERE p.id=c.id AND c.id=e.id WITHIN 12 hours Processing Sequence scan & construction : (p, e) pairs Selection : apply predicates Window : check time constraints Negation : check for negation Transformation : make complex output event Time
38 Challenges for High-Performance Processing Use “Workflows” to Early Terminate Pattern Queries Optimize Event Pattern Queries Using Rewriting Prefix Sharing of Multiple Event Pattern Queries Scalable Processing Using Cluster
39 CAPE: Uncertainties in Stream Query Processing Register Continuous Queries Scalable Stream Query Engine Scalable Stream Query Engine Streaming Data (push-based paradigm) Streaming Result Real-time and accurate responses required May have time- varying rates and high-volumes Available resources for executing each operator may vary over time. Distribution and Adaptations are required. High workload of queries Memory- and CPU resource limitations (continuous evaluation)
40 CAPE : Continuous Adaptive Processing Engine -- Adaptation at all Layers Reactive Operator Algorithms Adaptive Scheduling of Operators On-Line Query Plan Reshaping Multi-Query Pipeline Sharing Synchronized Data Tree Spilling Adaptive Cluster-Driven Load Shedding Dynamic Workload Distribution over Cluster Data-Partitioning for Parallel Stream Processing
41 Adaptation Techniques in CAPE On-Line Query Plan Reshaping (with Yali Zhu and G. Heineman ) Published in ACM SIGMOD’ 2004, and in Submission to TODS journal
42 Run-time Plan Re-Optimization Step1 - Decide when to optimize Statistics monitoring Step2 – Generate new query plan Query optimization Step3 – Replace current plan by new plan Plan Migration
43 Naïve Plan Migration Strategy Migration Steps Pause execution of old plan Drain out all tuples inside old plan Replace old plan by new plan Resume execution of new plan AB BC AB C AB BC A B C Problem: Works for stateless operators only
44 Stateful Operator in CQ Why stateful Need non-blocking operators in CQ Operator needs to output partial results AB AB State AState B Key Observation: The purge of tuples in states relies on processing of new tuples. Symmetric hash join For each new tuple A purge state B, join state B, insert to state A
45 Naïve Migration Strategy Revisited Steps (1) Pause execution of old plan (2) Drain out all tuples inside old plan (3) Replace old plan by new plan (4) Resume execution of new plan AB BC AB C (2) All tuples drained (4) Processing Resumed (3) Old Replaced By new Deadlock Waiting Problem:
46 Proposed Dynamic Migration Strategies Moving State Strategy Parallel Track Strategy
47 Moving State Strategy Basic idea Share common states between two boxes Key Steps Identify common states State matching Share common states State moving Recompute unmatched states State recomputing
48 Moving State Strategy State Matching State in old box has unique ID During rewriting, new ID given to newly generated state in new box When rewriting done, match states based on IDs. State Moving Between matched states On same machine, creates new pointers for matched states in new box What’s left? Unmatched states in new box CD S ABC SDSD BC S AB SCSC AB SASA SBSB SASA S BCD CD S BC SDSD BC SBSB SCSC QAQA QBQB QCQC QDQD QAQA QBQB QCQC QDQD Q ABCD Old BoxNew Box
49 Unmatched States State Recomputing Recursively recompute unmatched S BC and S BCD by joining matched states Why always possible? Old and new boxes have same input queues The states associated with input queues always match Why necessary? AB SASA S BCD CD S BC SDSD BC SBSB SCSC QAQA QBQB QCQC QDQD Q ABCD
50 MS Migration Pros and Cons Pros Fast when # of tuples in states is small Low input rates or small window size Cons Output silence during entire migration stage Can we output results even during migration? Motivation for Parallel Track Strategy
51 Parallel Track Strategy Basic idea Execute both old and new plans in parallel Gradually “push” old tuples out of old box by purging Key Steps Connect new box Execute both boxes in parallel Remove old box once “expired” Contains only new tuples No old tuples or sub-tuples
52 Parallel Track Strategy Connect boxes Execute in parallel Until all old tuples purged Disconnect old box CD S ABC SDSD BC S AB SCSC AB SASA SBSB SASA S BCD CD S BC SDSD BC SBSB SCSC QAQA QBQB QCQC QDQD QAQA QBQB QCQC QDQD Q ABCD A Tuple ABC in S ABC ABC
53 PT Migrations Pros and Cons Pros Keep on producing results even during migration No results during MS migration Cons Migration duration is at least 2W MS may be faster depends on # of tuples in states
54 Summary : Stream Plan Migration First run-time solution for stateful operators Two migration methods: Moving State Strategy Parallel Track Strategy Cost Models and Experimental Evaluations What next ? Scope of optimization ? Support of other stateful operators ? Migration in distributed stream systems ?
55 Overall Summary : So Much Left to Do ! Large variety of challenging stream applications Generic core technology for stream processing engines Our central theme : Optimization via Adaptation Part I: Plan migration Part II: Plan distribution Part III: Plan-level spill Many open questions remain...
56 Thank You For Your Patience ! The End
57 Acknowledgments All the students (Ph.d., MS, and undergraduate) in the DSRG lab who have contributed to this research project directly or indirectly. Most notably ; Luping Ding, Yali Zhu, Bin Liu, Tim Sutherland, Brad Pielech, Rimma Nehme, Mariana Jbantova, Brad Momberger, Venky Raghavan, Song Wang, Natasha Bogdanova, Mingzhu Wei, Ming Li, and others. To National Science Foundation for partial support via IDM grants, to WPI for RDC grant, and to IBM and NEC
58 Selected CAPE Publications and Reports [RDZ04] E. A. Rundensteiner, L. Ding, Y. Zhu, T. Sutherland and B. Pielech, “CAPE: A Constraint- Aware Adaptive Stream Processing Engine”. Invited Book Chapter. July [ZRH04] Y. Zhu, E. A. Rundensteiner and G. T. Heineman, "Dynamic Plan Migration for Continuous Queries Over Data Streams”. SIGMOD 2004, pages [DMR+04] L. Ding, N. Mehta, E. A. Rundensteiner and G. T. Heineman, "Joining Punctuated Streams“. EDBT 2004, pages [DR04] L. Ding and E. A. Rundensteiner, "Evaluating Window Joins over Punctuated Streams“. CIKM 2004, to appear. [DRH03] L. Ding, E. A. Rundensteiner and G. T. Heineman, “MJoin: A Metadata-Aware Stream Join Operator”. DEBS [RDSZBM04] E A. Rundensteiner, L Ding, T Sutherland, Y Zhu, B Pielech And N Mehta. CAPE: Continuous Query Engine with Heterogeneous-Grained Adaptivity. Demonstration Paper. VLDB 2004 [SR04] T. Sutherland and E. A. Rundensteiner, "D-CAPE: A Self-Tuning Continuous Query Plan Distribution Architecture“. Tech Report, WPI-CS-TR-04-18, [SPR04] T. Sutherland, B. Pielech, Yali Zhu, Luping Ding, and E. A. Rundensteiner, "Adaptive Multi- Objective Scheduling Selection Framework for Continuous Query Processing “. IDEAS [SLJR05] T Sutherland, B Liu, M Jbantova, and E A. Rundensteiner, D-CAPE: Distributed and Self- Tuned Continuous Query Processing, CIKM, Bremen, Germany, Nov [LR05] Bin Liu and E.A. Rundensteiner, Revisiting Pipelined Parallelism in Multi-Join Query Processing, VLDB [B05] Bin Liu and E.A. Rundensteiner, Partition-based Adaptation Strategies Integrating Spill and Relocation, Tech Report, WPI-CS-TR-05, (in submission) CAPE Project: