Stream Processing Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems March 30, 2005
2 Administrivia Thursday, L101, 3PM: Muthian Sivathanu, U. Wisc., Semantically Smart Disk Systems Next readings: Monday – read and review the Madden paper Wednesday – read and summarize the Brin and Page paper
3 Today’s Trivia Question
4 Data Stream Management Basic idea: static queries, dynamic data Applications: Publish-subscribe systems Stock tickers, news headlines Data acquisition, e.g., from sensors, traffic monitoring, … The main two projects that are purely “stream processors”: Stanford STREAM MIT/Brown/Brandeis Aurora/Medusa
5 Summary from Last Time Streams are time-varying data series STREAM maps them into timestamped sets (Aurora doesn’t seem to do this) Most operations on streams resemble normal DB queries: Filtering, projection; grouping and aggregation; join (Though the latter few are over windows) STREAM started with an SQL-like language called CQL All stream operations go “through” relations Query plan operators have queues and synopses
6 Some Tricks for Performance Sharing synopses across multiple operators In a few cases, more than one operator may join with the same synopsis Can exploit punctuations or “k-constraints” Analogous to interesting orders Referential integrity k-constraint: bound of k between arrival of “many” element and its corresponding “one” element Ordered-arrival k-constraint: need window of at most k to sort Clustered-arrival k-constraint: bound on distance between items with same grouping attributes
7 Query Processing – “Chain Scheduling” Similar in many ways to eddies Combination of locally greedy and FIFO scheduling Apply operator to data as follows: Assume we know how many tuples can be processed in a time unit Cluster groups of operators into “chains” that maximize reduction in queue size per unit time (i.e., most selective operators per time unit) Greedily forward tuples into the most selective chain Within a chain, process the data in FIFO order STREAM also does a form of join reordering
8 Scratching the Surface: Approximation They point out two areas where we might need to approximate output: CPU is limited, and we need to drop some stream elements according to some probabilistic metric Collect statistics via a profiler Use Hoeffding inequality to derive a sampling rate in order to maintain a confidence interval This is generally termed load shedding May need to do similar things if memory usage is a constraint Are there other options? When might they be useful?
9 STREAM in General “Logical semantics first” Starts with a basic data model: streams as timestamped sets Develops a language and semantics Heavily based on SQL Proposes a relatively straightforward implementation Interesting ideas like k-constraints Interesting approaches like chain scheduling No real consideration of distributed processing
10 Aurora “Implementation first; mix and match operations from past literature” Basic philosophy: most of the ideas in streams existed in previous research Sliding windows, load shedding, approximation, … So let’s borrow those ideas and focus on how to build a real system with them! Emphasis is on building a scalable, robust system Distributed implementation: Medusa
11 Queries in Aurora Oddly: no declarative query language! Queries are workflows of physical query operators (SQuAl) Many operators resemble relational algebra ops
12 Example Query
13 Some Interesting Aspects A relatively simple adaptive query optimizer Can push filtering and mapping into many operators Can reorder some operators (e.g., joins, unions) Need built-in error handling If a data source fails to respond in a certain amount of time, create a special alarm tuple This propagates through the query plan Incorporate built-in load-shedding, RT sched. to support QoS Have a notion of combining a query over historical data with data from a stream Switches from a pull-based mode (reading from disk) to a push-based mode (reading from network)
14 The Medusa Processor Distributed coordinator between many Aurora nodes Scalability through federation and distribution Fail-over Load balancing
15 Main Components Lookup Distributed catalog – schemas, where to find streams, where to find queries Brain Query setup, load monitoring via I/O queues and stats Load distribution and balancing scheme is used Very reminiscent of Mariposa!
16 Load Balancing Migration – an operator can be moved from one node to another Initial implementation didn’t support moving of state The state is simply dropped, and operator processing resumes Implications on semantics? Plans to support state migration “Agoric system model to create incentives” Clients pay nodes for processing queries Nodes pay each other to handle load – pairwise contracts negotiated offline Bounded-price mechanism – price for migration of load, spec for what a node will take on Does this address the weaknesses of the Mariposa model?
17 Some Applications They Tried Financial services (stock ticker) Main issue is not volume, but problems with feeds Two-level alarm system, where higher-level alarm helps diagnose problems Shared computation among queries User-defined aggregation and mapping Linear road (sensor monitoring) Traffic sensors in a toll road – change toll depending on how many cars are on the road Combination of historical and continuous queries Environmental monitoring Sliding-window calculations
18 The Big Application? Military battalion monitoring Positions & images of friends and foes Load shedding is important Randomly drop data vs. semantic, predicate-based dropping to maintain QoS Based on a QoS utility function
19 Lessons Learned Historical data is important – not just stream data (Summaries?) Sometimes need synchronization for consistency “ACID for streams”? Streams can be out of order, bursty “Stream cleaning”? Adaptors and XML are important … But we already knew that! Performance is critical They spent a great deal of time using microbenchmarks and optimizing
20 Borealis Aurora is now commercial Borealis follows up with some new directions: Dynamic revision of results, i.e., corrections to stream data Dynamic query modification – change on the fly “Control lines”: change parameters “Time travel”: support execution of multiple queries, starting from different points in time (past thru future) Distributed optimization Combine stream and sensor processing ideas (we’ll talk about sensor nets next time) Sensor-heavy vs. server-heavy optimization
21 Streams and Integration How do streams and data integration relate? Are streams the future, or just an interesting vista point on the side of the road?