Continuous Analytics Over Discontinuous Streams Sailesh Krishnamurthy, Michael Franklin, Jeff Davis, Daniel Farina, Pasha Golovko, Alan Li, Neil Thombre June 10, 2010 SIGMOD, Indianapolis
Founded in 2005 Roots in TelegraphCQ project from UC Berkeley HQ in Foster CIty, CA Focus on “Continuous Analytics” Fortune 100 and web-based Big Data Customers
Data Records / “Events” Update Display Real-Time Analysis CQ Processor Source Data Stream Query Processing (Traditional View) 3
SQL Execution On Streaming Data 4 A stream is an unbounded sequence of records A table is a set of records Window operators convert streams to tables SQL queries apply to tables Window Operator Each window produces a set of records (a table) Semantics: Repeatedly apply generic SQL to the results of window operators Results are continuously appended to the output stream
Example: SQL Queries over Streams 5 SELECT I.Advertiser, SUM(I.price*I.volume) FROM Impressions I, Campaigns C WHERE I.campaign_id = C.campaign_id and C.type = ‘CPM’ GROUP BY I.Advertiser “I want to look at 5 seconds worth of impressions” “I want results every 3 seconds” Every 3 seconds, compute the revenue by advertiser based on impression data, over a 5 second “sliding window” Result(s) Impression Data Stream Result(s) … Window Window Operator Clause
Assumptions About Streams 6 Continuous sequences Arriving mostly in order , 2
The Reality Minutes, Hours, Days, late arriving Data Multiple streams out of sync, with gaps, … 1, 5, ?
Traditional (in Order) Solution #1: “Slack” , ,2, ,2,2, ,2,2, , , , ,9 Time Stamp 3-Second Slack Buffer OUTPUTTuple #
Slack 9 Pros Simple Handles “jitter” (slightly out of order arrival) Cons Introduces delay Permanently drops arrivals later than buffer Unbounded buffer size Permanently drops arrivals if lulls in multiple input streams
Traditional (in Order) Solution #2: “Drift” 10 (A,1) (a,2) (A,1) (B,2) (b,3) (a,2), (B,2) (C,3) (c,4) (b,3), (C,3) (G,4) (d,5) (c,4), (G,4) (D,6) (d,5) (E,7) (D,6),(E,7) (R,8) (E,7),(R,8) (D,6) (F,9) (x,5) (R,8),(F,9) (E,7) (z,10) (z,10) (R,8), (F,9) Source 2 2-Second Drift Buffer OUTPUT Source 1
Drift 11 Pros Simple Handles multiple streams with short “lulls” in arrival Cons Doesn’t handle streams with dramatically different arrival rates Permanently drops data that arrives after drift window has expired
Traditional Solution #3: Order-agnostic Operators 12 Slack and Drift aim to order streams before presenting them to order-sensitive operators Many operators don’t care about order SELECT count(*), cq_close(*) ts FROM S
Out of Order Processing: Count Example (4,t=5) (3,t=10) Time Stamp Count State OUTPUT Tuple # Heart- Beat
Order-agnostic Operators 14 Pros No buffering No extra delays Handles out-of-order tuples that make it before heart-beat Cons Some operators do care about order Permanently drops data that arrives after heartbeat Note: Lost data also impacts bigger “roll up queries” e.g. with sharing
So, how to handle very late data and discontinuous streams? 15
Integration Framework Shared Stream Query Processor Persistent Data Store SQL Interface Raw Data Aggregates “Stream-Relational” Architecture [CIDR 09] 16 JDBC / JMS XML Flat files ETL tools SOAP APIs Data Warehouse App Logic / UDFs Other TrucQ’s
Order-Independent Processing: Overview 17 Answers that have already been delivered can only be compensated Need to preserve all arriving data Queries return answers based on all relevant data that has arrived: CQ’s: Continuous Queries SQ’s: SQL queries on archived streams & answers Approach: Leverage benefits of SQL(!): Data-Parallel processing w/on-demand consolidation Powerful “View” mechanisms Basically, create parallel partitions for late data Rewrite queries as views over partial results
Out of Order Processing: Count Example (6,t=5) Data TS Control Count State Partitions OUTPUT Tuple #
Out of Order Processing: Count Example (3,t=10) (2,t=5) (1,t=15) 20 flush-2 2 (2,t=10) 21 flush-3 (2,t=5) Data TS Control Count State Partitions OUTPUT Tuple # (6,t=5)
Out of Order Processing: Count Example 20 (6,t=5) (3,t=10) (2,t=5) (1,t=15) (2,t=10) (2,t=5) OUTPUT Treat output as “Partial State Records” Rewrite queries using views over PSRs i.e., consolidate On-Demand Paper goes into substantial detail on how rewrites work Same answer as Order-Insensitive as roll-up Answer contains all data Subsequent SQs over archived results and raw data contain all data too!
Handles Very Late Data, Plus You Get… 21 Parallel Processing – Multicore and Cluster U U U U D D D D D D D D D D Client High-bandwidth Network Interconnect D = Distributed Processing Node U = Unified Processing Node
Other Details in the Paper 22 Beyond late data and parallelism, approach also is key to supporting: Fault Tolerance using replication High-Availability via fast restart “Nostalgic” continuous queries that start in the past and catch up to the present Fast concurrent creation of archives for new CQs Algorithmic/Systems details on Integration with overall system architecture Interaction with Transaction Mechanism Need for Background Reducer task Hybrid Plans for non-parallelizable parts of queries
Conclusions 23 Early Stream Processing Systems were based on simplistic assumptions about ordering Truviso’s 3.2 engine incorporates a new mechanism so no data is permanently dropped Approach leverages strengths of SQL Data-parallel processing models Sophisticated and efficient view functionality Key is On-Demand Consolidation Of course, you can only do it if you have an integrated stream-relational system For more info: or