Query Processing, Resource Management, and Approximation in a Data Stream Management System Jennifer Widom Stanford University stanfordstreamdatamanager
2 Formula for a Database Research Project Pick a simple but fundamental assumption underlying traditional database systems –Drop it Reconsider all aspects of data management and query processing –Many Ph.D. theses –Prototype from scratch
stanfordstreamdatamanager 3 Following the Formula We followed this formula once before –The LORE project –Dropped assumption: Data has a fixed schema declared in advance –Semistructured data The STREAM Project –Dropped assumption: First load data, then index it, then run queries –Continuous data streams (+ continuous queries)
stanfordstreamdatamanager 4 Data Streams Continuous, unbounded, rapid, time-varying streams of data elements Occur in a variety of modern applications –Network monitoring and traffic engineering –Sensor networks –Telecom call records –Financial applications –Web logs and click-streams –Manufacturing processes DSMSDSMS = Data Stream Management System
stanfordstreamdatamanager 5 DBMS versus DSMS
stanfordstreamdatamanager 6 DBMS versus DSMS Persistent relationsTransient streams (and persistent relations)
stanfordstreamdatamanager 7 DBMS versus DSMS Persistent relations One-time queries Transient streams (and persistent relations) Continuous queries
stanfordstreamdatamanager 8 DBMS versus DSMS Persistent relations One-time queries Random access Transient streams (and persistent relations) Continuous queries Sequential access
stanfordstreamdatamanager 9 DBMS versus DSMS Persistent relations One-time queries Random access Access plan determined by query processor and physical DB design Transient streams (and persistent relations) Continuous queries Sequential access Unpredictable data characteristics and arrival patterns
stanfordstreamdatamanager 10 The STREAM System Data streams and stored relations Declarative language for registering continuous queries Flexible query plans Designed to cope with high data rates and query workloads –Graceful approximation when needed –Careful resource allocation and usage Relational, centralized (for now)
stanfordstreamdatamanager 11 Contributions to Date Semantics for continuous queries Query plans Exploiting stream constraints Operator scheduling Approximation techniques Resource allocation to maximize precision Initial running prototype
stanfordstreamdatamanager 12 DSMS Scratch Store The (Simplified) Big Picture Input streams Register Query Streamed Result Stored Result Archive Stored Relations
stanfordstreamdatamanager 13 (Simplified) Network Monitoring Register Monitoring Queries DSMS Scratch Store Network measurements, Packet traces Intrusion Warnings Online Performance Metrics Archive Lookup Tables
stanfordstreamdatamanager 14 Using Conventional DBMS relation inserts triggers materialized viewsData streams as relation inserts, continuous queries as triggers or materialized views Problems with this approach –Inserts are typically batched, high overhead –Expressiveness: simple conditions (triggers), no built-in notion of sequence (views) –No notion of approximation, resource allocation –Current systems don’t scale to large # of triggers –Views don’t provide streamed results But we (and others) plan to compareBut we (and others) plan to compare
stanfordstreamdatamanager 15 Declarative Language for Continuous Queries A distinction between STREAM and the Aurora project –Aurora users directly manipulate one large execution plan –STREAM compiles declarative queries into individual plans, system may merge plans –STREAM also supports direct entry of plans Syntax based on SQL, additional constructs for sliding windows and sampling
stanfordstreamdatamanager 16 Example Query 1 Two streams, contrived for ease of examples: Orders (orderID, customer, cost) Fulfillments (orderID, clerk)
stanfordstreamdatamanager 17 Example Query 1 Two streams, contrived for ease of examples: Orders (orderID, customer, cost) Fulfillments (orderID, clerk) Total cost of orders fulfilled over the last day by clerk “Sue” for customer “Joe” Select Sum(O.cost) From Orders O, Fulfillments F [Range 1 Day] Where O.orderID = F.orderID And F.clerk = “Sue” And O.customer = “Joe”
stanfordstreamdatamanager 18 Example Query 1 Two streams, contrived for ease of examples: Orders (orderID, customer, cost) Fulfillments (orderID, clerk) Total cost of orders fulfilled over the last day by clerk “Sue” for customer “Joe” Select Sum(O.cost) Fulfillments F [Range 1 Day] From Orders O, Fulfillments F [Range 1 Day] Where O.orderID = F.orderID And F.clerk = “Sue” And O.customer = “Joe”
stanfordstreamdatamanager 19 Example Query 1 Two streams, contrived for ease of examples: Orders (orderID, customer, cost) Fulfillments (orderID, clerk) Total cost of orders fulfilled over the last day by clerk “Sue” for customer “Joe” Select Sum(O.cost) Orders O From Orders O, Fulfillments F [Range 1 Day] Where O.orderID = F.orderID Where O.orderID = F.orderID And F.clerk = “Sue” And O.customer = “Joe”
stanfordstreamdatamanager 20 Example Query 1 Two streams, contrived for ease of examples: Orders (orderID, customer, cost) Fulfillments (orderID, clerk) Total cost of orders fulfilled over the last day by clerk “Sue” for customer “Joe” Select Sum(O.cost) From Orders O, Fulfillments F [Range 1 Day] And F.clerk = “Sue” Where O.orderID = F.orderID And F.clerk = “Sue” And O.customer = “Joe” And O.customer = “Joe”
stanfordstreamdatamanager 21 Example Query 1 Two streams, contrived for ease of examples: Orders (orderID, customer, cost) Fulfillments (orderID, clerk) Total cost of orders fulfilled over the last day by clerk “Sue” for customer “Joe” Select Sum(O.cost) From Orders O, Fulfillments F [Range 1 Day] Where O.orderID = F.orderID And F.clerk = “Sue” And O.customer = “Joe”
stanfordstreamdatamanager 22 Example Query 2 Using a 10% sample of the Fulfillments stream, take the 5 most recent fulfillments for each clerk and return the maximum cost Select F.clerk, Max(O.cost) From Orders O, Fulfillments F [Partition By clerk Rows 5] 10% Sample Where O.orderID = F.orderID Group By F.clerk
stanfordstreamdatamanager 23 Example Query 2 Using a 10% sample of the Fulfillments stream, take the 5 most recent fulfillments for each clerk and return the maximum cost Select F.clerk, Max(O.cost) From Orders O, Fulfillments F [Partition By clerk Rows 5] 10% Sample Where O.orderID = F.orderID Group By F.clerk
stanfordstreamdatamanager 24 Example Query 2 Using a 10% sample of the Fulfillments stream, take the 5 most recent fulfillments for each clerk and return the maximum cost Select F.clerk, Max(O.cost) From Orders O, Fulfillments F [Partition By clerk Rows 5] 10% Sample Where O.orderID = F.orderID Group By F.clerk
stanfordstreamdatamanager 25 Example Query 2 Using a 10% sample of the Fulfillments stream, take the 5 most recent fulfillments for each clerk and return the maximum cost Select F.clerk, Max(O.cost) From Orders O, Fulfillments F [Partition By clerk Rows 5] 10% Sample Where O.orderID = F.orderID Group By F.clerk
stanfordstreamdatamanager 26 Semantics of Database Languages An often neglected topic Traditional relational databases are in reasonable shape –Relational algebra SQL But triggers were a mess The semantics of an innocent-looking continuous query over data streams may not be obvious
stanfordstreamdatamanager 27 A Nonobvious Continuous Query Stream of stock quotes: Stocks(ticker,price) Monitor last 10 minutes of quotes: Select From Stocks [Range 10 minutes] Is result a relation, a stream, or something else? If a relation, what exactly does it contain? If a stream, how does query differ from: Select From Stocks [Range 1 minute] or Select From Stocks [ ]
stanfordstreamdatamanager 28 Our Semantics and Language for Continuous Queries Abstract:Abstract: interpretation for CQs based on certain “black boxes” Concrete:Concrete: SQL-based instantiation for our system; includes syntactic shortcuts, defaults, equivalences Goals –CQs over multiple streams and relations –Exploit relational semantics to the extent possible –Easy queries should be easy to write, simple queries should do what you expect
stanfordstreamdatamanager 29 Relations and Streams Assume global, discrete, ordered time domain (more on this later) Relation –Maps time T to set-of-tuples R Stream –Set of (tuple,timestamp) elements
stanfordstreamdatamanager 30 Conversions StreamsRelations Window specification Special operators: Istream, Dstream, Rstream Any relational query language
stanfordstreamdatamanager 31 Conversion Definitions Stream-to-relation –S [W] is a relation — at time T it contains all tuples in window W applied to stream S up to T –When W = , contains all tuples in stream S up to T Relation-to-stream –Istream(R) contains all (r,T ) where r R at time T but r R at time T–1 –Dstream(R) contains all (r,T ) where r R at time T–1 but r R at time T –Rstream(R) contains all (r,T ) where r R at time T
stanfordstreamdatamanager 32 Abstract Semantics Take any relational query language Can reference streams in place of relations –But must convert to relations using any window specification language ( default window = [ ] ) Can convert relations to streams –For streamed results –For windows over relations (note: converts back to relation)
stanfordstreamdatamanager 33 Query Result at Time T Use all relations at time T Use all streams up to T, converted to relations Compute relational result Convert result to streams if desired
stanfordstreamdatamanager 34 Time Easiest: global system clock –Stream elements and relation updates timestamped on entry to system Application-defined time –Streams and relation updates contain application timestamps, may be out of order –Application generates “heartbeat” Or deduce heartbeat from parameters: stream skew, scrambling, latency, and clock progress –Query results in application time
stanfordstreamdatamanager 35 Abstract Semantics – Example 1 Select F.clerk, Max(O.cost) From O [ ], F [Rows 1000] Where O.orderID = F.orderID Group By F.clerk Maximum-cost order fulfilled by each clerk in last 1000 fulfillments
stanfordstreamdatamanager 36 Abstract Semantics – Example 1 Select F.clerk, Max(O.cost) From O [ ], F [Rows 1000] Where O.orderID = F.orderID Group By F.clerk At time T: entire stream O and last 1000 tuples of F as relations Evaluate query, update result relation at T
stanfordstreamdatamanager 37 Abstract Semantics – Example 1 Istream Select Istream(F.clerk, Max(O.cost)) From O [ ], F [Rows 1000] Where O.orderID = F.orderID Group By F.clerk At time T: entire stream O and last 1000 tuples of F as relations Evaluate query, update result relation at T Streamed result:Streamed result: New element (,T) whenever changes from T–1
stanfordstreamdatamanager 38 Abstract Semantics – Example 2 Relation CurPrice(stock, price) Select stock, Avg(price) From Istream(CurPrice) [Range 1 Day] Group By stock Average price over last day for each stock
stanfordstreamdatamanager 39 Abstract Semantics – Example 2 Relation CurPrice(stock, price) Select stock, Avg(price) From Istream(CurPrice) [Range 1 Day] Group By stock Istream provides history of CurPrice Window on history, back to relation, group and aggregate
stanfordstreamdatamanager 40 Concrete Language – CQL Relational query language: SQL Window spec. language derived from SQL-99 –Tuple-based, time-based, partitioned Syntactic shortcuts and defaultsSyntactic shortcuts and defaults –So easy queries are easy to write and simple queries do what you expect EquivalencesEquivalences –Basis for query-rewrite optimizations –Includes all relational equivalences, plus new stream-based ones
stanfordstreamdatamanager 41 Two Extremely Simple CQL Examples Select From Strm Had better return Strm (It does) –Default window for Strm –Default Istream for result Select From Strm, Rel Where Strm.A = Rel.B Often want “ NOW ” window for Strm But may not want as default
stanfordstreamdatamanager 42 Query Execution query planWhen a continuous query is registered, generation a query plan –Users can also register plans directly Plans composed of three main components: –Operators –Operators (as in most conventional DBMS’s) Queues –Inter-operator Queues (as in many conventional DBMS’s) –State (synopses) schedulerGlobal scheduler for plan execution
stanfordstreamdatamanager 43 Operators and State synopsesState (synopses) –Summarize tuples seen so far (exact or approximate) for operators requiring history –To implement windows Example: synopsis join –Sliding-window join –Approximation of full join State 1 State 2 ⋈
stanfordstreamdatamanager 44 Simple Query Plan State 4 ⋈ State 3 Stream 1 Stream 2 Stream 3 Q1Q1 Q2Q2 State 1 State 2 ⋈ Scheduler
stanfordstreamdatamanager 45 Some Issues in Query Plan Generation +/- streamsCompatibility and conversions for streams and relations (+/- streams) State sharing, incremental computation Windowed joinsWindowed joins: Multiway versus 2-way Windows in general: push down, pull up, split, merge, … Time coordination, operator-level heartbeats
stanfordstreamdatamanager 46 Memory Overhead in Query Processing Queues + State Continuous queries keep state indefinitely Online requirements suggest using memory rather than disk –But we realize this assumption is shaky
stanfordstreamdatamanager 47 Memory Overhead in Query Processing Queues + State Continuous queries keep state indefinitely Online requirements suggest using memory rather than disk –But we realize this assumption is shaky Goal: minimize memory use while providing timely, accurate answersGoal: minimize memory use while providing timely, accurate answers
stanfordstreamdatamanager 48 Reducing Memory Overhead Two main techniques to date constraints on streamsreduce state 1)Exploit constraints on streams to reduce state operator schedulingreduce queue sizes 2)Clever operator scheduling to reduce queue sizes
stanfordstreamdatamanager 49 Exploiting Stream Constraints For most queries, unbounded memory is required for arbitrary streams [PODS ’01]
stanfordstreamdatamanager 50 Exploiting Stream Constraints arbitraryFor most queries, unbounded memory is required for arbitrary streams [PODS ’01] But streams may exhibit constraints that reduce, bound, or even eliminate state
stanfordstreamdatamanager 51 Exploiting Stream Constraints For most queries, unbounded memory is required for arbitrary streams [PODS ’01] But streams may exhibit constraints that reduce, bound, or even eliminate state Conventional database constraints –Redefined for streams –“Relaxed” for stream environment
stanfordstreamdatamanager 52 Stream Constraints adherence parameterEach constraint type defines adherence parameter k Clustered(k) for attribute S.A Ordered(k) for attribute S.A Referential-Integrity(k) for join S 1 S 2
stanfordstreamdatamanager 53 Algorithm for Exploiting Constraints Input –Any Select-Project-Join query over streams –Any set of k-constraints Output –Query execution plan that reduces or eliminates state based on k-constraints –If constraints violated, get approximate result
stanfordstreamdatamanager 54 Constraint Examples Orders (orderID, cost) Fulfillments (orderID, portion, clerk) Query: Many-one join F O
stanfordstreamdatamanager 55 Constraint Examples Orders (orderID, cost) Fulfillments (orderID, portion, clerk) Query: Many-one join F O Clustered(k) on F.orderID Matched O tuples discarded after k arrivals of non- matching F’s
stanfordstreamdatamanager 56 Constraint Examples Orders (orderID, cost) Fulfillments (orderID, portion, clerk) Query: Many-one join F O Clustered(k) on F.orderID Matched O tuples discarded after k arrivals of non- matching F’s Referential-Integrity(k) F tuples retained for at most k arrivals of O tuples
stanfordstreamdatamanager 57 Operator Scheduling Global scheduler invokes run method of query plan operators with “timeslice” parameter Many possible scheduling objectives: minimize latency, inaccuracy, memory use, computation, starvation, … –First scheduler: round-robin Second scheduler: minimize queue sizes –Third scheduler: minimize combination of queue sizes and latency
stanfordstreamdatamanager 58 Scheduling Algorithm Goal: minimize total queue size for unpredictable, bursty stream arrival patterns Proven: within constant factor of any “clairvoyant” strategy for some queries Empirical results: large savings over naive strategies for many queries But minimizing queue sizes is at odds with minimizing latency
stanfordstreamdatamanager 59 Scheduling Algorithm (contd.) ready chain Always schedule ready chain with steepest downslope progress chart Plan progress chart: memory delta per unit time Independent scheduling of operators makes mistakes chains Consider chains of operators Memory Op1 Op2 Op3 Op4 Computation time
stanfordstreamdatamanager 60 Approximation Why approximate? –Memory requirement too high, even with constraints and clever scheduling –Can’t process streams fast enough for query load
stanfordstreamdatamanager 61 Approximation (cont’d) Static:Static: rewrite queries to add (or shrink) sampling or windows +User can participate, predictable behavior –Doesn’t consider dynamic conditions Dynamic:Dynamic: modify query plan – insert sampling operators, shrink windows, load shedding +Adapts to current resource availability (major open issue) –How to convey to user? (major open issue)
stanfordstreamdatamanager 62 Resource Allocation to Maximize Precision Query plan: tree of operators function from resource allocation R to precision PEach operator has function from resource allocation R to precision P Simple (FP,FN) precision model –FP = probability of answer being false positive –FN = # false negatives per correct answer
stanfordstreamdatamanager 63 Precision of an Operator State 1 State 2 ⋈ Approx. Answer Exact Answer Precision: FP = False Positives FP = False Positives FN = False Negatives FN = False Negatives
stanfordstreamdatamanager 64 Precision of Two Operators (FP1,FN1) State 2 State 1 ⋈ Approx. Answer Exact Answer State 3 Approx. Answer Exact Answer (FP2,FN2)
stanfordstreamdatamanager 65 The Problem Statement Given a plan, precision function for each operator in the plan, and R total resources, allocate R to operators to maximize result precision Solution –For each operator type: formula for calculating output precision given input precision and operator resource allocation –Assume precision for input streams –Becomes optimization problem
stanfordstreamdatamanager 66 The Holy Grail Given: –Declarative query –Resources –Constraints on streams Generate plan and resource allocation that takes advantage of constraints and maximizes precision Do it for multiple (weighted) queries, dynamically and adaptively, and convey what’s happening to the user
stanfordstreamdatamanager 67 The Stream Systems Landscape (At least) three general-purpose DSMS prototypes underway –STREAM –STREAM (Stanford) –Aurora –Aurora (Brown, Brandeis, MIT) –TelegraphCQ –TelegraphCQ (Berkeley) All will be demo’d at SIGMOD ’03 stream system benchmarkCooperating to develop stream system benchmark Goal: demonstrate that conventional systems are far inferior for data stream applications
stanfordstreamdatamanager 68 Contributors: Arvind Arasu, Brian Babcock, Shivnath Babu, Mayur Datar, Rajeev Motwani, Justin Rosenstein, Rohit Varma