Download presentation
Presentation is loading. Please wait.
Published byRebecca Chandler Modified over 9 years ago
1
Query Processing, Resource Management, and Approximation in a Data Stream Management System Jennifer Widom Stanford University stanfordstreamdatamanager
2
2 Formula for a Database Research Project Pick a simple but fundamental assumption underlying traditional database systems –Drop it Reconsider all aspects of data management and query processing –Many Ph.D. theses –Prototype from scratch
3
stanfordstreamdatamanager 3 Following the Formula We followed this formula once before –The LORE project –Dropped assumption: Data has a fixed schema declared in advance –Semistructured data The STREAM Project –Dropped assumption: First load data, then index it, then run queries –Continuous data streams (+ continuous queries)
4
stanfordstreamdatamanager 4 Data Streams Continuous, unbounded, rapid, time-varying streams of data elements Occur in a variety of modern applications –Network monitoring and traffic engineering –Sensor networks –Telecom call records –Financial applications –Web logs and click-streams –Manufacturing processes DSMSDSMS = Data Stream Management System
5
stanfordstreamdatamanager 5 DBMS versus DSMS
6
stanfordstreamdatamanager 6 DBMS versus DSMS Persistent relationsTransient streams (and persistent relations)
7
stanfordstreamdatamanager 7 DBMS versus DSMS Persistent relations One-time queries Transient streams (and persistent relations) Continuous queries
8
stanfordstreamdatamanager 8 DBMS versus DSMS Persistent relations One-time queries Random access Transient streams (and persistent relations) Continuous queries Sequential access
9
stanfordstreamdatamanager 9 DBMS versus DSMS Persistent relations One-time queries Random access Access plan determined by query processor and physical DB design Transient streams (and persistent relations) Continuous queries Sequential access Unpredictable data characteristics and arrival patterns
10
stanfordstreamdatamanager 10 The STREAM System Data streams and stored relations Declarative language for registering continuous queries Flexible query plans Designed to cope with high data rates and query workloads –Graceful approximation when needed –Careful resource allocation and usage Relational, centralized (for now)
11
stanfordstreamdatamanager 11 Contributions to Date Semantics for continuous queries Query plans Exploiting stream constraints Operator scheduling Approximation techniques Resource allocation to maximize precision Initial running prototype
12
stanfordstreamdatamanager 12 DSMS Scratch Store The (Simplified) Big Picture Input streams Register Query Streamed Result Stored Result Archive Stored Relations
13
stanfordstreamdatamanager 13 (Simplified) Network Monitoring Register Monitoring Queries DSMS Scratch Store Network measurements, Packet traces Intrusion Warnings Online Performance Metrics Archive Lookup Tables
14
stanfordstreamdatamanager 14 Using Conventional DBMS relation inserts triggers materialized viewsData streams as relation inserts, continuous queries as triggers or materialized views Problems with this approach –Inserts are typically batched, high overhead –Expressiveness: simple conditions (triggers), no built-in notion of sequence (views) –No notion of approximation, resource allocation –Current systems don’t scale to large # of triggers –Views don’t provide streamed results But we (and others) plan to compareBut we (and others) plan to compare
15
stanfordstreamdatamanager 15 Declarative Language for Continuous Queries A distinction between STREAM and the Aurora project –Aurora users directly manipulate one large execution plan –STREAM compiles declarative queries into individual plans, system may merge plans –STREAM also supports direct entry of plans Syntax based on SQL, additional constructs for sliding windows and sampling
16
stanfordstreamdatamanager 16 Example Query 1 Two streams, contrived for ease of examples: Orders (orderID, customer, cost) Fulfillments (orderID, clerk)
17
stanfordstreamdatamanager 17 Example Query 1 Two streams, contrived for ease of examples: Orders (orderID, customer, cost) Fulfillments (orderID, clerk) Total cost of orders fulfilled over the last day by clerk “Sue” for customer “Joe” Select Sum(O.cost) From Orders O, Fulfillments F [Range 1 Day] Where O.orderID = F.orderID And F.clerk = “Sue” And O.customer = “Joe”
18
stanfordstreamdatamanager 18 Example Query 1 Two streams, contrived for ease of examples: Orders (orderID, customer, cost) Fulfillments (orderID, clerk) Total cost of orders fulfilled over the last day by clerk “Sue” for customer “Joe” Select Sum(O.cost) Fulfillments F [Range 1 Day] From Orders O, Fulfillments F [Range 1 Day] Where O.orderID = F.orderID And F.clerk = “Sue” And O.customer = “Joe”
19
stanfordstreamdatamanager 19 Example Query 1 Two streams, contrived for ease of examples: Orders (orderID, customer, cost) Fulfillments (orderID, clerk) Total cost of orders fulfilled over the last day by clerk “Sue” for customer “Joe” Select Sum(O.cost) Orders O From Orders O, Fulfillments F [Range 1 Day] Where O.orderID = F.orderID Where O.orderID = F.orderID And F.clerk = “Sue” And O.customer = “Joe”
20
stanfordstreamdatamanager 20 Example Query 1 Two streams, contrived for ease of examples: Orders (orderID, customer, cost) Fulfillments (orderID, clerk) Total cost of orders fulfilled over the last day by clerk “Sue” for customer “Joe” Select Sum(O.cost) From Orders O, Fulfillments F [Range 1 Day] And F.clerk = “Sue” Where O.orderID = F.orderID And F.clerk = “Sue” And O.customer = “Joe” And O.customer = “Joe”
21
stanfordstreamdatamanager 21 Example Query 1 Two streams, contrived for ease of examples: Orders (orderID, customer, cost) Fulfillments (orderID, clerk) Total cost of orders fulfilled over the last day by clerk “Sue” for customer “Joe” Select Sum(O.cost) From Orders O, Fulfillments F [Range 1 Day] Where O.orderID = F.orderID And F.clerk = “Sue” And O.customer = “Joe”
22
stanfordstreamdatamanager 22 Example Query 2 Using a 10% sample of the Fulfillments stream, take the 5 most recent fulfillments for each clerk and return the maximum cost Select F.clerk, Max(O.cost) From Orders O, Fulfillments F [Partition By clerk Rows 5] 10% Sample Where O.orderID = F.orderID Group By F.clerk
23
stanfordstreamdatamanager 23 Example Query 2 Using a 10% sample of the Fulfillments stream, take the 5 most recent fulfillments for each clerk and return the maximum cost Select F.clerk, Max(O.cost) From Orders O, Fulfillments F [Partition By clerk Rows 5] 10% Sample Where O.orderID = F.orderID Group By F.clerk
24
stanfordstreamdatamanager 24 Example Query 2 Using a 10% sample of the Fulfillments stream, take the 5 most recent fulfillments for each clerk and return the maximum cost Select F.clerk, Max(O.cost) From Orders O, Fulfillments F [Partition By clerk Rows 5] 10% Sample Where O.orderID = F.orderID Group By F.clerk
25
stanfordstreamdatamanager 25 Example Query 2 Using a 10% sample of the Fulfillments stream, take the 5 most recent fulfillments for each clerk and return the maximum cost Select F.clerk, Max(O.cost) From Orders O, Fulfillments F [Partition By clerk Rows 5] 10% Sample Where O.orderID = F.orderID Group By F.clerk
26
stanfordstreamdatamanager 26 Semantics of Database Languages An often neglected topic Traditional relational databases are in reasonable shape –Relational algebra SQL But triggers were a mess The semantics of an innocent-looking continuous query over data streams may not be obvious
27
stanfordstreamdatamanager 27 A Nonobvious Continuous Query Stream of stock quotes: Stocks(ticker,price) Monitor last 10 minutes of quotes: Select From Stocks [Range 10 minutes] Is result a relation, a stream, or something else? If a relation, what exactly does it contain? If a stream, how does query differ from: Select From Stocks [Range 1 minute] or Select From Stocks [ ]
28
stanfordstreamdatamanager 28 Our Semantics and Language for Continuous Queries Abstract:Abstract: interpretation for CQs based on certain “black boxes” Concrete:Concrete: SQL-based instantiation for our system; includes syntactic shortcuts, defaults, equivalences Goals –CQs over multiple streams and relations –Exploit relational semantics to the extent possible –Easy queries should be easy to write, simple queries should do what you expect
29
stanfordstreamdatamanager 29 Relations and Streams Assume global, discrete, ordered time domain (more on this later) Relation –Maps time T to set-of-tuples R Stream –Set of (tuple,timestamp) elements
30
stanfordstreamdatamanager 30 Conversions StreamsRelations Window specification Special operators: Istream, Dstream, Rstream Any relational query language
31
stanfordstreamdatamanager 31 Conversion Definitions Stream-to-relation –S [W] is a relation — at time T it contains all tuples in window W applied to stream S up to T –When W = , contains all tuples in stream S up to T Relation-to-stream –Istream(R) contains all (r,T ) where r R at time T but r R at time T–1 –Dstream(R) contains all (r,T ) where r R at time T–1 but r R at time T –Rstream(R) contains all (r,T ) where r R at time T
32
stanfordstreamdatamanager 32 Abstract Semantics Take any relational query language Can reference streams in place of relations –But must convert to relations using any window specification language ( default window = [ ] ) Can convert relations to streams –For streamed results –For windows over relations (note: converts back to relation)
33
stanfordstreamdatamanager 33 Query Result at Time T Use all relations at time T Use all streams up to T, converted to relations Compute relational result Convert result to streams if desired
34
stanfordstreamdatamanager 34 Time Easiest: global system clock –Stream elements and relation updates timestamped on entry to system Application-defined time –Streams and relation updates contain application timestamps, may be out of order –Application generates “heartbeat” Or deduce heartbeat from parameters: stream skew, scrambling, latency, and clock progress –Query results in application time
35
stanfordstreamdatamanager 35 Abstract Semantics – Example 1 Select F.clerk, Max(O.cost) From O [ ], F [Rows 1000] Where O.orderID = F.orderID Group By F.clerk Maximum-cost order fulfilled by each clerk in last 1000 fulfillments
36
stanfordstreamdatamanager 36 Abstract Semantics – Example 1 Select F.clerk, Max(O.cost) From O [ ], F [Rows 1000] Where O.orderID = F.orderID Group By F.clerk At time T: entire stream O and last 1000 tuples of F as relations Evaluate query, update result relation at T
37
stanfordstreamdatamanager 37 Abstract Semantics – Example 1 Istream Select Istream(F.clerk, Max(O.cost)) From O [ ], F [Rows 1000] Where O.orderID = F.orderID Group By F.clerk At time T: entire stream O and last 1000 tuples of F as relations Evaluate query, update result relation at T Streamed result:Streamed result: New element (,T) whenever changes from T–1
38
stanfordstreamdatamanager 38 Abstract Semantics – Example 2 Relation CurPrice(stock, price) Select stock, Avg(price) From Istream(CurPrice) [Range 1 Day] Group By stock Average price over last day for each stock
39
stanfordstreamdatamanager 39 Abstract Semantics – Example 2 Relation CurPrice(stock, price) Select stock, Avg(price) From Istream(CurPrice) [Range 1 Day] Group By stock Istream provides history of CurPrice Window on history, back to relation, group and aggregate
40
stanfordstreamdatamanager 40 Concrete Language – CQL Relational query language: SQL Window spec. language derived from SQL-99 –Tuple-based, time-based, partitioned Syntactic shortcuts and defaultsSyntactic shortcuts and defaults –So easy queries are easy to write and simple queries do what you expect EquivalencesEquivalences –Basis for query-rewrite optimizations –Includes all relational equivalences, plus new stream-based ones
41
stanfordstreamdatamanager 41 Two Extremely Simple CQL Examples Select From Strm Had better return Strm (It does) –Default window for Strm –Default Istream for result Select From Strm, Rel Where Strm.A = Rel.B Often want “ NOW ” window for Strm But may not want as default
42
stanfordstreamdatamanager 42 Query Execution query planWhen a continuous query is registered, generation a query plan –Users can also register plans directly Plans composed of three main components: –Operators –Operators (as in most conventional DBMS’s) Queues –Inter-operator Queues (as in many conventional DBMS’s) –State (synopses) schedulerGlobal scheduler for plan execution
43
stanfordstreamdatamanager 43 Operators and State synopsesState (synopses) –Summarize tuples seen so far (exact or approximate) for operators requiring history –To implement windows Example: synopsis join –Sliding-window join –Approximation of full join State 1 State 2 ⋈
44
stanfordstreamdatamanager 44 Simple Query Plan State 4 ⋈ State 3 Stream 1 Stream 2 Stream 3 Q1Q1 Q2Q2 State 1 State 2 ⋈ Scheduler
45
stanfordstreamdatamanager 45 Some Issues in Query Plan Generation +/- streamsCompatibility and conversions for streams and relations (+/- streams) State sharing, incremental computation Windowed joinsWindowed joins: Multiway versus 2-way Windows in general: push down, pull up, split, merge, … Time coordination, operator-level heartbeats
46
stanfordstreamdatamanager 46 Memory Overhead in Query Processing Queues + State Continuous queries keep state indefinitely Online requirements suggest using memory rather than disk –But we realize this assumption is shaky
47
stanfordstreamdatamanager 47 Memory Overhead in Query Processing Queues + State Continuous queries keep state indefinitely Online requirements suggest using memory rather than disk –But we realize this assumption is shaky Goal: minimize memory use while providing timely, accurate answersGoal: minimize memory use while providing timely, accurate answers
48
stanfordstreamdatamanager 48 Reducing Memory Overhead Two main techniques to date constraints on streamsreduce state 1)Exploit constraints on streams to reduce state operator schedulingreduce queue sizes 2)Clever operator scheduling to reduce queue sizes
49
stanfordstreamdatamanager 49 Exploiting Stream Constraints For most queries, unbounded memory is required for arbitrary streams [PODS ’01]
50
stanfordstreamdatamanager 50 Exploiting Stream Constraints arbitraryFor most queries, unbounded memory is required for arbitrary streams [PODS ’01] But streams may exhibit constraints that reduce, bound, or even eliminate state
51
stanfordstreamdatamanager 51 Exploiting Stream Constraints For most queries, unbounded memory is required for arbitrary streams [PODS ’01] But streams may exhibit constraints that reduce, bound, or even eliminate state Conventional database constraints –Redefined for streams –“Relaxed” for stream environment
52
stanfordstreamdatamanager 52 Stream Constraints adherence parameterEach constraint type defines adherence parameter k Clustered(k) for attribute S.A Ordered(k) for attribute S.A Referential-Integrity(k) for join S 1 S 2
53
stanfordstreamdatamanager 53 Algorithm for Exploiting Constraints Input –Any Select-Project-Join query over streams –Any set of k-constraints Output –Query execution plan that reduces or eliminates state based on k-constraints –If constraints violated, get approximate result
54
stanfordstreamdatamanager 54 Constraint Examples Orders (orderID, cost) Fulfillments (orderID, portion, clerk) Query: Many-one join F O
55
stanfordstreamdatamanager 55 Constraint Examples Orders (orderID, cost) Fulfillments (orderID, portion, clerk) Query: Many-one join F O Clustered(k) on F.orderID Matched O tuples discarded after k arrivals of non- matching F’s
56
stanfordstreamdatamanager 56 Constraint Examples Orders (orderID, cost) Fulfillments (orderID, portion, clerk) Query: Many-one join F O Clustered(k) on F.orderID Matched O tuples discarded after k arrivals of non- matching F’s Referential-Integrity(k) F tuples retained for at most k arrivals of O tuples
57
stanfordstreamdatamanager 57 Operator Scheduling Global scheduler invokes run method of query plan operators with “timeslice” parameter Many possible scheduling objectives: minimize latency, inaccuracy, memory use, computation, starvation, … –First scheduler: round-robin Second scheduler: minimize queue sizes –Third scheduler: minimize combination of queue sizes and latency
58
stanfordstreamdatamanager 58 Scheduling Algorithm Goal: minimize total queue size for unpredictable, bursty stream arrival patterns Proven: within constant factor of any “clairvoyant” strategy for some queries Empirical results: large savings over naive strategies for many queries But minimizing queue sizes is at odds with minimizing latency
59
stanfordstreamdatamanager 59 Scheduling Algorithm (contd.) ready chain Always schedule ready chain with steepest downslope progress chart Plan progress chart: memory delta per unit time Independent scheduling of operators makes mistakes chains Consider chains of operators Memory 1 3 48 1 0.9 2 0.2 Op1 Op2 Op3 Op4 Computation time
60
stanfordstreamdatamanager 60 Approximation Why approximate? –Memory requirement too high, even with constraints and clever scheduling –Can’t process streams fast enough for query load
61
stanfordstreamdatamanager 61 Approximation (cont’d) Static:Static: rewrite queries to add (or shrink) sampling or windows +User can participate, predictable behavior –Doesn’t consider dynamic conditions Dynamic:Dynamic: modify query plan – insert sampling operators, shrink windows, load shedding +Adapts to current resource availability (major open issue) –How to convey to user? (major open issue)
62
stanfordstreamdatamanager 62 Resource Allocation to Maximize Precision Query plan: tree of operators function from resource allocation R to precision PEach operator has function from resource allocation R to precision P Simple (FP,FN) precision model –FP = probability of answer being false positive –FN = # false negatives per correct answer
63
stanfordstreamdatamanager 63 Precision of an Operator State 1 State 2 ⋈ Approx. Answer Exact Answer Precision: FP = False Positives FP = False Positives FN = False Negatives FN = False Negatives
64
stanfordstreamdatamanager 64 Precision of Two Operators (FP1,FN1) State 2 State 1 ⋈ Approx. Answer Exact Answer State 3 Approx. Answer Exact Answer (FP2,FN2)
65
stanfordstreamdatamanager 65 The Problem Statement Given a plan, precision function for each operator in the plan, and R total resources, allocate R to operators to maximize result precision Solution –For each operator type: formula for calculating output precision given input precision and operator resource allocation –Assume precision for input streams –Becomes optimization problem
66
stanfordstreamdatamanager 66 The Holy Grail Given: –Declarative query –Resources –Constraints on streams Generate plan and resource allocation that takes advantage of constraints and maximizes precision Do it for multiple (weighted) queries, dynamically and adaptively, and convey what’s happening to the user
67
stanfordstreamdatamanager 67 The Stream Systems Landscape (At least) three general-purpose DSMS prototypes underway –STREAM –STREAM (Stanford) –Aurora –Aurora (Brown, Brandeis, MIT) –TelegraphCQ –TelegraphCQ (Berkeley) All will be demo’d at SIGMOD ’03 stream system benchmarkCooperating to develop stream system benchmark Goal: demonstrate that conventional systems are far inferior for data stream applications
68
stanfordstreamdatamanager 68 http://www-db.stanford.edu/stream Contributors: Arvind Arasu, Brian Babcock, Shivnath Babu, Mayur Datar, Rajeev Motwani, Justin Rosenstein, Rohit Varma
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.