Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Continuous Queries over Data Streams Vitaly Kroivets, Lyan Marina Presentation for The Seminar on Database and Internet The Hebrew University of Jerusalem,

Similar presentations


Presentation on theme: "1 Continuous Queries over Data Streams Vitaly Kroivets, Lyan Marina Presentation for The Seminar on Database and Internet The Hebrew University of Jerusalem,"— Presentation transcript:

1

2 1 Continuous Queries over Data Streams Vitaly Kroivets, Lyan Marina Presentation for The Seminar on Database and Internet The Hebrew University of Jerusalem, Fall 2002

3 2 Contents of the lecture Introduction Proposed Architecture of Data Stream Management System Research problems Query Optimization Bibliography

4 3 Data Streams vs. Data Sets Data Sets:Data Streams:  Updates infrequent  Data changed constantly (sometimes additions only)  Old data required many times  Mostly only freshest data used  Example: employees personal data table  Examples: financial tickers, data feeds from sensors, network monitoring, etc

5 4 Using Traditional Database User/ApplicationUser/Application LoaderLoader QueryResult Result…Query…

6 5 Data Streams Paradigm User/ApplicationUser/Application Register Query Stream Query Processor Result

7 6 Data Streams Paradigm User/ApplicationUser/Application Register Query Stream Query Processor Result Scratch Space (Memory and/or Disk) Data Stream Management System (DSMS)

8 7 What Is A Continuous Query ? Query which is issued once and logically run continuously.

9 8 What is Continuous Query ? Query which is issued once and run continuously. Example: detect abnormalities in network traffic behavior in real-time and their cause -- like link congestion due to hardware failure.

10 9 What is Continuous Query ? Query which is issued once and run continuously. More examples: Continues queries used to support load balancing, online automatic trading at Stock Exchange

11 10 Special Challenges Timely online answers even for rapid data streams Timely online answers even for rapid data streams Ability of fast access to large portions of data Ability of fast access to large portions of data Processing of multiple streams simultaneously Processing of multiple streams simultaneously

12 11 Making Things Concrete Outgoing (call_ID, caller, time, event) Incoming (call_ID, callee, time, event) event = start or end Central Office Central Office DSMS BOBALICE

13 12 Making Things Concrete Database = two streams of mobile call records Database = two streams of mobile call records  Outgoing(connectionID, caller, start, end)  Incoming(connectionID, callee, start, end) Query language = SQL Query language = SQL FROM clauses can refer to streams and/or relations

14 13 Query 1 (self-join) Find all outgoing calls longer than 2 minutes SELECT O1.call_ID, O1.caller FROM Outgoing O1, Outgoing O2 WHERE (O2.time – O1.time > 2 AND O1.call_ID = O2.call_ID AND O1.call_ID = O2.call_ID AND O1.event = start AND O1.event = start AND O2.event = end) AND O2.event = end)  Result requires unbounded storage  Can provide result as data stream  Can output after 2 min, without seeing end

15 14 Query 2 (join) Pair up callers and callees SELECT O.caller, I.callee FROM Outgoing O, Incoming I WHERE O.call_ID = I.call_ID  Can still provide result as data stream  Requires unbounded temporary storage …  … unless streams are near-synchronized

16 15 Query 3 (group-by aggregation) Total connection time for each caller SELECT O1.caller, sum(O2.time – O1.time) FROM Outgoing O1, Outgoing O2 WHERE (O1.call_ID = O2.call_ID AND O1.event = start AND O1.event = start AND O2.event = end) AND O2.event = end) GROUP BY O1.caller  Cannot provide result in (append-only) stream. Alternatives: Alternatives: Output stream with updates Output stream with updates Provide current value on demand Provide current value on demand Keep answer in memory Keep answer in memory

17 16 Conclusions  Conventional DBMS technology is inadequate  We need reconsider all aspects of data management and processing in presence of data streams

18 17 DBMS versus DSMS Persistent relationsPersistent relations Transient streams (and persistent relations)

19 18 DBMS versus DSMS Persistent relationsPersistent relationsTransient streams (and persistent relations) One-time queriesOne-time queries Continuous queriesContinuous queries

20 19 DBMS versus DSMS Persistent relationsPersistent relations Transient streams (and persistent relations) One-time queriesOne-time queries Continuous queriesContinuous queries Random accessRandom access Sequential accessSequential access

21 20 DBMS versus DSMS Persistent relationsPersistent relations Transient streams (and persistent relations) One-time queriesOne-time queries Continuous queriesContinuous queries Random accessRandom access Sequential accessSequential access Access plan determined by query processor and physical DB designAccess plan determined by query processor and physical DB design Unpredictable data arrival and characteristicsUnpredictable data arrival and characteristics

22 21 DBMS versus DSMS Persistent relationsPersistent relations Transient streams (and persistent relations) One-time queriesOne-time queries Continuous queriesContinuous queries Random accessRandom access Sequential accessSequential access Access plan determined by query processor and physical DB designAccess plan determined by query processor and physical DB design Unpredictable data arrival and characteristicsUnpredictable data arrival and characteristics “Unbounded” disk store“Unbounded” disk store Bounded main memoryBounded main memory

23 22 Related work Tapestry system Content-based filtering of email messages. Restricted subset of SQL append-only query results Content-based filtering of email messages. Restricted subset of SQL append-only query results Cronicle data model Append-only ordered sequences of tuples restricted view-definition language doesnt store any cronicles Append-only ordered sequences of tuples restricted view-definition language doesnt store any cronicles Alert system Event-condition Action triggers in conventional SQL DB Continuous Queries over append-only "active tables". Event-condition Action triggers in conventional SQL DB Continuous Queries over append-only "active tables".

24 23 Related work Materialized Views  Materialized Views are queries which need to be reevaluated whenever database changes.  Materialized Views vs. Continuous Queries: Continuous Queries  May stream rather then store result  May deal with append only relations  May provide approximate answers  Processing strategy may adapt characteristics of data stream

25 24 Architecture for continuous queries Single stream of tuples D, single continuous Query Q and Answer to the query A Q is issued once and operates continuously Q Data Stream Continuous Query A? Answer

26 25 Architecture for continuous queries We consider data streams that adhere to the relation model (i. e. streams of tuples), although many of the ideas and techniques are independent of the data model being considered Q Data Stream Continuous Query A? Answer

27 26 Architecture for continuous queries Scenario 1 (simplest): Data stream D is append only - no updates or deletions. How to handle Q? 1) Always store current answer A to Q. 1) Always store current answer A to Q. D is of unbounded size => A may be too. D is of unbounded size => A may be too. 2) Not to store A, but make new tuples in A available as another continuous stream. 2) Not to store A, but make new tuples in A available as another continuous stream. No need for unbounded storage for A, but may need unbounded storage to determine new tuples in A. No need for unbounded storage for A, but may need unbounded storage to determine new tuples in A.

28 27 Architecture for continuous queries Scenario 2 Input stream is append-only, but may cause updates and deletions in answer A. Input stream is append-only, but may cause updates and deletions in answer A. => May need to update/delete tuples in output data stream => May need to update/delete tuples in output data stream Scenario3 (most general) Input stream D includes updates and deletions. Input stream D includes updates and deletions. => Much data of stream should be stored to determine answer. => Much data of stream should be stored to determine answer.

29 28 Architecture for continuous queries How to solve? 1) Restrict expressiveness of Q. 1) Restrict expressiveness of Q. 2) Impose constrains on data stream to 2) Impose constrains on data stream to guarantee that answer to Q is bounded guarantee that answer to Q is bounded and amount of data needed to compute Q. and amount of data needed to compute Q. 3) Provide approximate answer. 3) Provide approximate answer.

30 29 Arcitecture for processing continuous queries Stream Query Processor Processor Stream 1 Stream 2 Stream N...... Throw Scratch Store Stream

31 30 Architecture for continuous queries STREAM is data stream containing tuples appended to A. It is append-only stream (shouldnt include updates/deletions) STREAM is data stream containing tuples appended to A. It is append-only stream (shouldnt include updates/deletions) STREAM and STORE define current answer A. STREAM and STORE define current answer A.

32 31 Architecture for continuous queries When query Q is notified of new tuple t in a relevant data stream, it can perform number of actions, which are not mutually exclusive 1) t causes new tuples in A 1) t causes new tuples in A if tuple a will remain in A forever: if tuple a will remain in A forever: send a to STREAM send a to STREAM 2) if a should be in A, but may be removed at some moment: add a to STORE 2) if a should be in A, but may be removed at some moment: add a to STORE Stream Query Processor Processor Throw ScratchStore Stream

33 32 Architecture for continuous queries When query Q is notified of new tuple t in a relevant data stream, it can perform number of actions, which are not mutually exclusive 3) t may cause update or deletion 3) t may cause update or deletion of answer tuples in Store. Answer of answer tuples in Store. Answer tuples may be moved from tuples may be moved from STORE to STREAM STORE to STREAM 4) May need to save t or derived data to ensure in future can compute data to ensure in future can compute query result send t to SCRATCH query result send t to SCRATCH Stream Query Processor Processor Throw ScratchStore Stream

34 33 Architecture for continuous queries When query Q is notified of new tuple t in a relevant data stream, it can perform number of actions, which are not mutually exclusive 5) t not needed and will not be needed. Send it to THROW needed. Send it to THROW (unless we like to archive it) 6) As a result of t we may move data from STORE or SCRATCH data from STORE or SCRATCH to THROW Stream Query Processor Processor Throw ScratchStore Stream

35 34 Architecture for continuous queries Scenario1 Data stream D is append only - no updates or deletions. Always store current answer A to Q. STREAM empty STORE always contain A SCRATCH contains whatever needed to to keep answer in STORE up to date

36 35 Architecture for continuous queries Scenario2 Answer A exclusively as data stream D. STREAM stream answer A STORE empty SCRATCH contains whatever needed to to keep answer in STORE up to date

37 36 Architecture for continuous queries Scenario 3 Input stream append only, answer A may have updates and deletions Example : Q is group-by with Min aggregation function. Example : Q is group-by with Min aggregation function. Answer A maintained in STORE SCRATCH is empty

38 37 Architecture for continuous queries Scenario 4 Input streams may include updates and deletions Unbounded storage required for SCRATCH to ensure that Min always will be computed to ensure that Min always will be computed Both in 3 and 4: data moved to STREAM only whenever known that no further updates/deletions etc of tuples of this group will occur. Both in 3 and 4: data moved to STREAM only whenever known that no further updates/deletions etc of tuples of this group will occur.

39 38 The Architecture and Related Work Implementing Triggers in terms of proposed architecture (for launching triggered actions assume actions performed by SQL stored- procedures.) Implementing Triggers in terms of proposed architecture (for launching triggered actions assume actions performed by SQL stored- procedures.)  STREAM and STORE empty.  SCRATCH used for data required to moniotor complex events Benefits: complex multitable events & conditions to be monitored Benefits: complex multitable events & conditions to be monitored  Trigger processing benefit from efficient data management / processing  Techniques ( see below)

40 39 The Architecture and Related Work Implementing Materialized views in terms of proposed architecture  View itsef is maintained in STORE  Base data: in SCRATCH  Data expiration : to expedite cleanup of SCRATCH SCRATCH  No way to ensure bounding of size of STORE and SCRATCH

41 40 End of Part I

42 41 Research Problems Designing Query Language Designing Query Language Online processing of rapid streams Online processing of rapid streams Approximation techniques Approximation techniques Storage constrains vs. performance requirements Storage constrains vs. performance requirements Summarization Summarization Query Planning / Optimization Query Planning / Optimization Building good Query Plan Building good Query Plan Scheduling Scheduling Sub-Plans Sharing Sub-Plans Sharing Resource Management Resource Management Adaptation Adaptation

43 42 Research Problems: Languages for Continuous Queries Bounding the size of scratch/store Bounding the size of scratch/store Open problem : to determine for arbitrary SQL query whether properties satisfied Open problem : to determine for arbitrary SQL query whether properties satisfied

44 43 Query Language Query language allows both streams and relations Query language allows both streams and relations Assumptions: Assumptions: Streams:  Ordered  Append-only  Unbounded Multiple streams allowed Relations:  Unordered  Support updates and deletions

45 44 SQL Extensions For Continuous Queries FROM allowed both to Streams and Relations FROM allowed both to Streams and Relations Sliding Window for FROM clause (for streams) Sliding Window for FROM clause (for streams) Optional "Partitioning" clause Optional "Partitioning" clause Mandatory "Window size" Mandatory "Window size" Optional "Filtering predicate" Optional "Filtering predicate"

46 45 Windows specification Using ROWS Using ROWS ROWS 50 PRECEEDING ROWS 50 PRECEEDING Using RANGE Using RANGE RANGE 15 minutes PRECEEDING RANGE 15 minutes PRECEEDING

47 46 Example 1 Web Server CL1 CL 2 CL 3 CL 4 DSMS Internet S ( C lient_id, URL, domain, time ) Clients.com CL 5 CL 7.il.NF CS webMath web

48 47 Example 1 (CQL) “ From ” with “ Range ”  Stream "Requests" of requests to web server with attributes: ( client_id, URL, domain, time) ( client_id, URL, domain, time)  Query counting number of request of pages from domain “cs.huji.ac.il” in the last day: SELECT COUNT(*) SELECT COUNT(*) FROM Request S[RANGE 1 DAY PRECEEDING] FROM Request S[RANGE 1 DAY PRECEEDING] WHERE S.domain= "cs.huji.ac.il" WHERE S.domain= "cs.huji.ac.il"

49 48 Partitioning Clause Partitions data in several groups Partitions data in several groups Computes separate window for each group Computes separate window for each group Merges windows into single result Merges windows into single result Is syntactically same as GROUP BY clause Is syntactically same as GROUP BY clause Example : Example :

50 49 Example 2 “ Partition By”  How many pages served (only each clients 10 most recent requests) by request from domain CS.HUJI.AC.IL from CS website ? CS.HUJI.AC.IL from CS website ? SELECT COUNT (*) FROM requests S SELECT COUNT (*) FROM requests S [PARTITION BY s.Client_id [PARTITION BY s.Client_id Rows 10 PRECEEDING Rows 10 PRECEEDING Where s.Domain = ‘CS.HUJI.AC.IL’ ] Where s.Domain = ‘CS.HUJI.AC.IL’ ] Where s.URL LIKE 'http://cs.huji.Ac.Il/%' Where s.URL LIKE 'http://cs.huji.Ac.Il/%'

51 50 Example 3 Join with relation Classify domain by primary type of web content they serve Classify domain by primary type of web content they serve. ac.il EDUCATION. ac.il EDUCATION.gov.il Government.gov.il Government.co.il COMMERCE.co.il COMMERCE.com COMMERCE.com COMMERCE Count number of requests from "commerce" domains out of last 10000 records Count number of requests from "commerce" domains out of last 10000 records 10% sample of requests stream is used 10% sample of requests stream is used

52 51 Example 3 (Cont.) SELECT COUNT (*) FROM (SELECT R.class (SELECT R.class FROM Requests S 10% SAMPLE, Domains R FROM Requests S 10% SAMPLE, Domains R WHERE S.Domain=R.Domain) T WHERE S.Domain=R.Domain) T [ROWS 10000 PRECEEDING] [ROWS 10000 PRECEEDING] WHERE T.class = "commerce" WHERE T.class = "commerce"  Note: stream of Requests is joined with Domains relation resulting in stream T, before applying sliding window

53 52 Performance Challenge: Multiple rapid incoming data streams Multiple rapid incoming data streams Multiple complex queries with timeliness requirements Multiple complex queries with timeliness requirements Finite resources Finite resources

54 53 Solution: Approximation Approximate answers Approximate answers Graceful degradation Graceful degradation Maximize precision based on available resources Maximize precision based on available resources

55 54 Approximation : Static vs. Dynamic Approximation : Static vs. Dynamic  Queries modified at submission time to use fewer resources  User guaranteed certain query behavior  User can configure approximation mechanism  Adaptation mechanisms not needed  Queries modified at run time  Not suitable for some applications

56 55 Approximation Techniques Window Reduction Window Reduction Sampling rate reduction Sampling rate reduction Summarization (Synopses) Summarization (Synopses)

57 56 Window reduction  Decreasing size of window  Introduce Window where none was specified originally  May increase output rate (duplicate elimination for example)  Must detect bad cases statically  Affects resources used by operator

58 57 Sampling rate reduction  will reduce output rate  will not to influence resource requirements of operation  Introduce SAMPLE if not specified  Reduce sampling rate

59 58 Summarization Summaries(data synopses) - concise representation at expense of accuracy Sampling, Histograms Wavelets  How to make guaranties about query results based on summaries ?  How to maintain efficiently in rapid data streams ?  What summarization techniques are better ?

60 59 Dynamic approximation Challenges Some apps will not tolerate unpredicted and variable accuracy Extend Language to specify tolerable imprecision

61 60 Dynamic approximation techniques Synopses compression Sampling Load shedding

62 61 Synopses compression Synopses: concise representation at expense of accuracy Reducing memory overhead Methods: histograms, Wavelets, etc

63 62 Load shedding Drop tuples from queries, when they grow too large Drops chunks of tuples at time -- differs from sampling, which eliminates probabilistically load shedding -- biased, but easier to implement

64 63 Query Plans: How DSMS process Query?  Separate Query Plan for each Continuous Query vs. one Mega-Query plan for all computations for all users  Plan components may be shared Query registers before streams start to produce data  How about adding queries over existing streams  Queries over archived / discarded Data Issues to consider:

65 64 STREAM System: Query Plans Query Operators Reads stream of tuples from set of input queues, processes them, writes output tuples into single output queue Input Queue Operator Output Queue

66 65 Query Plans (Cont.)  Inter-Operator Queues Queues connect different operators and define tuples flow  Synopses Summarizes tuples seen so far at intermediate operator as needed for future

67 66 When Synopses Needed ?  Join operator Must remember tuples seen so far on each of input streams – maintain synopses for each  Filter operator (selection) Do not maintain state – no need for synopses

68 67 Example Stream R Stream S Operator O1 (Join) Synop1Synop2 Synop3 Synop4 Stream T Operator O2 (select) Operator O3 (Join) Query 1 Query 2 Queue1Queue2 Queue3 Queue 4 Selection Over Join of R and S Join of R,S, T Q3 is Shared Scheduler

69 68 Explanations to Example  Two plans (for Q1 and Q2) share a sub-plan joining streams R and S by sharing it output queue q3  Execution of operators controlled by Global Scheduler  When operator O scheduled, control passes to O for period determined by number of tuples  Possible time-slice based scheduling

70 69 Resource Sharing for Query Plans When Continuous Queries share common sub- expressions Similar to traditional DBMS Resource sharing and Approximation considered separately  Do not share, if sharing introduces approximation like merging sub-expressions with different window sizes

71 70 Implementation of Shared Queue  Queue maintains pointer to first unread tuple for each operator  Discard tuple once they had been read by all operators t1t2t3t4t5t6t7t8 Shared Queue Op1 Op2 Op3 Op4

72 71 Resource Sharing (cont.)  Base Data Stream accessed by multiple queries shared as common sub-expression  Number of tuples in shared queue depends on :  Rate of addition to the queue  Rate at which slowest operator consumes tuples  Common sub-expression of 2 queries with very different consumption rates

73 72 Shared Queue Issues  P1, P2 – parents of operator J  J will be scheduled frequently, for sake of P1  J should be scheduled less frequently for P2 (to avoid proliferation of tuples in q) Operator J (Join) Queue q Stream P1 Heavy consumer P2 Light consumer

74 73 Sub-Plan Sharing Formally proven:  sub-plan sharing may be sub-optimal for common sub-expressions with joins  for common sub-expressions without joins sharing is always preferable

75 74 Synopses Sharing Issues to consider:  Which operator responsible to manage shared synopses ?  Synopses required by different operators, how to choose size of common synopses?  If synopses are identical, how to cope with different consumption rates?

76 75 Scheduling Objective for Scheduler:  Stream-based variation of response time  Throughput  Weighted fairness among queues  Minimize intermediate queues sizes Granularity for Scheduler: Granularity for Scheduler:  Max number of tuples consumed by operator  Time-unit  Parallelism in scheduling algorithm ?

77 76 Scheduling : Example O1 takes 1 time unit to operate on n tuples from q1,with 20% selectivity, produces n/5 tuples in q2 Op. O1Op. O2 q1q2 O2 takes 1 time unit to operate on n/5 tuples from q2,and it doesn’t produces tuples. O2 takes 1 time unit to operate on n/5 tuples from q2,and it doesn’t produces tuples.

78 77 Scheduling Example (Cont.) Assume, average arrival rate on q1 is no more than n per 2 time units queues are bounded Arrivals may be bursty  Possible scheduling strategies  Algoritm1 (time-slicing) : tuples processed 1 time unit by each operator. O1 consumes n units, O2 consumes n/5; O1, O2 …  Algoritm2 : O1 operates until its queue empty, afterwards – O2

79 78 Algorithm 1 1 2 3 4 5 6 7 8 26 26 26 26 1 2n tuples arrived n tuples arrived Orange : Tuples in Q1 Yellow : Tuples in Q2 Time Queue Size

80 79 Algorithm2 Orange : tuples in Q1 Yellow : Tuples in Q2 1 2 3 4 5 6 7 8 2n tuples arrived n tuples arrived Queue Size Time

81 80 Comparison. Which is better? 2 3 4 5 6 7 8 26 26 26 26 1 2n tuples arrived n tuples arrived Time 1  Orange : Algorithm1  Yellow : Algorithm2 Total size of both queues

82 81 Greedy Scheduler Rule  Schedule the operator that consumes largest number of of tuples per time and is the most selective (produces fewest tuples)  Operators with full batches in input queues are favored over high priority operators with under- full inputs (better utilization of time-slice)  High-priority operator may be underutilized if feeders are low priority – consider chains of operators

83 82 Scheduling Algorithm Discussion Queue size minimization Increased time to initial results Strategy 1 would produce initial results faster Incorporate response time and weighted fairness into algorithm Flexible time-slices Taking context-switching into account

84 83 Resource Management  Relevant Resources:  Memory  CPU  I/O (if disk used)  Network (in Distributed DSMS)  Our Goal Maximize query precision by making best use of available resources and have a capability to do that dynamically and adaptively

85 84 Resource Management Cont. Allocating memory to query plan Incorporating known constraints on input streams to reduce synopses without compromising precision Operator scheduling to minimize queue size  Focus on memory used by synopses and queues  Algorithms developed in STREAM :

86 85 Resource Management Approaches (Cont.) Exploiting constraints over data streams When additional information about streams is available (gathered stats, constraint specs) -- reduce resource utilization with same result precision

87 86 Adaptation – why? Adaptation: Queries are long running Queries are long running Parameters Parameters  Stream flow rate  Stream data characteristics  Environment (available RAM)  may vary -- how to adapt?

88 87 Exploiting Constraints over Data Streams Answering Requires synopses of unbounded size ! Query Q : join, to monitor fulfillments delays OF Stream Orders Stream Fulfillments Order_I DItem_I D Synop-O Synop-F

89 88 Constraints (cont.)  Tuples for given (orderID, itemID) arrive at stream O before corresponding tuples arrive to F  No need to maintain a join synopses for F !!  Another constrain: tuples arrive at O clustered by orderID  We need only to save tuples for given orderID, until next orderID seen Ord1, item 4 Ord1, item 2 Ord1, item 1 Ord1, item 3 Ord3, item 4 Ord3, item 1 Ord1, item 3 Ord1, item 2 Ord1, item 1 Ord3, item 1 Ord3, item 4 Ord3, item 2 More RAM needed for synapse

90 89 Constraints (Cont.) Referential integrity Unique-value Clustered-Arrival Ordered-Arrival

91 90 Summary  Architecture for DSMS  Query Language  Common Design Problems  Tradeoff: efficiency, accuracy, storage

92 91 References “Continuous Queries over Data Streams” by S.Babu, J.Widom (Stanford University) “Query Processing, Approximation, and Resource Management In a Data Stream Management System” by R.Motiwani, J.Widom and others (Stanford University) http://www.db.stanford.edu/stream

93 Questions ?

94 93


Download ppt "1 Continuous Queries over Data Streams Vitaly Kroivets, Lyan Marina Presentation for The Seminar on Database and Internet The Hebrew University of Jerusalem,"

Similar presentations


Ads by Google