Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Continuous Query Languages for DSMS CS240B Notes by Carlo Zaniolo.

Similar presentations


Presentation on theme: "1 Continuous Query Languages for DSMS CS240B Notes by Carlo Zaniolo."— Presentation transcript:

1 1 Continuous Query Languages for DSMS CS240B Notes by Carlo Zaniolo

2 2 Outline zDesign Objectives for Data Stream Management System (DSMS) zLanguages for expressing continuous queries yThe Blocking problem yThe expressive Power problem zThe Expressive Stream Language: ESL

3 3 Blocking Operators zA blocking query operator is ‘one that is unable to produce the first tuple of the output until it has seen the entire input’ [Babcock et al. PODS02] zBut continuous queries cannot wait for the end of the stream: must return results while the data is streaming in. Blocking operators cannot be used.  Only non-blocking ( nb ) queries and operators can be used on data streams (i.e. those that return their results before they have detected the end of the input). zCurrent DBMSs make heavy usage of blocking computations: 1.For operators that are intrinsically blocking: e.g., SQL aggregates, 2.And for those that are not: e.g., sort-based implementation of joins and group by 3.We only need to be concerned with 1: find a characterization for blocking & nonblocking independent of implementation.

4 4 Partial Ordering Let S = [ t 1, , t n ] be a sequence and 0  k  n. Then [t 1, , t k ] is said to be the presequence of S, of length k, denoted by S k. We write L  S to denote that L is a presequence of S,  Defines a Partial Order: reflexive, antisymmetric and transitive.  generalizes to the subset notion when order and duplicates are immaterial The empty sequence, [ ], is a subsequence of every other sequence.

5 5 employees(E#, Dept) select dept, count(E#) from employees group by dept zTraditional SQL-2 aggregates: Blocking select dept, count(E#) over (partition by dept range unbounded preceding ) from employees zSQL:2003 OLAP functions: Non-Blocking Continuous count returns, for each new tuple, the count so far. Consider a sequence of length n. At each step j<n, j is returned  cumulative return up to j: sum j (S)= [1,2, …, j] independent on whether j=n or j<n. Traditional count: For each j<n --nothing: sum j (S)=[] Final: sum n (S)=[n]

6 6 Operators on Sequences: S  G  G(S) G(S): result of applying G to the whole S S j the first j elements of S (presequence of length j  n) G j (S) denotes the cumulative output produced up to the j -th input tuple included. Then G is said to be: yBlocking when G j (S) = [ ] for j < n, and G n (S) = G(S)  Nonblocking when G j (S) = G(S j ), for every j  n. For example say that G produces one output tuple for each input tuple. Operators viewed as incremental transducers on a sequence S of legth n.

7 7 Examples Traditional SQL-2 aggregates are blocking— SQL:2003 OLAP functions are not. Selection is nonblocking. Continuous count (i.e., the unlimited preceding count of OLAP functions) is non-blocking Also window aggregates are non-blocking In between cases: e.g., traditional aggregates on input that is already sorted on group-by values.

8 8 Characterization of NonBlocking ( nb ) zMany functions expressible by nb-computations can also be expressed by blocking ones. E.g., joins can be implemented using sorting. Ditto for projections with duplicate elimination. zBut many functions implemented using blocking computation cannot be given an nb-implementation. zWe must distinguish between the two kinds of functions, since one can be supported in our DSMS (via suitable nb-implementation) and the other cannot. Theorem: Queries can be expressed via nb computations iff they are monotonic w.r.t. the presequence ordering.

9 9 NB-completeness  A query language L can express a given set of functions F on its input (DB, sequences, data streams)---the larger F, the greater the expressive power of L.  Non-monotonic functions are intrinsically blocking and they cannot be used on data streams. Thus, if we use L in a DSMS, we give up the non-monotonic subset of F with no regret. However, let us make sure that we do not give up anything more!  More? Yes, because for continuous queries of streams, we will normally disallow L’s blocking (i.e. nonmonotonic) operators & constructs, and only allow nb (i.e., monotonic ) operators.  But are ALL the monotonic functions expressible by L using the nb-operators of L ? Or by disallowing blocking operators did we also lose the ability of expressing some monotonic queries? Definition: L is said to be nb -complete when it can express all the monotonic queries expressible by L using only its nb - operators.

10 10 Expressive Power and NB-Completeness zNB-completeness is a test that a language is as suitable for continuous queries on data streams as it is on stored database. zIn a language L lacking nb-completeness, there are monotonic functions that L cannot express as continuous queries, that L can express if the stream had been stored in a database. zFor instance, Relational Algebra and SQL are not nb-complete (in addition to the shortcomings they might have on DBs).

11 11 Sets versus Sequences zSets are sequences where duplicates are allowed and order is immaterial. zThus S1 is a subset of S2 iff S1 can be reordered in a presequence of S2. zTheorem [Lifted from sequences to sets]. A function is is nb iff it is monotonic. zNB=monotonic: selection, projection, and OLAP functions zBlocking=Non-Monotonic: e.g. Traditional aggregates. zOperators of more than one argument: y Join are monotonic (i.e., NB) in both arguments. yR-S is monotonic on R and antimonotonic on S: i.e., will block on S but not on R (after it has seen the whole S, though)

12 12 Relational Algebra (RA)  Set difference can produce monotonic queries: Intersection: R 1  R 2  = R 1  (R 1  R 2 ) z Are these still expressible without set diff? z Intersection can be expressed as a joins: product+select z But interval coalescing and Until queries are monotonic queries that can be expressed in RA but not in nb-RA. z Example: Temporal domain isomorfic to nonnegative integers.Intervals closed to the left but open to the right: p(0, 3). % 0,1, and 2 are in p but 3 is not p(2, 4). % 3 is not a hole because is covered by this p(4, 5). % 5 is a hole because not covered by any other interval p(6, 8).

13 13 Coalesce p (cp) & p Until q p(0, 3). p(2, 4). p(4, 5). p(6, 8). cp(0, 3). cp(2, 4). cp(4, 5). cp(6, 8). cp(0, 4). cp(2, 5). cp(0,5). cp contains intervals from the start point of any p interval to the endpoint of any p interval unless the endpoint of an interval in between is a hole. cp(I1, J2)  p(I1, J1), p(I2, J2), J1 < J2,  hole(I1, J2). hole(I1, J2)  p(I1, J1), p(I2, J2), p(_,K), J1  K, K < I2,  cep(K). cep(K)  p(_, K), p(I, J), I  K, K < J. q(5,_) holds if cp has an interval that starts at 0 & contains 5 p Until q(yes)  q(0, J). p Until q(yes)  cp(0, I), q(J, _), I  J.

14 14 Relational Algebra zNonMonotonic (i.e., blocking) RA operators: set difference and division zWe are left with: select, project, join, and union. Can these express all FO monotonic queries? zSome interesting temporal queries: coalesce and until yThey are expressible in RA (by double negation) yThey are monotonic yThey cannot be expressed in nb-RA. Theorem: RA and SQL are not nb-complete. SQL faces two problems: (i) the exclusion of EXCEPT/NOT EXISTS, and (ii) the exclusion of aggregates.

15 15 E-Bay Example zAuctions: a stream of bids on an item. bidStream(Item#, BidValue, Time) zItems for which sum of bids is > 100K SELECT Item# FROM bidStream GROUP BY Item# HAVING SUM(BidValue) > 100000; zThis is a monotonic query. Thus it can be expressed in a language containing suitable query operators, but not in SQL-2. SQL-2 is not nb-complete; thus it is ill-suited for continuous queries on data streams. zSo SQL-2 is not nb-complete because of its blocking aggregates. What about relational algebra?

16 16 Incompleteness of Relational QL zThe coalesce and until queries ycan be expressed in safe nonrecursive Datalog, thus yThey are expressible in RA, yThey are monotonic yThey cannot be expressed in nb-RA Theorem: RA and SQL are not nb-complete. zA new limitation for DB query languages (which were already severely challenged in terms of expressive power)

17 17 Embedding SQL Queries in a PL  In DB applications, SQL can be embedded in a PL (Java, C++…) where the PL accesses the tuples returned by SQL using a Get Next of Cursor statement. zOperations that could not be expressed in SQL can then be expressed in the PL: yan effective remedy for the lack of expressive power of SQL zBut cursors is a ‘pull-based’ mechanism and cannot be used on data streams: the DSMS cannot hold tuples until the PL request them. zThe DSMS can only deliver its output to the PL as a stream—This is OK to drive a GUI. But if most of the work has not been done yet, who is the DSMS? yContrast this to DBMS who are useful even with a weak QL.

18 18 Reviewing the Situation zSQL’s lack of expressive power is a major problem for database-centric applications. zThese problems are significantly more serious for data streams since: yOnly monotonic queries can be used, yActually, not even all the monotonic ones since SQL is not nb-complete, yThese problems cannot be really by using PLs with embedded SQL statements on streams zDSMS will be impaired--unless significant improvements can be made.

19 19 UDAs to the Rescue yFull support for UDAs with all window combinations— effective on UDAs written in SQL, PLs, and even built- ins ySupport for continuous queries and ad hoc queries, under a simple and unified semantics yTuring completeness --all possible queries ynb-completeness all monotonic queries using only non- blocking operators (e.g., window UDAs & those without TERMINATE ) yEffective on a broad range of data-intensive applications: data/stream mining, approximate queries, sequential patters (XML not there) yMaking a strong case for the DB-oriented approach to data streams.

20 20 Conclusion zLanguage Technology: yESL a very powerful language for data stream and DB applications ySimple semantics and unified syntax conforming to SQL:2003 standards yStrong case for the DB-oriented approach to data streams zSystem Technology: ySome performance-oriented techniques well-developed— e.g., buffer management for windows yFor others: work is still in progress—stay tuned for latest news  Stream Mill is up and running: http://wis.cs.ucla.edu/stream-mill http://wis.cs.ucla.edu/stream-mill

21 21 References [1]ATLaS user manual. http://wis.cs.ucla.edu/atlas. [2]SQL/LPP: A Time Series Extension of SQL Based on Limited Patience Patterns, volume 1677 of Lecture Notes in Computer Science. Springer, 1999. [4]A. Arasu, S. Babu, and J. Widom. An abstract semantics and concrete language for continuous queries over streams and relations. Technical report, Stanford University, 2002. [5]B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom. Models and issues in data stream systems. In PODS, 2002. [9]D. Carney, U. Cetintemel, M. Cherniack, C. Convey, S. Lee, G. Seidman, M. Stonebraker, N. Tatbul, and S. Zdonik. Monitoring streams - a new class of data management applications. In VLDB, Hong Kong, China, 2002. [10]J. Celko. SQL for Smarties, chapter Advanced SQL Programming. Morgan Kaufmann, 1995. [11]S. Chandrasekaran and M. Franklin. Streaming queries over streaming data. In VLDB, 2002. [12]J. Chen, D. J. DeWitt, F. Tian, and Y. Wang. NiagaraCQ: A scalable continuous query system for internet databases. In SIGMOD, pages 379-390, May 2000. [13]C. Cranor, Y. Gao, T. Johnson, V. Shkapenyuk, and O. Spatscheck. Gigascope: A stream database for network applications. In SIGMOD Conference, pages 647-651. ACM Press, 2003. [14]Lukasz Golab and M. Tamer Özsu. Issues in data stream management. ACM SIGMOD Record, 32(2):5-14, 2003. [15]J. M. Hellerstein, P. J. Haas, and H. J. Wang. Online aggregation. In SIGMOD, 1997. [16] Yijian Bai, Hetal Thakkar, Chang Luo, Haixun Wang, Carlo Zaniolo: A Data Stream Language and System Designed for Power and Extensibility. Proc. of the ACM 15th Conference on Information and Knowledge Management (CIKM'06), 2006 [17] Yijian Bai, Hetal Thakkar, Haixun Wang and Carlo Zaniolo: Optimizing Timestamp Management in Data Stream Management Systems. ICDE 2007.

22 22 References (Cont.) [18] Yan-Nei Law, Haixun Wang, Carlo Zaniolo: Query Languages and Data Models for Database Sequences and Data Streams. VLDB 2004: 492-503 [19] Sam Madden, Mehul A. Shah, Joseph M. Hellerstein, and Vijayshankar Raman. Continuously adaptive continuous queries over streams. In SIGMOD, pages 49-61, 2002. [20]R. Motwani, J. Widom, A. Arasu, B. Babcock, M. Datar S. Babu, G. Manku, C. Olston, J. Rosenstein, and R. Varma. Query processing, approximation, and resource management in a data stream management system. In First CIDR 2003 Conference, Asilomar, CA, 2003. [21]R. Ramakrishnan, D. Donjerkovic, A. Ranganathan, K. Beyer, and M. Krishnaprasad. SRQL: Sorted relational query language, 1998. [23]Reza Sadri, Carlo Zaniolo, and Amir M. Zarkesh andJafar Adibi. A sequential pattern query language for supporting instant data minining for e-services. In VLDB, pages 653-656, 2001. [24]Reza Sadri, Carlo Zaniolo, Amir Zarkesh, and Jafar Adibi. Optimization of sequence queries in database systems. In PODS, Santa Barbara, CA, May 2001. [25]P. Seshadri. Predator: A resource for database research. SIGMOD Record, 27(1):16-20, 1998. [26]P. Seshadri, M. Livny, and R. Ramakrishnan. SEQ: A model for sequence databases. In ICDE, pages 232-239, Taipei, Taiwan, March 1995. [27]Praveen Seshadri, Miron Livny, and Raghu Ramakrishnan. Sequence query processing. In ACM SIGMOD 1994, pages 430-441. ACM Press, 1994. [28]M. Sullivan. Tribeca: A stream database manager for network traffic analysis. In VLDB, 1996. [29]D. Terry, D. Goldberg, D. Nichols, and B. Oki. Continuous queries over append-only databases. In SIGMOD, pages 321-330, 6 1992. [30]Peter A. Tucker, David Maier, Tim Sheard, and Leonidas Fegaras. Exploiting punctuation semantics in continuous data streams. IEEE Trans. Knowl. Data Eng, 15(3):555-568, 2003. [31]Haixun Wang and Carlo Zaniolo. ATLaS: a native extension of SQL for data minining. In Proceedings of Third SIAM Int. Conference on Data MIning, pages 130-141, 2003.

23 23 DSMS Research Projects zAurora (Brandeis/Brown/MIT) http://www.cs.brown.edu/research/aurora/ zCougar (Cornell) http://www.cs.cornell.edu/database/cougar/ zTelegraph (Berkeley)- http://telegraph.cs.berkeley.edu zSTREAM (Stanford) –http://www-db.stanford.edu/stream zNiagara (OGI/Wisconsin)-http://www.cs.wisc.edu/niagara/ zOpenCQ (Georgia Tech) – http://disl.cc.gatech.edu/CQ zTapestry (Xerox) – electronic documents stream filtering zHancock (AT&T) http://www.research.att.com/~kfisher/hancock/ zCape (WPI) http://davis.wpi.edu/dsrg/CAPE/home.html zTribeca (Bellcore) – network monitoring zStream Mill (UCLA) – http://wis.cs.ucla.edu/stream-millhttp://wis.cs.ucla.edu/stream-mill zGigascope …

24 24 CQLs for DSMS zMost of DSMS projects use SQL for continuous queries—for good reasons, since yMany applications span data streams and DB tables yA CQL based on SQL will be easier to learn & use yMoreover: the fewer the differences the better! zBut DSMS were designed for persistent data and transient queries---not for persistent queries on transient data zAdaptation of SQL and its enabling technology presents difficult research challenges zThese combine with traditional SQL problem, such as inability to deal with sequences, DM tasks, and other complex query tasks---i.e., lack of expressive power

25 25 Language Problems z Most DSMS projects use SQL — queries spanning both data streams and DBs will be easier. But … zEven for persistent data, SQL is far from perfect. Important application areas poorly supported include: yData Mining, and we need to mine data streams, ySequence queries, and data streams are infinite time series! zMajor new problems for SQL on data stream applications. ( After all, it was designed for persistent data on secondary store, not for streaming data) y Only NonBlocking operators in DSMS: blocking forbidden y Distinction not clear in DBMS which often use blocking implementations for nonblocking operators yThe distinction needs to formally characterized y and so is the loss of query power of the QL.


Download ppt "1 Continuous Query Languages for DSMS CS240B Notes by Carlo Zaniolo."

Similar presentations


Ads by Google