Download presentation
Presentation is loading. Please wait.
Published byRüdiger Bauer Modified over 5 years ago
1
Continuous Query Languages (CQL) Blocking Operators and the expressive power problem
Carlo Zaniolo UCLA CSD 2017
2
CQLs for DSMS Most of DSMS projects use SQL for continuous queries—for good reasons, since Many applications span data streams and DB tables A CQL based on SQL will be easier to learn & use Moreover: the fewer the differences the better! But DBMS were designed for persistent data and transient queries---not for persistent queries on transient data Adaptation of SQL and its enabling technology presents difficult research challenges These combine with traditional SQL problem, such as inability to deal with sequences, DM tasks, and other complex query tasks---i.e., lack of expressive power
3
Language Problems Most DSMS use SQL — queries spanning both data streams and DBs will be easier. But … Even for persistent data, SQL is far from perfect. Important application areas poorly supported include: Data Mining, and we need to mine data streams, Sequence queries: and data streams are unbounded sequences!! Major new problems for SQL on data stream applications. (After all, it was designed for persistent data on secondary store, not for streaming data) Only NonBlocking operators in DSMS: blocking forbidden Distinction not clear in DBMS which often use blocking implementations for nonblocking operators The distinction needs to formally characterized and so is the loss of query power caused upon CQLs.
4
Blocking Operators A blocking query operator is ‘one that is unable to produce the first tuple of the output until it has seen the entire input’ [Babcock et al. PODS02] But continuous queries cannot wait for the end of the stream: must return results while the data is streaming in. Blocking operators cannot be used. Only non-blocking (nb) queries and operators can be used on data streams (i.e. those that return their results before they have detected the end of the input). Current DBMSs make heavy usage of blocking computations: For operators that are intrinsically blocking And for those that are not—i.e., they are only implemented that way. To exclude 1, we need to find a characterization for blocking & nonblocking that is independent of implementation.
5
Partial Ordering Let S = [ t1, ¼, tn] be a sequence and 0 £ k £ n.
Then Sk =[t1, ¼, tk ] is said to be the presequence of S, of length k>0. Also S0=[ ] denotes the empty sequence L S denotes that L is a presequence of S, Defines a Partial Order: reflexive, antisymmetric and transitive. The notion of subset is different from that of `preorder.’ For sets order and duplicates are immaterial The empty sequence [ ] is a pre-sequence of every other sequence.
6
Operators on Sequences: S ®G ® G(S)
Gj(S) denotes the cumulative output produced up to the j-th input tuple included. Sj input up to step j. S is a sequence of length n. Then G is said to be: Blocking when Gj(S)=[ ] for j<n, and Gn(S)=G(S) Nonblocking when Gj(S) = G(Sj), for every j £ n. G(S): result of a applying G to the whole S Operators viewed as incremental transducers:
7
employees(E#,Sal, ...) Tradional count: Cumulative return
select count(E#) from employees grouped by Sal Traditional SQL-2 aggregates: blocking select Sal, count(E#) over (range unbounded preceding) from employees ordered by Sal SQL:2003 Non Blocking Continuous count returns, for each new tuple, the count so far. On a sequence of length n: at each step j<n the count up to j is returned: count1 (S)= [1], count2 (S)= [1,2], ... countj (S)= [1,2, …, j] independent on whether j=n or j<n. Tradional count: Cumulative return For each j<n: nothing, countj (S)=[ ] Final: countn (S)=[n]
8
Examples Selection is nonblocking.
Projection is non-blocking even if we eliminate resulting duplicates. Traditional SQL-2 aggregates are blocking (for arbitrarily ordered input) SQL:2003 OLAP functions are not. E.g. Continuous count, sum, max, etc. (i.e., the unlimited preceding count of OLAP functions) is non-blocking Intermediate cases are also possible
9
Characterization of NonBlocking (NB)
Theorem: Queries can be expressed via nonblocking computations iff they are monotonic w.r.t. the presequence ordering. Proof: NB G implies monotonic G: We need to prove that if Sj Sk then G(Sj) G(Sk). Since j ≤ k,it is always true that Gj(Sk) Gk(Sk). But if G is NB then Gj(Sk)=Gj(Sj) and Gk(Sk)= G(Sk) QED monotonic G implies NB G … the incremental G transducer, at step j+1 adds the difference between G(Sj+1) and G(Sj).
10
NonBlocking Iff Monotonic
The theorem generalizes from presequences to sets---i.e. presequences where duplicates are not allowed and order is immaterial. In fact S1 is a subset of S2 iff S1 is a presequence of S2, after proper reordering and elimination of duplicates NB=monotonic: e.g., selection, projection, and OLAP functions Blocking= Non-Monotonic: e.g. Traditional aggregates. Results hold for operators of more than one argument: Join are monotonic (i.e., NB) in both arguments. R-S is monotonic on R and antimonotonic on S: i.e., will block on S but not on R (but it will unblock on R only after it has seen the whole S!)
11
NB-Completeness A query language L can express a given set of functions on its input (DB, sequences, data streams). Thus nonmonotonic functions are intrinsically blocking and they cannot be used on data streams. For continuous queries on data streams, we should disallow blocking (i.e., nonmonotonic) operators & constructs and only allow nonblocking (i.e., monotonic ) operators: nb-operators for short. But can ALL the monotonic functions expressible by L be expressed using only its nb-operators ? Or did we also lose some monotonic queries? Definition: When using only its NB-operators L can express all the monotonic queries expressible in L, then L is said to be NB-complete.
12
Expressive Power and NB-Completeness
Consider a (DB) language L. The expressive power of L is the set of functions F that can be computed on the DB using its operators (or constructs). On data streams, we are only interested in monotonic functions: F’ F. Also let O be the operators of L, and O’ O be the subset of such operators that are monotonic. L will be said to be NB-complete if all functions in F’ can be expressed using only the operators in O’. NB-completeness is a test that O is as suitable for continuous queries on data streams as it is on the database. Say that L is not NB-complete: then there exist monotonic functions that L can express on the data stored in the DB, but it can no longer express on the same data presented as a stream.
13
Is SQL NB complete? bidStream(Item#, BidValue, Time)
E-Bay Example Auctions: a stream of positive bids on an item. bidStream(Item#, BidValue, Time) Items for which the sum of bids is > 100K SELECT Item# FROM bidStream GROUP BY Item# HAVING SUM(BidValue) > ; This is a monotonic query.cThus it can be expressed in a language containing suitable query operators. But it cannot be expressed in SQL-2. SQL-2 is not nb-complete; thus it is ill-suited for continuous queries on data streams. So SQL-2 is not nb-complete because of its blocking aggregates. What about RA without aggregates?
14
Relational Algebra (RA)
Set difference can produce monotonic queries: Are these still expressible without set diff? Intersection is monotonic: R1 Ç R2 = R1 - (R1 - R2) But intersection can also be expressed as a joins: product+select. So it is not lost if we disallow set diff. But interval coalescing and Until queries are monotonic queries that can be expressed in RA but not in nb-RA. Example: Temporal domain isomorfic to nonnegative integers.Intervals closed to the left but open to the right: p(0, 3) % 0,1, and 2 are in p but 3 is not p(2, 4) % 3 is not a hole because is covered by this p(4, 5) % 5 is a hole because not covered by any other interval p(6, 8).
15
Coalesce p (cp) & p Until q
p(0, 3). p(2, 4). p(4, 5) p(6, 8). cp(0, 3). cp(2, 4). cp(4, 5). cp(6, 8). cp(0, 4). cp(2, 5). cp(0,5). cp contains intervals from the start point of any p interval to the endpoint of any p interval unless the endpoint of some interval in between is a hole. cp(I1, J2) ¬ p(I1, J1), p(I2, J2), J1 < J2, Øhole(I1, J2). hole(I1, J2) ¬ p(I1, J1), p(I2, J2), p(_,K), J1 £ K, K < I2, Øcep(K). cep(K) ¬ p(_, K), p(I, J), I £ K, K < J. q(5,_) holds if cp has an interval that starts at 0 & contains 5 pUntil q(yes) ¬ q(0, J). pUntil q(yes) ¬ cp(0, I), q(J, _), I ³ J .
16
Relational Algebra NonMonotonic (i.e., blocking) RA operators: set difference and division We are left with: select, project, join, and union. Can these express all FO monotonic queries? Some interesting temporal queries: coalesce and until They are expressible in RA (by double negation) They are monotonic But they cannot be expressed in NB-RA. Theorem: RA and SQL are not NB-complete. SQL faces two problems: (i) the exclusion of EXCEPT/NOT EXISTS, and (ii) the exclusion of aggregates.
17
Real Applications Require REAL Power
SQL’s lack of expressive power is a major problem for database-centric applications. These problems are significantly more serious for data streams since: Only monotonic queries can be used, Actually, not even all the monotonic ones since SQL is not nb-complete, These problems cannot be solved by embedding SQL statements in a PL program—next slide!
18
Embedding SQL Queries in a PL
In DB applications, SQL can be embedded in a PL (Java, C++…) where the PL accesses the tuples returned by SQL using a `Get Next of Cursor’ statement. Operations that could not be expressed in SQL can then be expressed in the PL: an effective remedy for the lack of expressive power of SQL But cursors are a ‘pull-based’ mechanism and cannot be used on data streams: the DSMS cannot hold tuples until the PL request them! The DSMS can only deliver its output to the PL as a stream This might be OK for simple situations But if the core of the work has not been done yet, the PL system must do the actual DSMS work! Conclusion: to support applications of any complexity we must have a DSMS with real expressive power, As opposed to DBMS that are useful even with a weak QL.
19
Real Applications Require Real Power
Embedding CQL in PL programs does not work well ... BUT: Embedding PL programs in CQL works: User Defined Functions with BLOBS: Good for DBMS but DSMS require incremental computation User-Defined Aggregates (UDAs) functions: Incremental computation model Can be defined using a PL or SQL itself with natively defined UDAs, SQL becomes Turing complete And NB-complete: can express all monotonic functions Simple syntactic characterization for NB aggregates. Effective on a broad range of data-intensive applications: KDD in particular. A few extensions are still need—more later.
20
Why UDAS are Important We have seen how new aggregates can be defined by the intialize, iterate, terminate scheme, using SQL itself (native UDAs) or an external language (C++, Java, etc.) Theorem [Law-Wang-Zaniolo 2011] SQL with natively defined UDAs is Turing-Complete. With non-blocking UDAs SQL, becomes NB-complete: it can express all monotonic computable functions on a single stream. Also complete on multiple streams if we union becomes a sort-merge operator.
21
References D. Carney, U. Cetintemel, M. Cherniack, C. Convey, S. Lee, G. Seidman, M. Stonebraker, N. Tatbul, and S. Zdonik. Monitoring streams - a new class of data management applications. In VLDB, Hong Kong, China, 2002. Yijian Bai, Hetal Thakkar, Chang Luo, Haixun Wang, Carlo Zaniolo: A Data Stream Language and System Designed for Power and Extensibility. Proc. of the ACM 15th Conference on Information and Knowledge Management (CIKM'06), 2006 Yan-Nei Law, Haixun Wang, Carlo Zaniolo: Query Languages and Data Models for Database Sequences and Data Streams. VLDB 2004: Haixun Wang and Carlo Zaniolo. ATLaS: a native extension of SQL for data minining. In Proceedings of Third SIAM Int. Conference on Data MIning, pages , 2003 Yan-Nei Law, Haixun Wang, Carlo Zaniolo:Relational languages and data models for continuous queries on sequences and data streams. ACM Trans. Database Syst. 36(2): 8:1-8:32 (2011)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.