Continuous Query Languages (CQL) Blocking Operators and the expressive power problem Carlo Zaniolo UCLA CSD 2017.

Slides:



Advertisements
Similar presentations
Semantics and Evaluation Techniques for Window Aggregates in Data Streams Jin Li, David Maier, Kristin Tufte, Vassilis Papadimos, Peter A. Tucker SIGMOD.
Advertisements

Primitive Recursive Functions (Chapter 3)
Relational Algebra, Join and QBE Yong Choi School of Business CSUB, Bakersfield.
1 Continuous Query Languages (CQL) Blocking Operators and the expressive power problem Carlo Zaniolo UCLA CSD Spring 2009.
1 Efficient Temporal Coalescing Query Support in Relational Database Systems Xin Zhou 1, Carlo Zaniolo 1, Fusheng Wang 2 1 UCLA, 2 Simens Corporate Research.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Relational Algebra Chapter 4, Part A Modified by Donghui Zhang.
Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.
CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 52 Database Systems I Relational Algebra.
Introduction to Computability Theory
1 Continuous Query Languages for DSMS CS240B Notes by Carlo Zaniolo.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 6 The Relational Algebra and Relational Calculus.
Slides adapted from A. Silberschatz et al. Database System Concepts, 5th Ed. SQL - part 2 - Database Management Systems I Alex Coman, Winter 2006.
ATLaS: A Complete Database Language for Streams Carlo Zaniolo, Haixun Wang Richard Luo,Jan-Nei Law et al. Documentation and software downloads:
Graph Algebra with Pattern Matching and Aggregation Support 1.
Set theory Sets: Powerful tool in computer science to solve real world problems. A set is a collection of distinct objects called elements. Traditionally,
Chapter 3 Section 3.4 Relational Database Operators
1 Relational Algebra and Calculus Chapter 4. 2 Relational Query Languages  Query languages: Allow manipulation and retrieval of data from a database.
Lecture 05 Structured Query Language. 2 Father of Relational Model Edgar F. Codd ( ) PhD from U. of Michigan, Ann Arbor Received Turing Award.
Set, Combinatorics, Probability & Number Theory Mathematical Structures for Computer Science Chapter 3 Copyright © 2006 W.H. Freeman & Co.MSCS Slides Set,
CSE314 Database Systems The Relational Algebra and Relational Calculus Doç. Dr. Mehmet Göktürk src: Elmasri & Navanthe 6E Pearson Ed Slide Set.
Copyright © Cengage Learning. All rights reserved.
Database Management Systems, R. Ramakrishnan1 Relational Algebra Module 3, Lecture 1.
CS4432: Database Systems II Query Processing- Part 2.
Presented By: Miss N. Nembhard. Relation Algebra Relational Algebra is : the formal description of how a relational database operates the mathematics.
Blocking, Monotonicity, and Turing Completeness in a Database Language for Sequences and Streams Yan-Nei Law, Haixun Wang, Carlo Zaniolo 12/06/2002.
Lecture 9: Query Complexity Tuesday, January 30, 2001.
Complexity Analysis (Part I)
Relational Algebra & Calculus
Ritu CHaturvedi Some figures are adapted from T. COnnolly
CSE202 Database Management Systems
Chapter (6) The Relational Algebra and Relational Calculus Objectives
Chapter 2 Sets and Functions.
Continuous Query Languages for DSMS
Database Management System
Set, Combinatorics, Probability & Number Theory
Relational Algebra - Part 1
Continuous Query Languages (CQL) Blocking Operators and the expressive power problem Carlo Zaniolo UCLA CSD 2017.
Relational Algebra Chapter 4 1.
GC211Data Structure Lecture2 Sara Alhajjam.
Load Shedding CS240B notes.
Advanced Algorithms Analysis and Design
Relational Algebra Chapter 4, Part A
Evaluation of Relational Operations
Chapter 15 QUERY EXECUTION.
Relational Algebra 461 The slides for this text are organized into chapters. This lecture covers relational algebra, from Chapter 4. The relational calculus.
Relational Algebra.
CS405G: Introduction to Database Systems
Relational Algebra Chapter 4 1.
The Relational Algebra and Relational Calculus
Relational Algebra Chapter 4 - part I.
The Relational Algebra
Selected Topics: External Sorting, Join Algorithms, …
Relational Algebra Chapter 4, Sections 4.1 – 4.2
The Relational Model Textbook /7/2018.
SQL: Structured Query Language
Lecture 2- Query Processing (continued)
The Relational Algebra
CS240B: Assignment1 Winter 2016.
Continuous Query Languages for DSMS
Continuous Query Languages for DSMS
Load Shedding CS240B notes.
Advanced Analysis of Algorithms
CS240B Midterm: Winter 2017 Your Name: and your ID:
CENG 351 File Structures and Data Managemnet
Relational Algebra & Calculus
Relational Algebra Chapter 4 - part I.
Complexity Analysis (Part I)
Complexity Analysis (Part I)
CS 405G: Introduction to Database Systems
Presentation transcript:

Continuous Query Languages (CQL) Blocking Operators and the expressive power problem Carlo Zaniolo UCLA CSD 2017

CQLs for DSMS Most of DSMS projects use SQL for continuous queries—for good reasons, since Many applications span data streams and DB tables A CQL based on SQL will be easier to learn & use Moreover: the fewer the differences the better! But DBMS were designed for persistent data and transient queries---not for persistent queries on transient data Adaptation of SQL and its enabling technology presents difficult research challenges These combine with traditional SQL problem, such as inability to deal with sequences, DM tasks, and other complex query tasks---i.e., lack of expressive power

Language Problems Most DSMS use SQL — queries spanning both data streams and DBs will be easier. But … Even for persistent data, SQL is far from perfect. Important application areas poorly supported include: Data Mining, and we need to mine data streams, Sequence queries: and data streams are unbounded sequences!! Major new problems for SQL on data stream applications. (After all, it was designed for persistent data on secondary store, not for streaming data) Only NonBlocking operators in DSMS: blocking forbidden Distinction not clear in DBMS which often use blocking implementations for nonblocking operators The distinction needs to formally characterized and so is the loss of query power caused upon CQLs.

Blocking Operators A blocking query operator is ‘one that is unable to produce the first tuple of the output until it has seen the entire input’ [Babcock et al. PODS02] But continuous queries cannot wait for the end of the stream: must return results while the data is streaming in. Blocking operators cannot be used. Only non-blocking (nb) queries and operators can be used on data streams (i.e. those that return their results before they have detected the end of the input). Current DBMSs make heavy usage of blocking computations: For operators that are intrinsically blocking And for those that are not—i.e., they are only implemented that way. To exclude 1, we need to find a characterization for blocking & nonblocking that is independent of implementation.

Partial Ordering Let S = [ t1, ¼, tn] be a sequence and 0 £ k £ n. Then Sk =[t1, ¼, tk ] is said to be the presequence of S, of length k>0. Also S0=[ ] denotes the empty sequence L  S denotes that L is a presequence of S,  Defines a Partial Order: reflexive, antisymmetric and transitive. The notion of subset is different from that of `preorder.’ For sets order and duplicates are immaterial The empty sequence [ ] is a pre-sequence of every other sequence.

Operators on Sequences: S ®G ® G(S) Gj(S) denotes the cumulative output produced up to the j-th input tuple included. Sj input up to step j. S is a sequence of length n. Then G is said to be: Blocking when Gj(S)=[ ] for j<n, and Gn(S)=G(S) Nonblocking when Gj(S) = G(Sj), for every j £ n. G(S): result of a applying G to the whole S Operators viewed as incremental transducers:

employees(E#,Sal, ...) Tradional count: Cumulative return select count(E#) from employees grouped by Sal Traditional SQL-2 aggregates: blocking select Sal, count(E#) over (range unbounded preceding) from employees ordered by Sal SQL:2003 Non Blocking Continuous count returns, for each new tuple, the count so far. On a sequence of length n: at each step j<n the count up to j is returned: count1 (S)= [1], count2 (S)= [1,2], ... countj (S)= [1,2, …, j] independent on whether j=n or j<n. Tradional count: Cumulative return For each j<n: nothing, countj (S)=[ ] Final: countn (S)=[n]

Examples Selection is nonblocking. Projection is non-blocking even if we eliminate resulting duplicates. Traditional SQL-2 aggregates are blocking (for arbitrarily ordered input) SQL:2003 OLAP functions are not. E.g. Continuous count, sum, max, etc. (i.e., the unlimited preceding count of OLAP functions) is non-blocking Intermediate cases are also possible

Characterization of NonBlocking (NB) Theorem: Queries can be expressed via nonblocking computations iff they are monotonic w.r.t. the presequence ordering. Proof: NB G implies monotonic G: We need to prove that if Sj  Sk then G(Sj)  G(Sk). Since j ≤ k,it is always true that Gj(Sk)  Gk(Sk). But if G is NB then Gj(Sk)=Gj(Sj) and Gk(Sk)= G(Sk) QED monotonic G implies NB G … the incremental G transducer, at step j+1 adds the difference between G(Sj+1) and G(Sj).

NonBlocking Iff Monotonic The theorem generalizes from presequences to sets---i.e. presequences where duplicates are not allowed and order is immaterial. In fact S1 is a subset of S2 iff S1 is a presequence of S2, after proper reordering and elimination of duplicates NB=monotonic: e.g., selection, projection, and OLAP functions Blocking= Non-Monotonic: e.g. Traditional aggregates. Results hold for operators of more than one argument: Join are monotonic (i.e., NB) in both arguments. R-S is monotonic on R and antimonotonic on S: i.e., will block on S but not on R (but it will unblock on R only after it has seen the whole S!)

NB-Completeness A query language L can express a given set of functions on its input (DB, sequences, data streams). Thus nonmonotonic functions are intrinsically blocking and they cannot be used on data streams. For continuous queries on data streams, we should disallow blocking (i.e., nonmonotonic) operators & constructs and only allow nonblocking (i.e., monotonic ) operators: nb-operators for short. But can ALL the monotonic functions expressible by L be expressed using only its nb-operators ? Or did we also lose some monotonic queries? Definition: When using only its NB-operators L can express all the monotonic queries expressible in L, then L is said to be NB-complete.

Expressive Power and NB-Completeness Consider a (DB) language L. The expressive power of L is the set of functions F that can be computed on the DB using its operators (or constructs). On data streams, we are only interested in monotonic functions: F’  F. Also let O be the operators of L, and O’  O be the subset of such operators that are monotonic. L will be said to be NB-complete if all functions in F’ can be expressed using only the operators in O’. NB-completeness is a test that O is as suitable for continuous queries on data streams as it is on the database. Say that L is not NB-complete: then there exist monotonic functions that L can express on the data stored in the DB, but it can no longer express on the same data presented as a stream.

Is SQL NB complete? bidStream(Item#, BidValue, Time) E-Bay Example Auctions: a stream of positive bids on an item. bidStream(Item#, BidValue, Time) Items for which the sum of bids is > 100K       SELECT Item#   FROM bidStream    GROUP BY Item# HAVING SUM(BidValue) > 100000; This is a monotonic query.cThus it can be expressed in a language containing suitable query operators. But it cannot be expressed in SQL-2. SQL-2 is not nb-complete; thus it is ill-suited for continuous queries on data streams. So SQL-2 is not nb-complete because of its blocking aggregates. What about RA without aggregates?

Relational Algebra (RA) Set difference can produce monotonic queries: Are these still expressible without set diff? Intersection is monotonic: R1 Ç R2 = R1 - (R1 - R2) But intersection can also be expressed as a joins: product+select. So it is not lost if we disallow set diff. But interval coalescing and Until queries are monotonic queries that can be expressed in RA but not in nb-RA. Example: Temporal domain isomorfic to nonnegative integers.Intervals closed to the left but open to the right: p(0, 3). % 0,1, and 2 are in p but 3 is not p(2, 4). % 3 is not a hole because is covered by this p(4, 5). % 5 is a hole because not covered by any other interval p(6, 8).

Coalesce p (cp) & p Until q p(0, 3). p(2, 4). p(4, 5). p(6, 8). cp(0, 3). cp(2, 4). cp(4, 5). cp(6, 8). cp(0, 4). cp(2, 5). cp(0,5). cp contains intervals from the start point of any p interval to the endpoint of any p interval unless the endpoint of some interval in between is a hole. cp(I1, J2) ¬ p(I1, J1), p(I2, J2), J1 < J2, Øhole(I1, J2). hole(I1, J2) ¬ p(I1, J1), p(I2, J2), p(_,K), J1 £ K, K < I2, Øcep(K). cep(K) ¬ p(_, K), p(I, J), I £ K, K < J. q(5,_) holds if cp has an interval that starts at 0 & contains 5 pUntil q(yes) ¬ q(0, J). pUntil q(yes) ¬ cp(0, I), q(J, _), I ³ J .

Relational Algebra NonMonotonic (i.e., blocking) RA operators: set difference and division We are left with: select, project, join, and union. Can these express all FO monotonic queries? Some interesting temporal queries: coalesce and until They are expressible in RA (by double negation) They are monotonic But they cannot be expressed in NB-RA. Theorem: RA and SQL are not NB-complete. SQL faces two problems: (i) the exclusion of EXCEPT/NOT EXISTS, and (ii) the exclusion of aggregates.

Real Applications Require REAL Power SQL’s lack of expressive power is a major problem for database-centric applications. These problems are significantly more serious for data streams since: Only monotonic queries can be used, Actually, not even all the monotonic ones since SQL is not nb-complete, These problems cannot be solved by embedding SQL statements in a PL program—next slide!

Embedding SQL Queries in a PL In DB applications, SQL can be embedded in a PL (Java, C++…) where the PL accesses the tuples returned by SQL using a `Get Next of Cursor’ statement. Operations that could not be expressed in SQL can then be expressed in the PL: an effective remedy for the lack of expressive power of SQL But cursors are a ‘pull-based’ mechanism and cannot be used on data streams: the DSMS cannot hold tuples until the PL request them! The DSMS can only deliver its output to the PL as a stream This might be OK for simple situations But if the core of the work has not been done yet, the PL system must do the actual DSMS work! Conclusion: to support applications of any complexity we must have a DSMS with real expressive power, As opposed to DBMS that are useful even with a weak QL.

Real Applications Require Real Power Embedding CQL in PL programs does not work well ... BUT: Embedding PL programs in CQL works: User Defined Functions with BLOBS: Good for DBMS but DSMS require incremental computation User-Defined Aggregates (UDAs) functions: Incremental computation model Can be defined using a PL or SQL itself with natively defined UDAs, SQL becomes Turing complete And NB-complete: can express all monotonic functions Simple syntactic characterization for NB aggregates. Effective on a broad range of data-intensive applications: KDD in particular. A few extensions are still need—more later.

Why UDAS are Important We have seen how new aggregates can be defined by the intialize, iterate, terminate scheme, using SQL itself (native UDAs) or an external language (C++, Java, etc.) Theorem [Law-Wang-Zaniolo 2011] SQL with natively defined UDAs is Turing-Complete. With non-blocking UDAs SQL, becomes NB-complete: it can express all monotonic computable functions on a single stream. Also complete on multiple streams if we union becomes a sort-merge operator.

References D. Carney, U. Cetintemel, M. Cherniack, C. Convey, S. Lee, G. Seidman, M. Stonebraker, N. Tatbul, and S. Zdonik. Monitoring streams - a new class of data management applications. In VLDB, Hong Kong, China, 2002. Yijian Bai, Hetal Thakkar, Chang Luo, Haixun Wang, Carlo Zaniolo: A Data Stream Language and System Designed for Power and Extensibility. Proc. of the ACM 15th Conference on Information and Knowledge Management (CIKM'06), 2006 Yan-Nei Law, Haixun Wang, Carlo Zaniolo: Query Languages and Data Models for Database Sequences and Data Streams. VLDB 2004: 492-503 Haixun Wang and Carlo Zaniolo. ATLaS: a native extension of SQL for data minining. In Proceedings of Third SIAM Int. Conference on Data MIning, pages 130-141, 2003 Yan-Nei Law, Haixun Wang, Carlo Zaniolo:Relational languages and data models for continuous queries on sequences and data streams. ACM Trans. Database Syst. 36(2): 8:1-8:32 (2011)