Continuous Query Languages for DSMS

Slides:



Advertisements
Similar presentations
Chapter 10: Designing Databases
Advertisements

Semantics and Evaluation Techniques for Window Aggregates in Data Streams Jin Li, David Maier, Kristin Tufte, Vassilis Papadimos, Peter A. Tucker SIGMOD.
1 Continuous Query Languages (CQL) Blocking Operators and the expressive power problem Carlo Zaniolo UCLA CSD Spring 2009.
1 Efficient Temporal Coalescing Query Support in Relational Database Systems Xin Zhou 1, Carlo Zaniolo 1, Fusheng Wang 2 1 UCLA, 2 Simens Corporate Research.
CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 52 Database Systems I Relational Algebra.
Midterm Review Lecture 14b. 14 Lectures So Far 1.Introduction 2.The Relational Model 3.Disks and Files 4.Relational Algebra 5.File Org, Indexes 6.Relational.
An Abstract Semantics and Concrete Language for Continuous Queries over Streams and Relations Presenter: Liyan Zhang Presentation of ICS
1 Continuous Query Languages for DSMS CS240B Notes by Carlo Zaniolo.
1 Continuous Query Languages for DSMS CS240B Notes by Carlo Zaniolo.
1 External Sorting for Query Processing Yanlei Diao UMass Amherst Feb 27, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
ATLaS: A Complete Database Language for Streams Carlo Zaniolo, Haixun Wang Richard Luo,Jan-Nei Law et al. Documentation and software downloads:
Concepts of Database Management Sixth Edition
Avoiding Idle Waiting in the execution of Continuous Queries Carlo Zaniolo CSD CS240B Notes April 2008.
Graph Algebra with Pattern Matching and Aggregation Support 1.
NiagaraCQ A Scalable Continuous Query System for Internet Databases Jianjun Chen, David J DeWitt, Feng Tian, Yuan Wang University of Wisconsin – Madison.
Physical Database Design & Performance. Optimizing for Query Performance For DBs with high retrieval traffic as compared to maintenance traffic, optimizing.
DBSQL 14-1 Copyright © Genetic Computer School 2009 Chapter 14 Microsoft SQL Server.
Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002.
Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.
Concepts of Database Management Seventh Edition
Data Streams: Lecture 101 Window Aggregates in NiagaraST Kristin Tufte, Jin Li Thanks to the NiagaraST PSU.
1 Chapter 10 Joins and Subqueries. 2 Joins & Subqueries Joins – Methods to combine data from multiple tables – Optimizer information can be limited based.
Database Systems Design, Implementation, and Management Coronel | Morris 11e ©2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or.
Query Execution Section 15.1 Shweta Athalye CS257: Database Systems ID: 118 Section 1.
Patterns in Sequences and Data Streams Carlo Zaniolo Computer Science Department UCLA.
Blocking, Monotonicity, and Turing Completeness in a Database Language for Sequences and Streams Yan-Nei Law, Haixun Wang, Carlo Zaniolo 12/06/2002.
Text TCS INTERNAL Oracle PL/SQL – Introduction. TCS INTERNAL PL SQL Introduction PLSQL means Procedural Language extension of SQL. PLSQL is a database.
By: Peter J. Haas and Joseph M. Hellerstein published in June 1999 : Presented By: Sthuti Kripanidhi 9/28/20101 CSE Data Exploration.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Evaluation of Relational Operations Chapter 14, Part A (Joins)
1 Out of Order Processing for Stream Query Evaluation Jin Li (Portland State Universtiy) Joint work with Theodore Johnson, Vladislav Shkapenyuk, David.
Patterns in Sequences and Data Streams
More SQL: Complex Queries, Triggers, Views, and Schema Modification
Using Collaborative Filtering to Weave an Information Tapestry
Unary Query Processing Operators
COMP3211 Advanced Databases
Database Management System
Haixun Wang, Carlo Zaniolo Computer Science Dept.
PL/SQL LANGUAGE MULITPLE CHOICE QUESTION SET-1
Continuous Query Languages (CQL) Blocking Operators and the expressive power problem Carlo Zaniolo UCLA CSD 2017.
Load Shedding CS240B notes.
Data Analysis with SQL Window Functions
Database Performance Tuning and Query Optimization
Relational Algebra Chapter 4, Part A
Evaluation of Relational Operations
Chapter 15 QUERY EXECUTION.
Relational Algebra 461 The slides for this text are organized into chapters. This lecture covers relational algebra, from Chapter 4. The relational calculus.
MANAGING DATA RESOURCES
Sidharth Mishra Dr. T.Y. Lin CS 257 Section 1 MH 222 SJSU - Fall 2016
Lecture Set 14 B new Introduction to Databases - Database Processing: The Connected Model (Using DataReaders)
Selected Topics: External Sorting, Join Algorithms, …
Relational Algebra Chapter 4, Sections 4.1 – 4.2
Data Analysis with SQL Window Functions
Advance Database Systems
Implementation of Relational Operations
Chapter 7 Using SQL in Applications
Chapter 8 Advanced SQL.
Chapter 11 Database Performance Tuning and Query Optimization
CS240B: Assignment1 Winter 2016.
Continuous Query Languages (CQL) Blocking Operators and the expressive power problem Carlo Zaniolo UCLA CSD 2017.
CS222P: Principles of Data Management Notes #13 Set operations, Aggregation, Query Plans Instructor: Chen Li.
UCLA, Fall CS240B Midterm Your Name: and your ID:
CS240B—Fall 2018 Task 4.1.  Express the Flajolet-Martin's distinct_count sketch as a user-defined aggregate mamed dcount_sketch, to be called in the same.
Evaluation of Relational Operations: Other Techniques
Continuous Query Languages for DSMS
A Framework for Testing Query Transformation Rules
CS240B, Spring 2014 Task 2.2:  Using a syntax based on that of notes and reference 3 above, express a user-defined aggregate d_count to perform the exact.
Idle Waiting for slides
Continuous Query Languages for DSMS
Load Shedding CS240B notes.
CS240B Midterm: Winter 2017 Your Name: and your ID:
Presentation transcript:

Continuous Query Languages for DSMS CS240B Notes by Carlo Zaniolo

CQLs for DSMS Most of DSMS projects use SQL for continuous queries—for good reasons, since Many applications span data streams and DB tables A CQL based on SQL will be easier to learn & use Moreover: the fewer the differences the better! But DSMS were designed for persistent data and transient queries---not for persistent queries on transient data Adaptation of SQL and its enabling technology presents many research challenges Lack of expressive power—even worse now since only nonblocking operators are allowed.

Continuous Query Graph: many components—arbitrary DAGs Source σ ∑1 Sink ∑2 Source Sink O2 O3 O1  Source1 U Sink Source2 σ Source1  U Source2 σ ∑1 Sink ∑2

Relational Algebra Operators Stored data Selection, Projection Union Join (including X) on tables Set Difference Aggregates: Traditional Blocking aggregates OLAP functions on windows or unlimited preceding Data Streams ... same Union by Sort-Merging on timestamps Join of Stream with table Window joins on streams (timestamps merged into 1 column) No stream difference (blocking—diff of stream with table OK). Aggregates: No blocking aggregate OLAP functions on windows or unlimited preceding Slides, and tumbles.

Bolts and Nuts create stream bids(bid#, item, offer, Time) create stream mybids as (select bid#, offer, Time from bids where item=bolt union select bid#, offer, Time where item=nut) Result same as: select bid#, offer, Time where item= bolt or item=nut

Joins We could create a stream called interesting bids by say joining bids with the ‘interesting_items’ table. We next find the bolt bids for which there was a nut bid offered in the last 5 minutes for the same price. create stream selfjoinbids as (select S1.bid#, S1.offer, S2.bid#, S2.Time from bids as S1, bids as S2 [window of 5 minutes] where S1.item=bolt and S2.item=nut and S1.offer=S2.offer) The window condition implies that S1.Time >= S2.Time and S2.Time >= S1.Time-5 minutes. Windows on both streams are used very often.

Processing Union and Joins Special techniques are needed to process unions and joins on data streams. The main problem are slow response while waiting to sync multiple data streams---i.e., idle waiting This will be discussed later—after we discuss UDAs that solve the expressive power problem---as needed for more complex queries, such as mining queries.

Relational Algebra Operators Stored data Selection, Projection Union Join (including X) on tables Set Difference Aggregates: Traditional Blocking aggregates OLAP functions on windows or unlimited preceding Data Streams ... same Union by Sort-Merging on timestamps Join of Stream with table Window joins on streams (timestamps merged into 1 column) No stream difference (blocking—diff of stream with table OK). Aggregates: No blocking aggregate OLAP functions on windows or unlimited preceding Slides, and tumbles. Including UDAs

User-Defined Aggregates: Max Power via Min SQL Extensions Windows (logical, physical, slides, tumbles,…): flexible synopses that solve the blocking problem for aggregates DSMS only support these constructs on built-in aggregates ESL is the first to support the complete integration of these two User Defined Aggregates (UDAs) —the key to power and extensibility, and And thus can support data mining, XML, sequences not supported by other DSMS One framework for aggregates and windows, whether they are built-ins or user-defined, and independent on the language used to define them.

Defining Traditional Aggregates Specification consists of 3 blocks of code--- Written in an external PL (as DBMS and other DSMS do), or In SQL itself (SQL becomesTuring Complete!) INITIALIZE Executed upon the arrival of the first tuple ITERATE Executed upon the arrival of each subsequent tuples (an incremental computation suitable for streams) TERMINATE Executed after the end of the relation/stream has been reached Invocation: SELECT myavg(start_price)  FROM OpenAuction The previous are simple SQL extensions. Recently people proposed alternative semantics and a new set of operators on that semantics What we find that using our language construct, we can support this semantics in a natural fashion, using Union. Do not go to details, answer if questions Client server architecture -

The UDA AVG in SQL AGGREGATE avg(Next Int) : Real { TABLE state(tsum Int, cnt Int); INITIALIZE : { INSERT INTO state VALUES (Next, 1); } ITERATE : { UPDATE state SET tsum=tsum+Next, cnt=cnt+1; TERMINATE : { INSERT INTO RETURN SELECT tsum/cnt FROM state; “INSERT INTO RETURN” in TERMINATE  a blocking UDA

NonBlocking UDA: AVG of last 200 Values AGGREGATE myavg(Next Int) : Real {TABLE state(tsum Int, cnt Int); INITIALIZE : { INSERT INTO state VALUES (Next, 1); } ITERATE : { UPDATE state SET tsum=tsum+Next, cnt=cnt+1; INSERT INTO RETURN SELECT tsum/cnt FROM state WHERE cnt %200 =0; UPDATE state SET tsum=Next, cnt=1 WHERE cnt %200 =1 TERMINATE : { } Empty TERMINATE Denotes a non-blocking UDA

UDAs in ESL In ESL user-defined Aggregates (UDAs) can be defined directly in SQL, rather than in a PL Native extensibility in SQL via UDAs (which can also be defined in a PL for better performance) No impedance mismatch Access to DB tables from UDAs Data Independence and optimization Good ease of use and performance Turing completeness & nb-completeness.

Data Intensive Applications & UDAs Complex Applications can expressed concisely, with good performance ATLAS: a single-user DBMS developed at UCLA. Support for SQL with UDAs On top of Berkeley-DB record manager. Data Mining Algorithms in ATLAS Decision Tree Classifiers: 18 lines of codes APriori: 40 lines of codes Modest overhead: <50% w.r.t procedural UDA Data Stream Applications in ESL Data Stream Mining, approximate aggregates, sketches, histograms, …

SQL:2003 OLAP Functions Aggregates on Windows CREATE STREAM ClosedAuction (/*auction closings */ itemID, /*id of the item in this auction.*/ buyerID /*buyer of this item.*/) Final price real /*final price of the item */, Current_time) order by … source … Auctions For each seller, show the average selling price over the last 10 items sold (physical window) CREATE STREAM LastTenAvg SELECT sellerID, AVG(price) OVER(PARTITION BY sellerID ROWS 9 PRECEDING), Current_time FROM ClosedPrice;

Optimizing Window AVG in ESL For each expired tuple decrease the count by one and the sum by the expired value—works for logical & physical windows WINDOW AGGREGATE avg(Next Real) : Real { TABLE state(tsum Int, cnt Real); TABLE inwindow(wnext Real); INITIALIZE : { INSERT INTO state VALUES (Next, 1)} ITERATE : { UPDATE state SET tsum=tsum+Next, cnt=cnt+1; INSERT INTO RETURN SELECT tsum/cnt FROM state} EXPIRE: { /*if there are expired tuples, take the oldest */ UPDATE state SET cnt= cnt-1, tsum = tsum – (select wnext FROM inwindow WHERE oldest(inwindow)) } }

MAX System maintains inwindow Remove dominated (less & older) values The oldest is always the max. WINDOW AGGREGATE max (Next Real) : Real { TABLE inwindow(wnext real); INITIALIZE : { etc.} /*system adds new tuples to inwindow*/ ITERATE : { DELETE FROM inwindow WHERE wnext <Next; INSERT INTO RETURN SELECT wnext FROM inwindow WHERE oldest(inwindow) } EXPIRE: { } /*expired tuples removed automatically*/ }

For Each Aggregate two versions The traditional Base aggregate with terminate The Window aggregate with inwindow and expire. These definitions will take care of both logical and physical windows. But there are more complications: slides and tumbles.

Slides and Tumbles Every two minutes, show the average selling price over the last 10 minutes (logical window) CREATE STREAM LastTenAvg SELECT sellerID, max(price) OVER(RANGE 10 MINUTE PRECEDING SLIDE 2 MINUTE), Current_time FROM ClosedPrice; Here the window is W=10 and the slide is S=2. Tumble: When S ≥ W

SLIDEs window slide/pane Summary Tuples The slide constructs divides a window into panes, results only returned at the end of each pane Algebraic Properties make slide is conducive to optimization. Combine summaries into the desired aggregation E.g.: MAX(1, 2, 3, 4)= MAX(MAX(1,2), MAX(3,4)) = 4 I.e., for MAX, we can perform MAX on subsets of numbers as local summaries, then combine them together to get the true MAX Used for built-in aggregates in SQL 2003: but what constructs should be used to integrate these concepts into a language for user-defined aggregates?

Slides &Tumbles--Examples Tumble – where the SLIDE size is equal or larger than the window size E.g. Once every 50 tuples, compute and return average over the last 10 tuples Easy to optimize Skip the first 40 tuples of every 50 tuples, and compute the blocking base version of the aggregate on the last 10 Slide – where slide size is smaller than the window size E.g. Once every 10 tuples, compute and return average over the last 50 tuples Naïve implementation--not optimized Perform incremental maintenance on every incoming tuple Ignore RETURN statements for most incoming tuples Only invoke RETURN once every 10 tuples

Pane-Based SLIDE Optimization Two-level cascading aggregates using two existing aggregates Perform sub-aggregation inside each pane using the base aggregate No need for incremental maintenance here Computed with a blocking aggregate once for each pane Combine the summary tuples using the window aggregate that returns on every incoming tuple (non-blocking) With incremental maintenance here At any time, only the last un-finished pane needs to store data tuples all finished panes are reduced to one reusable summary tuple window Agg1 (base) Agg2 (window)

Pane-based SLIDE optimization ClosedAuction (itemID, buyerID, Final_price, Current_time) Computing the MAX on window of 50 tuples & slide size of 10 tuples CREATE STREAM temp AS (SELECT itemID, max(sale_price) OVER(PARTITION BY itemID ROWS 49 PRECEDING SLIDE 10) FROM Auction); This is computed as the cascade of A tumble of 10 rows (returning the max of those 10 rows), Followed by a max on a window of 5 rows. Notes here: complex aggregates possible using this model, e.g. ensemble voting The same mechanism can be used for both logical and physical windows

Pane-based SLIDE optimization SUM with window size of 50 tuples, and slide size of 10 tuples 1. First create a stream of summary tuples using base aggregate CREATE STREAM temp AS ( SELECT itemID, max(sale_price) OVER(PARTITION BY itemID ROWS 9 PRECEDING SLIDE 10) AS msp FROM Auction); This is computed as a tumble using the base version of the UDA 2. Then apply the window version of the aggregate on the five (4+1=5) tuples produced in 1. SELECT itemID, window_max(msp) OVER(PARTITION BY itemID ROWS 4 PRECEDING) FROM temp; Notes here: complex aggregates possible using this model, e.g. ensemble voting The same mechanism can be used for both logical and physical windows

Checkpoint {Logical|Physical}x{tumble|slide unlimited_preceding} Six different types of calls, supported by two definitions Both SQL or procedural languages can be used in the definition. This simple approach can be used to implement very complex aggregations (e.g. ensemble classifiers) Applies uniformly to logical/physical windows defined in SQL or in an external language

Window UDAs vs. Base UDAs called as traditional SQL-2 aggregates, with optional GROUP BY Window UDAs: called with SQL:2003 OVER clause optional PARTITION BY clause logical or physical windows Optional SLIDE clauses in ESL ca be Clear semantics and optimization rules unify: UDAs—SQL or PL-defined, algebraic or not … window (logical & physical), slice, tumbles, etc. System vs. user roles in optimization clearly defined.

Window UDAs: Physical Optimization The Stream Mill System provides efficient support for: Management of new & expiring tuples in buffer Main memory & intelligent paging into disk Events caused by tuple expiration Users can access the buffer as the table called inwindow

Conclusion Language Technology: ESL a very powerful language for data stream and DB applications Simple semantics and unified syntax conforming to SQL:2003 standards Strong case for the DB-oriented approach to data streams System Technology: Some performance-oriented techniques well-developed—e.g., buffer management for windows For others: work is still in progress—stay tuned for latest news Stream Mill is up and running: http://wis.cs.ucla.edu/stream-mill

********* The End THANK YOU ! *****

References [1]ATLaS user manual. http://wis.cs.ucla.edu/atlas. [2]SQL/LPP: A Time Series Extension of SQL Based on Limited Patience Patterns, volume 1677 of Lecture Notes in Computer Science. Springer, 1999. [4]A. Arasu, S. Babu, and J. Widom. An abstract semantics and concrete language for continuous queries over streams and relations. Technical report, Stanford University, 2002. [5]B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom. Models and issues in data stream systems. In PODS, 2002. [9]D. Carney, U. Cetintemel, M. Cherniack, C. Convey, S. Lee, G. Seidman, M. Stonebraker, N. Tatbul, and S. Zdonik. Monitoring streams - a new class of data management applications. In VLDB, Hong Kong, China, 2002. [10]J. Celko. SQL for Smarties, chapter Advanced SQL Programming. Morgan Kaufmann, 1995. [11]S. Chandrasekaran and M. Franklin. Streaming queries over streaming data. In VLDB, 2002. [12]J. Chen, D. J. DeWitt, F. Tian, and Y. Wang. NiagaraCQ: A scalable continuous query system for internet databases. In SIGMOD, pages 379-390, May 2000. [13]C. Cranor, Y. Gao, T. Johnson, V. Shkapenyuk, and O. Spatscheck. Gigascope: A stream database for network applications. In SIGMOD Conference, pages 647-651. ACM Press, 2003. [14]Lukasz Golab and M. Tamer Özsu. Issues in data stream management. ACM SIGMOD Record, 32(2):5-14, 2003. [15]J. M. Hellerstein, P. J. Haas, and H. J. Wang. Online aggregation. In SIGMOD, 1997. [16] Yijian Bai, Hetal Thakkar, Chang Luo, Haixun Wang, Carlo Zaniolo: A Data Stream Language and System Designed for Power and Extensibility. Proc. of the ACM 15th Conference on Information and Knowledge Management (CIKM'06), 2006 [17] Yijian Bai, Hetal Thakkar, Haixun Wang and Carlo Zaniolo: Optimizing Timestamp Management in Data Stream Management Systems. ICDE 2007.

References (Cont.) [18] Yan-Nei Law, Haixun Wang, Carlo Zaniolo: Query Languages and Data Models for Database Sequences and Data Streams. VLDB 2004: 492-503 [19] Sam Madden, Mehul A. Shah, Joseph M. Hellerstein, and Vijayshankar Raman. Continuously adaptive continuous queries over streams. In SIGMOD, pages 49-61, 2002. [20]R. Motwani, J. Widom, A. Arasu, B. Babcock, M. Datar S. Babu, G. Manku, C. Olston, J. Rosenstein, and R. Varma. Query processing, approximation, and resource management in a data stream management system. In First CIDR 2003 Conference, Asilomar, CA, 2003. [21]R. Ramakrishnan, D. Donjerkovic, A. Ranganathan, K. Beyer, and M. Krishnaprasad. SRQL: Sorted relational query language, 1998. [23]Reza Sadri, Carlo Zaniolo, and Amir M. Zarkesh andJafar Adibi. A sequential pattern query language for supporting instant data minining for e-services. In VLDB, pages 653-656, 2001. [24]Reza Sadri, Carlo Zaniolo, Amir Zarkesh, and Jafar Adibi. Optimization of sequence queries in database systems. In PODS, Santa Barbara, CA, May 2001. [25]P. Seshadri. Predator: A resource for database research. SIGMOD Record, 27(1):16-20, 1998. [26]P. Seshadri, M. Livny, and R. Ramakrishnan. SEQ: A model for sequence databases. In ICDE, pages 232-239, Taipei, Taiwan, March 1995. [27]Praveen Seshadri, Miron Livny, and Raghu Ramakrishnan. Sequence query processing. In ACM SIGMOD 1994, pages 430-441. ACM Press, 1994. [28]M. Sullivan. Tribeca: A stream database manager for network traffic analysis. In VLDB, 1996. [29]D. Terry, D. Goldberg, D. Nichols, and B. Oki. Continuous queries over append-only databases. In SIGMOD, pages 321-330, 6 1992. [30]Peter A. Tucker, David Maier, Tim Sheard, and Leonidas Fegaras. Exploiting punctuation semantics in continuous data streams. IEEE Trans. Knowl. Data Eng, 15(3):555-568, 2003. [31]Haixun Wang and Carlo Zaniolo. ATLaS: a native extension of SQL for data minining. In Proceedings of Third SIAM Int. Conference on Data MIning, pages 130-141, 2003.