Data Streams COMP3017 Advanced Databases Dr Nicholas Gibbins – 2012-2013.

Slides:



Advertisements
Similar presentations
Chapter 10: Designing Databases
Advertisements

Semantics and Evaluation Techniques for Window Aggregates in Data Streams Jin Li, David Maier, Kristin Tufte, Vassilis Papadimos, Peter A. Tucker SIGMOD.
Analysis of : Operator Scheduling in a Data Stream Manager CS561 – Advanced Database Systems By Eric Bloom.
Temporal Databases S. Srinivasa Rao April 12, 2007
1 11. Streaming Data Management Chapter 18 Current Issues: Streaming Data and Cloud Computing The 3rd edition of the textbook.
Maintaining Variance over Data Stream Windows Brian Babcock, Mayur Datar, Rajeev Motwani, Liadan O ’ Callaghan, Stanford University ACM Symp. on Principles.
A Data Stream Management System for Network Traffic Management Shivnath Babu Stanford University Lakshminarayanan Subramanian Univ. California, Berkeley.
Data Stream Computation Lecture Notes in COMP 9314 modified from those by Nikos Koudas (Toronto U), Divesh Srivastava (AT & T), and S. Muthukrishnan (Rutgers)
Data Streams & Continuous Queries The Stanford STREAM Project stanfordstreamdatamanager.
1 Continuous Queries over Data Streams Vitaly Kroivets, Lyan Marina Presentation for The Seminar on Database and Internet The Hebrew University of Jerusalem,
ONE PASS ALGORITHM PRESENTED BY: PRADHYUMAN RAOL ID : 114 Instructor: Dr T.Y. LIN.
An Abstract Semantics and Concrete Language for Continuous Queries over Streams and Relations Presenter: Liyan Zhang Presentation of ICS
1 Stream-based Data Management IS698 Min Song 2 Characteristics of Data Streams  Data Streams Data streams — continuous, ordered, changing, fast, huge.
ONE PASS ALGORITHM PRESENTED BY: PRADHYUMAN RAOL ID : 114 Instructor: Dr T.Y. LIN.
Building a Data Stream Management System Prof. Jennifer Widom Joint project with Prof. Rajeev Motwani and a team of graduate studentshttp://www-db.stanford.edu/stream.
1 PODS 2002 Motivation. 2 PODS 2002 Data Streams data sets Traditional DBMS – data stored in finite, persistent data sets data streams New Applications.
1 External Sorting for Query Processing Yanlei Diao UMass Amherst Feb 27, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
Models and Issues in Data Streaming Presented By :- Ankur Jain Department of Computer Science 6/23/03 A list of relevant papers is available at
STREAM The Stanford Data Stream Management System.
NiagaraCQ A Scalable Continuous Query System for Internet Databases Jianjun Chen, David J DeWitt, Feng Tian, Yuan Wang University of Wisconsin – Madison.
NiagaraCQ : A Scalable Continuous Query System for Internet Databases (modified slides available on course webpage) Jianjun Chen et al Computer Sciences.
Physical Database Design & Performance. Optimizing for Query Performance For DBs with high retrieval traffic as compared to maintenance traffic, optimizing.
Chapter 7 Advanced SQL Database Systems: Design, Implementation, and Management, Sixth Edition, Rob and Coronel.
Query Processing, Resource Management, and Approximation in a Data Stream Management System.
CSCE Database Systems Chapter 15: Query Execution 1.
Master’s Thesis (30 credits) By: Morten Lindeberg Supervisors: Vera Goebel and Jarle Søberg Design, Implementation, and Evaluation of Network Monitoring.
A new model and architecture for data stream management.
1 Data Warehouses BUAD/American University Data Warehouses.
1 STREAM: The Stanford Data Stream Management System STanfordstREamdatAManager 陳盈君 吳哲維 林冠良.
8 1 Chapter 8 Advanced SQL Database Systems: Design, Implementation, and Management, Seventh Edition, Rob and Coronel.
Database Systems Design, Implementation, and Management Coronel | Morris 11e ©2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or.
Data Stream Management Systems
Aum Sai Ram Security for Stream Data Modified from slides created by Sujan Pakala.
A new model and architecture for data stream management.
Data Streams Topics in Data Mining Fall 2015 Bruno Ribeiro © 2015 Bruno Ribeiro.
Data Mining: Concepts and Techniques Mining data streams
Triggers and Streams Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems March 28, 2005.
IMS 4212: Database Implementation 1 Dr. Lawrence West, Management Dept., University of Central Florida Physical Database Implementation—Topics.
1 Semantics and Evaluation Techniques for Window Aggregates in Data Streams Jin Li, David Maier, Kristin Tufte, Vassilis Papadimos, Peter Tucker This work.
CPT-S Advanced Databases 11 Yinghui Wu EME 49.
Query Processing COMP3017 Advanced Databases Nicholas Gibbins
Streaming Semantic Data COMP6215 Semantic Web Technologies Dr Nicholas Gibbins –
1 Advanced Database Systems: DBS CB, 2 nd Edition Advanced Topics of Interest: DB the Cloud, and SQL & Stream Processing.
Fundamental of Database Systems
More SQL: Complex Queries, Triggers, Views, and Schema Modification
Mining Data Streams (Part 1)
Advanced Database Systems: DBS CB, 2nd Edition
S. Sudarshan CS632 Course, Mar 2004 IIT Bombay
Efficient Evaluation of XQuery over Streaming Data
COMP3211 Advanced Databases
Dieter Gawlick, Oracle October, 2005 (GGF15 in Boston)
CPS216: Data-intensive Computing Systems
Database System Concepts and Architecture
Database Management System
Database Systems: Design, Implementation, and Management Tenth Edition
Big Data Infrastructure
SQL: Advanced Options, Updates and Views Lecturer: Dr Pavle Mogin
Chapter 12: Query Processing
Evaluation of Relational Operations
Sidharth Mishra Dr. T.Y. Lin CS 257 Section 1 MH 222 SJSU - Fall 2016
Selected Topics: External Sorting, Join Algorithms, …
The Relational Model Textbook /7/2018.
One-Pass Algorithms for Database Operations (15.2)
Chapter 8 Advanced SQL.
Lecture 23: Query Execution
Database Systems: Design, Implementation, and Management Tenth Edition
Theppatorn rhujittawiwat
Adaptive Query Processing (Background)
An Analysis of Stream Processing Languages
Lecture 20: Query Execution
Presentation transcript:

Data Streams COMP3017 Advanced Databases Dr Nicholas Gibbins –

From Databases to Data Streams 2 Traditional DBMS makes several assumptions: –persistent data storage –relatively static records –(typically) no predefined notion of time –complex one-off queries

From Databases to Data Streams 3 Some applications have very different requirements: –data arrives in real-time –data is ordered (implicitly by arrival time or explicitly by timestamp) –too much data to store! –data never stops coming –ongoing analysis of rapidly changing data

Example Application: MIDAS 4

Application Domains 5 Network monitoring and traffic engineering Sensor networks, RFID tags Telecommunications call records Financial applications Web logs and click-streams Manufacturing processes

Data Streams A (potentially unbounded) sequence of tuples Transactional data streams: log interactions between entities – Credit card: purchases by consumers from merchants – Telecommunications: phone calls by callers to dialed parties – Web: accesses by clients of resources at servers Measurement data streams: monitor evolution of entity states – Sensor networks: physical phenomena, road traffic – IP network: traffic at router interfaces – Earth climate: temperature, moisture at weather stations 6

One-Time versus Continuous Queries One-time queries Run once to completion over the current data set Continuous queries Issued once and then continuously evaluated over a data stream –“Notify me when the temperature drops below X” –“Tell me when prices of stock Y > 300” 7

Database Management System 8 query processor stored data on disk query results

Data Stream Management System (DSMS) 9 query processor continuous query stream of results data streams data streams

DBMS Persistent relations (relatively static, stored) One-time queries Random access “Unbounded” disk store Only current state matters DSMS Transient streams (on-line analysis) Continuous queries (CQs) Sequential access Bounded main memory Historical data is important DBMS versus DSMS 10

DBMS No real-time services Relatively low update rate Data at any granularity Assume precise data Access plan determined by query processor, physical DB design DSMS Real-time requirements Possibly multi-GB arrival rate Data at fine granularity Data stale/imprecise Unpredictable/variable data arrival and characteristics DBMS versus DSMS 11

A Motivation for Stream Processing 12 Over the past twenty-five years: –CPU performance has increased by a factor of >1,000,000 –Typical RAM capacity increased by a factor of >1,000,000 –RAM access time has decreased by a factor of >50,000 –Typical HD capacity increased by a factor of >50,000 –HD access time has decreased by a factor of ~10

DBMS Resource (memory, disk, per- tuple computation) rich Extremely sophisticated query processing, analysis Useful to audit query results of data stream systems. Query Evaluation: Arbitrary Query Plan: Fixed. DSMS Resource (memory, per-tuple computation) limited Reasonably complex, near real time, query processing Useful to identify what data to populate in database Query Evaluation: One pass Query Plan: Adaptive Architectural Issues 13

Query Processing 14

Example: Continuous Query Language 15 Queries produce/refer to relations and streams Based on SQL, with the addition of: –Streams as new data type –Continuous instead of one-time semantics –Windows on streams (derived from SQL-99) –Sampling on streams (basic)

Query Processing 16 Construct query plan based on relational operators, as in an RDBMS –Selection –Projection –Join –Aggregation (group by) Combine plans from continuous queries (reduce redundancy) Stream tuples through the resulting network of operators

17 Evaluation only requires consideration of one tuple at a time –Selection and projection Tuple-at-a-time Operators op input streamoutput stream

18 Some full relation operators can work on a tuple at a time –Count, sum, average, max, min (even with group by) –(order by, however, can’t) Full Relation Operators op input streamoutput stream accumulator

19 Other (binary) full relation operators can’t –Intersection, difference, product, join –(union, however, can be evaluated tuple-by-tuple) Full Relation Operators op input stream output stream input stream

20 May block when applied to streams –no output until entire input seen, but streams are unbounded –joins may need to join tuples that are arbitrarily far apart Full Relation Operators op input stream output stream input stream

Relation/Stream Translation 21 Some relational operators can work directly on streams –Selection, projection, union, some aggregates Some relational operators need to work on relations –Join, product, difference, intersection, other aggregates Stream-to-relation operators –Windows Relation-to-stream operators –Istream, Dstream, Rstream

Windows 22 Mechanism for extracting a finite relation (synopsis) from an infinite stream Various window proposals for restricting operator scope. –Windows based on ordering attribute (e.g. last 5 minutes of tuples) –Windows based on tuple counts (e.g. last 1000 tuples) –Windows based on explicit markers (e.g. punctuations) –Variants (e.g., partitioning tuples in a window) Various window behaviours –Sliding, tumbling

Sliding Windows 23 data stream time t1t2t3t4t5 windows

Tumbling Windows 24 data stream time t1t2t3t4t5 windows

25 Consider a stream-based join operation: –a conventional join over a pair of windows on the input streams –outputs a stream of tuples joined from the input streams Join Evaluation input stream input stream output stream

Scalability and Completeness 26 DBMS deals with finite relations –query evaluation should produce all results for a given query DSMS deals with unbounded data streams –may not be possible to return all results for a given query –trade-off between resource use and completeness of result set –size of buffers used for windows is one example of a parameter that affects resource use and completeness –can further reduce resource use by randomly sampling from streams

Relation-to-Stream Operators 27 Insert Stream (Istream) –Whenever a tuple is inserted into the relation, emit it on the stream Delete Stream (Dstream) –Whenever a tuple is deleted from the relation, emit it on the stream Relation Stream (Rstream) –At every time instant, emit every tuple in relation on the stream

Example CQL Query 28 SELECT Istream(*) FROM S [rows unbounded] WHERE S.A > 10 S is converted into a relation (of unbounded size!) Resulting relation is converted back to a stream via Istream

Example CQL Query 29 SELECT * FROM S WHERE S.A > 10 S is a stream – query plan involves only selection, so window is now unnecessary

Example CQL Query 30 SELECT * FROM S1 [rows 1000], S2 [range 2 minutes] WHERE S1.A = S2.A AND S1.A > 10 Windows specified on streams –Tuple-based sliding window – [rows 1000] –Time-based sliding window – [range 2 minutes]

Example CQL Query 31 SELECT Rstream(S.A, R.B) FROM S [now], R WHERE S.A = R.A Query probes a stored table R based on each tuple in stream S and streams the result –[now] – time-based sliding window containing tuples received in last time step

Query Optimisation 32 Traditionally relation cardinalities used in query optimiser –Minimize the size of intermediate results. Problematic in a streaming environment –All streams are unbounded = infinite size!

Query Optimisation 33 Need novel optimisation objectives that are relevant when input sources are streams –Stream rate based (e.g. NiagaraCQ) –Resource-based (e.g. STREAM) –QoS based (e.g. Aurora) Continuous adaptive optimisation

Notable DSMS Projects 34 Aurora, Borealis (Brown/MIT) – sensor monitoring Niagara (OGI/Wisconsin) – Internet XML databases OpenCQ (Georgia) – triggers, incr. view maintenance STREAM (Stanford) – general-purpose DSMS Telegraph (Berkeley) – adaptive engine for sensors