CPT-S 580-06 Advanced Databases 11 Yinghui Wu EME 49.

Slides:



Advertisements
Similar presentations
Sampling From a Moving Window Over Streaming Data Brian Babcock * Mayur Datar Rajeev Motwani * Speaker Stanford University.
Advertisements

Chapter 10: Designing Databases
Semantics and Evaluation Techniques for Window Aggregates in Data Streams Jin Li, David Maier, Kristin Tufte, Vassilis Papadimos, Peter A. Tucker SIGMOD.
Analysis of : Operator Scheduling in a Data Stream Manager CS561 – Advanced Database Systems By Eric Bloom.
C6 Databases.
1 11. Streaming Data Management Chapter 18 Current Issues: Streaming Data and Cloud Computing The 3rd edition of the textbook.
Maintaining Variance over Data Stream Windows Brian Babcock, Mayur Datar, Rajeev Motwani, Liadan O ’ Callaghan, Stanford University ACM Symp. on Principles.
Mining Data Streams.
A Data Stream Management System for Network Traffic Management Shivnath Babu Stanford University Lakshminarayanan Subramanian Univ. California, Berkeley.
Adaptive Monitoring of Bursty Data Streams Brian Babcock, Shivnath Babu, Mayur Datar, and Rajeev Motwani.
Data Stream Computation Lecture Notes in COMP 9314 modified from those by Nikos Koudas (Toronto U), Divesh Srivastava (AT & T), and S. Muthukrishnan (Rutgers)
1 Continuous Queries over Data Streams Vitaly Kroivets, Lyan Marina Presentation for The Seminar on Database and Internet The Hebrew University of Jerusalem,
Engine Design: Stream Operators Everywhere Theodore Johnson AT&T Labs – Research Contributors: Chuck Cranor Vladislav Shkapenyuk.
An Abstract Semantics and Concrete Language for Continuous Queries over Streams and Relations Presenter: Liyan Zhang Presentation of ICS
Ph.D. DefenceUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor:
Aurora Proponent Team Wei, Mingrui Liu, Mo Rebuttal Team Joshua M Lee Raghavan, Venkatesh.
Chapter 10: Stream-based Data Management Title: Design, Implementation, and Evaluation of the Linear Road Benchmark on the Stream Processing Core Authors:
1 Stream-based Data Management IS698 Min Song 2 Characteristics of Data Streams  Data Streams Data streams — continuous, ordered, changing, fast, huge.
Building a Data Stream Management System Prof. Jennifer Widom Joint project with Prof. Rajeev Motwani and a team of graduate studentshttp://www-db.stanford.edu/stream.
1 PODS 2002 Motivation. 2 PODS 2002 Data Streams data sets Traditional DBMS – data stored in finite, persistent data sets data streams New Applications.
How to Build a Stream Database Theodore Johnson AT&T Labs - Research.
1 External Sorting for Query Processing Yanlei Diao UMass Amherst Feb 27, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
The Stanford Data Streams Research Project Profs. Rajeev Motwani & Jennifer Widom And a cast of full- and part-time students: Arvind Arasu, Brian Babcock,
SWIM 1/9/20031 QoS in Data Stream Systems Rajeev Motwani Stanford University.
Models and Issues in Data Streaming Presented By :- Ankur Jain Department of Computer Science 6/23/03 A list of relevant papers is available at
CS 580S Sensor Networks and Systems Professor Kyoung Don Kang Lecture 7 February 13, 2006.
CS 591 A11 Algorithms for Data Streams Dhiman Barman CS 591 A1 Algorithms for the New Age 2 nd Dec, 2002.
Morten Lindeberg University of Oslo (With slides from Vera Goebel)
Cloud and Big Data Summer School, Stockholm, Aug Jeffrey D. Ullman.
Database System Concepts and Architecture Lecture # 3 22 June 2012 National University of Computer and Emerging Sciences.
Mirek Riedewald Department of Computer Science Cornell University Efficient Processing of Massive Data Streams for Mining and Monitoring.
An Integration Framework for Sensor Networks and Data Stream Management Systems.
Query Processing, Resource Management, and Approximation in a Data Stream Management System.
Master’s Thesis (30 credits) By: Morten Lindeberg Supervisors: Vera Goebel and Jarle Søberg Design, Implementation, and Evaluation of Network Monitoring.
Data Stream Systems Reynold Cheng 12 th July, 2002 Based on slides by B. Babcock et.al, “Models and Issues in Data Stream Systems”, PODS’02.
Heartbeat Mechanism and its Applications in Gigascope Vladislav Shkapenyuk (speaker), Muthu S. Muthukrishnan Rutgers University Theodore Johnson Oliver.
PODS Models and Issues in Data Stream Systems Rajeev Motwani Stanford University (with Brian Babcock, Shivnath Babu, Mayur Datar, and Jennifer Widom)
C6 Databases. 2 Traditional file environment Data Redundancy and Inconsistency: –Data redundancy: The presence of duplicate data in multiple data files.
Design, Implementation, and Evaluation of Network Monitoring Tasks with the TelegraphCQ Data Stream Management System INF5100, Autumn 2006 Jarle Søberg.
Database Systems Design, Implementation, and Management Coronel | Morris 11e ©2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or.
Adaptive Query Processing in Data Stream Systems Paper written by Shivnath Babu Kamesh Munagala, Rajeev Motwani, Jennifer Widom stanfordstreamdatamanager.
Data Stream Management Systems
Aum Sai Ram Security for Stream Data Modified from slides created by Sujan Pakala.
Load Shedding Techniques for Data Stream Systems Brian Babcock Mayur Datar Rajeev Motwani Stanford University.
INNOV-10 Progress® Event Engine™ Technical Overview Prashant Thumma Principal Software Engineer.
An Example Data Stream Management System: TelegraphCQ INF5100, Autumn 2009 Jarle Søberg.
CS 257 Chapter – 15.9 Summary of Query Execution Database Systems: The Complete Book Krishna Vellanki 124.
CS4432: Database Systems II Query Processing- Part 2.
Data Streams Topics in Data Mining Fall 2015 Bruno Ribeiro © 2015 Bruno Ribeiro.
Data Mining: Concepts and Techniques Mining data streams
Review Lecture DB A/18-849B/95-811A/19-729A Internet-Scale Sensor Systems: Design and Policy Review Lecture Databases Phil Gibbons May 1, 2003.
Streaming Semantic Data COMP6215 Semantic Web Technologies Dr Nicholas Gibbins –
Continuous Monitoring of Distributed Data Streams over a Time-based Sliding Window MADALGO – Center for Massive Data Algorithmics, a Center of the Danish.
Data Streams COMP3017 Advanced Databases Dr Nicholas Gibbins –
Mining Data Streams (Part 1)
Big Data Infrastructure
Advanced Database Systems: DBS CB, 2nd Edition
S. Sudarshan CS632 Course, Mar 2004 IIT Bombay
Efficient Evaluation of XQuery over Streaming Data
COMP3211 Advanced Databases
CPT-S 415 Big Data Yinghui Wu EME B45.
The Stream Model Sliding Windows Counting 1’s
Load Shedding Techniques for Data Stream Systems
Brian Babcock, Shivnath Babu, Mayur Datar, and Rajeev Motwani
Feifei Li, Ching Chang, George Kollios, Azer Bestavros
Chapter 8 Advanced SQL.
Introduction to Stream Computing and Reservoir Sampling
Evaluation of Relational Operations: Other Techniques
Adaptive Query Processing (Background)
Presentation transcript:

CPT-S Advanced Databases 11 Yinghui Wu EME 49

Data stream management systems (DSMS) 2 Introduction –Research field –DBMS vs. DSMS –Motivation Concepts and Issues –Requirements –Architecture –Data model –Queries –Data reduction Examples Courtesy (slides adapted) from Vera Goebel

The DSMS Research Field Active research field (~ 15 years) derived from the database community –Stream algorithms –Application and database perspective Two syllabus articles: –Brian Babcock, Shivnath Babu, Mayur Datar, Rajeev Motwani, Jennifer Widom: "Models and issues in data stream systems" –Lukasz Golab, M. Tamer Ozsu: "Issues in data stream management” Future: Complex Event Processing (CEP) 3

DBMS vs. DSMS #1 Query Processing Continuous Query (CQ) Result Query Processing Main Memory Data Stream(s) Disk Main Memory SQL Query Result 4

DBMS vs. DSMS #2 Traditional DBMS: –stored sets of relatively static records with no pre-defined notion of time –good for applications that require persistent data storage and complex querying DSMS: support on-line analysis of rapidly changing data streams data stream: real-time, continuous, ordered (implicitly by arrival time or explicitly by timestamp) sequence of items, too large to store entirely, not ending continuous queries 5

6 DBMS vs. DSMS #3 Persistent relations One-time queries Random access “Unbounded” disk store Only current state matters Passive repository Relatively low update rate No real-time services Assume precise data Access plan determined by query processor, physical DB design Transient streams Continuous queries Sequential access Bounded main memory History/arrival-order is critical Active stores Possibly multi-GB arrival rate Real-time requirements Data stale/imprecise Unpredictable/variable data arrival and characteristics

DSMS Applications Network security (e.g., iPolicy, NetForensics/Cisco, Niksun ) –Network packet streams, user session information –URL filtering, detecting intrusions & DOS attacks & viruses Sensor Networks –E.g. TinyDB. –Network Traffic Analysis –Real time analysis of Internet traffic. E.g., Traffic statistics and critical condition detection. Financial Tickers –On-line analysis of stock prices, discover correlations, identify trends. Transaction Log Analysis –E.g. Web click streams and telephone calls Pull-based Push-based 7

Data Streams - Terms A data stream is a (potentially unbounded) sequence of tuples Each tuple consist of a set of attributes, similar to a row in database table Transactional data streams: log interactions between entities – Credit card: purchases by consumers from merchants – Telecommunications: phone calls by callers to dialed parties – Web: accesses by clients of resources at servers Measurement data streams: monitor evolution of entity states – Sensor networks: physical phenomena, road traffic – IP network: traffic at router interfaces – Earth climate: temperature, moisture at weather stations 8

Motivation #1 Massive data sets: –Huge numbers of users, e.g., AT&T long-distance: ~ 300M calls/day AT&T IP backbone: ~ 10B IP flows/day –Highly detailed measurements, e.g., NOAA: satellite-based measurements of earth geodetics –Huge number of measurement points, e.g., Sensor networks with huge number of sensors 9

Motivation #2 Near real-time analysis –ISP: controlling service levels –NOAA: tornado detection using weather radar –Hospital: Patient monitoring Traditional data feeds –Simple queries (e.g., value lookup) needed in real-time –Complex queries (e.g., trend analyses) performed off-line 10

Motivation #3 Stig Støa, Morten Lindeberg and Vera Goebel. Online Analysis of Myocardial Ischemia From Medical Sensor Data Streams with Esper, 2009 Queries over sensor traces from surgical procedures on pigs performed at IVS, Rikshospitalet, running a open source java system called Esper Successful identification of occlusion to the heart (heart attack) 11 Heart attack! SELECT y, timestamp FROM Accelerometer.win:ext_timed(t, 5 s) HAVING count(y) BETWEEN 2 AND 200

Concepts and issues: Requirement 12 Courtesy (slides adapted) from Vera Goebel

Requirements Data model and query semantics: order- and time-based operations –Selection –Nested aggregation –Multiplexing and demultiplexing –Frequent item queries –Joins –Windowed queries Query processing: –Streaming query plans must use non-blocking operators –Only single-pass algorithms over data streams Data reduction: approximate summary structures –Synopses, digests => no exact answers Real-time reactions for monitoring applications => active mechanisms Long-running queries: variable system conditions Scalability: shared execution of many continuous queries, monitoring multiple streams 13

Concepts and issues: Architecture 14 Courtesy (slides adapted) from Vera Goebel

Generic DSMS Architecture Input Monitor Output Buffer Query Processor Query Reposi- tory Working Storage Summary Storage Static Storage Streaming Inputs Streaming Outputs Updates to Static Data User Queries [Golab & Özsu 2003] 15

Architecture #2 buffer input module buffer output module Query processor user query static dB Query Optimizer query tree Load Shedder System monitor Concepts from Borealis 16 “Load Shedding techniques for data stream systems, B.Babcock, et.al 2003”

3-Level Architecture Reduce tuples through several layered operations (several DSMSs) Store results in static DB for later analysis E.g., distributed DSMSs VLDB 2003 Tutorial [Koudas & Srivastava 2003] 17

Concepts and issues: Data models 18 Courtesy (slides adapted) from Vera Goebel

Data Models Real-time data stream: sequence of items that arrive in some order and may only be seen once. Stream items: like relational tuples –Relation-based: e.g., STREAM, TelegraphCQ and Borealis –Object-based: e.g., COUGAR, Tribecca Window models –Direction of movements of the endpoints: fixed window, sliding window, landmark window –Time-based vs. Tuple-based –Update interval: eager (for each new arriving), lazy (batch processing), non-overlapping tumbling windows. 19

More on Windows window Sliding: Jumping: Overlapping (adapted from Jarle Søberg) Mechanism for extracting a finite relation from an infinite stream Desirable query semantic; data freshness; blocking operators.. 20

Timestamps Used for tuple ordering and by the DSMS for defining window sizes (time-based) Useful for the user to know when the the tuple originated Explicit: set by the source of data Implicit: set by DSMS, when it has arrived Ordering is an issue Distributed systems: no exact notion of time 21

Concepts and issues: Queries 22 Courtesy (slides adapted) from Vera Goebel

Queries #1 DBMS: one-time (transient) queries DSMS: continuous (persistent) queries Unbounded memory requirements Blocking operators: window techniques Queries referencing past data 23

Queries #2 DBMS: (mostly) exact query answer DSMS: (mostly) approximate query answer –Approximate query answers have been studied: sampling, synopses, sketches, wavelets, histograms, … Data reduction: Batch processing 24

One-pass Query Evaluation DBMS: –Arbitrary data access –One/few pass algorithms have been studied: Limited memory selection/sorting: n-pass quantiles Tertiary memory databases: reordering execution Complex aggregates: bounding number of passes DSMS: –Per-element processing: single pass to reduce drops –Block processing: multiple passes to optimize I/O cost 25

Query Plan DBMS: fixed query plans optimized at beginning DSMS: adaptive query operators –Adaptive plans plans have been studied: Query scrambling: wide-area data access –Update query plan for unexpected delay Eddies: volatile, unpredictable environments –Pipelined query plan tree Borealis: High Availability monitors and query distribution 26

Query Languages #1 Stream query language issues (compositionality, windows) SQL-like proposals suitably extended for a stream environment: – Composable SQL operators – Queries reference relations or streams – Queries produce relations or streams Query operators (selection/projection, join, aggregation) Examples: –GSQL (Gigascope) –CQL (STREAM) –EPL (ESPER) 27

Query Languages #2 3 querying paradigms for streaming data: 1. Relation-based: SQL-like syntax and enhanced support for windows and ordering, e.g., CQL (STREAM), StreaQuel (TelegraphCQ), AQuery, GigaScope 2. Object-based: object-oriented stream modeling, classify stream elements according to type hierarchy, e.g., Tribeca, or model the sources as ADTs, e.g., COUGAR 3. Procedural: users specify the data flow, e.g., Borealis, users construct query plans via a graphical interface (1) and (2) are declarative query languages, currently, the relation- based paradigm is mostly used. 28

Sample Stream Traffic ( sourceIP -- source IP address sourcePort -- port number on source destIP -- destination IP address destPort -- port number on destination length -- length in bytes time -- time stamp ); 29

Procedural Query (Borealis) Simple DoS (SYN Flooding) identification query 30

Selections and Projections Selections, (duplicate preserving) projections are straightforward – Local, per-element operators – Duplicate eliminating projection is like grouping Projection needs to include ordering attribute – No restriction for position ordered streams SELECT sourceIP, time FROM Traffic WHERE length >

Joins General case of join operators problematic on streams –May need to join arbitrarily far apart stream tuples –Equijoin on stream ordering attributes is tractable Majority of work focuses on joins between streams with windows specified on each stream SELECT A.sourceIP, B.sourceIP FROM Traffic1 A [window T1], Traffic2 B [window T2] WHERE A.destIP = B.destIP 32

Aggregations General form: –select G, F1 from S where P group by G having F2 op ϑ –G: grouping attributes, F1,F2: aggregate expressions –Window techniques are needed! Aggregate expressions: – distributive: sum, count, min, max – algebraic: avg – holistic: count-distinct, median 33

Query Optimization DBMS: table based cardinalities used in query optimization => Problematic in a streaming environment Cost metrics and statistics: accuracy and reporting delay vs. memory usage, output rate, power usage Query optimization: query rewriting to minimize cost metric, adaptive query plans, due to changing processing time of operators, selectivity of predicates, and stream arrival rates Query optimization techniques – stream rate based – resource based – QoS based Continuously adaptive optimization Possibility that objectives cannot be met: –resource constraints –bursty arrivals under limited processing capability 34

Data Reduction Techniques Aggregation: approximations e.g., mean or median Load Shedding: drop random tuples Sampling: only consider samples from the stream (e.g., random selection). Used in sensor networks. Sketches: summaries of stream that occupy small amount of memory, e.g., randomized sketching Wavelets: hierarchical decomposition Histograms: approximate frequency of element values in stream 35