Building a Data Stream Management System Prof. Jennifer Widom Joint project with Prof. Rajeev Motwani and a team of graduate studentshttp://www-db.stanford.edu/stream.

Slides:



Advertisements
Similar presentations
Analysis of : Operator Scheduling in a Data Stream Manager CS561 – Advanced Database Systems By Eric Bloom.
Advertisements

A Paper on RANDOM SAMPLING OVER JOINS by SURAJIT CHAUDHARI RAJEEV MOTWANI VIVEK NARASAYYA PRESENTED BY, JEEVAN KUMAR GOGINENI SARANYA GOTTIPATI.
1 11. Streaming Data Management Chapter 18 Current Issues: Streaming Data and Cloud Computing The 3rd edition of the textbook.
Distributed DBMSPage © 1998 M. Tamer Özsu & Patrick Valduriez Outline Introduction Background Distributed DBMS Architecture Distributed Database.
Distributed DBMS© M. T. Özsu & P. Valduriez Ch.6/1 Outline Introduction Background Distributed Database Design Database Integration Semantic Data Control.
Maintaining Variance over Data Stream Windows Brian Babcock, Mayur Datar, Rajeev Motwani, Liadan O ’ Callaghan, Stanford University ACM Symp. on Principles.
A Data Stream Management System for Network Traffic Management Shivnath Babu Stanford University Lakshminarayanan Subramanian Univ. California, Berkeley.
Adaptive Monitoring of Bursty Data Streams Brian Babcock, Shivnath Babu, Mayur Datar, and Rajeev Motwani.
Data Streams & Continuous Queries The Stanford STREAM Project stanfordstreamdatamanager.
Query Processing, Resource Management, and Approximation in a Data Stream Management System Jennifer Widom Stanford University stanfordstreamdatamanager.
Silberschatz, Galvin and Gagne  2002 Modified for CSCI 399, Royden, Operating System Concepts Operating Systems Lecture 19 Scheduling IV.
Static Optimization of Conjunctive Queries with Sliding Windows over Infinite Streams Presented by: Andy Mason and Sheng Zhong Ahmed M.Ayad and Jeffrey.
1 Continuous Queries over Data Streams Vitaly Kroivets, Lyan Marina Presentation for The Seminar on Database and Internet The Hebrew University of Jerusalem,
An Abstract Semantics and Concrete Language for Continuous Queries over Streams and Relations Presenter: Liyan Zhang Presentation of ICS
1 Continuous Queries over Data Streams Vitaly Kroivets, Lyan Marina Presentation for The Seminar on Database and Internet The Hebrew University of Jerusalem,
Aurora Proponent Team Wei, Mingrui Liu, Mo Rebuttal Team Joshua M Lee Raghavan, Venkatesh.
Query Processing, Resource Management, and Approximation in a Data Stream Management System Selected subset of slides taken from talk by Jennifer Widom.
Chapter 10: Stream-based Data Management Title: Design, Implementation, and Evaluation of the Linear Road Benchmark on the Stream Processing Core Authors:
SWiM Panel on Engine Implementation Jennifer Widom.
1 Stream-based Data Management IS698 Min Song 2 Characteristics of Data Streams  Data Streams Data streams — continuous, ordered, changing, fast, huge.
Stream Processing Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems March 30, 2005.
Chain: Operator Scheduling for Memory Minimization in Data Stream Systems Authors: Brian Babcock, Shivnath Babu, Mayur Datar, and Rajeev Motwani (Dept.
1 PODS 2002 Motivation. 2 PODS 2002 Data Streams data sets Traditional DBMS – data stored in finite, persistent data sets data streams New Applications.
An Adaptive Multi-Objective Scheduling Selection Framework For Continuous Query Processing Timothy M. Sutherland Bradford Pielech Yali Zhu Luping Ding.
Bandwidth Allocation in a Self-Managing Multimedia File Server Vijay Sundaram and Prashant Shenoy Department of Computer Science University of Massachusetts.
Stream Data Management System Prototypes Ying Sheng, Richard Sia June 1, 2004 Professor Carlo Zaniolo CS 240B Spring 2004.
The Stanford Data Streams Research Project Profs. Rajeev Motwani & Jennifer Widom And a cast of full- and part-time students: Arvind Arasu, Brian Babcock,
SWIM 1/9/20031 QoS in Data Stream Systems Rajeev Motwani Stanford University.
Models and Issues in Data Streaming Presented By :- Ankur Jain Department of Computer Science 6/23/03 A list of relevant papers is available at
STREAM The Stanford Data Stream Management System.
Efficient Scheduling of Heterogeneous Continuous Queries Mohamed A. Sharaf Panos K. Chrysanthis Alexandros Labrinidis Kirk Pruhs Advanced Data Management.
Mirek Riedewald Department of Computer Science Cornell University Efficient Processing of Massive Data Streams for Mining and Monitoring.
Sensor Data Management: Challenges and (some) Solutions Amol Deshpande, University of Maryland.
Context Tailoring the DBMS –To support particular applications Beyond alphanumerical data Beyond retrieve + process –To support particular hardware New.
Grid Data Management A network of computers forming prototype grids currently operate across Britain and the rest of the world, working on the data challenges.
Query Processing, Resource Management, and Approximation in a Data Stream Management System.
Master’s Thesis (30 credits) By: Morten Lindeberg Supervisors: Vera Goebel and Jarle Søberg Design, Implementation, and Evaluation of Network Monitoring.
Data Stream Systems Reynold Cheng 12 th July, 2002 Based on slides by B. Babcock et.al, “Models and Issues in Data Stream Systems”, PODS’02.
Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.
Chapter 3 System Performance and Models. 2 Systems and Models The concept of modeling in the study of the dynamic behavior of simple system is be able.
Towards Low Overhead Provenance Tracking in Near Real-Time Stream Filtering Nithya N. Vijayakumar, Beth Plale DDE Lab, Indiana University {nvijayak,
Swarup Acharya Phillip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented By Vinay Hoskere.
PODS Models and Issues in Data Stream Systems Rajeev Motwani Stanford University (with Brian Babcock, Shivnath Babu, Mayur Datar, and Jennifer Widom)
1 STREAM: The Stanford Data Stream Management System STanfordstREamdatAManager 陳盈君 吳哲維 林冠良.
Runtime Optimization of Continuous Queries Balakumar K. Kendai and Sharma Chakravarthy Information Technology Laboratory Department of Computer Science.
Adaptive Query Processing in Data Stream Systems Paper written by Shivnath Babu Kamesh Munagala, Rajeev Motwani, Jennifer Widom stanfordstreamdatamanager.
Data Stream Management Systems
Aum Sai Ram Security for Stream Data Modified from slides created by Sujan Pakala.
JONATHAN LESSINGER A CRITIQUE OF CQL. PLAN 1.Background (How CQL, STREAM work) 2.Issues.
CQL: A Language for Continuous Queries over Streams and Relations Jennifer Widom Stanford University Joint work with Arvind Arasu & Shivnath Babu stanfordstreamdatamanager.
Load Shedding Techniques for Data Stream Systems Brian Babcock Mayur Datar Rajeev Motwani Stanford University.
Data Mining: Concepts and Techniques Mining data streams
Aurora: a new model and architecture for data stream management Daniel J. Abadi 1, Don Carney 2, Ugur Cetintemel 2, Mitch Cherniack 1, Christian Convey.
Adaptivity in continuous query systems Luis A. Sotomayor & Zhiguo Xu Professor Carlo Zaniolo CS240B - Spring 2003.
Triggers and Streams Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems March 28, 2005.
Streaming Semantic Data COMP6215 Semantic Web Technologies Dr Nicholas Gibbins –
Data Streams COMP3017 Advanced Databases Dr Nicholas Gibbins –
1 Advanced Database Systems: DBS CB, 2 nd Edition Advanced Topics of Interest: DB the Cloud, and SQL & Stream Processing.
Advanced Database Systems: DBS CB, 2nd Edition
S. Sudarshan CS632 Course, Mar 2004 IIT Bombay
COMP3211 Advanced Databases
Dieter Gawlick, Oracle October, 2005 (GGF15 in Boston)
NASA Space Communications Symposium
Applying Control Theory to Stream Processing Systems
Srinivas Narayana MIT CSAIL October 7, 2016
Arvind Arasu, Brian Babcock
Data Stream Management System (DSMS)
Load Shedding Techniques for Data Stream Systems
Brian Babcock, Shivnath Babu, Mayur Datar, and Rajeev Motwani
Parallel Programming in C with MPI and OpenMP
Presentation transcript:

Building a Data Stream Management System Prof. Jennifer Widom Joint project with Prof. Rajeev Motwani and a team of graduate studentshttp://www-db.stanford.edu/stream stanfordstreamdatamanager

2 Data Streams Continuous, unbounded, rapid, time-varying streams of data elements Occur in a variety of modern applications –Network monitoring and traffic engineering –Sensor networks, RFID tags –Telecom call records –Financial applications –Web logs and click-streams –Manufacturing processes DSMSDSMS = Data Stream Management System

stanfordstreamdatamanager 3 DBMS versus DSMS Persistent relations One-time queries Random access Access plan determined by query processor and physical DB design Transient streams (and persistent relations) Continuous queries Sequential access Unpredictable data characteristics and arrival patterns

stanfordstreamdatamanager 4 DSMS Scratch Store The (Simplified) Big Picture Input streams Register Query Streamed Result Stored Result Archive Stored Relations

stanfordstreamdatamanager 5 (Simplified) Network Monitoring Register Monitoring Queries DSMS Scratch Store Network measurements, Packet traces Intrusion Warnings Online Performance Metrics Archive Lookup Tables

stanfordstreamdatamanager 6 The STREAM System Data streams and stored relations Declarative language for registering continuous queries Designed to cope with high data rates and query workloads –Graceful approximation when needed –Careful resource allocation and usage

stanfordstreamdatamanager 7 The STREAM System Designed to cope with bursty streams and changing workloads –Continuous monitoring of data and system behavior –Continuous reoptimization Relational, centralized (for now)

stanfordstreamdatamanager 8 Contributions to Date Semantics for continuous queries Query plans Query optimization and dynamic reoptimization Exploiting stream constraints Operator scheduling Approximation techniques Resource allocation to maximize precision Initial running prototype

stanfordstreamdatamanager 9 Contributions to Date Semantics for continuous queries Query plans Query optimization and dynamic reoptimization Exploiting stream constraints Operator scheduling Approximation techniques Resource allocation to maximize precision Initial running prototype

stanfordstreamdatamanager 10 Semantics for Continuous Queries Abstract:Abstract: window specification language and relational query language as “black boxes” Concrete:Concrete: SQL-based instantiation for our system; includes syntactic shortcuts, defaults, equivalences Goals –CQs over multiple streams and relations –Exploit relational semantics to the extent possible –Easy queries should be easy to write –Queries should do what you expect

stanfordstreamdatamanager 11 Concrete Language – CQL Relational query language: SQL Window spec. language derived from SQL-99 –Tuple-based, time-based, partitioned Syntactic shortcuts and defaultsSyntactic shortcuts and defaults –So easy queries are easy to write and queries do what you expect EquivalencesEquivalences –Basis for query-rewrite optimizations –Includes all relational equivalences, plus new stream-based ones

stanfordstreamdatamanager 12 CQL Example Two streams, contrived for example: Orders (orderID, cost) Fulfillments (orderID, clerk) Maximum cost of orders fulfilled by each clerk over his/her last 100 fulfillments:

stanfordstreamdatamanager 13 CQL Example Two streams, contrived for example: Orders (orderID, cost) Fulfillments (orderID, clerk) Maximum cost of orders fulfilled by each clerk over his/her last 100 fulfillments: Select F.clerk, Max(O.cost) From Orders O, Fulfillments F [Partition By clerk Rows 100] Where O.orderID = F.orderID Group By F.clerk

stanfordstreamdatamanager 14 CQL Example Two streams, contrived for example: Orders (orderID, cost) Fulfillments (orderID, clerk) Maximum cost of orders fulfilled by each clerk over his/her last 100 fulfillments (streamed result): Select Istream(F.clerk, Max(O.cost)) From Orders O, Fulfillments F [Partition By clerk Rows 100] Where O.orderID = F.orderID Group By F.clerk

stanfordstreamdatamanager 15 Query Execution query planWhen a continuous query is registered, generate a query plan –Users can also register plans directly Plans composed of three main components: –Operators –Operators (as in most conventional DBMS’s) Queues –Inter-operator Queues (as in many conventional DBMS’s) –State (synopses) schedulerGlobal scheduler for plan execution

stanfordstreamdatamanager 16 Simple Query Plan State 4 ⋈ State 3  Stream 1 Stream 2 Stream 3 Q1Q1 Q2Q2 State 1 State 2 ⋈ Scheduler

stanfordstreamdatamanager 17 Memory Overhead in Query Processing Queues + State Continuous queries keep state indefinitely Online requirements suggest using memory rather than disk –But we realize this assumption is shaky

stanfordstreamdatamanager 18 Memory Overhead in Query Processing Queues + State Continuous queries keep state indefinitely Online requirements suggest using memory rather than disk –But we realize this assumption is shaky Goal: minimize memory use while providing timely, accurate answersGoal: minimize memory use while providing timely, accurate answers

stanfordstreamdatamanager 19 Reducing Memory Overhead Two main techniques to date constraints on streamsreduce state 1)Exploit constraints on streams to reduce state operator schedulingreduce queue sizes 2)Clever operator scheduling to reduce queue sizes

stanfordstreamdatamanager 20 Exploiting Stream Constraints For most queries, unbounded memory is required for arbitrary streams

stanfordstreamdatamanager 21 Exploiting Stream Constraints arbitraryFor most queries, unbounded memory is required for arbitrary streams But streams may exhibit constraints that reduce, bound, or even eliminate state

stanfordstreamdatamanager 22 Exploiting Stream Constraints For most queries, unbounded memory is required for arbitrary streams But streams may exhibit constraints that reduce, bound, or even eliminate state Conventional database constraints –Redefined for streams –“Relaxed” for stream environment

stanfordstreamdatamanager 23 Stream Constraints adherence parameterEach constraint type defines adherence parameter k Clustered(k) for attribute S.A Ordered(k) for attribute S.A Referential-Integrity(k) for join S 1  S 2

stanfordstreamdatamanager 24 Algorithm for Exploiting Constraints Input –Any Select-Project-Join query over streams –Any set of k-constraints Output –Query execution plan that reduces or eliminates state based on k-constraints –If constraints violated, get approximate result

stanfordstreamdatamanager 25 Operator Scheduling Global scheduler invokes run method of query plan operators with “timeslice” parameter Many possible scheduling objectives: minimize latency, inaccuracy, memory use, computation, starvation, … –First scheduler: round-robin  Second scheduler: minimize queue sizes –Third scheduler: minimize combination of queue sizes and latency

stanfordstreamdatamanager 26 CHAIN Scheduling Algorithm Goal: minimize total queue size for unpredictable, bursty stream arrival patterns Proven: within constant factor of any “clairvoyant” strategy for some queries Empirical results: large savings over naive strategies for many queries But minimizing queue sizes is at odds with minimizing latency

stanfordstreamdatamanager 27 CHAIN Algorithm (contd.) ready chain Always schedule ready chain with steepest downslope progress chart Plan progress chart: memory delta per unit time Independent scheduling of operators makes mistakes chains Consider chains of operators Memory Op1 Op2 Op3 Op4 Computation time

stanfordstreamdatamanager 28 Approximation Why approximate? –Memory requirement too high, even with constraints and clever scheduling –Can’t process streams fast enough for query load

stanfordstreamdatamanager 29 Approximation (cont’d) Static:Static: rewrite queries to add (or shrink) sampling or windows +User can participate, predictable behavior –Doesn’t consider dynamic conditions Dynamic:Dynamic: modify query plan – insert sampling operators, shrink windows, load shedding +Adapts to current resource availability (major open issue) –How to convey to user? (major open issue)

stanfordstreamdatamanager 30 The Stream Systems Landscape (At least) three general-purpose DSMS prototypes underway –STREAM –STREAM (Stanford) –Aurora –Aurora (Brown, Brandeis, MIT) –TelegraphCQ –TelegraphCQ (Berkeley) All will be demo’d at SIGMOD ’03 stream system benchmarkCooperating to develop stream system benchmark  Goal: demonstrate that conventional systems are far inferior for data stream applications

Conclusion: We’re building a Data Stream Management System stanfordstreamdatamanager