The Stanford Data Streams Research Project Profs. Rajeev Motwani & Jennifer Widom And a cast of full- and part-time students: Arvind Arasu, Brian Babcock,

Slides:



Advertisements
Similar presentations
1 11. Streaming Data Management Chapter 18 Current Issues: Streaming Data and Cloud Computing The 3rd edition of the textbook.
Advertisements

Maintaining Variance over Data Stream Windows Brian Babcock, Mayur Datar, Rajeev Motwani, Liadan O ’ Callaghan, Stanford University ACM Symp. on Principles.
Mining Data Streams.
A Data Stream Management System for Network Traffic Management Shivnath Babu Stanford University Lakshminarayanan Subramanian Univ. California, Berkeley.
Adaptive Monitoring of Bursty Data Streams Brian Babcock, Shivnath Babu, Mayur Datar, and Rajeev Motwani.
Data Stream Computation Lecture Notes in COMP 9314 modified from those by Nikos Koudas (Toronto U), Divesh Srivastava (AT & T), and S. Muthukrishnan (Rutgers)
Data Streams & Continuous Queries The Stanford STREAM Project stanfordstreamdatamanager.
Query Processing, Resource Management, and Approximation in a Data Stream Management System Jennifer Widom Stanford University stanfordstreamdatamanager.
1 Continuous Queries over Data Streams Vitaly Kroivets, Lyan Marina Presentation for The Seminar on Database and Internet The Hebrew University of Jerusalem,
5 Complex Event Processing (CEP) is the continuous and incremental processing of event streams from multiple sources based on declarative query.
--What is a Database--1 What is a database What is a Database.
An Abstract Semantics and Concrete Language for Continuous Queries over Streams and Relations Presenter: Liyan Zhang Presentation of ICS
1 Continuous Queries over Data Streams Vitaly Kroivets, Lyan Marina Presentation for The Seminar on Database and Internet The Hebrew University of Jerusalem,
PODS Models and Issues in Data Stream Systems Rajeev Motwani Stanford University (with Brian Babcock, Shivnath Babu, Mayur Datar, and Jennifer Widom)
Aurora Proponent Team Wei, Mingrui Liu, Mo Rebuttal Team Joshua M Lee Raghavan, Venkatesh.
Advanced Topics COMP163: Database Management Systems University of the Pacific December 9, 2008.
Query Processing, Resource Management, and Approximation in a Data Stream Management System Selected subset of slides taken from talk by Jennifer Widom.
SWiM Panel on Engine Implementation Jennifer Widom.
1 Stream-based Data Management IS698 Min Song 2 Characteristics of Data Streams  Data Streams Data streams — continuous, ordered, changing, fast, huge.
Building a Data Stream Management System Prof. Jennifer Widom Joint project with Prof. Rajeev Motwani and a team of graduate studentshttp://www-db.stanford.edu/stream.
Chain: Operator Scheduling for Memory Minimization in Data Stream Systems Authors: Brian Babcock, Shivnath Babu, Mayur Datar, and Rajeev Motwani (Dept.
1 PODS 2002 Motivation. 2 PODS 2002 Data Streams data sets Traditional DBMS – data stored in finite, persistent data sets data streams New Applications.
1 Mining Data Streams The Stream Model Sliding Windows Counting 1’s.
Chapter 14 The Second Component: The Database.
SWIM 1/9/20031 QoS in Data Stream Systems Rajeev Motwani Stanford University.
Models and Issues in Data Streaming Presented By :- Ankur Jain Department of Computer Science 6/23/03 A list of relevant papers is available at
CS 591 A11 Algorithms for Data Streams Dhiman Barman CS 591 A1 Algorithms for the New Age 2 nd Dec, 2002.
Cloud and Big Data Summer School, Stockholm, Aug Jeffrey D. Ullman.
Mirek Riedewald Department of Computer Science Cornell University Efficient Processing of Massive Data Streams for Mining and Monitoring.
CPS 216: Advanced Database Systems Shivnath Babu.
NiagaraCQ : A Scalable Continuous Query System for Internet Databases (modified slides available on course webpage) Jianjun Chen et al Computer Sciences.
CPS 216: Advanced Database Systems Shivnath Babu Fall 2006.
Query Processing, Resource Management, and Approximation in a Data Stream Management System.
Data Stream Systems Reynold Cheng 12 th July, 2002 Based on slides by B. Babcock et.al, “Models and Issues in Data Stream Systems”, PODS’02.
John Plummer Technical Specialist Data Platform Microsoft Ltd StreamInsight Complex Event Processing (CEP) Platform.
Towards Low Overhead Provenance Tracking in Near Real-Time Stream Filtering Nithya N. Vijayakumar, Beth Plale DDE Lab, Indiana University {nvijayak,
The τ - Synopses System Yossi Matias Leon Portman Tel Aviv University.
PODS Models and Issues in Data Stream Systems Rajeev Motwani Stanford University (with Brian Babcock, Shivnath Babu, Mayur Datar, and Jennifer Widom)
1 STREAM: The Stanford Data Stream Management System STanfordstREamdatAManager 陳盈君 吳哲維 林冠良.
Adaptive Query Processing in Data Stream Systems Paper written by Shivnath Babu Kamesh Munagala, Rajeev Motwani, Jennifer Widom stanfordstreamdatamanager.
Data Stream Management Systems
Aum Sai Ram Security for Stream Data Modified from slides created by Sujan Pakala.
INNOV-10 Progress® Event Engine™ Technical Overview Prashant Thumma Principal Software Engineer.
Memory Requirements of Data Streams Reynold Cheng 19 th July, 2002.
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
Data Mining: Concepts and Techniques Mining data streams
Aurora: a new model and architecture for data stream management Daniel J. Abadi 1, Don Carney 2, Ugur Cetintemel 2, Mitch Cherniack 1, Christian Convey.
Review Lecture DB A/18-849B/95-811A/19-729A Internet-Scale Sensor Systems: Design and Policy Review Lecture Databases Phil Gibbons May 1, 2003.
CPT-S Advanced Databases 11 Yinghui Wu EME 49.
Lecture A/18-849B/95-811A/19-729A Internet-Scale Sensor Systems: Design and Policy Lecture 15 Sensor Databases & Data Stream Systems Phil Gibbons.
Chapter 1 Overview of Databases and Transaction Processing.
Streaming Semantic Data COMP6215 Semantic Web Technologies Dr Nicholas Gibbins –
BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data Authored by Sameer Agarwal, et. al. Presented by Atul Sandur.
Data Streams COMP3017 Advanced Databases Dr Nicholas Gibbins –
1 Advanced Database Systems: DBS CB, 2 nd Edition Advanced Topics of Interest: DB the Cloud, and SQL & Stream Processing.
Mining Data Streams (Part 1)
Advanced Database Systems: DBS CB, 2nd Edition
S. Sudarshan CS632 Course, Mar 2004 IIT Bombay
COMP3211 Advanced Databases
Dieter Gawlick, Oracle October, 2005 (GGF15 in Boston)
The Stream Model Sliding Windows Counting 1’s
Models and Issues in Data Stream Systems
Software Architecture in Practice
An overview of Data Streaming
Arvind Arasu, Brian Babcock
Streaming Sensor Data Fjord / Sensor Proxy Multiquery Eddy
Brian Babcock, Shivnath Babu, Mayur Datar, and Rajeev Motwani
Models and Issues in Data Stream Systems
GATES: A Grid-Based Middleware for Processing Distributed Data Streams
Adaptive Query Processing (Background)
Presentation transcript:

The Stanford Data Streams Research Project Profs. Rajeev Motwani & Jennifer Widom And a cast of full- and part-time students: Arvind Arasu, Brian Babcock, Shivnath Babu, Mayur Datar, Gurmeet Manku, Liadan O’Callaghan, Justin Rosentein, Qi Sun, Rohit Varma stanfordstreamdatamanager

2 Data Streams data setsTraditional DBMS -- data stored in finite, persistent data sets data streamsNew applications -- data as multiple, continuous, rapid, time-varying data streams –Network monitoring and traffic engineering –Security applications –Telecom call records –Financial applications –Web logs and click-streams –Sensor networks –Manufacturing processes

stanfordstreamdatamanager 3 Challenges Multiple, continuous, rapid, time-varyingMultiple, continuous, rapid, time-varying streams of data continuousQueries may be continuous (not just one-time) –Evaluated continuously as stream data arrives –Answer updated over time complexQueries may be complex –Beyond element-at-a-time processing –Beyond stream-at-a-time processing

stanfordstreamdatamanager 4 Using Traditional Database User/ApplicationUser/Application LoaderLoader QueryResult Result…Query…

stanfordstreamdatamanager 5 New Approach for Data Streams User/ApplicationUser/Application Register Query Stream Query Processor Result

stanfordstreamdatamanager 6 New Approach for Data Streams User/ApplicationUser/Application Register Query Stream Query Processor Result Scratch Space (Memory and/or Disk) Data Stream Management System (DSMS)

stanfordstreamdatamanager 7 DBMS versus DSMS

stanfordstreamdatamanager 8 DBMS versus DSMS Persistent relationsTransient streams (and persistent relations)

stanfordstreamdatamanager 9 DBMS versus DSMS Persistent relations One-time queries Transient streams (and persistent relations) Continuous queries

stanfordstreamdatamanager 10 DBMS versus DSMS Persistent relations One-time queries Random access Transient streams (and persistent relations) Continuous queries Sequential access

stanfordstreamdatamanager 11 DBMS versus DSMS Persistent relations One-time queries Random access Access plan determined by query processor and physical DB design Transient streams (and persistent relations) Continuous queries Sequential access Unpredictable data arrival and characteristics

stanfordstreamdatamanager 12 DBMS versus DSMS Persistent relations One-time queries Random access Access plan determined by query processor and physical DB design “Unbounded” disk store Transient streams (and persistent relations) Continuous queries Sequential access Unpredictable data arrival and characteristics Bounded main memory

stanfordstreamdatamanager 13 Sample Applications Network management and traffic engineering (e.g., Sprint) –Streams of measurements and packet traces –Queries: detect anomalies, adjust routing Telecom call data (e.g., AT&T) –Streams of call records –Queries: fraud detection, customer call patterns, billing

stanfordstreamdatamanager 14 Sample Applications (cont’d) Network security (e.g., iPolicy, NetForensics/Cisco, Netscreen) –Network packet streams, user session information –Queries: URL filtering, detecting intrusions & DOS attacks & viruses Financial applications (e.g., Traderbot) –Streams of trading data, stock tickers, news feeds –Queries: arbitrage opportunities, analytics, patterns

stanfordstreamdatamanager 15 Sample Applications (cont’d) Web tracking and personalization (e.g., Yahoo, Google, Akamai) –Clickstreams, user query streams, log records –Queries: monitoring, analysis, personalization Truly massive databases (e.g., Astronomy Archives) –Stream the data by once (or over and over) –Queries do the best they can

stanfordstreamdatamanager 16 Making Things Concrete Database = two streams of mobile call records –Outgoing(connectionID, caller, start, end) –Incoming(connectionID, callee, start, end) Query language = SQL FROM clauses can refer to streams and/or relations

stanfordstreamdatamanager 17 Query Example 1 Find all outgoing calls longer than 2 minutes (relational selection) SELECT O.connectionID, O.caller FROM Outgoing O WHERE O.end – O.start > 2 Result requires unbounded storage Can provide result as data stream

stanfordstreamdatamanager 18 Query Example 2 Pair up callers and callees (relational join) SELECT O.caller, I.callee FROM Outgoing O, Incoming I WHERE O.connectionID = I.connectionID Can still provide result as data stream Requires unbounded temporary storage (without additional assumptions)

stanfordstreamdatamanager 19 Query Example 3 Find total connection time for each caller (relational grouping and aggregation) SELECT O.caller, sum(O.end – O.start) FROM Outgoing O GROUP BY O.caller Cannot provide result in (append-only) stream

stanfordstreamdatamanager 20 Project Goal Reconsider all aspects of data management and processing in presence of data streams

stanfordstreamdatamanager 21 Remainder of Talk Data stream model Queries over data streams –Language, semantics, evaluation & optimization DSMS query processing architecture and system internals Results to date Ongoing work Related work

stanfordstreamdatamanager 22 Data Model relations + data streamsDatabase: relations + data streams Stream characteristics –Type of data (schema) –Data distribution –Flow rate –Stability of distribution and flow –Ordering and other constraints –Synchronization of multiple streams –Distributed streams

stanfordstreamdatamanager 23 Data Stream Queries -- Basic Issues Answer availability –One-time –Multiple-time –Continuous (“standing”), stored or streamed Registration time –Predefined –Ad-hoc Stream access –Arbitrary –Sliding window (special case: size = 1)

stanfordstreamdatamanager 24 Data Stream Queries -- Basic Issues Answer availability –One-time –Multiple-time –Continuous (“standing”), stored or streamed Registration time –Predefined –Ad hoc Stream access –Arbitrary –Sliding window (special case: size = 1)

stanfordstreamdatamanager 25 Query Language & Semantics Specifying queries over streams –SQL-like versus dataflow network of operators –Sliding windows as first-class query construct Semantic issues –Blocking operators, e.g., aggregation, order-by –Streams as sets versus lists –Timestamping

stanfordstreamdatamanager 26 Query Evaluation -- Approximation Why approximate? –Streams are coming too fast –Exact answer requires unbounded storage or significant computational resources –Ad hoc queries reference history Issues in approximation –Sliding windows, sampling, synopses, … –How is approximation controlled? –How is it understood by user? Accuracy-efficiency-storage tradeoffAccuracy-efficiency-storage tradeoff

stanfordstreamdatamanager 27 Query Evaluation -- Adaptivity Why adaptivity? –Queries are long-running –Fluctuating stream arrival & data characteristics –Evolving query loads Issues in adaptivity –Adaptive resource allocation (memory, computation) –Adaptive query execution plans

stanfordstreamdatamanager 28 Query Evaluation -- Multiple Queries Possibly large number of continuous queries Long-running Shared resources Multi-query optimization

stanfordstreamdatamanager 29 Query Evaluation -- Distributed Streams 1Many physical streams but one logical stream –E.g., maintain top 100 visited pages at Yahoo 2Correlate streams at distributed servers –E.g., network monitoring 3Many streams controlled by a few servers –E.g., sensor networks Issues –Move processing to streams, not streams to processor –Approximation-bandwidth tradeoff

stanfordstreamdatamanager 30 Query Processing Architecture Input Data Streams Users issue continuous and ad-hoc queries Administrator can monitor query execution and adjust run-time parameters Applications register continuous queries Output Stream   X X Waiting Op Ready Op Running Op Synopses Query Plans

stanfordstreamdatamanager 31 DSMS Internals operators, synopses, queuesQuery plans: operators, synopses, queues Memory management –Dynamic allocation to buffers, queues, synopses –Accuracy vs. memory tradeoff –Operators adapt gracefully to memory reallocation Scheduler –Handles variable-rate input streams –Handles varying operator and query requirements

stanfordstreamdatamanager 32 Some Results to Date Algorithms on data streams –Online clustering [FOCS 2000, ICDE 2002] –Online quantiles [SIGMOD 98, SIGMOD 99] –Statistics over sliding windows [SODA 2002] –Online frequency counting Theory of stream query processing –Memory requirements of stream queries [PODS02] System design –STREAM –STREAM: stanfordstreamdatamanager

stanfordstreamdatamanager 33 STREAM System Implementation Comprehensive DSMS query processor Broad suite of operators and synopses Sophisticated “developer’s workbench” interface –Submit queries in extended SQL or algebra –Submit or edit query plans in XML or GUI –Query plan execution visualizer –On-the-fly modification of memory allocation, scheduling policies, etc.

stanfordstreamdatamanager 34 Ongoing Work Algebra for streams Synopses and algorithmic issues Memory management issues Exploiting constraints on streams Approximation in query processing Distributed stream processing System development

stanfordstreamdatamanager 35 Ongoing Work Algebra for streams Synopses and algorithmic issues Memory management issues Exploiting constraints on streams Approximation in query processing Distributed stream processing System development

stanfordstreamdatamanager 36 Ongoing Work -- Constraints Exploiting constraints on streams in query processing –Foreign-key joins, referential integrity, clustering, ordering –Need not be exact (e.g., k-clustered) –Reduce memory requirements –Unblock blocking operators

stanfordstreamdatamanager 37 Ongoing Work -- Approximation in Query Processing Understanding behavior of approximate operators when composed Memory allocation to operators in a plan, given per-operator memory-accuracy curve Best query plan, assuming best memory allocation Multiple (weighted) queries sharing resources

stanfordstreamdatamanager 38 Related Work Triggers, alerters, materialized views, continuous queries on conventional DBs, pub/sub, sequence & temporal databases, … TelegraphTelegraph project at UC Berkeley NiagaraNiagara project at Wisconsin/OGI AmazonAmazon project at Cornell AuroraAurora project at Brown/MIT And others

For Papers and General Info. stanfordstreamdatamanager