Models and Issues in Data Stream Systems

Slides:

Advertisements

Similar presentations

Online Mining of Frequent Query Trees over XML Data Streams Hua-Fu Li*, Man-Kwan Shan and Suh-Yin Lee Department of Computer Science.

Advertisements

Sampling From a Moving Window Over Streaming Data Brian Babcock * Mayur Datar Rajeev Motwani * Speaker Stanford University.

Analysis of : Operator Scheduling in a Data Stream Manager CS561 – Advanced Database Systems By Eric Bloom.

Probabilistic Skyline Operator over Sliding Windows Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Xuemin Lin, Ying Zhang, Wei.

1 11. Streaming Data Management Chapter 18 Current Issues: Streaming Data and Cloud Computing The 3rd edition of the textbook.

Di Yang, Elke A. Rundensteiner and Matthew O. Ward Worcester Polytechnic Institute VLDB 2009, Lyon, France 1 A Shared Execution Strategy for Multiple Pattern.

Maintaining Variance over Data Stream Windows Brian Babcock, Mayur Datar, Rajeev Motwani, Liadan O ’ Callaghan, Stanford University ACM Symp. on Principles.

Mining Data Streams.

A Data Stream Management System for Network Traffic Management Shivnath Babu Stanford University Lakshminarayanan Subramanian Univ. California, Berkeley.

Adaptive Monitoring of Bursty Data Streams Brian Babcock, Shivnath Babu, Mayur Datar, and Rajeev Motwani.

1 Continuous Queries over Data Streams Vitaly Kroivets, Lyan Marina Presentation for The Seminar on Database and Internet The Hebrew University of Jerusalem,

PODS Models and Issues in Data Stream Systems Rajeev Motwani Stanford University (with Brian Babcock, Shivnath Babu, Mayur Datar, and Jennifer Widom)

Aurora Proponent Team Wei, Mingrui Liu, Mo Rebuttal Team Joshua M Lee Raghavan, Venkatesh.

1 Stream-based Data Management IS698 Min Song 2 Characteristics of Data Streams  Data Streams Data streams — continuous, ordered, changing, fast, huge.

Building a Data Stream Management System Prof. Jennifer Widom Joint project with Prof. Rajeev Motwani and a team of graduate studentshttp://www-db.stanford.edu/stream.

Chain: Operator Scheduling for Memory Minimization in Data Stream Systems Authors: Brian Babcock, Shivnath Babu, Mayur Datar, and Rajeev Motwani (Dept.

1 PODS 2002 Motivation. 2 PODS 2002 Data Streams data sets Traditional DBMS – data stored in finite, persistent data sets data streams New Applications.

CS591A1 Fall Sketch based Summarization of Data Streams Manish R. Sharma and Weichao Ma.

An Adaptive Multi-Objective Scheduling Selection Framework For Continuous Query Processing Timothy M. Sutherland Bradford Pielech Yali Zhu Luping Ding.

The Stanford Data Streams Research Project Profs. Rajeev Motwani & Jennifer Widom And a cast of full- and part-time students: Arvind Arasu, Brian Babcock,

SWIM 1/9/20031 QoS in Data Stream Systems Rajeev Motwani Stanford University.

Models and Issues in Data Streaming Presented By :- Ankur Jain Department of Computer Science 6/23/03 A list of relevant papers is available at

CS 591 A11 Algorithms for Data Streams Dhiman Barman CS 591 A1 Algorithms for the New Age 2 nd Dec, 2002.

Cloud and Big Data Summer School, Stockholm, Aug Jeffrey D. Ullman.

Mirek Riedewald Department of Computer Science Cornell University Efficient Processing of Massive Data Streams for Mining and Monitoring.

Context Tailoring the DBMS –To support particular applications Beyond alphanumerical data Beyond retrieve + process –To support particular hardware New.

Query Processing, Resource Management, and Approximation in a Data Stream Management System.

Data Stream Systems Reynold Cheng 12 th July, 2002 Based on slides by B. Babcock et.al, “Models and Issues in Data Stream Systems”, PODS’02.

Towards Low Overhead Provenance Tracking in Near Real-Time Stream Filtering Nithya N. Vijayakumar, Beth Plale DDE Lab, Indiana University {nvijayak,

Swarup Acharya Phillip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented By Vinay Hoskere.

PODS Models and Issues in Data Stream Systems Rajeev Motwani Stanford University (with Brian Babcock, Shivnath Babu, Mayur Datar, and Jennifer Widom)

1 Approximating Quantiles over Sliding Windows Srimathi Harinarayanan CMPS 565.

Adaptive Query Processing in Data Stream Systems Paper written by Shivnath Babu Kamesh Munagala, Rajeev Motwani, Jennifer Widom stanfordstreamdatamanager.

Data Stream Management Systems

1 Online Computation and Continuous Maintaining of Quantile Summaries Tian Xia Database CCIS Northeastern University April 16, 2004.

Memory Requirements of Data Streams Reynold Cheng 19 th July, 2002.

Data Mining: Concepts and Techniques Mining data streams

Review Lecture DB A/18-849B/95-811A/19-729A Internet-Scale Sensor Systems: Design and Policy Review Lecture Databases Phil Gibbons May 1, 2003.

Data Streams COMP3017 Advanced Databases Dr Nicholas Gibbins –

1 Advanced Database Systems: DBS CB, 2 nd Edition Advanced Topics of Interest: DB the Cloud, and SQL & Stream Processing.

Mining Data Streams (Part 1)

Advanced Database Systems: DBS CB, 2nd Edition

S. Sudarshan CS632 Course, Mar 2004 IIT Bombay

Module 11: File Structure

Threads vs. Events SEDA – An Event Model 5204 – Operating Systems.

COMP3211 Advanced Databases

CPS216: Data-intensive Computing Systems

The Stream Model Sliding Windows Counting 1’s

Parallel Databases.

Applying Control Theory to Stream Processing Systems

Models and Issues in Data Stream Systems

Load Shedding CS240B notes.

A paper on Join Synopses for Approximate Query Answering

Fast Approximate Query Answering over Sensor Data with Deterministic Error Guarantees Chunbin Lin Joint with Etienne Boursier, Jacque Brito, Yannis Katsis,

Introduction to Query Optimization

Arvind Arasu, Brian Babcock

Load Shedding Techniques for Data Stream Systems

Multimedia Data Stream Management System

Mining Unusual Patterns in Data Streams in Multi-Dimensional Space

External Sorting The slides for this text are organized into chapters. This lecture covers Chapter 11. Chapter 1: Introduction to Database Systems Chapter.

Streaming Sensor Data Fjord / Sensor Proxy Multiquery Eddy

Selected Topics: External Sorting, Join Algorithms, …

Brian Babcock, Shivnath Babu, Mayur Datar, and Rajeev Motwani

View and Index Selection Problem in Data Warehousing Environments

Evaluation of Relational Operations: Other Techniques

Load Shedding CS240B notes.

Online Analytical Processing Stream Data: Is It Feasible?

Adaptive Query Processing (Background)

An Analysis of Stream Processing Languages

Mining Data Streams Many slides are borrowed from Stanford Data Mining Class and Prof. Jiawei Han’s lecture slides.

Presentation transcript:

Models and Issues in Data Stream Systems Rajeev Motwani Stanford University (with Brian Babcock, Shivnath Babu, Mayur Datar, and Jennifer Widom) STREAM Project Members: Arvind Arasu, Gurmeet Manku, Liadan O’Callaghan, Justin Rosentein, Qi Sun, Rohit Varma PODS 2002

Data Streams Traditional DBMS – data stored in finite, persistent data sets New Applications – data input as continuous, ordered data streams Network monitoring and traffic engineering Telecom call records Network security Financial applications Sensor networks Manufacturing processes Web logs and clickstreams Massive data sets PODS 2002

Data Stream Management System User/Application Register Query Results Data Stream Management System (DSMS) Stream Query Processor Scratch Space (Memory and/or Disk) PODS 2002

Meta-Questions Killer-apps Motivation Non-Trivial Application stream rates exceed DBMS capacity? Can DSMS handle high rates anyway? Motivation Need for general-purpose DSMS? Not ad-hoc, application-specific systems? Non-Trivial DSMS = merely DBMS with enhanced support for triggers, temporal constructs, data rate mgmt? PODS 2002

Sample Applications Network security (e.g., iPolicy, NetForensics/Cisco, Niksun) Network packet streams, user session information Queries: URL filtering, detecting intrusions & DOS attacks & viruses Financial applications (e.g., Traderbot) Streams of trading data, stock tickers, news feeds Queries: arbitrage opportunities, analytics, patterns SEC requirement on closing trades PODS 2002

Executive Summary Data Stream Management Systems (DSMS) Caveats Highlight issues and motivate research Not a tutorial or comprehensive survey Caveats Personal view of emerging field  Stanford STREAM Project bias  Cannot cover all projects in detail PODS 2002

DBMS versus DSMS Persistent relations One-time queries Random access “Unbounded” disk store Only current state matters Passive repository Relatively low update rate No real-time services Assume precise data Access plan determined by query processor, physical DB design Transient streams Continuous queries Sequential access Bounded main memory History/arrival-order is critical Active stores Possibly multi-GB arrival rate Real-time requirements Data stale/imprecise Unpredictable/variable data arrival and characteristics PODS 2002

Making Things Concrete ALICE BOB Central Office Central Office Outgoing (call_ID, caller, time, event) Incoming (call_ID, callee, time, event) DSMS event = start or end PODS 2002

Query 1 (self-join) Find all outgoing calls longer than 2 minutes SELECT O1.call_ID, O1.caller FROM Outgoing O1, Outgoing O2 WHERE (O2.time – O1.time > 2 AND O1.call_ID = O2.call_ID AND O1.event = start AND O2.event = end) Result requires unbounded storage Can provide result as data stream Can output after 2 min, without seeing end PODS 2002

Query 2 (join) Pair up callers and callees SELECT O.caller, I.callee FROM Outgoing O, Incoming I WHERE O.call_ID = I.call_ID Can still provide result as data stream Requires unbounded temporary storage … … unless streams are near-synchronized PODS 2002

Query 3 (group-by aggregation) Total connection time for each caller SELECT O1.caller, sum(O2.time – O1.time) FROM Outgoing O1, Outgoing O2 WHERE (O1.call_ID = O2.call_ID AND O1.event = start AND O2.event = end) GROUP BY O1.caller Cannot provide result in (append-only) stream Output updates? Provide current value on demand? Memory? PODS 2002

Outline of Remaining Talk Stream Models and DSMS Architectures Query Processing Runtime and Systems Issues Algorithms Conclusion PODS 2002

Data Model Append-only Updates Deletes Meta-Data Call records Updates Stock tickers Deletes Transactional data Meta-Data Control signals, punctuations System Internals – probably need all above PODS 2002

Query Model User/Application Query Processor DSMS PODS 2002

Related Database Technology DSMS must use ideas, but none is substitute Triggers, Materialized Views in Conventional DBMS Main-Memory Databases Distributed Databases Pub/Sub Systems Active Databases Sequence/Temporal/Timeseries Databases Realtime Databases Adaptive, Online, Partial Results Novelty in DSMS Semantics: input ordering, streaming output, … State: cannot store unending streams, yet need history Performance: rate, variability, imprecision, … PODS 2002

Stream Projects Amazon/Cougar (Cornell) – sensors Aurora (Brown/MIT) – sensor monitoring, dataflow Hancock (AT&T) – telecom streams Niagara (OGI/Wisconsin) – Internet XML databases OpenCQ (Georgia) – triggers, incr. view maintenance Stream (Stanford) – general-purpose DSMS Tapestry (Xerox) – pub/sub content-based filtering Telegraph (Berkeley) – adaptive engine for sensors Tribeca (Bellcore) – network monitoring PODS 2002

Outline of Remaining Talk Stream Models and DSMS Architectures Query Processing Runtime and Systems Issues Algorithms Conclusion PODS 2002

Blocking Operators Blocking Simple Aggregates – output “update” stream No output until entire input seen Streams – input never ends Simple Aggregates – output “update” stream Set Output (sort, group-by) Root – could maintain output data structure Intermediate nodes – try non-blocking analogs Example – juggle for sort [Raman,R,Hellerstein] Punctuations and constraints Join non-blocking, but intermediate state? sliding-window restrictions PODS 2002

Punctuations [Tucker, Maier, Sheard, Fegaras] Assertion about future stream contents Unblocks operators, reduces state Future Work Inserted at source or internal (operator signaling)? Does P unblock Q? Exists P? Rewrite Q? Relation between P and memory for Q? group-by R.A<10 R.A≥10 State/Index X R S P: S.A≥10 PODS 2002

Impact of Limited Memory Continuous streams grow unboundedly Queries may require unbounded memory [ABBMW 02] a priori memory bounds for query Conjunctive queries with arithmetic comparisons Queries with join need domain restrictions Impact of duplication elimination Open – general queries PODS 2002

Approximate Query Evaluation Why? Handling load – streams coming too fast Avoid unbounded storage and computation Ad hoc queries need approximate history How? Sliding windows, synopsis, samples, load-shed Major Issues? Metric for set-valued queries Composition of approximate operators How is it understood/controlled by user? Integrate into query language Query planning and interaction with resource allocation Accuracy-efficiency-storage tradeoff and global metric PODS 2002

Sliding Window Approximation 0 1 1 0 0 0 0 1 1 1 0 0 0 0 0 1 0 1 0 1 0 Why? Approximation technique for bounded memory Natural in applications (emphasizes recent data) Well-specified and deterministic semantics Issues Extend relational algebra, SQL, query optimization Algorithmic work Timestamps? PODS 2002

Timestamps Explicit Implicit Issues Injected by data source Models real-world event represented by tuple Tuples may be out-of-order, but if near-ordered can reorder with small buffers Implicit Introduced as special field by DSMS Arrival time in system Enables order-based querying and sliding windows Issues Distributed streams? Composite tuples created by DSMS? PODS 2002

Timestamps in JOIN Output R x T S Approach 1 User-specified, with defaults Compute output timestamp Must output in order of timestamps Better for Explicit Timestamp Need more buffering Get precise semantics and user-understanding Approach 2 Best-effort, no guarantee Output timestamp is exit-time Tuples arriving earlier more likely to exit earlier Better for Implicit Timestamp Maximum flexibility to system Difficult to impose precise semantics PODS 2002

Approximate via Load-Shedding Handles scan and processing rate mismatch Input Load-Shedding Sample incoming tuples Use when scan rate is bottleneck Positive – online aggregation [Hellerstein, Haas, Wang] Negative – join sampling [Chaudhuri, Motwani, Narasaya] Output Load-Shedding Buffer input infrequent output Use when query processing is bottleneck Example – XJoin [Urhan, Franklin] Exploit synopses PODS 2002

Stream Query Language? SQL extension Sliding windows as first-class construct Awkward in SQL, needs reference to timestamps SQL-99 allows aggregations over sliding windows Sampling/approximation/load-shedding/QoS support? Stream relational algebra and rewrite rules Aurora and STREAM Sequence/Temporal Databases PODS 2002

Outline of Remaining Talk Stream Models and DSMS Architectures Query Processing Runtime and Systems Issues Algorithms Conclusion PODS 2002

DSMS Internals Query plans: operators, synopses, queues Memory management Dynamic Allocation – queries, operators, queues, synopses Graceful adaptation to reallocation Impact on throughput and precision Operator scheduling Variable-rate streams, varying operator/query requirements Response time and QoS Load-shedding Interaction with queue/memory management PODS 2002

Queue Memory and Scheduling [Babcock, Babu, Datar, Motwani] Goal Given – query plan and selectivity estimates Schedule – tuples through operator chains Minimize total queue memory Best-slope scheduling is near-optimal Danger of starvation for some tuples Minimize tuple response time Schedule tuple completely through operator chain Danger of exceeding memory bound Open – graceful combination and adaptivity PODS 2002

Precision-Resource Tradeoff Resources – memory, computation, I/O Global Optimization Problem Input: queries with alternate plans, importance weights Precision: function of resource allocation to queries/operators Goal: select plans, allocate resources, maximize precision Memory Allocation Algorithm [Varma, Widom] Model – single query plan, simple precision model Rules for precision of composed operators Non-linear numerical optimization formulation Open – Combinatorial algorithm? General case? PODS 2002

Outline of Remaining Talk Stream Models and DSMS Architectures Query Processing Runtime and Systems Issues Algorithms Conclusion PODS 2002

Synopses Queries may access or aggregate past data Need bounded-memory history-approximation Synopsis? Succinct summary of old stream tuples Like indexes/materialized-views, but base data is unavailable Examples Sliding Windows Samples Sketches Histograms Wavelet representation PODS 2002

Many other results … Histograms Data Mining V-Opt Histograms [Gilbert, Guha, Indyk, Kotidis, Muthukrishnan, Strauss], [Indyk] End-Biased Histograms (Iceberg Queries) [Manku, Motwani], [Fang, Shiva, Garcia-Molina, Motwani, Ullman] Equi-Width Histograms (Quantiles) [Manku, Rajagopalan, Lindsay], [Khanna, Greenwald] Wavelets Seminal work [Vitter, Wang, Iyer] + many others! Data Mining Stream Clustering [Guha, Mishra, Motwani, O’Callaghan] [O’Callaghan, Meyerson, Mishra, Guha, Motwani] Decision Trees [Domingos, Hulten], [Domingos, Hulten, Spencer] PODS 2002

Conclusion: Future Work Query Processing Stream Algebra and Query Languages Approximations Blocking, Constraints, Punctuations Runtime Management Scheduling, Memory Management, Rate Management Query Optimization (Adaptive, Multi-Query, Ad-hoc) Distributed processing Synopses and Algorithmic Problems Systems UI, statistics, crash recovery and transaction management System development and deployment PODS 2002

Thank You! PODS 2002