Data Stream Systems Reynold Cheng 12 th July, 2002 Based on slides by B. Babcock et.al, “Models and Issues in Data Stream Systems”, PODS’02.

Slides:



Advertisements
Similar presentations
Probabilistic Skyline Operator over Sliding Windows Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Xuemin Lin, Ying Zhang, Wei.
Advertisements

1 11. Streaming Data Management Chapter 18 Current Issues: Streaming Data and Cloud Computing The 3rd edition of the textbook.
ABSTRACT We consider the problem of computing information theoretic functions such as entropy on a data stream, using sublinear space. Our first result.
Maintaining Variance over Data Stream Windows Brian Babcock, Mayur Datar, Rajeev Motwani, Liadan O ’ Callaghan, Stanford University ACM Symp. on Principles.
Mining Data Streams.
A Data Stream Management System for Network Traffic Management Shivnath Babu Stanford University Lakshminarayanan Subramanian Univ. California, Berkeley.
Adaptive Monitoring of Bursty Data Streams Brian Babcock, Shivnath Babu, Mayur Datar, and Rajeev Motwani.
1 Maintaining Bernoulli Samples Over Evolving Multisets Rainer Gemulla Wolfgang Lehner Technische Universität Dresden Peter J. Haas IBM Almaden Research.
New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007.
1 Continuous Queries over Data Streams Vitaly Kroivets, Lyan Marina Presentation for The Seminar on Database and Internet The Hebrew University of Jerusalem,
Algorithms for Distributed Functional Monitoring Ke Yi HKUST Joint work with Graham Cormode (AT&T Labs) S. Muthukrishnan (Google Inc.)
Probabilistic Aggregation in Distributed Networks Ling Huang, Ben Zhao, Anthony Joseph and John Kubiatowicz {hling, ravenben, adj,
1 Continuous Queries over Data Streams Vitaly Kroivets, Lyan Marina Presentation for The Seminar on Database and Internet The Hebrew University of Jerusalem,
Finding Aggregates from Streaming Data in Single Pass Medha Atre Course Project for CS631 (Autumn 2002) under Prof. Krithi Ramamritham (IITB).
PODS Models and Issues in Data Stream Systems Rajeev Motwani Stanford University (with Brian Babcock, Shivnath Babu, Mayur Datar, and Jennifer Widom)
Processing Data-Stream Joins Using Skimmed Sketches Minos Garofalakis Internet Management Research Department Bell Labs, Lucent Technologies Joint work.
SWiM Panel on Engine Implementation Jennifer Widom.
1 Stream-based Data Management IS698 Min Song 2 Characteristics of Data Streams  Data Streams Data streams — continuous, ordered, changing, fast, huge.
A survey on stream data mining
Estimating Set Expression Cardinalities over Data Streams Sumit Ganguly Minos Garofalakis Rajeev Rastogi Internet Management Research Department Bell Labs,
Building a Data Stream Management System Prof. Jennifer Widom Joint project with Prof. Rajeev Motwani and a team of graduate studentshttp://www-db.stanford.edu/stream.
Chain: Operator Scheduling for Memory Minimization in Data Stream Systems Authors: Brian Babcock, Shivnath Babu, Mayur Datar, and Rajeev Motwani (Dept.
1 PODS 2002 Motivation. 2 PODS 2002 Data Streams data sets Traditional DBMS – data stored in finite, persistent data sets data streams New Applications.
CS591A1 Fall Sketch based Summarization of Data Streams Manish R. Sharma and Weichao Ma.
1 External Sorting for Query Processing Yanlei Diao UMass Amherst Feb 27, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
The Stanford Data Streams Research Project Profs. Rajeev Motwani & Jennifer Widom And a cast of full- and part-time students: Arvind Arasu, Brian Babcock,
SWIM 1/9/20031 QoS in Data Stream Systems Rajeev Motwani Stanford University.
CS 591 A11 Algorithms for Data Streams Dhiman Barman CS 591 A1 Algorithms for the New Age 2 nd Dec, 2002.
One-Pass Wavelet Decompositions of Data Streams TKDE May 2002 Anna C. Gilbert,Yannis Kotidis, S. Muthukrishanan, Martin J. Strauss Presented by James Chan.
Cloud and Big Data Summer School, Stockholm, Aug Jeffrey D. Ullman.
Data Warehouse Operational Issues Potential Research Directions.
Efficient Scheduling of Heterogeneous Continuous Queries Mohamed A. Sharaf Panos K. Chrysanthis Alexandros Labrinidis Kirk Pruhs Advanced Data Management.
Mirek Riedewald Department of Computer Science Cornell University Efficient Processing of Massive Data Streams for Mining and Monitoring.
NiagaraCQ : A Scalable Continuous Query System for Internet Databases (modified slides available on course webpage) Jianjun Chen et al Computer Sciences.
Context Tailoring the DBMS –To support particular applications Beyond alphanumerical data Beyond retrieve + process –To support particular hardware New.
An Integration Framework for Sensor Networks and Data Stream Management Systems.
Query Processing, Resource Management, and Approximation in a Data Stream Management System.
Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002.
Towards Low Overhead Provenance Tracking in Near Real-Time Stream Filtering Nithya N. Vijayakumar, Beth Plale DDE Lab, Indiana University {nvijayak,
PODS Models and Issues in Data Stream Systems Rajeev Motwani Stanford University (with Brian Babcock, Shivnath Babu, Mayur Datar, and Jennifer Widom)
Data Streams Part 3: Approximate Query Evaluation Reynold Cheng 23 rd July, 2002.
Data Stream Management Systems
Aum Sai Ram Security for Stream Data Modified from slides created by Sujan Pakala.
Memory Requirements of Data Streams Reynold Cheng 19 th July, 2002.
Data Streams Topics in Data Mining Fall 2015 Bruno Ribeiro © 2015 Bruno Ribeiro.
Data Mining: Concepts and Techniques Mining data streams
Calculating frequency moments of Data Stream
ApproxHadoop Bringing Approximations to MapReduce Frameworks
Review Lecture DB A/18-849B/95-811A/19-729A Internet-Scale Sensor Systems: Design and Policy Review Lecture Databases Phil Gibbons May 1, 2003.
CPT-S Advanced Databases 11 Yinghui Wu EME 49.
Lecture A/18-849B/95-811A/19-729A Internet-Scale Sensor Systems: Design and Policy Lecture 15 Sensor Databases & Data Stream Systems Phil Gibbons.
Safety Guarantee of Continuous Join Queries over Punctuated Data Streams Hua-Gang Li *, Songting Chen, Junichi Tatemura Divykant Agrawal, K. Selcuk Candan.
Streaming Semantic Data COMP6215 Semantic Web Technologies Dr Nicholas Gibbins –
Data Streams COMP3017 Advanced Databases Dr Nicholas Gibbins –
1 Advanced Database Systems: DBS CB, 2 nd Edition Advanced Topics of Interest: DB the Cloud, and SQL & Stream Processing.
Mining Data Streams (Part 1)
Advanced Database Systems: DBS CB, 2nd Edition
S. Sudarshan CS632 Course, Mar 2004 IIT Bombay
COMP3211 Advanced Databases
The Stream Model Sliding Windows Counting 1’s
A paper on Join Synopses for Approximate Query Answering
Bolin Ding Silu Huang* Surajit Chaudhuri Kaushik Chakrabarti Chi Wang
Arvind Arasu, Brian Babcock
External Sorting The slides for this text are organized into chapters. This lecture covers Chapter 11. Chapter 1: Introduction to Database Systems Chapter.
Brian Babcock, Shivnath Babu, Mayur Datar, and Rajeev Motwani
Models and Issues in Data Stream Systems
Approximate Frequency Counts over Data Streams
Feifei Li, Ching Chang, George Kollios, Azer Bestavros
Evaluation of Relational Operations: Other Techniques
Heavy Hitters in Streams and Sliding Windows
Presentation transcript:

Data Stream Systems Reynold Cheng 12 th July, 2002 Based on slides by B. Babcock et.al, “Models and Issues in Data Stream Systems”, PODS’02.

Outline of this Talk An Overview of Streams An Overview of Streams Data and Query Models Approximation Queries Other Research Issues

Data Streams data sets Traditional DBMS – data stored in finite, persistent data sets data streams New Applications – data input as continuous, ordered data streams A data stream as a growing relational table of potentially infinite size

Using Traditional Database User/ApplicationUser/Application LoaderLoader QueryResult ResultQuery ……

New Approach for Data Streams User/ApplicationUser/Application Register Query Stream Query ProcessorResults

New Approach for Data Streams User/ApplicationUser/Application Register Query Stream Query Processor Results Scratch Space (Memory and/or Disk) Data Stream Management System (DSMS)

Sample Applications Network management and traffic engineering (e.g., Sprint) Streams of measurements and packet traces Queries: detect anomalies, adjust routing Telecom call data (e.g., AT&T) Streams of call records Queries: fraud, customer call patterns, billing

Sample Applications Sensor Networks Large number of cheap, wireless sensors streams of real-world measurements Queries: monitoring, aggregate, alert Web tracking and personalization (e.g., Yahoo, Google) Clickstreams, user query streams, log records Queries: monitoring, analysis, personalization

Challenges Multiple, continuous, rapid, time-varying, ordered streams Main memory computations Queries may be continuous (not just one-time) Evaluated continuously as stream data arrives Answer updated over time Queries may be ad-hoc Beyond relational queries (scientific, data mining)

Meta-Questions Killer-apps Application stream rates exceed DBMS capacity? Can DSMS handle high rates anyway? Motivation Need for general-purpose DSMS? Not ad-hoc, application-specific systems? Non-Trivial DSMS = merely DBMS with enhanced support for triggers, temporal constructs, data rate mgmt?

DBMS versus DSMS Persistent relations Transient streams

DBMS versus DSMS Persistent relations One-time queries Transient streams Continuous queries

DBMS versus DSMS Persistent relations One-time queries Random access Transient streams Continuous queries Sequential access

DBMS versus DSMS Persistent relations One-time queries Random access “Unbounded” disk store Transient streams Continuous queries Sequential access Bounded main memory

DBMS versus DSMS Persistent relations One-time queries Random access “Unbounded” disk store Only current state matters Transient streams Continuous queries Sequential access Bounded main memory History/arrival-order is critical

DBMS versus DSMS Persistent relations One-time queries Random access “Unbounded” disk store Only current state matters Relatively low update rate Transient streams Continuous queries Sequential access Bounded main memory History/arrival-order is critical Possibly multi-GB arrival rate

DBMS versus DSMS Persistent relations One-time queries Random access “Unbounded” disk store Only current state matters Relatively low update rate No real-time services Transient streams Continuous queries Sequential access Bounded main memory History/arrival-order is critical Possibly multi-GB arrival rate Real-time requirements

DBMS versus DSMS Persistent relations One-time queries Random access “Unbounded” disk store Only current state matters Relatively low update rate No real-time services Assume precise data Transient streams Continuous queries Sequential access Bounded main memory History/arrival-order is critical Possibly multi-GB arrival rate Real-time requirements Data stale/imprecise

Outline of this Talk An Overview of Streams Data and Query Models Data and Query Models Approximation Queries Other Research Issues

Aurora/STREAM Overview Input streams Users issue continuous and ad-hoc queries Administrator monitors query execution and adjusts run-time parameters Applications register continuous queries Output streams  x  x Waiting Op Ready Op Running Op Synopses Query Plans Historical Storage

Data Model Append-only Call records Updates Stock tickers Deletes Transactional data Meta-Data Control signals, punctuations System Internals – probably need all above

Query Model User/ Application DSMS Query Processor

Example Queries DSMS Outgoing (call_ID, caller, time, event) Incoming (call_ID, callee, time, event) event = start or end Central Office Central Office John May

Query 1 ( self-join ) Find all outgoing calls longer than 2 minutes SELECT O1.call_ID, O1.caller FROM Outgoing O1, Outgoing O2 WHERE (O2.time – O1.time > 2 AND O1.call_ID = O2.call_ID AND O1.event = start AND O2.event = end) Result requires unbounded storage Can provide result as data stream Can output after 2 min, without seeing end

Query 2 ( join ) Pair up callers and callees SELECT O.caller, I.callee FROM Outgoing O, Incoming I WHERE O.call_ID = I.call_ID Can still provide result as data stream Requires unbounded temporary storage

Query 3 ( group-by aggregation ) Total connection time for each caller SELECT O1.caller, sum(O2.time – O1.time) FROM Outgoing O1, Outgoing O2 WHERE (O1.call_ID = O2.call_ID AND O1.event = start AND O2.event = end) GROUP BY O1.caller Cannot provide result in (append-only) stream Output updates? Provide current value on demand?

Outline of this Talk An Overview of Streams Data and Query Model Approximation Queries Approximation Queries Other Research Issues

Impact of Limited Memory Continuous streams grow unboundedly Queries may require unbounded memory [ABBMW 02] a priori memory bounds for query Conjunctive queries with arithmetic comparisons Impact of duplication elimination Open – general queries

Approximate Query Evaluation Why? Handling load – streams coming too fast Data stream is archived in a off-site data warehouse, expensive access of archived data Avoid unbounded storage and computation Ad hoc queries need approximate history Try to look at the data items only once and in a fixed order

Approximate Query Evaluation How? Sliding windows, synopsis, samples, load- shed Major Issues? Metric for set-valued queries Composition of approximate operators How is it understood/controlled by user? Integrate into query language Query planning and interaction with resource allocation Accuracy-efficiency-storage tradeoff and global metric

Synopses Queries may access or aggregate past data Need bounded-memory history-approximation Synopsis? Succinct summary of old stream tuples Like indexes/materialized-views, but base data is unavailable Examples Sliding Windows Samples Sketches Histograms Wavelet representation

Sketching Techniques Self-Join Size Estimation Stream of values from D = {1,2,…,n} Let f i = frequency of value i Consider S = Σ f i 2, or Gini’s index of homogeneity. Useful in parallel DB applications, error estimation in query result size estimation and access plan costs. Equivalent query: count (R |><| D R)

Evaluating S = Σ f i 2 To update S, keep a counter f i for each value i in the domain D   (n) space Has to be kept for each self-join Question – estimating S in sub-linear space? (O(log n))

Self-Join Size Estimation AMS Technique (randomized sketches) Given (f 1,f 2,…,f N ) Z i = random{-1,1} X = Σ f i Z i (X incrementally computable) Theorem Exp[X 2 ] = Σ f i 2 Cross-terms f i Z i f j Z j have 0 expectation Square-terms f i Z i f i Z i = f i 2 Space = log (N + Σ f i ) Independent samples X k reduce variance

Estimation Quality How can independent samples X k improve the quality of estimation? Keep s 1 x s 2 samples for X k s 1 reduces variance, s 2 boosts confidence Avg(X 1j 2 ) Avg(X 2j 2 ) Avg(X 3j 2 ) Avg(X 4j 2 ) Avg(X 5j 2 ) Median (sketch) s1s1 s2s2 Atomic Sketch

Sample Run of AMS 36257V = Z 1 = Z 2 = X 1 = 5, X 1 2 = 25 X 2 = 14, X 2 2 = 196 Est = 110.5Σv i 2 = V = Z 1 = Z 2 = X 1 = 6, X 1 2 = 36,X 2 = 12, X 2 2 = 144, Est = 90Σ v i 2 = 130,

Comments on AMS The self-join size can be computed on-line Sufficiently small variance (controlled by s 1 and s 2 ) Can this method be extended to answer other queries?

Complex Aggregate Queries A. Dobra et al. extend the idea of AMS to provide approximate answers to complex aggregate queries. AGGE SELECT AGG FROM R 1,R 2,…,R r where E AGG: COUNT/SUM/AVERAGE E: conjunction of (R i.A j = R k.A l ) It is proved that the error of these estimates is at most ε with probability 1-δ.

Basic Notions of Approximation For aggregate queries (e.g., SUM, COUNT), approximation quality can be measured by relative error: (Estimated value – Actual value) / Actual value Open question: for queries involving more than simple aggregation, how should we define approximation? Consider S |><| B T: (S: {A,B}, T: {B,C}) ABC Doctor 810.3Lawyer 310.2Teacher ABC 810.3Lawyer 310.2Teacher Actual ResultApproximate Result

Basic Notions of Approximation (2) Can we accept this kind of approximation? ABC Doctor 810.3Lawyer 310.2Teacher Actual ResultApproximate Result ABC Doctor 810.3Student 310.2Teacher

Basic Notions of Approximation (3) Can we provide useful (semantically correct) but stale results? ABC Doctor 810.3Lawyer 310.2Teacher Actual Result (at time t) Approximate Result (correct result at time t -  ) ABC Doctor 810.3Lawyer

Outline of this Talk An Overview of Streams Data and Query Model Approximation Queries Other Research Issues Other Research Issues

Data Mining High-Speed Stream Data Mining Association Rules Stream Clustering Decision Trees Single-pass algorithms for infering interesting patterns on-line (as the data stream arrives) Useful for mission-critical tasks like telecom fraud detection

Conclusion: Future Work Query Processing Stream Algebra and Query Languages Approximations Blocking Operators, Constraints, Punctuations Runtime Management Scheduling, Memory Management, Rate Management Query Optimization (Adaptive, Multi-Query, Ad-hoc) Distributed processing Synopses and Algorithmic Problems Systems UI, statistics, crash recovery and transaction management System development and deployment

References B. Babcock, S. Babu, M. Datar, R. Motwani, J. Widom. Models and Issues in Data Stream Systems, PODS ’02. (Paper and Talk Slides) A. Arasu, B. Babcock, S. Babu, J. McAlister, J. Widom. Characterizing Memory Requirements for Queries over Continuous Data Streams, PODS ’02. A. Dobra, M. Garofalakis, J. Gehrke, R. Rastogi. Processing Complex Aggregate Queries over Data Streams, SIGMOD ’02. N. Alon, Y. Matias, M. Szegedy. The Space Complexity of Approximating the Frequency Moments, STOC ’96.

Thank You!