Memory Requirements of Data Streams Reynold Cheng 19 th July, 2002.

Slides:

Advertisements

Similar presentations

Raghavendra Madala. Introduction Icicles Icicle Maintenance Icicle-Based Estimators Quality Guarantee Performance Evaluation Conclusion 2 ICICLES: Self-tuning.

Advertisements

CS4432: Database Systems II

Probabilistic Skyline Operator over Sliding Windows Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Xuemin Lin, Ying Zhang, Wei.

DIMACS Streaming Data Working Group II On the Optimality of the Holistic Twig Join Algorithm Speaker: Byron Choi (Upenn) Joint Work with Susan Davidson.

Query Optimization CS634 Lecture 12, Mar 12, 2014 Slides based on “Database Management Systems” 3 rd ed, Ramakrishnan and Gehrke.

Fast Algorithms For Hierarchical Range Histogram Constructions

Maintaining Variance and k-Medians over Data Stream Windows Brian Babcock, Mayur Datar, Rajeev Motwani, Liadan O’Callaghan Stanford University.

Maintaining Variance over Data Stream Windows Brian Babcock, Mayur Datar, Rajeev Motwani, Liadan O ’ Callaghan, Stanford University ACM Symp. on Principles.

Mining Data Streams.

Efficient Query Evaluation on Probabilistic Databases

1 Maintaining Bernoulli Samples Over Evolving Multisets Rainer Gemulla Wolfgang Lehner Technische Universität Dresden Peter J. Haas IBM Almaden Research.

New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007.

1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 12 June 18, 2006

1 Distributed Databases CS347 Lecture 14 May 30, 2001.

A survey on stream data mining

Estimating Set Expression Cardinalities over Data Streams Sumit Ganguly Minos Garofalakis Rajeev Rastogi Internet Management Research Department Bell Labs,

Chain: Operator Scheduling for Memory Minimization in Data Stream Systems Authors: Brian Babcock, Shivnath Babu, Mayur Datar, and Rajeev Motwani (Dept.

CS591A1 Fall Sketch based Summarization of Data Streams Manish R. Sharma and Weichao Ma.

CS 591 A11 Algorithms for Data Streams Dhiman Barman CS 591 A1 Algorithms for the New Age 2 nd Dec, 2002.

Optimizing queries using materialized views J. Goldstein, P.-A. Larson SIGMOD 2001.

One-Pass Wavelet Decompositions of Data Streams TKDE May 2002 Anna C. Gilbert,Yannis Kotidis, S. Muthukrishanan, Martin J. Strauss Presented by James Chan.

Fast Approximate Wavelet Tracking on Streams Graham Cormode Minos Garofalakis Dimitris Sacharidis

Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman.

Hashed Samples Selectivity Estimators for Set Similarity Selection Queries.

Mirek Riedewald Department of Computer Science Cornell University Efficient Processing of Massive Data Streams for Mining and Monitoring.

Maintaining Variance and k-Medians over Data Stream Windows Paper by Brian Babcock, Mayur Datar, Rajeev Motwani and Liadan O’Callaghan. Presentation by.

Database Management 9. course. Execution of queries.

Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002.

Finding Frequent Items in Data Streams [Charikar-Chen-Farach-Colton] Paper report By MH, 2004/12/17.

Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.

Data Stream Systems Reynold Cheng 12 th July, 2002 Based on slides by B. Babcock et.al, “Models and Issues in Data Stream Systems”, PODS’02.

Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.

Streaming Algorithms Piotr Indyk MIT. Data Streams A data stream is a sequence of data that is too large to be stored in available memory Examples: –Network.

Physical Database Design I, Ch. Eick 1 Physical Database Design I About 25% of Chapter 20 Simple queries:= no joins, no complex aggregate functions Focus.

DAQ: A New Paradigm for Approximate Query Processing Navneet Potti Jignesh Patel VLDB 2015.

Data Streams Part 3: Approximate Query Evaluation Reynold Cheng 23 rd July, 2002.

Load Shedding Techniques for Data Stream Systems Brian Babcock Mayur Datar Rajeev Motwani Stanford University.

Lecture 15- Parallel Databases (continued) Advanced Databases Masood Niazi Torshiz Islamic Azad University- Mashhad Branch

CS 257 Chapter – 15.9 Summary of Query Execution Database Systems: The Complete Book Krishna Vellanki 124.

1 Online Computation and Continuous Maintaining of Quantile Summaries Tian Xia Database CCIS Northeastern University April 16, 2004.

Space-Efficient Online Computation of Quantile Summaries SIGMOD 01 Michael Greenwald & Sanjeev Khanna Presented by ellery.

Approximating Data Stream using histogram for Query Evaluation Huiping Cao Jan. 03, 2003.

Data Mining: Concepts and Techniques Mining data streams

Query Processing CS 405G Introduction to Database Systems.

File Processing : Query Processing 2008, Spring Pusan National University Ki-Joune Li.

Optimization of Association Rules Extraction Through Exploitation of Context Dependent Constraints Arianna Gallo, Roberto Esposito, Rosa Meo, Marco Botta.

Safety Guarantee of Continuous Join Queries over Punctuated Data Streams Hua-Gang Li *, Songting Chen, Junichi Tatemura Divykant Agrawal, K. Selcuk Candan.

1 Overview of Query Evaluation Chapter Outline  Query Optimization Overview  Algorithm for Relational Operations.

Written By: Presented By: Swarup Acharya,Amr Elkhatib Phillip B. Gibbons, Viswanath Poosala, Sridhar Ramaswamy Join Synopses for Approximate Query Answering.

Mining Data Streams (Part 1)

S. Sudarshan CS632 Course, Mar 2004 IIT Bombay

The Stream Model Sliding Windows Counting 1’s

A paper on Join Synopses for Approximate Query Answering

Finding Frequent Items in Data Streams

RE-Tree: An Efficient Index Structure for Regular Expressions

Bolin Ding Silu Huang* Surajit Chaudhuri Kaushik Chakrabarti Chi Wang

Relational Algebra Chapter 4, Part A

Arvind Arasu, Brian Babcock

Spatio-temporal Pattern Queries

Sublinear Algorithmic Tools 2

Spatial Online Sampling and Aggregation

Load Shedding Techniques for Data Stream Systems

AQUA: Approximate Query Answering

Relational Algebra Chapter 4, Sections 4.1 – 4.2

Brian Babcock, Shivnath Babu, Mayur Datar, and Rajeev Motwani

Feifei Li, Ching Chang, George Kollios, Azer Bestavros

Probabilistic Databases

Evaluation of Relational Operations: Other Techniques

Heavy Hitters in Streams and Sliding Windows

Approximation and Load Shedding Sampling Methods

Presentation transcript:

Memory Requirements of Data Streams Reynold Cheng 19 th July, 2002

Data Stream Model User/ApplicationUser/Application Register Query Stream Query ProcessorResults

Data Stream Model User/ApplicationUser/Application Register Query Stream Query Processor Results Scratch Space (Memory and/or Disk) Data Stream Management System (DSMS)

Impact of Limited Memory Continuous streams grow unboundedly Question: Can a continuous query be evaluated using a finite amount of memory?

Query Model SPJ (Select-Project-Join) Queries  L (  P (S 1 x S 2 x … x S n )) L: List of projected attributes P: Selection Predicate S 1, S 2,…, S n : Input Streams  : duplicate-eliminating;  ’: duplicate- preserving

Query Model: Predicate P P: conjunction of atomic predicates Atomic Predicate S i.A Op S j.B (i = j or i != j) S i.A Op k Op: { }

Query Model: Attributes All attributes have discrete, ordered domains All attributes are of type integer

Query Model: Monotonic and Exact Answers The queries are monotonic Any tuple that appears in the answer at any point in time continues to do so forever. The answers produced are exact i.e., no approximation.

Motivating Examples When does a SPJ query require only a bounded amount of memory? Assume 2 streams: S(A, B, C) and T(D, E)

Query 1  ’ A (  A>10 (S)) Is a simple filter on S Tuple-at-a-time processing No extra memory for saving stream tuples

Query 1 (again)  A (  A>10 (S)) Keep track of each distinct value of A > 10 To eliminate duplicates Requires unbounded memory

Query 2  A (  A=D (S x T)) Save each tuple s in S, since s may join with any tuples in T that arrive in the future Also need to save each tuple in T Requires unbounded memory

Query 3  ’ A (  A=D ^ A > 10 ^ D < 20 (S x T)) Can be evaluated with bounded memory! For each integer v in [11,19], keep: current # of tuples in S with A = v (v.S) current # of tuples in T with D = v (v.T) For an incoming tuple from stream S with A = v, output v for v.T times. For an incoming tuple from stream T with D = v, output v for v.S times.

Query 4  A (  B 10 ^ A < 20 (S x T)) Can be evaluated with bounded memory! For each integer v in [11,19], keep: Current min value of B among all tuples in S with A = v (v.B_min) Current max value of D among all tuples in T with A = v (v.D_max) For an incoming tuple from stream S, check: S.B < v.D_max. For an incoming tuple from stream T, check: T.D > v.B_min.

Our Goals Some queries involving join can be answered using finite amount of memory Under what conditions can a query be answered using finite memory over all possible instances of streams? Identify a class of queries that can be evaluated with a bounded amount of memory Consider both duplicate-preserving and duplicate-eliminating projections

Bounded Memory Computability An SPJ query is bounded-memory computable if we can find: A constant M and, An algorithm that evaluates the query using fewer than M units of memory A unit of memory can store one attribute value or a count

Memory boundness testing (Outline) Rewrite Q as a union of Locally Totally Ordered Queries (LTO queries), based on the following theorem: Any query Q can be rewritten as Q 1 U Q 2 U … U Q m, where each Q i is an LTO query and unions are duplicate-preserving

Memory boundness testing (Outline) Q is computable in bounded memory if and only if all its decomposed LTO queries are computable in bounded memory. Key: Develop a theorem for checking the memory boundness of a LTO query Key: Develop a theorem for checking the memory boundness of a LTO query

Definitions Can write a query Q as Q(P), when only P is important to discussions Element: constant / attribute of a stream C(Q): Set of constants appearing in Q S(Q): Set of streams appearing in Q A(S): Set of attributes in stream S  (S) = A(S) U C(Q): set of elements in Q potentially relevant to stream S

Total Ordering P + : transitive closure of P A = 10 and B B < 10 Set of elements E is totally ordered by a set of predicates P if for any elements e1 and e2 in E, Exactly one of e1 e2 is in P +

Order-Inducing Predicates A set of predicates P is order-inducing if a set of elements E is totally ordered by P. TO(E): Set of all order-inducing sets of predicates for elements E Example E = {A,B,5} Order-inducing sets: {A < B, 5 < A}, {A = B, B < 5}. These two sets belong to TO(E).

LTO Query A query Q(P) is LTO query if for every S  S(Q),  (S) is totally ordered by P Where S(Q): Set of streams appearing in Q  (S) = A(S) U C(Q): set of elements in Q potentially relevant to stream S

Decomposition of a Query into LTO Queries Q =  ’ L (  P (S 1 x S 2 x … x S n )) can be decomposed into Q 1 U Q 2 U … U Q m, where Q i is an LTO query. Let TO(  (S i )) = {T i 1, T i 2,…, T i mi }, T i j is a local total ordering – one possible ordering of S i and query constants.

Decomposition Theorem An exhaustive union of m 1 x m 2 x m n queries: Q(P U T 1 1 U T 2 1 U … U T n 1 ) U Q(P U T 1 2 U T 2 1 U … U T n 1 ) U … U Q(P U T 1 1 U T 2 2 U … U T n 1 ) U Q(P U T 1 2 U T 2 2 U … U T n 1 ) U … U Q(P U T 1 m1 U T 2 m2 U … U T n mn )

Boundedness of Attributes An attribute A is lower-bounded by a set of predicates P if there is an atomic predicate A > k  P + for some constant k An attribute A is upper-bounded by a set of predicates P if there is an atomic predicate A < k  P + for some constant k An atomic attribute is bounded if it is both upper- bounded and lower-bounded An attribute is unbounded if it is not bounded

MaxRef and MinRef MaxRef(S i ) is the set of all unbounded attributes A of S i that participate in join of the form S j.B < S i.A, i  j MinRef(S i ) is the set of all unbounded attributes A of S i that participate in join of the form S i.A < S j.B, i  j

Boundness Testing of LTO Query (Duplicate-Preserving) Let Q =  ’ L (  P (S 1 x S 2 x … x S n )) be an LTO query where n > 1. Q is bounded memory computable iff: 1. Every attribute in L is bounded 2. For every equality join predicate S i.A=S j.B where i  j, S i.A and S j.B are both bounded 3. |MaxRef(S i )| = |MinRef(S i )| = 0 for i = 1,…,n

Proof Outline We create synopses of n data streams For each stream S i, partition the tuples that satisfy the total order condition on S i into distinct buckets based on the values of the bounded attributes Tuples with the same values of all bounded attributes are placed in the same bucket Tuples that differ on at least one bounded attribute are placed in different buckets

Proof Outline (2) For each bucket, store The values for the bounded attributes Total number of tuples falling into the bucket The total size of these synopses is bounded by constant Can be proved that if the 3 conditions are satisfied, these synopses suffice to evaluate the query.

Proof (If): Attributes not in project list / join condition can be ignored, because all local selection conditions must be implied by the total order C1 and C2 guarantee that all attributes in equijoin and project list are bounded Each synopsis maintains full information about the values of all bounded attributes for every tuple Equijoin and projections can be handled properly

Proof (If): C3 asserts For every S i.A < S j.B, i  j, either one is true: 1. Both attributes are bounded --- Synopses have full information for A & B 2. The total order implies S i.A < c < S j.B where c is some constant in Q --- No need to store actual attribute values, since all tuples from S i that satisfy the total order on S i join with all tuples from S j that satisfy the total order on S j No relevant information is lost by consolidating tuples into buckets

Proof (Only if): If one of the conditions C1, C2 and C3 does not hold, then Q cannot always be evaluated in bounded memory. For each condition, if the condition is violated, one can construct instances and presentations of the input streams that require more than M memory to correctly evaluate Q.

Boundness Testing of LTO Query (Duplicate-Eliminating) Let Q =  L (  P (S 1 x S 2 x … x S n )) be an LTO query where n > 1. Q is bounded memory computable iff: Every attribute in L is bounded For every equality join predicate S i.A=S j.B where i  j, S i.A and S j.B are both bounded |MaxRef(S i )| eq + |MinRef(S i )| eq  1 for i = 1,…,n |E| eq denotes the number of P-induced equivalence classes in the element set E.

Comments Q is computable in bounded memory if and only if all its decomposed LTO queries are computable in bounded memory. The checking algorithm needs O(|A(Q)| 4 ) times. If a query is not memory-bounded computable, approximation algorithms are necessary. For a memory-bounded computable query, computation is costly in the evaluation of each LTO query. Need to evaluate the query for every arrival of tuples.

Comments Only spatial requirement, not query speed, is considered. Can we extend the memory-boundness checking to general (including aggregate) queries? It is not always possible to provide exact answers to queries, so approximation query algorithms have to be developed. For these approximation queries, need to consider memory-boundness, execution speed and approximation quality.

Outline of this Talk Memory Requirements of Queries Memory Requirements of Queries Approximation Queries Approximation Queries Other Research Issues

Approximate Query Evaluation Why? Handling load – streams coming too fast Data stream is archived in a off-site data warehouse, expensive access of archived data Avoid unbounded storage and computation Ad hoc queries need approximate history Try to look at the data items only once and in a fixed order

Approximate Query Evaluation How? Sliding windows, synopsis, samples Major Issues? Metric for set-valued queries Composition of approximate operators How is it understood/controlled by user? Integrate into query language Query planning and interaction with resource allocation Accuracy-efficiency-storage tradeoff and global metric

Synopses Queries may access or aggregate past data Need bounded-memory history-approximation Synopsis? Succinct summary of old stream tuples Like indexes/materialized-views, but base data is unavailable Examples Sliding Windows Samples Sketches Histograms Wavelet representation

Sketching Techniques Self-Join Size Estimation Stream of values from D = {1,2,…,n} Let f i = frequency of value i Consider S = Σ f i 2, or Gini’s index of homogeneity. Useful in parallel DB applications, error estimation in query result size estimation and access plan costs. Equivalent query: count (R |><| D R)

Evaluating S = Σ f i 2 To update S, keep a counter f i for each value i in the domain D   (n) space Has to be kept for each self-join Question – estimating S in sub-linear space? (O(log n))

Self-Join Size Estimation AMS Technique (randomized sketches) Given (f 1,f 2,…,f N ) Z i = random{-1,1} X = Σ f i Z i (X incrementally computable) Theorem Exp[X 2 ] = Σ f i 2 Cross-terms f i Z i f j Z j have 0 expectation Square-terms f i Z i f i Z i = f i 2 Space = log (N + Σ f i ) Independent samples X k reduce variance

Estimation Quality How can independent samples X k improve the quality of estimation? Keep s 1 x s 2 samples for X k s 1 reduces variance, s 2 boosts confidence Avg(X 1j 2 ) Avg(X 2j 2 ) Avg(X 3j 2 ) Avg(X 4j 2 ) Avg(X 5j 2 ) Median (sketch) s1s1 s2s2 Atomic Sketch

Sample Run of AMS 36257V = Z 1 = Z 2 = X 1 = 5, X 1 2 = 25 X 2 = 14, X 2 2 = 196 Est = 110.5Σv i 2 = V = Z 1 = Z 2 = X 1 = 6, X 1 2 = 36,X 2 = 12, X 2 2 = 144, Est = 90Σ v i 2 = 130,

Comments on AMS The self-join size can be computed on-line Sufficiently small variance (controlled by s 1 and s 2 ) Can this method be extended to answer other queries?

Complex Aggregate Queries A. Dobra et al. extend the idea of AMS to provide approximate answers to complex aggregate queries. AGGE SELECT AGG FROM R 1,R 2,…,R r where E AGG: COUNT/SUM/AVERAGE E: conjunction of (R i.A j = R k.A l ) It is proved that the error of these estimates is at most ε with probability 1-δ.

Basic Notions of Approximation For aggregate queries (e.g., SUM, COUNT), approximation quality can be measured by relative error: (Estimated value – Actual value) / Actual value Open question: for queries involving more than simple aggregation, how should we define approximation? Consider S |><| B T: (S: {A,B}, T: {B,C}) ABC Doctor 810.3Lawyer 310.2Teacher ABC 810.3Lawyer 310.2Teacher Actual ResultApproximate Result

Basic Notions of Approximation (2) Can we accept this kind of approximation? ABC Doctor 810.3Lawyer 310.2Teacher Actual ResultApproximate Result ABC Doctor 810.3Student 310.2Teacher

Basic Notions of Approximation (3) Can we provide useful (semantically correct) but stale results? ABC Doctor 810.3Lawyer 310.2Teacher Actual Result (at time t) Approximate Result (correct result at time t -  ) ABC Doctor 810.3Lawyer

Outline of this Talk An Overview of Streams Data and Query Model Approximation Queries Other Research Issues Other Research Issues

Data Mining High-Speed Stream Data Mining Association Rules Stream Clustering Decision Trees Single-pass algorithms for infering interesting patterns on-line (as the data stream arrives) Useful for mission-critical tasks like telecom fraud detection

Conclusion: Future Work Query Processing Stream Algebra and Query Languages Approximations Blocking Operators, Constraints, Punctuations Runtime Management Scheduling, Memory Management, Rate Management Query Optimization (Adaptive, Multi-Query, Ad-hoc) Distributed processing Synopses and Algorithmic Problems Systems UI, statistics, crash recovery and transaction management System development and deployment

References B. Babcock, S. Babu, M. Datar, R. Motwani, J. Widom. Models and Issues in Data Stream Systems, PODS ’02. (Paper and Talk Slides) A. Arasu, B. Babcock, S. Babu, J. McAlister, J. Widom. Characterizing Memory Requirements for Queries over Continuous Data Streams, PODS ’02. A. Dobra, M. Garofalakis, J. Gehrke, R. Rastogi. Processing Complex Aggregate Queries over Data Streams, SIGMOD ’02. N. Alon, Y. Matias, M. Szegedy. The Space Complexity of Approximating the Frequency Moments, STOC ’96.

Thank You!