Approximating Data Stream using histogram for Query Evaluation Huiping Cao Jan. 03, 2003.

Slides:



Advertisements
Similar presentations
Sampling From a Moving Window Over Streaming Data Brian Babcock * Mayur Datar Rajeev Motwani * Speaker Stanford University.
Advertisements

Raghavendra Madala. Introduction Icicles Icicle Maintenance Icicle-Based Estimators Quality Guarantee Performance Evaluation Conclusion 2 ICICLES: Self-tuning.
Dynamic Sample Selection for Approximate Query Processing Brian Babcock Stanford University Surajit Chaudhuri Microsoft Research Gautam Das Microsoft Research.
Counting Distinct Objects over Sliding Windows Presented by: Muhammad Aamir Cheema Joint work with Wenjie Zhang, Ying Zhang and Xuemin Lin University of.
Fast Algorithms For Hierarchical Range Histogram Constructions
Towards Estimating the Number of Distinct Value Combinations for a Set of Attributes Xiaohui Yu 1, Calisto Zuzarte 2, Ken Sevcik 1 1 University of Toronto.
Introduction to Histograms Presented By: Laukik Chitnis
STHoles: A Multidimensional Workload-Aware Histogram Nicolas Bruno* Columbia University Luis Gravano* Columbia University Surajit Chaudhuri Microsoft Research.
Maintaining Variance and k-Medians over Data Stream Windows Brian Babcock, Mayur Datar, Rajeev Motwani, Liadan O’Callaghan Stanford University.
Probabilistic Histograms for Probabilistic Data Graham Cormode AT&T Labs-Research Antonios Deligiannakis Technical University of Crete Minos Garofalakis.
From Counting Sketches to Equi-Depth Histograms CS240B Notes from a EDBT11 paper entitled: A Fast and Space-Efficient Computation of Equi-Depth Histograms.
1 NNH: Improving Performance of Nearest- Neighbor Searches Using Histograms Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research) Chen Li (UC Irvine)
1 CS 361 Lecture 5 Approximate Quantiles and Histograms 9 Oct 2002 Gurmeet Singh Manku
New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007.
Integrating Bayesian Networks and Simpson’s Paradox in Data Mining Alex Freitas University of Kent Ken McGarry University of Sunderland.
Ph.D. DefenceUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor:
Processing Data-Stream Joins Using Skimmed Sketches Minos Garofalakis Internet Management Research Department Bell Labs, Lucent Technologies Joint work.
What ’ s Hot and What ’ s Not: Tracking Most Frequent Items Dynamically G. Cormode and S. Muthukrishman Rutgers University ACM Principles of Database Systems.
A survey on stream data mining
Estimating Set Expression Cardinalities over Data Streams Sumit Ganguly Minos Garofalakis Rajeev Rastogi Internet Management Research Department Bell Labs,
Chain: Operator Scheduling for Memory Minimization in Data Stream Systems Authors: Brian Babcock, Shivnath Babu, Mayur Datar, and Rajeev Motwani (Dept.
Online Data Gathering for Maximizing Network Lifetime in Sensor Networks IEEE transactions on Mobile Computing Weifa Liang, YuZhen Liu.
CS591A1 Fall Sketch based Summarization of Data Streams Manish R. Sharma and Weichao Ma.
1 Wavelet synopses with Error Guarantees Minos Garofalakis Phillip B. Gibbons Information Sciences Research Center Bell Labs, Lucent Technologies Murray.
One-Pass Wavelet Decompositions of Data Streams TKDE May 2002 Anna C. Gilbert,Yannis Kotidis, S. Muthukrishanan, Martin J. Strauss Presented by James Chan.
Fast Approximate Wavelet Tracking on Streams Graham Cormode Minos Garofalakis Dimitris Sacharidis
Cloud and Big Data Summer School, Stockholm, Aug Jeffrey D. Ullman.
Mirek Riedewald Department of Computer Science Cornell University Efficient Processing of Massive Data Streams for Mining and Monitoring.
Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002.
Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.
Data Stream Systems Reynold Cheng 12 th July, 2002 Based on slides by B. Babcock et.al, “Models and Issues in Data Stream Systems”, PODS’02.
Ripple Joins for Online Aggregation by Peter J. Haas and Joseph M. Hellerstein published in June 1999 presented by Ronda Hilton.
Streaming Algorithms Piotr Indyk MIT. Data Streams A data stream is a sequence of data that is too large to be stored in available memory Examples: –Network.
Constructing Optimal Wavelet Synopses Dimitris Sacharidis Timos Sellis
A Novel Approach for Approximate Aggregations Over Arrays SSDBM 2015 June 29 th, San Diego, California 1 Yi Wang, Yu Su, Gagan Agrawal The Ohio State University.
1 Approximating Quantiles over Sliding Windows Srimathi Harinarayanan CMPS 565.
Data Streams Part 3: Approximate Query Evaluation Reynold Cheng 23 rd July, 2002.
The Haar + Tree: A Refined Synopsis Data Structure Panagiotis Karras HKU, September 7 th, 2006.
The Impact of Duality on Data Synopsis Problems Panagiotis Karras KDD, San Jose, August 13 th, 2007 work with Dimitris Sacharidis and Nikos Mamoulis.
Histograms for Selectivity Estimation
How to Summarize the Universe: Dynamic Maintenance of Quantiles Gilbert, Kotidis, Muthukrishnan, Strauss Presented by Itay Malinger December 2003.
Load Shedding Techniques for Data Stream Systems Brian Babcock Mayur Datar Rajeev Motwani Stanford University.
ROCK: A Robust Clustering Algorithm for Categorical Attributes Authors: Sudipto Guha, Rajeev Rastogi, Kyuseok Shim Data Engineering, Proceedings.,
1 Online Computation and Continuous Maintaining of Quantile Summaries Tian Xia Database CCIS Northeastern University April 16, 2004.
Space-Efficient Online Computation of Quantile Summaries SIGMOD 01 Michael Greenwald & Sanjeev Khanna Presented by ellery.
Memory Requirements of Data Streams Reynold Cheng 19 th July, 2002.
Data Mining: Concepts and Techniques Mining data streams
The Impact of Duality on Data Representation Problems Panagiotis Karras HKU, June 14 th, 2007.
Approximate NN queries on Streams with Guaranteed Error/performance Bounds Nick AT&T labs-research Beng Chin Ooi, Kian-Lee Tan, Rui National.
Histograms for Selectivity Estimation, Part II Speaker: Ho Wai Shing Global Optimization of Histograms.
Effective Anomaly Detection with Scarce Training Data Presenter: 葉倚任 Author: W. Robertson, F. Maggi, C. Kruegel and G. Vigna NDSS
@ Carnegie Mellon Databases 1 Finding Frequent Items in Distributed Data Streams Amit Manjhi V. Shkapenyuk, K. Dhamdhere, C. Olston Carnegie Mellon University.
Continuous Monitoring of Distributed Data Streams over a Time-based Sliding Window MADALGO – Center for Massive Data Algorithmics, a Center of the Danish.
Dynamic Resource Allocation for Shared Data Centers Using Online Measurements By- Abhishek Chandra, Weibo Gong and Prashant Shenoy.
ICICLES: Self-tuning Samples for Approximate Query Answering By Venkatesh Ganti, Mong Li Lee, and Raghu Ramakrishnan Shruti P. Gopinath CSE 6339.
Dense-Region Based Compact Data Cube
Approximation and Load Shedding for QoS in DSMS*
Data Transformation: Normalization
Frequency Counts over Data Streams
A paper on Join Synopses for Approximate Query Answering
Data-Streams and Histograms
ICICLES: Self-tuning Samples for Approximate Query Answering
Lattice Histograms: A Resilient Synopsis Structure
Y. Kotidis, S. Muthukrishnan,
Brian Babcock, Shivnath Babu, Mayur Datar, and Rajeev Motwani
SPACE EFFICENCY OF SYNOPSIS CONSTRUCTION ALGORITHMS
Feifei Li, Ching Chang, George Kollios, Azer Bestavros
DATABASE HISTOGRAMS E0 261 Jayant Haritsa
Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research)
Maintaining Stream Statistics over Sliding Windows
Presentation transcript:

Approximating Data Stream using histogram for Query Evaluation Huiping Cao Jan. 03, 2003

2 Outline Introduction Background Histogramming a data stream V-Optimal Histogram Optimal Histogram Construction Agglomerative Histogram Algorithm Fixed-Window Histogram Algorithm Experiments Conclusion

3 Introduction Data Stream refers to the fixed order of data elements that come continuously and in a variable rate. Many applications generate streaming data, such as network monitoring records, data generated by sensors, etc. New features of algorithms used to handle data stream: single- pass, quick speed(maybe), limited memory, online(unbounded) Data stream operations Approximate querying, similarity searching, data mining. Such operations reply on good approximation of data stream, histogram is a popular way to approximate data stream

4 Background Histogram Histogram approximates the data distribution of data sets or data stream by partitioning the underlying data into subsets called buckets. Good histogram construction algorithm can approximate the data as accurately and quickly as possible Accuracy of the approximation depends on: (1) partitioning technique used to group values into buckets. I.e, how to partition the data into subsets while inducing less error. (2) approximation technique employed within each bucket. I.e., how to summary the values in one buckets. E.g., mean, average.

5 Background(cont.) Data stream model Agglomerative(Landmark) model Take into account every elements seen so far Figure 1(a) Fixed-window(Sliding-window) model Only consider the last seen n data elements or the elements observed t time units before the current time Figure 1(b) Sketch t0=0t current Fig. 1(a) Sketch t0=0t current Fig. 1(b) n

6 Background(cont.) Related work Approximate specific queries Distinct values([Gib01]), frequency counts([MM02]), quantile([GK01]), general aggregation([DGG+02]), join([KNV03]). Approximate methods Sample, histogram, wavelets, more common synopsis(Section 6 in [BBD+02]).  Focus of this talk: Query independent histogram construction methods, specifically concentrate on the partitioning of buckets.

7 Histogramming a data stream Optimal histogram([GK02, IP95]) Optimal histogram construction([GK02, JKM+98]) Agglomerative algorithm([GK02,GKS01]) Fixed window algorithm([GK02])

8 V-Optimal Histogram Optimal Histogram Problem Given a sequence of length n, a number of buckets B, and an error function E n (), find H B to minimize E(H B ). Independent on queries [IP95] showed that V-Optimal is the well known optimal histogram. Basic idea: attribute values are grouped in buckets based on proximity in their frequencies but not in their actual values. E n ()=  bi  [1,..,B]  v  bi (f v -C bi /V bi ) 2 B: maximum bucket number bi: the i-th bucket f v : the frequency of v in one bucket C bi,V bi : The sum and the number of frequencies in bucket bi

9 Optimal Histogram Construction Problem: The problem of constructing optimal histogram is intrinsically to partition the index set 1...n into B intervals or buckets minimizing E() Main idea: [JKM+98] the algorithm focuses on computing OPT[n,B] and getting the bucket boundaries at the same time. OPT[i,k] denotes the minimum error of representing [1,…,i] by a histogram with k buckets, where i  n and k  B. OPT[n,B]= min i<n {OPT[i,B-1]+SSE[i+1,n]} E() = OPT(i,B)=  k  [1...B] SSE k. SSE is the common error metric: Sum Squared Error(SSE) SSE([a,b])=  i  [a,b] (v i - avg(v)) 2 =  v i 2 - 1/(b-a+1)(  v i ) 2 = SQSUM[1,b]-SQLSUM[1,a-1] -(1/(b-a+1))(SUM[1,b]-SUM[1,a-1]) where, SUM[1,i]=  v j SQSUM[1,i] =  v j 2, j  [1,...,i]

10 Optimal Histogram Construction(Cont.) Algorithm OptimalHistogram() Compute SUM[1,i], SQSUM[1,i] for all 1  i  n Initialize OPT[j,1]= SQSUM[j,n], 1  j  n 1. For j=1 to n do 2. For k=2 to B do 3. For i=1 to j-1 do 4. OPT[j,k] =min i (OPT[i,k-1]+SSE[i+1,j]) Explanation For any latest seen element v j, it computes OPT[j,B] get the minimum cost of any possible intervals. E.g., OPT[n,B]= min i<n {OPT[i,B-1]+SSE[i+1,n]} means OPT[1,B-1]+SSE[2,n] OPT[2,B-1]+SSE[3,n]... OPT[n-1,B-1]+SSE[n,n] minimum=opt[n,B]

11 Example: data sequence:{x 1, x 2, x 3,...,x 10 } n=10, B=3 j=1 best partition: [1,1] j=2 best partition: [1,2]... j=5 k=B-1 best partition: [1,2][3,5] j=6 k=B-1 best partition: [1,3][4,6]... j=9, k=B OPT[9,B] = OPT[5,B-1]+SSE[6,9] Then, best partition = [1,2][3,5][6,9] j=10, k=B OPT[10,B]=OPT[6,B-1]+SSE[7,10] Then, best partition=[1,3][4,6],[7,10] Time complexity: O(n 2 B), Space complexity: O(n) Optimal Histogram Construction(Cont.)

12 Agglomerative algorithm  -approximation algorithm Given a sequence of length n, a number of buckets B, an error function E n () and a precision  >0, find H B with E n (H B ) less than (1+  )min H (E n (H)). If the data sequence is a data stream, then n is the fixed memory space used to store a portion, n data points, of the stream. Agglomerative algorithm aims to construct an  -approximation histogram. Can we improve the optimal construction algorithm to  -approximation algorithm in data stream setting? The cost for searching minimum approximation error is big [GKS01]

13 Agglomerative algorithm(cont.) Improvement to the OptimalHistogram algorithm: It reduced the cost to compute OPT[j,k] OptimalHistogram: OPT[j,k] =min i (OPT[i,k-1]+SSE[i+1,j]) Agg. Algorithm: OPT[j,k] = min(OPT[bi,k-1]+SSE[bi+1,j]), bi are end points of intervals for approximating j data points using k-1 buckets. E.g.: If {v i }={v 1,v 2,v 3,....v 9 } and {bi}={v 3, v 5, v 9 }, then OptimalHistogram algorithm needs to compare 9 values, but Agg. algorithm just needs to compare 3 values. Reason: OPT[b,k-1]+SSE[b+1,j]  (1+  )(OPT[i,k-1]+SSE[i+1,j]), a  i  b SSE[i+1,j] is a positive non-increasing function if j is fixed and i increases. OPT[i,k-1] is a positive non-decreasing function as i increases.

14 Main idea: For each 1  k  B, the algorithm maintains intervals(a 1 k,b 1 k ),...,(a l k,b l k ) such that, a 1 k =1, b l k =n, b j k +1= a j+1 k for j<l. OPT[k, b j k ]  (1+  ) OPT[k, a j k ] (1+  ) B  1+  Store OPT[k, a j k ], OPT[k, b j k ] for all j and k, also store SUM[1,r], SQSUM[1,r], where r  k,j {{a j k }  {b j k }}  B-1 queues storing the intervals and the related SUMs and SQSUMs Agglomerative algorithm(cont.)

15 On seeing the n+1’st value v n+1, the algorithm Compute OPT[k,n+1] for all 1  k  B for k=1, OPT[n+1,1]=SSE[1,n+1] for k  2, OPT[n+1,k] = min i (OPT[b i k,k-1 ]+SSE[b i k,n+1]). Update the intervals (a 1 k,b 1 k ),...,(a l k,b l k ) The algorithm just need to update the last interval(a l k,b l k ), either setting b l k =n+1 or creating a new interval l+1 with a l+1 k = b l+1 k =n+1. Time complexity O((nB 2 /  )logn) Space complexity O((B 2 /  )logn) Agglomerative algorithm(cont.)

16 Fixed Window Algorithm Agglomerative algorithm is not very useful in constructing a fixed window histogram Reason: the computation of a histogram on [1,..,n] does not allow any information on[2,..., n]. Main Idea Maintain  l  j  i x j and  l  j  i x j 2 using two arrays SUM’ and SQSUM’ on [0,n], which are circular buffers. Here { x l,..., x i } are observations of interest.

17 FixedWindowHistogram() Compute SUM’ and SQSUM’ Assume 1 to be the first point in the circular buffer For k=1 to B-1{ Initialize k’th queue to empty CreateList[1,n,k] //time complexity: O((1/  ) 2 log 3 n),  =  B //creates intervals of [1...n] using k buckets //interval range[a,b] satisfying OPT[b,k]  (1+  )OPT[a,k] // && b is maximized } {let b l1, b l2,... are end points in Queue B-1 } OPT[n,B]=min i {OPT[b li,B-1]+SSE[b li +1,n]} Time complexity: O((B 3 /  2 )log 3 n), space complexity: O(n) Fixed Window Algorithm(Cont.)

18 Fixed Window Algorithm(Cont.) Example: data sequence {0,0,0,1,1,1,1,1}  =1, B=2 SUM’ =SQSUM’={0,0,0,1,2,3,4,5} CreateList[1,8,1] (a=1,b=8,k=1), running step: a=1, OPT[1,1]=0 find index c such that OPT[c,1]  0 =(1+  )OPT[1,1] and c is maximized.  c=3 Queue1={3} Call CreateList[4,8,1]//CreateList(c+1,b,k) OPT[4,1]=0.75 find index c such that OPT[c,1]  1.5= (1+  )OPT[4,1] and c is maximized.  c=6 Queue1={3,6} Call CreateList[7,8,1] get Queue1 = {3,6,8}

19 OPT[n=8,B=2] = minimum of the following 3 values OPT[3,1]+SSE[4,8]=0+0=0 OPT[6,1]+SSE[7,8]=1.5+0 =1.5 OPT[8,1] = 15/8 minimum is 0, then best partition {(1,3),(4,8)} Fixed Window Algorithm(Cont.)

20 Experimental Evaluation Test the Construction Performance Accuracy of fixed window algorithm when evaluating range sum queries Measure: Construction performance measure: time Accuracy measure: average results Data: Real data sets extracted from AT&T data warehouses

21 Accuracy test for various  and B Conclusion: For fixed window histogram, accuracy improves with  and B Fixed window histogram outperforms wavelet based histogram Exact Histogram Wavelets

22 Construction time for various  and B Conclusion: Wavelet based method is much worse than fixed window histogram (so, not given here) Construction time grows as B increases or  decreases

23 Conclusion Background knowledge on data stream Three algorithms used to construct optimal (  -approximate ) histogram in different scenario Other related work: New operators over a data stream Operations over multi data streams sketch technique, query optimization, etc.

24 Reference 1 [GK02] Sudipto Guha and Nick Koudas. Approximating a data stream for querying and estimation: algorithms and performance evaluation. In ICDE’02. [GKS01]Sudipto Guha, Nick Koudas and Kyuseok Shim. Data-Streams and Histograms. In STOC’01, pages [IP95]Yannis E. Ioannidis and Viswanath Poosala. Balancing Histogram Optimality and Practicality for Query Result Size Estimation. In SIGMOD’95. Pages [JKM+98] H.V.Jagadish, Nick Koudas, S. Muthukrishnan, Viswanath Poosala, Ken Sevcik and Torsten Suel. Optimal Histograms with Quality Guarantees. In VLDB’98. Pages

25 Reference 2 [BBD+02]B. Babcock, S. Babu, M. Datar, R. Motwani, J. Widom. Models and Issues in Data Stream Systems. In PODS’ 02, pages [DGG+02]A. Dobra, M. Garofalakis, J. Gehrke and R. Rastogi. Processing complex aggregate queries over data streams. In SIGMOD’ 02, pages [Gib01]Distinct Sampling for highly-accurate answers to distinct values queries and event reports. In VLDB’01, pages [GK01]M. Greenwald, S. Khanna. Space-efficient online computation of quantile summaries. In SIGMOD’01, pages [MM02]G. S. Manku, R. Motwani. Approximate frequency counts over data streams. In VLDB’02, pages [KNV03]Jaewoo Kang, J.F.Naughton and Stratis D. Biglas. Evaluating window joins over unbounded streams. In ICDE’03.

26 Thank you!