Approximating Data Stream using histogram for Query Evaluation Huiping Cao Jan. 03, 2003
2 Outline Introduction Background Histogramming a data stream V-Optimal Histogram Optimal Histogram Construction Agglomerative Histogram Algorithm Fixed-Window Histogram Algorithm Experiments Conclusion
3 Introduction Data Stream refers to the fixed order of data elements that come continuously and in a variable rate. Many applications generate streaming data, such as network monitoring records, data generated by sensors, etc. New features of algorithms used to handle data stream: single- pass, quick speed(maybe), limited memory, online(unbounded) Data stream operations Approximate querying, similarity searching, data mining. Such operations reply on good approximation of data stream, histogram is a popular way to approximate data stream
4 Background Histogram Histogram approximates the data distribution of data sets or data stream by partitioning the underlying data into subsets called buckets. Good histogram construction algorithm can approximate the data as accurately and quickly as possible Accuracy of the approximation depends on: (1) partitioning technique used to group values into buckets. I.e, how to partition the data into subsets while inducing less error. (2) approximation technique employed within each bucket. I.e., how to summary the values in one buckets. E.g., mean, average.
5 Background(cont.) Data stream model Agglomerative(Landmark) model Take into account every elements seen so far Figure 1(a) Fixed-window(Sliding-window) model Only consider the last seen n data elements or the elements observed t time units before the current time Figure 1(b) Sketch t0=0t current Fig. 1(a) Sketch t0=0t current Fig. 1(b) n
6 Background(cont.) Related work Approximate specific queries Distinct values([Gib01]), frequency counts([MM02]), quantile([GK01]), general aggregation([DGG+02]), join([KNV03]). Approximate methods Sample, histogram, wavelets, more common synopsis(Section 6 in [BBD+02]). Focus of this talk: Query independent histogram construction methods, specifically concentrate on the partitioning of buckets.
7 Histogramming a data stream Optimal histogram([GK02, IP95]) Optimal histogram construction([GK02, JKM+98]) Agglomerative algorithm([GK02,GKS01]) Fixed window algorithm([GK02])
8 V-Optimal Histogram Optimal Histogram Problem Given a sequence of length n, a number of buckets B, and an error function E n (), find H B to minimize E(H B ). Independent on queries [IP95] showed that V-Optimal is the well known optimal histogram. Basic idea: attribute values are grouped in buckets based on proximity in their frequencies but not in their actual values. E n ()= bi [1,..,B] v bi (f v -C bi /V bi ) 2 B: maximum bucket number bi: the i-th bucket f v : the frequency of v in one bucket C bi,V bi : The sum and the number of frequencies in bucket bi
9 Optimal Histogram Construction Problem: The problem of constructing optimal histogram is intrinsically to partition the index set 1...n into B intervals or buckets minimizing E() Main idea: [JKM+98] the algorithm focuses on computing OPT[n,B] and getting the bucket boundaries at the same time. OPT[i,k] denotes the minimum error of representing [1,…,i] by a histogram with k buckets, where i n and k B. OPT[n,B]= min i<n {OPT[i,B-1]+SSE[i+1,n]} E() = OPT(i,B)= k [1...B] SSE k. SSE is the common error metric: Sum Squared Error(SSE) SSE([a,b])= i [a,b] (v i - avg(v)) 2 = v i 2 - 1/(b-a+1)( v i ) 2 = SQSUM[1,b]-SQLSUM[1,a-1] -(1/(b-a+1))(SUM[1,b]-SUM[1,a-1]) where, SUM[1,i]= v j SQSUM[1,i] = v j 2, j [1,...,i]
10 Optimal Histogram Construction(Cont.) Algorithm OptimalHistogram() Compute SUM[1,i], SQSUM[1,i] for all 1 i n Initialize OPT[j,1]= SQSUM[j,n], 1 j n 1. For j=1 to n do 2. For k=2 to B do 3. For i=1 to j-1 do 4. OPT[j,k] =min i (OPT[i,k-1]+SSE[i+1,j]) Explanation For any latest seen element v j, it computes OPT[j,B] get the minimum cost of any possible intervals. E.g., OPT[n,B]= min i<n {OPT[i,B-1]+SSE[i+1,n]} means OPT[1,B-1]+SSE[2,n] OPT[2,B-1]+SSE[3,n]... OPT[n-1,B-1]+SSE[n,n] minimum=opt[n,B]
11 Example: data sequence:{x 1, x 2, x 3,...,x 10 } n=10, B=3 j=1 best partition: [1,1] j=2 best partition: [1,2]... j=5 k=B-1 best partition: [1,2][3,5] j=6 k=B-1 best partition: [1,3][4,6]... j=9, k=B OPT[9,B] = OPT[5,B-1]+SSE[6,9] Then, best partition = [1,2][3,5][6,9] j=10, k=B OPT[10,B]=OPT[6,B-1]+SSE[7,10] Then, best partition=[1,3][4,6],[7,10] Time complexity: O(n 2 B), Space complexity: O(n) Optimal Histogram Construction(Cont.)
12 Agglomerative algorithm -approximation algorithm Given a sequence of length n, a number of buckets B, an error function E n () and a precision >0, find H B with E n (H B ) less than (1+ )min H (E n (H)). If the data sequence is a data stream, then n is the fixed memory space used to store a portion, n data points, of the stream. Agglomerative algorithm aims to construct an -approximation histogram. Can we improve the optimal construction algorithm to -approximation algorithm in data stream setting? The cost for searching minimum approximation error is big [GKS01]
13 Agglomerative algorithm(cont.) Improvement to the OptimalHistogram algorithm: It reduced the cost to compute OPT[j,k] OptimalHistogram: OPT[j,k] =min i (OPT[i,k-1]+SSE[i+1,j]) Agg. Algorithm: OPT[j,k] = min(OPT[bi,k-1]+SSE[bi+1,j]), bi are end points of intervals for approximating j data points using k-1 buckets. E.g.: If {v i }={v 1,v 2,v 3,....v 9 } and {bi}={v 3, v 5, v 9 }, then OptimalHistogram algorithm needs to compare 9 values, but Agg. algorithm just needs to compare 3 values. Reason: OPT[b,k-1]+SSE[b+1,j] (1+ )(OPT[i,k-1]+SSE[i+1,j]), a i b SSE[i+1,j] is a positive non-increasing function if j is fixed and i increases. OPT[i,k-1] is a positive non-decreasing function as i increases.
14 Main idea: For each 1 k B, the algorithm maintains intervals(a 1 k,b 1 k ),...,(a l k,b l k ) such that, a 1 k =1, b l k =n, b j k +1= a j+1 k for j<l. OPT[k, b j k ] (1+ ) OPT[k, a j k ] (1+ ) B 1+ Store OPT[k, a j k ], OPT[k, b j k ] for all j and k, also store SUM[1,r], SQSUM[1,r], where r k,j {{a j k } {b j k }} B-1 queues storing the intervals and the related SUMs and SQSUMs Agglomerative algorithm(cont.)
15 On seeing the n+1’st value v n+1, the algorithm Compute OPT[k,n+1] for all 1 k B for k=1, OPT[n+1,1]=SSE[1,n+1] for k 2, OPT[n+1,k] = min i (OPT[b i k,k-1 ]+SSE[b i k,n+1]). Update the intervals (a 1 k,b 1 k ),...,(a l k,b l k ) The algorithm just need to update the last interval(a l k,b l k ), either setting b l k =n+1 or creating a new interval l+1 with a l+1 k = b l+1 k =n+1. Time complexity O((nB 2 / )logn) Space complexity O((B 2 / )logn) Agglomerative algorithm(cont.)
16 Fixed Window Algorithm Agglomerative algorithm is not very useful in constructing a fixed window histogram Reason: the computation of a histogram on [1,..,n] does not allow any information on[2,..., n]. Main Idea Maintain l j i x j and l j i x j 2 using two arrays SUM’ and SQSUM’ on [0,n], which are circular buffers. Here { x l,..., x i } are observations of interest.
17 FixedWindowHistogram() Compute SUM’ and SQSUM’ Assume 1 to be the first point in the circular buffer For k=1 to B-1{ Initialize k’th queue to empty CreateList[1,n,k] //time complexity: O((1/ ) 2 log 3 n), = B //creates intervals of [1...n] using k buckets //interval range[a,b] satisfying OPT[b,k] (1+ )OPT[a,k] // && b is maximized } {let b l1, b l2,... are end points in Queue B-1 } OPT[n,B]=min i {OPT[b li,B-1]+SSE[b li +1,n]} Time complexity: O((B 3 / 2 )log 3 n), space complexity: O(n) Fixed Window Algorithm(Cont.)
18 Fixed Window Algorithm(Cont.) Example: data sequence {0,0,0,1,1,1,1,1} =1, B=2 SUM’ =SQSUM’={0,0,0,1,2,3,4,5} CreateList[1,8,1] (a=1,b=8,k=1), running step: a=1, OPT[1,1]=0 find index c such that OPT[c,1] 0 =(1+ )OPT[1,1] and c is maximized. c=3 Queue1={3} Call CreateList[4,8,1]//CreateList(c+1,b,k) OPT[4,1]=0.75 find index c such that OPT[c,1] 1.5= (1+ )OPT[4,1] and c is maximized. c=6 Queue1={3,6} Call CreateList[7,8,1] get Queue1 = {3,6,8}
19 OPT[n=8,B=2] = minimum of the following 3 values OPT[3,1]+SSE[4,8]=0+0=0 OPT[6,1]+SSE[7,8]=1.5+0 =1.5 OPT[8,1] = 15/8 minimum is 0, then best partition {(1,3),(4,8)} Fixed Window Algorithm(Cont.)
20 Experimental Evaluation Test the Construction Performance Accuracy of fixed window algorithm when evaluating range sum queries Measure: Construction performance measure: time Accuracy measure: average results Data: Real data sets extracted from AT&T data warehouses
21 Accuracy test for various and B Conclusion: For fixed window histogram, accuracy improves with and B Fixed window histogram outperforms wavelet based histogram Exact Histogram Wavelets
22 Construction time for various and B Conclusion: Wavelet based method is much worse than fixed window histogram (so, not given here) Construction time grows as B increases or decreases
23 Conclusion Background knowledge on data stream Three algorithms used to construct optimal ( -approximate ) histogram in different scenario Other related work: New operators over a data stream Operations over multi data streams sketch technique, query optimization, etc.
24 Reference 1 [GK02] Sudipto Guha and Nick Koudas. Approximating a data stream for querying and estimation: algorithms and performance evaluation. In ICDE’02. [GKS01]Sudipto Guha, Nick Koudas and Kyuseok Shim. Data-Streams and Histograms. In STOC’01, pages [IP95]Yannis E. Ioannidis and Viswanath Poosala. Balancing Histogram Optimality and Practicality for Query Result Size Estimation. In SIGMOD’95. Pages [JKM+98] H.V.Jagadish, Nick Koudas, S. Muthukrishnan, Viswanath Poosala, Ken Sevcik and Torsten Suel. Optimal Histograms with Quality Guarantees. In VLDB’98. Pages
25 Reference 2 [BBD+02]B. Babcock, S. Babu, M. Datar, R. Motwani, J. Widom. Models and Issues in Data Stream Systems. In PODS’ 02, pages [DGG+02]A. Dobra, M. Garofalakis, J. Gehrke and R. Rastogi. Processing complex aggregate queries over data streams. In SIGMOD’ 02, pages [Gib01]Distinct Sampling for highly-accurate answers to distinct values queries and event reports. In VLDB’01, pages [GK01]M. Greenwald, S. Khanna. Space-efficient online computation of quantile summaries. In SIGMOD’01, pages [MM02]G. S. Manku, R. Motwani. Approximate frequency counts over data streams. In VLDB’02, pages [KNV03]Jaewoo Kang, J.F.Naughton and Stratis D. Biglas. Evaluating window joins over unbounded streams. In ICDE’03.
26 Thank you!