Download presentation
Presentation is loading. Please wait.
Published bySolomon Jones Modified over 8 years ago
1
The Haar + Tree: A Refined Synopsis Data Structure Panagiotis Karras HKU, September 7 th, 2006
2
The Need Approximate Query Answering (exact answers not always required). Learning, Classification, Event Detection. Data Mining, Selectivity Estimation. Situations where massive data arrives in a stream Routers, Sensors, Web.
3
The Formal Problem Given a data set X and a vector set {ψ i }, find a representation F=Σ i z i ψ i in B non-zero terms z i, such that an error metric (function of X-F) is minimized. Normalized Minkowski-norm error metric:
4
Two Principal Methods Histograms B non-overlapping constant- value intervals. Haar Wavelets (1910) Based on binary intervals. B wavelet coefficients.
5
Histograms Define B buckets, s i = [b i,e i ]. Attribute value v i to all items in s i. Assumes neighboring values exhibit slight variations. Advantage: Suitable for summarizing smooth natural signals. Disadvantage: Unsuitable for data sets with sharp discontinuities. Complexity: At least O(n 2 B) time for optimal histogram under general error metrics [Bellman 1961, Jagadish et al. 1998, Guha et al. 2004].
6
34 16 2 20 20 0 36 16 0 18 7 -8 9 -9 10 25 11 10 26 Haar Wavelets 18 :Wavelet transform: orthogonal transform for the hierarchical representation of functions and signals. Haar tree: structure for the visualization of decomposition and value reconstructions. Synopsis: Select B non- zero terms. Advantage: Approximates discontinuities. Complexity: Depends.
7
Previous Classical Haar Approaches Restricted Haar SynopsesRestricted Haar Synopses [Garofalakis and Kumar 2004] Compute Haar wavelet decomposition of X. Preserve optimal B-coefficient subset. Suboptimal Quality (usually worse than histogram). time for general error metrics. Unrestricted Haar SynopsesUnrestricted Haar Synopses [Guha and Harb 2005, 2006] Find optimal coefficient values to assign. Quantization by resolution step δ is used. Better Quality (may be better than histogram). time for minimizing L p error. E: upper bound for normalized L p error.
8
Deficiencies Haar wavelet coefficient contributes its value positively to one interval and negatively to another. Lack of flexibility. Quality constraint. Hard to delimit value search space for non- maximum-error metrics. Cubic complexity to n for L 1 metric. Conclusion: Need for a different synopsis structure.
9
The Haar + Tree c1c1 + c2c2 c3c3 C1C1 c5c5 c6c6 + C2C2 c7c7 c8c8 c9c9 coco d3d3 d2d2 d1d1 d0d0 - + + + - + c4c4 + - + + + C3C3 Triads in place of single coefficients. Each triad contains one head and two supplementary coefficients.
10
Basic Properties Haar + Synopsis needs to contain at most one coefficient per triad. No additional storage required. Classical Haar special case of Haar +. - c i+1 - + cici c i+2 + + + cici c i+1 c i+2 + +
11
An example X = [5,3,12,4], B=2. Best restricted synopsis: {c 0 =6, c 7 =4} Best L 1,L inf unrestricted synopsis: {c 0 =5.5, c 7 =4} Best L 1, L 2 & L inf histogram: {5,5,5,4}, {4,4,8,8} Best Haar+ synopsis: {c 0 =4, c 8 =8} Differences generalized to any gap with multiplication. c1c1 + c2c2 c3c3 C1C1 c5c5 c6c6 + C2C2 c7c7 c8c8 c9c9 coco 4123 5 - + + + - + c4c4 + - + + + C3C3
12
Delimiting the value domains - + cici - c i+1 + cici c i+2 + + - c i+1 + cici c i+2 + + Classical Haar: Haar+:
13
Delimiting the value domains Let m i,M i be minimum, maximum value in scope of triad C i. Non-zero head coefficient does not need to be assigned for incoming values v outside (m i,M i ). Otherwise, Similarly, Besides, In conclusion, the cardinality of is Such tight delimitation not possible with classical Haar. Generalization of structure simplifies computation. Space than can be allocated at C i :
14
Deriving the answer Bottom-up recursive process. At each triad C i, it calculates an array A from the pre- calculated ones L, R of its children triads. A[v,b] contains E(i,v,b) for C i, optimal z, b’. At most logn + 1 concurrently stored arrays.
15
Complexity Analysis Basic derivation of optimal error: Time:Space: Space efficient synopsis construction: Time:Space: Time efficient synopsis construction: Time:Space: Expressions in min equal when Same for all monotonic distributive error metrics. B time factor reduced to log 2 B for maximum-error metrics.
16
Time Complexity Comparison L 1 error metric. Optimal Histogram: Restricted Haar: Unrestricted Haar: Haar + : Haar + computes synopsis in time linear to n.
17
Experiments: Description of Data Datasets with Discontinuities FR: Mean monthly flows of Fraser River, Hope, B.C. Periodic autoregression features. FC: Frequencies of distinct values an attribute in forest cover type relation (US Forest Service). Used in previous work.
18
Experiments: Time, B=n/64
19
Experiments: Time, B=32
20
Quality, Maximum Error, FC dataset
21
Quality, Maximum Error, FR dataset
22
Quality, Average Error, FC dataset
23
Quality, Average Error, FR dataset
24
Conclusions Haar + achieves higher quality than Haar Wavelets (expected, cannot be worse). Can also achieve higher quality than Optimal Histogram. Outperforms histogram quality when classical Haar does not. Constructs synopses in time linear to n for any monotonic distributive error metric. First structure to achieve quality higher than optimal histogram in linear time. Future: extension to multidimensional data.
25
Related Work R. Bellman. On the approximation of curves by line segments using dynamic programming. Communications of the ACM, 1961 H. V. Jagadish, N. Koudas, S. Muthukrishnan, V. Poosala, K. C. Sevcik, and T. Suel. Optimal histograms with quality guarantees. VLDB 1998 S. Guha, K. Shim, and J. Woo. REHIST: Relative error histogram construction algorithms. VLDB 2004. M. Garofalakis and A. Kumar. Deterministic wavelet thresholding for maximum-error metrics. PODS 2004 S. Guha. Space efficiency in synopsis construction algorithms. VLDB 2005 S. Guha and B. Harb. Wavelet Synopses for Data Streams: Minimizing Non-Euclidean Error. KDD 2005 S. Guha and B. Harb. Approxmation algorithms for wavelet transform coding of data streams. SODA 2006
26
Thank you! Questions?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.