Lattice Histograms: A Resilient Synopsis Structure

Lattice Histograms: A Resilient Synopsis Structure
Panagiotis Karras HKU, February 15th, 2007

Need for Data Approximation
Approximate Query Answering (exact answers not always required). Learning, Classification, Event Detection. Data Mining, Selectivity Estimation. Situations where massive data arrives in a stream Distributed Stream Monitoring.

The Formal Problem Given a data set X and a vector set {ψi}, find a representation F=Σi ziψi in B non-zero terms zi, such that an error metric (function of X-F) is minimized. Normalized Minkowski-norm error metric:

Traditional Methods Histograms
B non-overlapping constant-value intervals. Haar Wavelets (1910) Based on binary intervals. B wavelet coefficients.

Histograms Define B buckets, si = [bi,ei].
Attribute value vi to all items in si. Assumes neighboring values exhibit slight variations. Advantage: Suitable for summarizing smooth natural signals. Disadvantage: Unsuitable for data sets with sharp discontinuities. Complexity: At least O(n2B) time for optimal histogram under general error metrics [Bellman 1961, Jagadish et al. 1998, Guha et al. 2004].

Haar Wavelets Wavelet transform: orthogonal transform for the hierarchical representation of functions and signals. Haar tree: structure for the visualization of decomposition and value reconstructions. 18 Synopsis: Select B non-zero terms. Advantage: Approximates discontinuities. Complexity: Depends. 18 18 7 -8 26 11 10 25 9 -9 10 10

The Haar+ Tree Triads in place of single coefficients.
Each triad contains one head and two supplementary coefficients. c1 + c2 c3 C1 c5 c6 C2 c7 c8 c9 co d3 d2 d1 d0 - c4 C3

Compact Hierarchical Histograms [Reiss et al., VLDB 2006]
Like Haar+ tree, but no head coefficients. Algorithms try to calculate single value on each node. co c1 c2 c3 c4 c5 c6 d0 d1 d2 d3

Discussion Quote: «The interaction between histograms and indices presents opportunities but also several technical challenges that need to be investigated. » Ioannidis, ICDT 2003 How well have we done?

Discussion Histograms: + choose intervals freely
- do not exploit hierarchy Hierarchical Index structures (Haar+, CHH) + exploit hierarchy - work with predefined (dyadic) intervals Can we do better?

co c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c11 c12 c13 c14 c15 c16 c17 c18 c19 c20 c21 c22 c23 c24 c25 c26 c27 c28 c29 c30 c31 c32 c33 c34 c35 d0 d1 d2 d3 d4 d5 d6 d7

The Lattice Histogram Can exploit any hierarchy n(n+1)/2 nodes
k nodes in k th level, affecting n-k+1 values Generic LH: any nodes occupied Hierarchical LH: contained nodes occupied plain histogram, CHH: special cases of HLH Equal storage for sparse data sets

co c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c11 c12 c13 c14 c15 c16 c17 c18 c19 c20 c21 c22 c23 c24 c25 c26 c27 c28 c29 c30 c31 c32 c33 c34 c35 d0 d1 d2 d3 d4 d5 d6 d7

An example X = [4,3,5,10,12,11,11,4]. B=2. Best L1, Linf histograms:
Best L1, Linf Haar+ synopses: L1 : {c0=4, c8=7}, [4,4,4,4,11,11,4,4] Linf : {c0=6, c3=2}, [6,6,6,6,8,8,8,8] Best L1, Linf Lattice synopsis: {c0=4, c13=11} [4, 4, 4,11,11,11,11, 4]

Technical Stuff Index of first node in level k is k th triangular numer: Node ci resides in level Children of ci are ci+k and ci+k+1 Nepot of ci : cnep = ci+2k+2 Complementary pair: pair of nodes covering interval of ci Linear descendants of ci : members of complementary pair. Theorem: Linear descendants of occupied node do not need to be occupied

node linear descendants nepotic descendants d0 d1 d2 d3 d4 d5 d6 d7

DP scheme Bottom-up process.
At each node ci, it calculates an array A from the pre-calculated array of nepot and linear descendants A[i,v,b] contains E(i,v,b) for ci, optimal z, b’.

Special case: Maximum Error Metrics
Solve Error-Bounded Problem: Employ this solution repetitively using binary search, in order to solve the dual, space-bounded problem.

Complexity Analysis General Error Maximum Error Time: Total Space:
Working Space:

Time Complexity Comparison
L1 error metric. Optimal Histogram: Unrestricted Haar: Haar+: CHH: Lattice

Experiments: Description of Data
Datasets with Discontinuities FR: Mean monthly flows of Fraser River, Hope, B.C. Periodic autoregression features. FC: Frequencies of distinct values an attribute in forest cover type relation (US Forest Service). Used in previous work.

Quality, Maximum Error, FC dataset

Quality, Maximum Error, FR dataset

Conclusions Lattice Histogram achieves higher quality than previous methods. Outperforms Haar+ in its advantageous domain. Tade-off only with Haar+ in time versus quality. Next step: space-efficient heuristic.

Related Work H. V. Jagadish, N. Koudas, S. Muthukrishnan, V. Poosala, K. C. Sevcik, and T. Suel: Optimal histograms with quality guarantees. VLDB 1998. Y. Ioannidis: Approximations in database systems. ICDT 2003. M. Garofalakis and A. Kumar: Wavelet synopses for general error metrics. TODS, 30(4):888–928, 2005. S. Guha and B. Harb. Approxmation algorithms for wavelet transform coding of data streams. SODA 2006. F. Reiss, M. Garofalakis, and J. M. Hellerstein: Compact histograms for hierarchical identifiers. VLDB 2006. P. Karras and N. Mamoulis: The Haar+ tree: a refined synopsis data structure. ICDE 2007.

Thank you! Questions?

Lattice Histograms: A Resilient Synopsis Structure

Similar presentations

Presentation on theme: "Lattice Histograms: A Resilient Synopsis Structure"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Lattice Histograms: A Resilient Synopsis Structure

Similar presentations

Presentation on theme: "Lattice Histograms: A Resilient Synopsis Structure"— Presentation transcript:

Similar presentations

About project

Feedback