Lattice Histograms: A Resilient Synopsis Structure

Slides:

Advertisements

Similar presentations

Extending Q-Grams to Estimate Selectivity of String Matching with Low Edit Distance [1] Pirooz Chubak May 22, 2008.

Advertisements

Wavelet and Matrix Mechanism CompSci Instructor: Ashwin Machanavajjhala 1Lecture 11 : Fall 12.

The A-tree: An Index Structure for High-dimensional Spaces Using Relative Approximation Yasushi Sakurai (NTT Cyber Space Laboratories) Masatoshi Yoshikawa.

Kaushik Chakrabarti(Univ Of Illinois) Minos Garofalakis(Bell Labs) Rajeev Rastogi(Bell Labs) Kyuseok Shim(KAIST and AITrc) Presented at 26 th VLDB Conference,

Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.

11/11/02 IDR Workshop Dealing With Location Uncertainty in Images Hasan F. Ates Princeton University 11/11/02.

Correlation Search in Graph Databases Yiping Ke James Cheng Wilfred Ng Presented By Phani Yarlagadda.

Lecture 15 Orthogonal Functions Fourier Series. LGA mean daily temperature time series is there a global warming signal?

Fast Algorithms For Hierarchical Range Histogram Constructions

Towards Estimating the Number of Distinct Value Combinations for a Set of Attributes Xiaohui Yu 1, Calisto Zuzarte 2, Ken Sevcik 1 1 University of Toronto.

Introduction to Histograms Presented By: Laukik Chitnis

STHoles: A Multidimensional Workload-Aware Histogram Nicolas Bruno* Columbia University Luis Gravano* Columbia University Surajit Chaudhuri Microsoft Research.

Probabilistic Histograms for Probabilistic Data Graham Cormode AT&T Labs-Research Antonios Deligiannakis Technical University of Crete Minos Garofalakis.

Extensions of wavelets

Multimedia DBs.

Deterministic Wavelet Thresholding for Maximum-Error Metrics Minos Garofalakis Bell Laboratories Lucent Technologies 600 Mountain Avenue Murray Hill, NJ.

1 An Adaptive Nearest Neighbor Classification Algorithm for Data Streams Yan-Nei Law & Carlo Zaniolo University of California, Los Angeles PKDD, Porto,

Approximate Computation of Multidimensional Aggregates of Sparse Data Using Wavelets Based on the work of Jeffrey Scott Vitter and Min Wang.

© Vipin Kumar CSci 8980 Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.

Basic Concepts and Definitions Vector and Function Space. A finite or an infinite dimensional linear vector/function space described with set of non-unique.

Computing Sketches of Matrices Efficiently & (Privacy Preserving) Data Mining Petros Drineas Rensselaer Polytechnic Institute (joint.

1 Wavelet synopses with Error Guarantees Minos Garofalakis Phillip B. Gibbons Information Sciences Research Center Bell Labs, Lucent Technologies Murray.

Representation and Compression of Multi-Dimensional Piecewise Functions Dror Baron Signal Processing and Systems (SP&S) Seminar June 2009 Joint work with:

CS 591 A11 Algorithms for Data Streams Dhiman Barman CS 591 A1 Algorithms for the New Age 2 nd Dec, 2002.

One-Pass Wavelet Decompositions of Data Streams TKDE May 2002 Anna C. Gilbert,Yannis Kotidis, S. Muthukrishanan, Martin J. Strauss Presented by James Chan.

Statistical Methods for long-range forecast By Syunji Takahashi Climate Prediction Division JMA.

Fast Approximate Wavelet Tracking on Streams Graham Cormode Minos Garofalakis Dimitris Sacharidis

Jinhui Tang †, Shuicheng Yan †, Richang Hong †, Guo-Jun Qi ‡, Tat-Seng Chua † † National University of Singapore ‡ University of Illinois at Urbana-Champaign.

Shape Matching for Model Alignment 3D Scan Matching and Registration, Part I ICCV 2005 Short Course Michael Kazhdan Johns Hopkins University.

Database Management 9. course. Execution of queries.

Special Topics in Data Engineering Panagiotis Karras CS6234 Lecture, March 4 th, 2009.

Wavelet Synopses with Predefined Error Bounds: Windfalls of Duality Panagiotis Karras DB seminar, 23 March, 2006.

Constructing Optimal Wavelet Synopses Dimitris Sacharidis Timos Sellis

The Haar + Tree: A Refined Synopsis Data Structure Panagiotis Karras HKU, September 7 th, 2006.

The Impact of Duality on Data Synopsis Problems Panagiotis Karras KDD, San Jose, August 13 th, 2007 work with Dimitris Sacharidis and Nikos Mamoulis.

Histograms for Selectivity Estimation

Approximating Data Stream using histogram for Query Evaluation Huiping Cao Jan. 03, 2003.

The Impact of Duality on Data Representation Problems Panagiotis Karras HKU, June 14 th, 2007.

1 Approximate XML Query Answers Presenter: Hongyu Guo Authors: N. polyzotis, M. Garofalakis, Y. Ioannidis.

One-Pass Wavelet Synopses for Maximum-Error Metrics Panagiotis Karras Trondheim, August 31st, 2005 Research at HKU with Nikos Mamoulis.

Histograms for Selectivity Estimation, Part II Speaker: Ho Wai Shing Global Optimization of Histograms.

@ Carnegie Mellon Databases 1 Finding Frequent Items in Distributed Data Streams Amit Manjhi V. Shkapenyuk, K. Dhamdhere, C. Olston Carnegie Mellon University.

ITree: Exploring Time-Varying Data using Indexable Tree Yi Gu and Chaoli Wang Michigan Technological University Presented at IEEE Pacific Visualization.

Cross-modal Hashing Through Ranking Subspace Learning

Feature learning for multivariate time series classification Mustafa Gokce Baydogan * George Runger * Eugene Tuv † * Arizona State University † Intel Corporation.

Dense-Region Based Compact Data Cube

Clustering (1) Clustering Similarity measure Hierarchical clustering

Keogh, E. , Chakrabarti, K. , Pazzani, M. & Mehrotra, S. (2001)

Nonlinear Dimensionality Reduction

Data Transformation: Normalization

Chapter 7. Classification and Prediction

Abolfazl Asudeh Azade Nazi Nan Zhang Gautam DaS

Data-Streams and Histograms

Supervised Time Series Pattern Discovery through Local Importance

Cristian Ferent and Alex Doboli

Fast Approximate Query Answering over Sensor Data with Deterministic Error Guarantees Chunbin Lin Joint with Etienne Boursier, Jacque Brito, Yannis Katsis,

Parts of these slides are based on

Query-Friendly Compression of Graph Streams

Y. Kotidis, S. Muthukrishnan,

Design of Hierarchical Classifiers for Efficient and Accurate Pattern Classification M N S S K Pavan Kumar Advisor : Dr. C. V. Jawahar.

Bounds for Optimal Compressed Sensing Matrices

Sudocodes Fast measurement and reconstruction of sparse signals

SPACE EFFICENCY OF SYNOPSIS CONSTRUCTION ALGORITHMS

Asymmetric Transitivity Preserving Graph Embedding

Data Transformations targeted at minimizing experimental variance

An Adaptive Nearest Neighbor Classification Algorithm for Data Streams

Text Categorization Berlin Chen 2003 Reference:

Wavelet-based histograms for selectivity estimation

Sudocodes Fast measurement and reconstruction of sparse signals

Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research)

Presentation transcript:

Lattice Histograms: A Resilient Synopsis Structure Panagiotis Karras HKU, February 15th, 2007

Need for Data Approximation Approximate Query Answering (exact answers not always required). Learning, Classification, Event Detection. Data Mining, Selectivity Estimation. Situations where massive data arrives in a stream Distributed Stream Monitoring.

The Formal Problem Given a data set X and a vector set {ψi}, find a representation F=Σi ziψi in B non-zero terms zi, such that an error metric (function of X-F) is minimized. Normalized Minkowski-norm error metric:

Traditional Methods Histograms B non-overlapping constant-value intervals. Haar Wavelets (1910) Based on binary intervals. B wavelet coefficients.

Histograms Define B buckets, si = [bi,ei]. Attribute value vi to all items in si. Assumes neighboring values exhibit slight variations. Advantage: Suitable for summarizing smooth natural signals. Disadvantage: Unsuitable for data sets with sharp discontinuities. Complexity: At least O(n2B) time for optimal histogram under general error metrics [Bellman 1961, Jagadish et al. 1998, Guha et al. 2004].

Haar Wavelets Wavelet transform: orthogonal transform for the hierarchical representation of functions and signals. Haar tree: structure for the visualization of decomposition and value reconstructions. 18 Synopsis: Select B non-zero terms. Advantage: Approximates discontinuities. Complexity: Depends. 18 18 7 -8 26 11 10 25 9 -9 10 10 34 16 2 20 20 0 36 16

The Haar+ Tree Triads in place of single coefficients. Each triad contains one head and two supplementary coefficients. c1 + c2 c3 C1 c5 c6 C2 c7 c8 c9 co d3 d2 d1 d0 - c4 C3

Compact Hierarchical Histograms [Reiss et al., VLDB 2006] Like Haar+ tree, but no head coefficients. Algorithms try to calculate single value on each node. co c1 c2 c3 c4 c5 c6 d0 d1 d2 d3

Discussion Quote: «The interaction between histograms and indices presents opportunities but also several technical challenges that need to be investigated. » Ioannidis, ICDT 2003 How well have we done?

Discussion Histograms: + choose intervals freely - do not exploit hierarchy Hierarchical Index structures (Haar+, CHH) + exploit hierarchy - work with predefined (dyadic) intervals Can we do better?

co c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c11 c12 c13 c14 c15 c16 c17 c18 c19 c20 c21 c22 c23 c24 c25 c26 c27 c28 c29 c30 c31 c32 c33 c34 c35 d0 d1 d2 d3 d4 d5 d6 d7

The Lattice Histogram Can exploit any hierarchy n(n+1)/2 nodes k nodes in k th level, affecting n-k+1 values Generic LH: any nodes occupied Hierarchical LH: contained nodes occupied plain histogram, CHH: special cases of HLH Equal storage for sparse data sets

co c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c11 c12 c13 c14 c15 c16 c17 c18 c19 c20 c21 c22 c23 c24 c25 c26 c27 c28 c29 c30 c31 c32 c33 c34 c35 d0 d1 d2 d3 d4 d5 d6 d7

co c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c11 c12 c13 c14 c15 c16 c17 c18 c19 c20 c21 c22 c23 c24 c25 c26 c27 c28 c29 c30 c31 c32 c33 c34 c35 d0 d1 d2 d3 d4 d5 d6 d7

An example X = [4,3,5,10,12,11,11,4]. B=2. Best L1, Linf histograms: Best L1, Linf Haar+ synopses: L1 : {c0=4, c8=7}, [4,4,4,4,11,11,4,4] Linf : {c0=6, c3=2}, [6,6,6,6,8,8,8,8] Best L1, Linf Lattice synopsis: {c0=4, c13=11} [4, 4, 4,11,11,11,11, 4]

Technical Stuff Index of first node in level k is k th triangular numer: Node ci resides in level Children of ci are ci+k and ci+k+1 Nepot of ci : cnep = ci+2k+2 Complementary pair: pair of nodes covering interval of ci Linear descendants of ci : members of complementary pair. Theorem: Linear descendants of occupied node do not need to be occupied

node linear descendants nepotic descendants d0 d1 d2 d3 d4 d5 d6 d7

DP scheme Bottom-up process. At each node ci, it calculates an array A from the pre-calculated array of nepot and linear descendants A[i,v,b] contains E(i,v,b) for ci, optimal z, b’.

Special case: Maximum Error Metrics Solve Error-Bounded Problem: Employ this solution repetitively using binary search, in order to solve the dual, space-bounded problem.

Complexity Analysis General Error Maximum Error Time: Total Space: Working Space:

Time Complexity Comparison L1 error metric. Optimal Histogram: Unrestricted Haar: Haar+: CHH: Lattice

Experiments: Description of Data Datasets with Discontinuities FR: Mean monthly flows of Fraser River, Hope, B.C. Periodic autoregression features. FC: Frequencies of distinct values an attribute in forest cover type relation (US Forest Service). Used in previous work.

Quality, Maximum Error, FC dataset

Quality, Maximum Error, FR dataset

Conclusions Lattice Histogram achieves higher quality than previous methods. Outperforms Haar+ in its advantageous domain. Tade-off only with Haar+ in time versus quality. Next step: space-efficient heuristic.

Related Work H. V. Jagadish, N. Koudas, S. Muthukrishnan, V. Poosala, K. C. Sevcik, and T. Suel: Optimal histograms with quality guarantees. VLDB 1998. Y. Ioannidis: Approximations in database systems. ICDT 2003. M. Garofalakis and A. Kumar: Wavelet synopses for general error metrics. TODS, 30(4):888–928, 2005. S. Guha and B. Harb. Approxmation algorithms for wavelet transform coding of data streams. SODA 2006. F. Reiss, M. Garofalakis, and J. M. Hellerstein: Compact histograms for hierarchical identifiers. VLDB 2006. P. Karras and N. Mamoulis: The Haar+ tree: a refined synopsis data structure. ICDE 2007.

Thank you! Questions?