The Haar + Tree: A Refined Synopsis Data Structure Panagiotis Karras HKU, September 7 th, 2006.

Slides:



Advertisements
Similar presentations
Kaushik Chakrabarti(Univ Of Illinois) Minos Garofalakis(Bell Labs) Rajeev Rastogi(Bell Labs) Kyuseok Shim(KAIST and AITrc) Presented at 26 th VLDB Conference,
Advertisements

Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
11/11/02 IDR Workshop Dealing With Location Uncertainty in Images Hasan F. Ates Princeton University 11/11/02.
Fast Algorithms For Hierarchical Range Histogram Constructions
Towards Estimating the Number of Distinct Value Combinations for a Set of Attributes Xiaohui Yu 1, Calisto Zuzarte 2, Ken Sevcik 1 1 University of Toronto.
Introduction to Histograms Presented By: Laukik Chitnis
Probabilistic Histograms for Probabilistic Data Graham Cormode AT&T Labs-Research Antonios Deligiannakis Technical University of Crete Minos Garofalakis.
Extensions of wavelets
An Introduction to Sparse Coding, Sparse Sensing, and Optimization Speaker: Wei-Lun Chao Date: Nov. 23, 2011 DISP Lab, Graduate Institute of Communication.
Algorithmic Complexity Nelson Padua-Perez Bill Pugh Department of Computer Science University of Maryland, College Park.
Multimedia DBs.
Deterministic Wavelet Thresholding for Maximum-Error Metrics Minos Garofalakis Bell Laboratories Lucent Technologies 600 Mountain Avenue Murray Hill, NJ.
Ph.D. DefenceUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor:
Generic Object Detection using Feature Maps Oscar Danielsson Stefan Carlsson
1 Jun Wang, 2 Sanjiv Kumar, and 1 Shih-Fu Chang 1 Columbia University, New York, USA 2 Google Research, New York, USA Sequential Projection Learning for.
1 An Adaptive Nearest Neighbor Classification Algorithm for Data Streams Yan-Nei Law & Carlo Zaniolo University of California, Los Angeles PKDD, Porto,
Approximate Computation of Multidimensional Aggregates of Sparse Data Using Wavelets Based on the work of Jeffrey Scott Vitter and Min Wang.
Multi-Scale Analysis for Network Traffic Prediction and Anomaly Detection Ling Huang Joint work with Anthony Joseph and Nina Taft January, 2005.
Basic Concepts and Definitions Vector and Function Space. A finite or an infinite dimensional linear vector/function space described with set of non-unique.
Techniques and Data Structures for Efficient Multimedia Similarity Search.
Computing Sketches of Matrices Efficiently & (Privacy Preserving) Data Mining Petros Drineas Rensselaer Polytechnic Institute (joint.
1 Wavelet synopses with Error Guarantees Minos Garofalakis Phillip B. Gibbons Information Sciences Research Center Bell Labs, Lucent Technologies Murray.
A fuzzy video content representation for video summarization and content-based retrieval Anastasios D. Doulamis, Nikolaos D. Doulamis, Stefanos D. Kollias.
Deterministic Wavelet Thresholding for Maximum-Error Metrics Minos Garofalakis Internet Management Research Dept. Bell Labs, Lucent
Spatial and Temporal Databases Efficiently Time Series Matching by Wavelets (ICDE 98) Kin-pong Chan and Ada Wai-chee Fu.
CS 591 A11 Algorithms for Data Streams Dhiman Barman CS 591 A1 Algorithms for the New Age 2 nd Dec, 2002.
Statistical Methods for long-range forecast By Syunji Takahashi Climate Prediction Division JMA.
Classification and Prediction: Regression Analysis
©2003/04 Alessandro Bogliolo Background Information theory Probability theory Algorithms.
Fast Approximate Wavelet Tracking on Streams Graham Cormode Minos Garofalakis Dimitris Sacharidis
Jinhui Tang †, Shuicheng Yan †, Richang Hong †, Guo-Jun Qi ‡, Tat-Seng Chua † † National University of Singapore ‡ University of Illinois at Urbana-Champaign.
Optimal distance estimation on compressed data (the data mining perspective) Nick Freris LCAV, EPFL November 4, 2013.
Distributed Constraint Optimization Michal Jakob Agent Technology Center, Dept. of Computer Science and Engineering, FEE, Czech Technical University A4M33MAS.
Charalampos (Babis) E. Tsourakakis SODA th January ‘11 SODA '111.
Frame by Frame Bit Allocation for Motion-Compensated Video Michael Ringenburg May 9, 2003.
Cs: compressed sensing
Special Topics in Data Engineering Panagiotis Karras CS6234 Lecture, March 4 th, 2009.
REVISED CONTEXTUAL LRT FOR VOICE ACTIVITY DETECTION Javier Ram’ırez, Jos’e C. Segura and J.M. G’orriz Dept. of Signal Theory Networking and Communications.
Wavelet Synopses with Predefined Error Bounds: Windfalls of Duality Panagiotis Karras DB seminar, 23 March, 2006.
Constructing Optimal Wavelet Synopses Dimitris Sacharidis Timos Sellis
The Impact of Duality on Data Synopsis Problems Panagiotis Karras KDD, San Jose, August 13 th, 2007 work with Dimitris Sacharidis and Nikos Mamoulis.
Histograms for Selectivity Estimation
Efficient Local Statistical Analysis via Integral Histograms with Discrete Wavelet Transform Teng-Yok Lee & Han-Wei Shen IEEE SciVis ’13Uncertainty & Multivariate.
Review of fundamental 1 Data mining in 1D: curve fitting by LLS Approximation-generalization tradeoff First homework assignment.
Approximating Data Stream using histogram for Query Evaluation Huiping Cao Jan. 03, 2003.
The Impact of Duality on Data Representation Problems Panagiotis Karras HKU, June 14 th, 2007.
Trajectory Simplification: On Minimizing the Direction-based Error
One-Pass Wavelet Synopses for Maximum-Error Metrics Panagiotis Karras Trondheim, August 31st, 2005 Research at HKU with Nikos Mamoulis.
Histograms for Selectivity Estimation, Part II Speaker: Ho Wai Shing Global Optimization of Histograms.
Wavelets Chapter 7 Serkan ERGUN. 1.Introduction Wavelets are mathematical tools for hierarchically decomposing functions. Regardless of whether the function.
Continuous Monitoring of Distributed Data Streams over a Time-based Sliding Window MADALGO – Center for Massive Data Algorithmics, a Center of the Danish.
Dense-Region Based Compact Data Cube
Data Transformation: Normalization
Machine Learning for the Quantified Self
Chapter 7. Classification and Prediction
Data-Streams and Histograms
Supervised Time Series Pattern Discovery through Local Importance
Cristian Ferent and Alex Doboli
Fast Approximate Query Answering over Sensor Data with Deterministic Error Guarantees Chunbin Lin Joint with Etienne Boursier, Jacque Brito, Yannis Katsis,
Lattice Histograms: A Resilient Synopsis Structure
Query-Friendly Compression of Graph Streams
Y. Kotidis, S. Muthukrishnan,
Bounds for Optimal Compressed Sensing Matrices
SPACE EFFICENCY OF SYNOPSIS CONSTRUCTION ALGORITHMS
Feifei Li, Ching Chang, George Kollios, Azer Bestavros
Asymmetric Transitivity Preserving Graph Embedding
Data Transformations targeted at minimizing experimental variance
An Adaptive Nearest Neighbor Classification Algorithm for Data Streams
Text Categorization Berlin Chen 2003 Reference:
Wavelet-based histograms for selectivity estimation
Presentation transcript:

The Haar + Tree: A Refined Synopsis Data Structure Panagiotis Karras HKU, September 7 th, 2006

The Need Approximate Query Answering (exact answers not always required). Learning, Classification, Event Detection. Data Mining, Selectivity Estimation. Situations where massive data arrives in a stream Routers, Sensors, Web.

The Formal Problem Given a data set X and a vector set {ψ i }, find a representation F=Σ i z i ψ i in B non-zero terms z i, such that an error metric (function of X-F) is minimized. Normalized Minkowski-norm error metric:

Two Principal Methods Histograms B non-overlapping constant- value intervals. Haar Wavelets (1910) Based on binary intervals. B wavelet coefficients.

Histograms Define B buckets, s i = [b i,e i ]. Attribute value v i to all items in s i. Assumes neighboring values exhibit slight variations. Advantage: Suitable for summarizing smooth natural signals. Disadvantage: Unsuitable for data sets with sharp discontinuities. Complexity: At least O(n 2 B) time for optimal histogram under general error metrics [Bellman 1961, Jagadish et al. 1998, Guha et al. 2004].

Haar Wavelets 18 :Wavelet transform: orthogonal transform for the hierarchical representation of functions and signals. Haar tree: structure for the visualization of decomposition and value reconstructions. Synopsis: Select B non- zero terms. Advantage: Approximates discontinuities. Complexity: Depends.

Previous Classical Haar Approaches Restricted Haar SynopsesRestricted Haar Synopses [Garofalakis and Kumar 2004] Compute Haar wavelet decomposition of X. Preserve optimal B-coefficient subset. Suboptimal Quality (usually worse than histogram). time for general error metrics. Unrestricted Haar SynopsesUnrestricted Haar Synopses [Guha and Harb 2005, 2006] Find optimal coefficient values to assign. Quantization by resolution step δ is used. Better Quality (may be better than histogram). time for minimizing L p error. E: upper bound for normalized L p error.

Deficiencies Haar wavelet coefficient contributes its value positively to one interval and negatively to another. Lack of flexibility. Quality constraint. Hard to delimit value search space for non- maximum-error metrics. Cubic complexity to n for L 1 metric. Conclusion: Need for a different synopsis structure.

The Haar + Tree c1c1 + c2c2 c3c3 C1C1 c5c5 c6c6 + C2C2 c7c7 c8c8 c9c9 coco d3d3 d2d2 d1d1 d0d c4c C3C3 Triads in place of single coefficients. Each triad contains one head and two supplementary coefficients.

Basic Properties Haar + Synopsis needs to contain at most one coefficient per triad. No additional storage required. Classical Haar special case of Haar +. - c i cici c i cici c i+1 c i+2 + +

An example X = [5,3,12,4], B=2. Best restricted synopsis: {c 0 =6, c 7 =4} Best L 1,L inf unrestricted synopsis: {c 0 =5.5, c 7 =4} Best L 1, L 2 & L inf histogram: {5,5,5,4}, {4,4,8,8} Best Haar+ synopsis: {c 0 =4, c 8 =8} Differences generalized to any gap with multiplication. c1c1 + c2c2 c3c3 C1C1 c5c5 c6c6 + C2C2 c7c7 c8c8 c9c9 coco c4c C3C3

Delimiting the value domains - + cici - c i+1 + cici c i c i+1 + cici c i Classical Haar: Haar+:

Delimiting the value domains Let m i,M i be minimum, maximum value in scope of triad C i. Non-zero head coefficient does not need to be assigned for incoming values v outside (m i,M i ). Otherwise, Similarly, Besides, In conclusion, the cardinality of is Such tight delimitation not possible with classical Haar. Generalization of structure simplifies computation. Space than can be allocated at C i :

Deriving the answer Bottom-up recursive process. At each triad C i, it calculates an array A from the pre- calculated ones L, R of its children triads. A[v,b] contains E(i,v,b) for C i, optimal z, b’. At most logn + 1 concurrently stored arrays.

Complexity Analysis Basic derivation of optimal error: Time:Space: Space efficient synopsis construction: Time:Space: Time efficient synopsis construction: Time:Space: Expressions in min equal when Same for all monotonic distributive error metrics. B time factor reduced to log 2 B for maximum-error metrics.

Time Complexity Comparison L 1 error metric. Optimal Histogram: Restricted Haar: Unrestricted Haar: Haar + : Haar + computes synopsis in time linear to n.

Experiments: Description of Data Datasets with Discontinuities FR: Mean monthly flows of Fraser River, Hope, B.C. Periodic autoregression features. FC: Frequencies of distinct values an attribute in forest cover type relation (US Forest Service). Used in previous work.

Experiments: Time, B=n/64

Experiments: Time, B=32

Quality, Maximum Error, FC dataset

Quality, Maximum Error, FR dataset

Quality, Average Error, FC dataset

Quality, Average Error, FR dataset

Conclusions Haar + achieves higher quality than Haar Wavelets (expected, cannot be worse). Can also achieve higher quality than Optimal Histogram. Outperforms histogram quality when classical Haar does not. Constructs synopses in time linear to n for any monotonic distributive error metric. First structure to achieve quality higher than optimal histogram in linear time. Future: extension to multidimensional data.

Related Work R. Bellman. On the approximation of curves by line segments using dynamic programming. Communications of the ACM, 1961 H. V. Jagadish, N. Koudas, S. Muthukrishnan, V. Poosala, K. C. Sevcik, and T. Suel. Optimal histograms with quality guarantees. VLDB 1998 S. Guha, K. Shim, and J. Woo. REHIST: Relative error histogram construction algorithms. VLDB M. Garofalakis and A. Kumar. Deterministic wavelet thresholding for maximum-error metrics. PODS 2004 S. Guha. Space efficiency in synopsis construction algorithms. VLDB 2005 S. Guha and B. Harb. Wavelet Synopses for Data Streams: Minimizing Non-Euclidean Error. KDD 2005 S. Guha and B. Harb. Approxmation algorithms for wavelet transform coding of data streams. SODA 2006

Thank you! Questions?