Constructing Optimal Wavelet Synopses Dimitris Sacharidis Timos Sellis

Slides:

Advertisements

Similar presentations

The Data Stream Space Complexity of Cascaded Norms T.S. Jayram David Woodruff IBM Almaden.

Advertisements

The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 4: An Introduction To Data Mining (II) Johannes Gehrke

Wavelet and Matrix Mechanism CompSci Instructor: Ashwin Machanavajjhala 1Lecture 11 : Fall 12.

Kaushik Chakrabarti(Univ Of Illinois) Minos Garofalakis(Bell Labs) Rajeev Rastogi(Bell Labs) Kyuseok Shim(KAIST and AITrc) Presented at 26 th VLDB Conference,

Object Specific Compressed Sensing by minimizing a weighted L2-norm A. Mahalanobis.

Wavelets Fast Multiresolution Image Querying Jacobs et.al. SIGGRAPH95.

Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.

An Improved Data Stream Summary: The Count-Min Sketch and its Applications Graham Cormode, S. Muthukrishnan 2003.

Fast Algorithms For Hierarchical Range Histogram Constructions

Summarizing Distributed Data Ke Yi HKUST += ?. Small summaries for BIG data  Allow approximate computation with guarantees and small space – save space,

Efficient Distribution Mining and Classification Yasushi Sakurai (NTT Communication Science Labs), Rosalynn Chong (University of British Columbia), Lei.

Multimedia DBs. Multimedia dbs A multimedia database stores text, strings and images Similarity queries (content based retrieval) Given an image find.

Optimal Workload-Based Weighted Wavelet Synopsis

School of Computing Science Simon Fraser University

Approximating Sensor Network Queries Using In-Network Summaries Alexandra Meliou Carlos Guestrin Joseph Hellerstein.

Deterministic Wavelet Thresholding for Maximum-Error Metrics Minos Garofalakis Bell Laboratories Lucent Technologies 600 Mountain Avenue Murray Hill, NJ.

Graph-Based Synopses for Relational Selectivity Estimation Joshua Spiegel and Neoklis Polyzotis University of California, Santa Cruz.

Finding Aggregates from Streaming Data in Single Pass Medha Atre Course Project for CS631 (Autumn 2002) under Prof. Krithi Ramamritham (IITB).

Game Playing CSC361 AI CSC361: Game Playing.

Dimensionality Reduction

Approximate querying about the Past, the Present, and the Future in Spatio-Temporal Databases Jimeng Sun, Dimitris Papadias, Yufei Tao, Bin Liu.

Approximate Computation of Multidimensional Aggregates of Sparse Data Using Wavelets Based on the work of Jeffrey Scott Vitter and Min Wang.

A Quick Introduction to Approximate Query Processing Part II

Basic Concepts and Definitions Vector and Function Space. A finite or an infinite dimensional linear vector/function space described with set of non-unique.

Dependency-Based Histogram Synopses for High-dimensional Data Amol Deshpande, UC Berkeley Minos Garofalakis, Bell Labs Rajeev Rastogi, Bell Labs.

1 Wavelet synopses with Error Guarantees Minos Garofalakis Phillip B. Gibbons Information Sciences Research Center Bell Labs, Lucent Technologies Murray.

Dimensionality Reduction

Deterministic Wavelet Thresholding for Maximum-Error Metrics Minos Garofalakis Internet Management Research Dept. Bell Labs, Lucent

Spatial and Temporal Databases Efficiently Time Series Matching by Wavelets (ICDE 98) Kin-pong Chan and Ada Wai-chee Fu.

Dimensionality Reduction. Multimedia DBs Many multimedia applications require efficient indexing in high-dimensions (time-series, images and videos, etc)

One-Pass Wavelet Decompositions of Data Streams TKDE May 2002 Anna C. Gilbert,Yannis Kotidis, S. Muthukrishanan, Martin J. Strauss Presented by James Chan.

RACE: Time Series Compression with Rate Adaptivity and Error Bound for Sensor Networks Huamin Chen, Jian Li, and Prasant Mohapatra Presenter: Jian Li.

Classification and Prediction: Regression Analysis

Nonlinear Dimensionality Reduction Approaches. Dimensionality Reduction The goal: The meaningful low-dimensional structures hidden in their high-dimensional.

Fast Subsequence Matching in Time-Series Databases Christos Faloutsos M. Ranganathan Yannis Manolopoulos Department of Computer Science and ISR University.

Fast Approximate Wavelet Tracking on Streams Graham Cormode Minos Garofalakis Dimitris Sacharidis

1 Wavelets for Efficient Querying of Large Multidimensional Datasets Wavelets for Efficient Querying of Large Multidimensional Datasets Cyrus Shahabi University.

Special Topics in Data Engineering Panagiotis Karras CS6234 Lecture, March 4 th, 2009.

Streaming Algorithms Piotr Indyk MIT. Data Streams A data stream is a sequence of data that is too large to be stored in available memory Examples: –Network.

Wavelet Synopses with Predefined Error Bounds: Windfalls of Duality Panagiotis Karras DB seminar, 23 March, 2006.

© 2010 AT&T Intellectual Property. All rights reserved. AT&T and the AT&T logo are trademarks of AT&T Intellectual Property. Structure-Aware Sampling:

Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.

A Novel Approach for Approximate Aggregations Over Arrays SSDBM 2015 June 29 th, San Diego, California 1 Yi Wang, Yu Su, Gagan Agrawal The Ohio State University.

The Haar + Tree: A Refined Synopsis Data Structure Panagiotis Karras HKU, September 7 th, 2006.

The Impact of Duality on Data Synopsis Problems Panagiotis Karras KDD, San Jose, August 13 th, 2007 work with Dimitris Sacharidis and Nikos Mamoulis.

SAGA: Array Storage as a DB with Support for Structural Aggregations SSDBM 2014 June 30 th, Aalborg, Denmark 1 Yi Wang, Arnab Nandi, Gagan Agrawal The.

Principal Component Analysis Machine Learning. Last Time Expectation Maximization in Graphical Models – Baum Welch.

ICDCS 2014 Madrid, Spain 30 June-3 July 2014

The Impact of Duality on Data Representation Problems Panagiotis Karras HKU, June 14 th, 2007.

Robust Estimation With Sampling and Approximate Pre-Aggregation Author: Christopher Jermaine Presented by: Bill Eberle.

1 Approximate XML Query Answers Presenter: Hongyu Guo Authors: N. polyzotis, M. Garofalakis, Y. Ioannidis.

One-Pass Wavelet Synopses for Maximum-Error Metrics Panagiotis Karras Trondheim, August 31st, 2005 Research at HKU with Nikos Mamoulis.

XCluster Synopses for Structured XML Content Alkis Polyzotis (UC Santa Cruz) Minos Garofalakis (Intel Research, Berkeley)

The Power-Method: A Comprehensive Estimation Technique for Multi- Dimensional Queries Yufei Tao U. Hong Kong Christos Faloutsos CMU Dimitris Papadias Hong.

Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.

Wavelets Chapter 7 Serkan ERGUN. 1.Introduction Wavelets are mathematical tools for hierarchically decomposing functions. Regardless of whether the function.

AQAX: Approximate Query Answering for XML Josh Spiegel, M. Pontikakis, S. Budalakoti, N. Polyzotis Univ. of California Santa Cruz.

Dense-Region Based Compact Data Cube

Spatial Data Management

Data Transformation: Normalization

New Characterizations in Turnstile Streams with Applications

Analyzing Redistribution Matrix with Wavelet

Fast Approximate Query Answering over Sensor Data with Deterministic Error Guarantees Chunbin Lin Joint with Etienne Boursier, Jacque Brito, Yannis Katsis,

Lattice Histograms: A Resilient Synopsis Structure

Y. Kotidis, S. Muthukrishnan,

Design of Hierarchical Classifiers for Efficient and Accurate Pattern Classification M N S S K Pavan Kumar Advisor : Dr. C. V. Jawahar.

SPACE EFFICENCY OF SYNOPSIS CONSTRUCTION ALGORITHMS

Data Transformations targeted at minimizing experimental variance

Wavelet-based histograms for selectivity estimation

Presentation transcript:

Constructing Optimal Wavelet Synopses Dimitris Sacharidis Timos Sellis

outline introduction background –wavelet basics –example wavelet synopses –example –error metrics –optimal synopses –interesting issues data streams –models –streaming wavelet synopses epilogue

introduction analyzing massive multi-dimensional datasets –complex aggregate queries over large parts of the data –exploratory nature –promptness over accuracy, but with guarantees –resort in approximate query processing over precomputed synopses (e.g., histograms, samples, wavelets) numerous data management applications require to continuously generate, process and analyze data on-line –the data streaming paradigm –summarize in real time, using small space and in one pass –provide approximate query answers with quality guarantees provide useful data summarization –need to measure inaccuracy, application dependent

outline introduction background –wavelet basics –example wavelet synopses –example –error metrics –optimal synopses –interesting issues data streams –models –streaming wavelet synopses epilogue

wavelets basics wavelet decomposition is a mathematical tool for the hierarchical decomposition of functions –applications in signal/image processing used extensively as a data reduction tool in db scenarios: –selectivity estimation for large aggregate queries –fast approximate query answers –general purpose streaming synopsis features –efficient: performs in linear time and space (vs. histograms ~N 2 )) –high compression ratio, small-B property –generalizes to multiple dimensions

example assume a data vector d of 8 values wavelet tree (a.k.a. error tree) every node contributes positively to the leaves in its left subtree and negatively to the leaves in its right subtree iteratively perform pair-wise averaging and semi differencing averages are not needed

outline introduction background –wavelet basics –example wavelet synopses –example –error metrics –optimal synopses –interesting issues data streams –models –streaming wavelet synopses epilogue

wavelet synopses any set of B coefficients constitutes a B-term wavelet synopsis –stored as pairs –implicitly all non-stored coefficients are set to zero introduces reconstruction error per point estimate e = |d-d|

measuring accuracy use some norm to aggregate individual errors L 2 norm: Σe i 2 is the sum squared error (sse) –sse = 224 L ∞ norm: max e i is the maximum absolute error –max-abs-error = 10 generalized to any weighted L p norm: Σw i e i p –e.g. max-rel-error = max (1/d i )e i = 10/4 = 250% vector of point errors e vector of data values d

optimal synopses a B-term wavelet synopsis can be optimized for any error metric sse optimal synopses are straightforward –wavelet transformation is orthonormal (after normalization)  by Parseval’s theorem L 2 norm is preserved –choose the highest in absolute (normalized) value coefficients other (weighted or non) L p norm optimal synopses require superlinear (quadratic) time in N –dynamic programming over the wavelet tree

interesting issues I/O efficiency issues when dealing with massive multidimensional datasets [M. Jahangiri, D. Sacharidis, C. Shahabi ‘05] –during transformation try to minimize I/Os –efficient maintenance as new data are appended (requires more than just some updating) how about optimizing for workloads of range-sum queries? –no known results (without using the prefix-sum array) –ranges overlap arbitrarily  no easy dynamic programming formulation exists

outline introduction background –wavelet basics –example wavelet synopses –example –error metrics –optimal synopses –interesting issues data streams –models –streaming wavelet synopses epilogue

working over data streams main challenges when data are streaming: –stream items are only seen once –require small working space –process stream items quickly –provide an answer quickly with quality guarantees two models depending on how a data vector a is rendered time series model stream elements are vector values of type (i,a[i]) and appear ordered in i (e.g., time) turnstile model stream elements are updates of type (i,±u) which implies a[i]  a[i] ± u and, further, do not appear ordered in i

streaming wavelet synopses time series model –at most only logN coefficients are affected –a large number of coefficients has finalized value –can perform bottom-up dynamic programming (space required is prohibitive) –greedy techniques should be deployed instead turnstile model –even optimizing for the sse is hard [G. Cormode, M. Garofalakis, D. Sacharidis ‘06] –other error metrics have not been studied

outline introduction background –wavelet basics –example wavelet synopses –example –error metrics –optimal synopses –interesting issues data streams –models –streaming wavelet synopses epilogue

wavelet synopses are a highly successful data summarization technique yet, several problems remain open: –optimize for range query workloads –greedy (time-series) streaming algorithms –other metrics for general (turnstile) streaming data

thank you!

unrestricted wavelet synopses the retained coefficients can assume any value, not restricted to their decomposed value (even harder optimization problem!) quick example: optimize for max-abs-error, d = {2, 10, 12, 8} and B=1 restricted synopsis: keep the overall average 8  m.a.e. = 6 unrestricted synopsis: keep the overall average but change its value to 7  m.a.e. = 5