1 Wavelets for Efficient Querying of Large Multidimensional Datasets Wavelets for Efficient Querying of Large Multidimensional Datasets Cyrus Shahabi University.

Slides:



Advertisements
Similar presentations
On-the-fly Visualization of Scientific Geospatial Data Using Wavelets
Advertisements

Dynamic Sample Selection for Approximate Query Processing Brian Babcock Stanford University Surajit Chaudhuri Microsoft Research Gautam Das Microsoft Research.
Wavelet and Matrix Mechanism CompSci Instructor: Ashwin Machanavajjhala 1Lecture 11 : Fall 12.
A Privacy Preserving Index for Range Queries
Kaushik Chakrabarti(Univ Of Illinois) Minos Garofalakis(Bell Labs) Rajeev Rastogi(Bell Labs) Kyuseok Shim(KAIST and AITrc) Presented at 26 th VLDB Conference,
Wavelets Fast Multiresolution Image Querying Jacobs et.al. SIGGRAPH95.
Fast Algorithms For Hierarchical Range Histogram Constructions
2008 SIAM Conference on Imaging Science July 7, 2008 Jason A. Palmer
Basis Expansion and Regularization Presenter: Hongliang Fei Brian Quanz Brian Quanz Date: July 03, 2008.
Optimal Workload-Based Weighted Wavelet Synopsis
Los Angeles September 27, 2006 MOBICOM Localization in Sparse Networks using Sweeps D. K. Goldenberg P. Bihler M. Cao J. Fang B. D. O. Anderson.
A Generic Framework for Handling Uncertain Data with Local Correlations Xiang Lian and Lei Chen Department of Computer Science and Engineering The Hong.
Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.
Deterministic Wavelet Thresholding for Maximum-Error Metrics Minos Garofalakis Bell Laboratories Lucent Technologies 600 Mountain Avenue Murray Hill, NJ.
Ph.D. DefenceUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor:
Dimensional reduction, PCA
Approximate Computation of Multidimensional Aggregates of Sparse Data Using Wavelets Based on the work of Jeffrey Scott Vitter and Min Wang.
Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the.
1 A Suite of Web-Services for Wavelet- based Analysis of Large Atmospheric Datasets A Suite of Web-Services for Wavelet- based Analysis of Large Atmospheric.
1 ISI’02 Multidimensional Databases Challenge: representation for efficient storage, indexing & querying Examples (time-series, images) New multidimensional.
Basic Concepts and Definitions Vector and Function Space. A finite or an infinite dimensional linear vector/function space described with set of non-unique.
Improving the Accuracy of Continuous Aggregates & Mining Queries Under Load Shedding Yan-Nei Law* and Carlo Zaniolo Computer Science Dept. UCLA * Bioinformatics.
1 CIDR’03 AIMS: An Immersidata Management System Cyrus Shahabi Computer Science Department & Integrated Media Systems Center University of Southern California.
Distributed and Streaming Evaluation of Batch Queries for Data-Intensive Computational Turbulence Kalin Kanov Department of Computer Science Johns Hopkins.
1 Wavelet synopses with Error Guarantees Minos Garofalakis Phillip B. Gibbons Information Sciences Research Center Bell Labs, Lucent Technologies Murray.
Indexing Spatio-Temporal Data Warehouses Dimitris Papadias, Yufei Tao, Panos Kalnis, Jun Zhang Department of Computer Science Hong Kong University of Science.
Fast multiresolution image querying CS474/674 – Prof. Bebis.
One-Pass Wavelet Decompositions of Data Streams TKDE May 2002 Anna C. Gilbert,Yannis Kotidis, S. Muthukrishanan, Martin J. Strauss Presented by James Chan.
Lecture II-2: Probability Review
Classification and Prediction: Regression Analysis
A first look Ref: Walker (ch1) Jyun-Ming Chen, Spring 2001
CONGRESSIONAL SAMPLES FOR APPROXIMATE ANSWERING OF GROUP-BY QUERIES Swarup Acharya Phillip Gibbons Viswanath Poosala ( Information Sciences Research Center,
1 Cube Computation and Indexes for Data Warehouses CPS Notes 7.
A Survey Based Seminar: Data Cleaning & Uncertain Data Management Speaker: Shawn Yang Supervisor: Dr. Reynold Cheng Prof. David Cheung
Biswanath Panda, Mirek Riedewald, Daniel Fink ICDE Conference 2010 The Model-Summary Problem and a Solution for Trees 1.
BRAID: Discovering Lag Correlations in Multiple Streams Yasushi Sakurai (NTT Cyber Space Labs) Spiros Papadimitriou (Carnegie Mellon Univ.) Christos Faloutsos.
Disclosure risk when responding to queries with deterministic guarantees Krish Muralidhar University of Kentucky Rathindra Sarathy Oklahoma State University.
Constructing Optimal Wavelet Synopses Dimitris Sacharidis Timos Sellis
Efficient Elastic Burst Detection in Data Streams Yunyue Zhu and Dennis Shasha Department of Computer Science Courant Institute of Mathematical Sciences.
A Novel Approach for Approximate Aggregations Over Arrays SSDBM 2015 June 29 th, San Diego, California 1 Yi Wang, Yu Su, Gagan Agrawal The Ohio State University.
1 C. Shahabi Mining Multidimensional Databases Mining Multidimensional Databases Cyrus Shahabi University of Southern California Dept. of Computer Science.
The Haar + Tree: A Refined Synopsis Data Structure Panagiotis Karras HKU, September 7 th, 2006.
Histograms for Selectivity Estimation
Active Sampling for Accelerated Learning of Performance Models Piyush Shivam, Shivnath Babu, Jeff Chase Duke University.
Efficient Local Statistical Analysis via Integral Histograms with Discrete Wavelet Transform Teng-Yok Lee & Han-Wei Shen IEEE SciVis ’13Uncertainty & Multivariate.
2005/12/021 Fast Image Retrieval Using Low Frequency DCT Coefficients Dept. of Computer Engineering Tatung University Presenter: Yo-Ping Huang ( 黃有評 )
BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data ACM EuroSys 2013 (Best Paper Award)
Presented By Anirban Maiti Chandrashekar Vijayarenu
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
Robust Estimation With Sampling and Approximate Pre-Aggregation Author: Christopher Jermaine Presented by: Bill Eberle.
By: Gang Zhou Computer Science Department University of Virginia 1 Medians and Beyond: New Aggregation Techniques for Sensor Networks CS851 Seminar Presentation.
Predicting the Location and Time of Mobile Phone Users by Using Sequential Pattern Mining Techniques Mert Özer, Ilkcan Keles, Ismail Hakki Toroslu, Pinar.
Page 1© Crown copyright 2004 The use of an intensity-scale technique for assessing operational mesoscale precipitation forecasts Marion Mittermaier and.
Data Visualization Fall The Data as a Quantity Quantities can be classified in two categories: Intrinsically continuous (scientific visualization,
SF-Tree and Its Application to OLAP Speaker: Ho Wai Shing.
Generalized Point Based Value Iteration for Interactive POMDPs Prashant Doshi Dept. of Computer Science and AI Institute University of Georgia
An unsupervised conditional random fields approach for clustering gene expression time series Chang-Tsun Li, Yinyin Yuan and Roland Wilson Bioinformatics,
King Faisal University جامعة الملك فيصل Deanship of E-Learning and Distance Education عمادة التعلم الإلكتروني والتعليم عن بعد [ ] 1 جامعة الملك فيصل عمادة.
BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data Authored by Sameer Agarwal, et. al. Presented by Atul Sandur.
Dense-Region Based Compact Data Cube
Data Transformation: Normalization
Fast multiresolution image querying
A Black-Box Approach to Query Cardinality Estimation
Analyzing Redistribution Matrix with Wavelet
Bolin Ding Silu Huang* Surajit Chaudhuri Kaushik Chakrabarti Chi Wang
Fast Approximate Query Answering over Sensor Data with Deterministic Error Guarantees Chunbin Lin Joint with Etienne Boursier, Jacque Brito, Yannis Katsis,
ICICLES: Self-tuning Samples for Approximate Query Answering
Spatial Online Sampling and Aggregation
Feature Selection Methods
Chapter 15: Wavelets (i) Fourier spectrum provides all the frequencies
Presentation transcript:

1 Wavelets for Efficient Querying of Large Multidimensional Datasets Wavelets for Efficient Querying of Large Multidimensional Datasets Cyrus Shahabi University of Southern California Integrated Media Systems Center (IMSC) and Dept. of Computer Science Los Angeles, CA

2 Outline Motivating Applications Approach: Wavelets ProPolyne: Overview and Features ProPolyne: Details Current Status Conclusion Future Work

3 Motivation: New Multidimensional Data Intensive Applications Multidimensional data sets: (w/ dimension & measure)  Remote sensory date (from JPL):  Sensor readings from GPS ground stations (from NASA):  Petroleum sales (from Digital-Government research center):  ACOUSTIC data (from UCLA sensor-network project):  Market data (from NCR): Large size, e.g., current (toy!) NASA/JPL data set:  Past 10 years, sampling twice a day, at a lat-long-alt grid of 64 * 128 * 16, recording 8 bytes of temperature & 16 bytes of dimensions  This is 6 MB of data per day; a total of 21 GB for 10 years  Increase: twice an hour sampling, 1024 * 4096 * 128 grid, …

4 Motivation: Multidimensional Applications I/O and computationally complex queries  Range-aggregate queries (w/ aggregate function) Average temperature, given an area and time interval Average velocity of upward movement of the station Total petroleum sales volume of a given product in a given location and year Number of jackets sold in Seattle in Sep  Tougher queries: Covariance of temperature and altitude (correlation) Variance of sale of petroleum in 2002 in CA Quick response-time (interactive):  the results can be approximate and/or progressively become exact

5 Recap! Multidimensional data Large data Aggregate queries Approximate answers Progressive answers Multi-resolution compression Wavelets!

6 Approach: Enabling Data Manipulation, Query & Analysis in the WAVELET Domain Everybody else’s idea: let’s compress data  Reason: save space? No not really!  Implicit reason: queries deal with smaller data sets and hence faster (not always true!)  More problems: not only query results can never be 100% accurate anymore, but also different queries can have very different error rates given their areas of interest  Why? At the data population time, we don’t know which coefficients are more/less important to our queries! ( also observed by [Garofalakis & Gibbons, SIGMOD’02], but they proposed other ways to drop coefficients assuming a uniform workload) Different than the signal-processing objective to reconstruct the entire signal as good as possible

7 Approach: Enabling Data Manipulation, Query & Analysis in the WAVELET Domain Our idea/distinction: storage is cheap and queries are ad-hoc; let’s keep all the wavelet coefficients! (no data compression) Opportunity: At the query time, however, we have the knowledge of what is important to the pending query Query ProPolyne: Progressive Evaluation of Polynomial Range-Aggregate Query

8 Define range-sum query as dot product of query vector and data vector (also observed by [Gilbert et. al, VLDB’2001]) Offline: Multidimensional wavelet transform of data At the query time: “lazy” wavelet transform of query vector (very fast) Dot product of query and data vectors in the transformed domain  exact result Choose high-energy query coefficients only  fast approximate result (90% accuracy by retrieving < 10% of data) Choose query coefficients in order of energy  progressive result Overview of ProPolyne

9 ProPolyne Unique Features “Measure” can be any polynomial on any combination of attributes  Can support COUNT, SUM, AVERAGE  Also supports Covariance, Kurtosis, etc.  All using one set of pre-computed aggregates Independent from how well the data set can be compressed/approximated by wavelets  Because: We show “range-sum queries” can always be approximated well by wavelets (not always HAAR though!) Potentially low update cost Can be used for exact, approximate and progressive range-sum query evaluation

10 Outline Motivating Applications ProPolyne: Overview and Features ProPolyne: Details  Polynomial Range-Sum Queries as Vector Queries  Evaluation of Vector Queries  Progressive/Approximate Evaluation of Vector Queries Current Status Conclusion Future Work

11 Polynomial Range-Sum Queries Polynomial range-sum queries: Q(R,f,I)  I is a finite instance of schema F  R SubSetOf Dom( F ), is the range  f : Dom( F )  R is a polynomial of degree  Example: F = (Age, Salary) R : (25 < age < 40) & (55k < salary < 150k) Age Salary 25$50k 28$55k 30$58k 50$100k 55$130k 57 $120k I

12 Polynomial Range-Sum Queries as “Vector Queries” The data frequency distribution of I is the function  I : Dom( F )  Z that maps a point x to the number of times it occurs in I To emphasize the fact that a query is an operator on the data frequency distribution, we write Example:  (25,50)=  (28,55)=…=  (57,120)=1 and  (x)=0 otherwise. Age Salary 25$50k 28$55k 30$58k 50$100k 55$130k 57 $120k I where: if Hence: Or: Vector Query querydata

13 Wavelet Transformation of Data and Query Vectors DWT of a aka wavelet coefficients of a Hence, vector queries can be computed in the wavelet- transformed space as: Transform the query vector at submission: O (N d ) ? Nop! Not with our “lazy” wavelet transform for range-aggregate queries

14 Fast Evaluation of Vector Queries Using Wavelets … Technical Requirements:  Wavelets should have small support (i.e., the shorter the filter, the better)  Wavelets must satisfy a “moment condition”  Supports any Polynomial Range-Sum up to a degree determined by the choice of wavelets E.g., Haar can only support degree 0 (e.g., COUNT), while db4 can support up to degree 1 (e.g., SUM), and db6 for degree 2 (e.g., VARIANCE) Standard DWT:  (N) Our lazy wavelet transform:  (l log N), where l is the length of the filter

15 Exact Evaluation of Vector Queries Query: SUM(salary) when (25 < age < 40) & (55k < salary < 150k) # of Wavelet Coefficients: 837# of Nonzero Coordinates: 4380

16 Progressive Evaluation of Vector Queries

17 Current Status: ProPolyne Demonstration

18 Conclusion A novel pre-aggregation strategy Supports conventional aggregates: COUNT, SUM and beyond: multivariate statistics First pre-aggregation technique that does not require measures be specified a priori  Measures treated as functions of the attributes at the time Provides a data independent progressive and approximate query answering technique With provably poly-logarithmic worst-case query and update costs And storage cost comparable or better than other pre-aggregation methods

19 Future Work Almost Complete  Batch queries  Efficient layout on disk In Progress ….  I/O efficient wavelet transformation and update  Hybrid ordering of data and query coefficients  Error forecasting Longer Term  Best basis per dimension  ProPolyne on GRID using p2p  ProPolyne on relation algebra operators

20 THANKS! (visit Acknowledgements Students: R. R. Schmidt and M. Jahangiri NSF: ERC, ITR & CAREER programs NASA/JPL: ESIP program Industry: Microsoft, NCR, Okawa Foundationhttp://infolab.usc.edu