Zahid Irfan & Dr. Asim Karim (Advisor) (zahidi, CS-509-Masters of Science (CS) Project Lahore University of Management Sciences, Lahore,

Slides:



Advertisements
Similar presentations
The Data Stream Space Complexity of Cascaded Norms T.S. Jayram David Woodruff IBM Almaden.
Advertisements

Kaushik Chakrabarti(Univ Of Illinois) Minos Garofalakis(Bell Labs) Rajeev Rastogi(Bell Labs) Kyuseok Shim(KAIST and AITrc) Presented at 26 th VLDB Conference,
Probabilistic Skyline Operator over Sliding Windows Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Xuemin Lin, Ying Zhang, Wei.
Sparse Recovery (Using Sparse Matrices)
An Improved Data Stream Summary: The Count-Min Sketch and its Applications Graham Cormode, S. Muthukrishnan 2003.
Fast Algorithms For Hierarchical Range Histogram Constructions
Brian Babcock Surajit Chaudhuri Gautam Das at the 2003 ACM SIGMOD International Conference By Shashank Kamble Gnanoba.
1 CS 361 Lecture 5 Approximate Quantiles and Histograms 9 Oct 2002 Gurmeet Singh Manku
Turnstile Streaming Algorithms Might as Well Be Linear Sketches Yi Li Huy L. Nguyen David Woodruff.
Optimal Workload-Based Weighted Wavelet Synopsis
Stabbing the Sky: Efficient Skyline Computation over Sliding Windows COMP9314 Lecture Notes.
Uncertainty Principles, Extractors, and Explicit Embeddings of L 2 into L 1 Piotr Indyk MIT.
Deterministic Wavelet Thresholding for Maximum-Error Metrics Minos Garofalakis Bell Laboratories Lucent Technologies 600 Mountain Avenue Murray Hill, NJ.
1 Wavelets and compression Dr Mike Spann. 2 Contents Scale and image compression Signal (image) approximation/prediction – simple wavelet construction.
Sharing Aggregate Computation for Distributed Queries Ryan Huebsch, UC Berkeley Minos Garofalakis, Yahoo! Research † Joe Hellerstein, UC Berkeley Ion Stoica,
Hongtao Cheng 1 Efficiently Supporting Ad Hoc Queries in Large Datasets of Time Sequences Author:Flip Korn H. V. Jagadish Christos Faloutsos From ACM.
Finding Aggregates from Streaming Data in Single Pass Medha Atre Course Project for CS631 (Autumn 2002) under Prof. Krithi Ramamritham (IITB).
Dimensionality Reduction
Dimensionality Reduction
1 Visualizing the Legislature Howard University - Systems and Computer Science October 29, 2010 Mugizi Robert Rwebangira.
ExaSphere Network Analysis Engine © 2006 Joseph E. Johnson, PhD
Scalable Wavelet Video Coding Using Aliasing- Reduced Hierarchical Motion Compensation Xuguang Yang, Member, IEEE, and Kannan Ramchandran, Member, IEEE.
CS591A1 Fall Sketch based Summarization of Data Streams Manish R. Sharma and Weichao Ma.
Mining Long Sequential Patterns in a Noisy Environment Jiong Yang, Wei Wang, Philip S. Yu, Jiawei Han SIGMOD 2002.
1 Wavelet synopses with Error Guarantees Minos Garofalakis Phillip B. Gibbons Information Sciences Research Center Bell Labs, Lucent Technologies Murray.
Dimensionality Reduction
Fast multiresolution image querying CS474/674 – Prof. Bebis.
Spatial and Temporal Databases Efficiently Time Series Matching by Wavelets (ICDE 98) Kin-pong Chan and Ada Wai-chee Fu.
Dimensionality Reduction. Multimedia DBs Many multimedia applications require efficient indexing in high-dimensions (time-series, images and videos, etc)
One-Pass Wavelet Decompositions of Data Streams TKDE May 2002 Anna C. Gilbert,Yannis Kotidis, S. Muthukrishanan, Martin J. Strauss Presented by James Chan.
Mirek Riedewald Department of Computer Science Cornell University Efficient Processing of Massive Data Streams for Mining and Monitoring.
NGDM’02 1 Efficient Data-Reduction Methods for On-line Association Rule Mining H. Bronnimann B. ChenM. Dash, Y. Qiao, P. ScheuermannP. Haas Polytechnic.
Estimating Entropy for Data Streams Khanh Do Ba, Dartmouth College Advisor: S. Muthu Muthukrishnan.
Department of Computer Science Provenance-based Trustworthiness Assessment in Sensor Networks Elisa Bertino CERIAS and Department of Computer Science,
Database Management 9. course. Execution of queries.
Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Byoung-Kee Yi N.D.Sidiropoulos Theodore Johnson 國立雲林科技大學 National.
Streaming Algorithms Piotr Indyk MIT. Data Streams A data stream is a sequence of data that is too large to be stored in available memory Examples: –Network.
RESOURCES, TRADE-OFFS, AND LIMITATIONS Group 5 8/27/2014.
Wavelet Synopses with Predefined Error Bounds: Windfalls of Duality Panagiotis Karras DB seminar, 23 March, 2006.
Constructing Optimal Wavelet Synopses Dimitris Sacharidis Timos Sellis
Major objective of this course is: Design and analysis of modern algorithms Different variants Accuracy Efficiency Comparing efficiencies Motivation thinking.
Dave McKenney 1.  Introduction  Algorithms/Approaches  Tiny Aggregation (TAG)  Synopsis Diffusion (SD)  Tributaries and Deltas (TD)  OPAG  Exact.
CprE 545 project proposal Long.  Introduction  Random linear code  LT-code  Application  Future work.
BARCODE IDENTIFICATION BY USING WAVELET BASED ENERGY Soundararajan Ezekiel, Gary Greenwood, David Pazzaglia Computer Science Department Indiana University.
The Haar + Tree: A Refined Synopsis Data Structure Panagiotis Karras HKU, September 7 th, 2006.
Geometric Problems in High Dimensions: Sketching Piotr Indyk.
Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality Piotr Indyk, Rajeev Motwani The 30 th annual ACM symposium on theory of computing.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
Massive Data Sets and Information Theory Ziv Bar-Yossef Department of Electrical Engineering Technion.
Data Mining: Concepts and Techniques Mining data streams
Calculating frequency moments of Data Stream
UMass Lowell Computer Science Analysis of Algorithms Prof. Karen Daniels Fall, 2001 Review Lecture Tuesday, 12/11/01.
University of Texas at Arlington Presented By Srikanth Vadada Fall CSE rd Sep 2010 Dynamic Sample Selection for Approximate Query Processing.
REU 2009-Traffic Analysis of IP Networks Daniel S. Allen, Mentor: Dr. Rahul Tripathi Department of Computer Science & Engineering Data Streams Data streams.
Wavelets Chapter 7 Serkan ERGUN. 1.Introduction Wavelets are mathematical tools for hierarchically decomposing functions. Regardless of whether the function.
Image Processing Architecture, © Oleh TretiakPage 1Lecture 5 ECEC 453 Image Processing Architecture Lecture 5, 1/22/2004 Rate-Distortion Theory,
Dense-Region Based Compact Data Cube
Advanced Algorithms Analysis and Design
Streaming & sampling.
Lattice Histograms: A Resilient Synopsis Structure
Sublinear Algorithmic Tools 3
Hidden Markov Models Part 2: Algorithms
Objective of This Course
Y. Kotidis, S. Muthukrishnan,
Overview Massive data sets Streaming algorithms Regression
Sequential Data Cleaning: A Statistical Approach
Wavelets and Ranking of database query results
Range-Efficient Computation of F0 over Massive Data Streams
Sublinear Algorihms for Big Data
Presentation transcript:

Zahid Irfan & Dr. Asim Karim (Advisor) (zahidi, CS-509-Masters of Science (CS) Project Lahore University of Management Sciences, Lahore, Pakistan 8 May 2004 Approximate Query Processing (AQP) in Data Streams

Acknowledgement This work is primarily based on the research paper “One-pass wavelets decompositions of data streams” by Gilbert, Muthukrishnan, Strauss and Kotidis, IEEE Trans. Knowledge and Data Engineering May/June, Work by Muthukrishnan, Piotr Indyk and of course Johnson-Lindenstrauss.

Introduction Streams and Streaming Models Wavelet Transform & Embedded Vectors Pseudo-Random Number Generator Implementation Details Test Results Conclusions and Future Work AQP in Data Streams

Lets solve a puzzle. Guess the missing number in a random sequence of numbers [1…N] without repetition. Introduction Space Requirements O (1). Time Complexity O (n). What about two numbers, three numbers …. and so on…

Data Stream “A sequence of digitally encoded signals used to represent information in transmission”. Input stream is the sequence a [i], arrives sequentially item by item. Data Streams

Applications Networks Data Monitoring. Applied to Traffic Flow Analysis World Wide Web. Website hits, statistics etc. Online Transactions Processing System Large Databases Query Processing Data Streams Applications

Time Series Comprises value of the same quantity over different time intervals. Typical examples Daily closing values of Stock Exchange Traffic at an IP-Link at time intervals. Stream Models

Cash Register Model Positive updates arrive over period of time. Typical examples well … Cash Register Cricket Scores Internet web-site hits or other statistics. Stream Models

Turnstile Model Fully dynamic model Updates are both negative & positive e.g. Passengers in an airport Relative Hardness Turnstile > Cash Register > Time Series “Depends and varies from application to application”. Stream Models

Wavelets A mathematical hierarchical tool for decomposition of signals/ functions. Types of Wavelets Haar Wavelets Daubechies Wavelets Many more… Wavelet Transform

Haar Wavelet Example Resolution Averages Detail Coefficients D = [2, 2, 0, 2, 3, 5, 4, 4] [2, 1, 4, 4][0, -1, -1, 0] [1.5, 4][0.5, 0] [2.75][-1.25] Haar Wavelet Decomposition [2.75, -1.25, 0.5, 0, 0, -1, -1, 0]

Wavelet in Space Haar Wavelets can be represented as the following. Example vector A of N=4, 4 coefficients. W1= 1/N*[ ], W2 = 1/N*[ ], W3=1/N*[ ], W4=1/N*[ ] 1 st Coefficient =.Average Coefficient 2 nd Coefficient =.Detail Coefficient 3 rd Coefficient =.Detail Coefficient 4 th Coefficient =.Detail Coefficient

Embedding Vectors Any n-point metric space can be embedded into an O(log 2 n) dimensional Euclidean space and L 1 metric with 1+є distortion f(v) = embedding for vector v =,, … >

Johnson-Lindenstrauss (JL) Lemma Simply stated ~ * Where j=1…k, k<<N r j is random vector= {1, -1 with equal probability} Implications Represent a vector in R N space in k-dimensional space. Benefits : Approximate Queries… ?? Johnson-Lindenstruass Lemma

~ * Approximate queries can be used by choosing special b. Query i th value choose b=[ …0] Range Query (i,j) value choose b=[ …0], where b[x]=1 for i<=x<=j. What's the catch?? … r j is also size of N. So where to store the random vectors?? AQP & JL-Lemma

Solution to large space over head is generate the random vectors on the fly!! Such as : for (i=0;i<k;i++) {srand (i); for (j=0;j<N;j++) { rand (); } } This solution works but there is a more elegant solution to this problem. Reed-Muller Codes Extractor. Pseudo-Random Generator

Reed-Muller Generator The Matrix values represent RM codes. RM (x,y)= Replace 0  1 & 1  -1 we get wavelet basis vectors.

Benefits of Reed-Muller Pseudo Random generator Generated on the fly. Every value is independently computed without anything to do with the previous values. Most nearly imitates Wavelet basis vectors. Hence the sketch contains most of the energy of the signal. Reed-Muller PR Generator

Things learnt so far There is a way to embed the N data into k<<N vectors JL-Lemma : ~ Reed-Muller Codes excellent imitators of both wavelet basis vectors as well as random vectors. Query Processing is possible thanks to JL- Lemma. Lessons so far !!

Implementation Details Implementation Trivia Implemented in Visual C Design follows Classes and Objects paradigm Test Results and graphs from MS Excel

Data Flow Diagram

Dataset Generator Synthetic Data Set was generated using Random Distributions. Normal Distribution Calling Telephone Number ~ (1000 lines) Receiving Telephone Number Exponential Distribution Call Time 0~512 minutes

Data Streamer The data streaming class offers methods, which help in useful imitation of a real-time data stream by continuously presenting the program with data. Type DataStreamer::getData();

Pseudo Random Generator This class calculates the Reed-Muller based Pseudo-random Numbers. type PseudoRandomGenerator::getRandom (int X,int Y); Uses the formula

Data Decomposition The data is decomposed into a sketch by calculating the dot product of data stream with O (log N) random vectors. The sketch is stored into Main Memory to be utilized by the query processing engine. Sketch [j]+=Data [i]*Random (i, j); Here i=(1,N) and j=(1,k);

Query Processing Engine The Query Processing Engine uses the sketch and a new vector b. Uses the same old JL-Lemma ~ * Setting various values of b result in theoretically any sort of query.

Point Query Processing Point Query Point Query can be processed by asking for any single value in the whole data stream. Point Query Algorithm Prepare b[i]={0 for i !=j, 1 for i=j} and generate QuerySketch[j] +=B[i] * Random (i,j); Result = (DataSketch * Query Sketch)/ N

Range Query Processing Range Query Range Queries specify the low and high between which the query is to be processed. Even multiple ranges can be specified Query Algorithm Prepare b[i]={0 for i !=j, 1 for i=j} and generate QuerySketch[j] +=B[i] * Random (i,j); Result = (DataSketch * Query Sketch)/ N

AQP Test Time Complexity Analysis Query Processing Accuracy with Data Size Query Processing Accuracy with Sketch Size

Time Complexity The following Time complexities were found to be linear in size of data. Sketching Time Query Processing Time

Time Complexity (Sketching)

Time Complexity (Query)

Accuracy versus Data Size Data Size versus Accuracy of Query PSNR (dB) versus Data Size Data Size is increased by Power of 2 Sketch size assumed to be log N

PSNR (dB) versus Data Size

Accuracy versus Sketch Size Accuracy of Query against the Sketch Size. PSNR (dB) versus Sketch Size Data Size is assumed to be constant = Sketch Size is varied

PSNR (dB) versus Sketch Size

Conclusions Space Complexity Reduction Prohibitively large data stream in sub-linear space. Time Complexity Reduction one-pass data stream algorithm. Scalability to multi-dimensions

Applications and Future Work Data Mining Streams Multimedia & Databases Trying it with Video coding might be fun or disaster Graph Theory Problems MST, Matching etc. need to be solved in the streaming model. Computational Geometry Earth observation data streams or weather data streams Solve any problem that can be modeled as a data stream

References S. Acharaya, P.B. Gibbons, V. Poosala and S. Ramaswamy, “Join Synopsis for Approximate Query Answering”, ACM In Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data, J. M. Hellerstein, P. J. Haas and H. J. Wang, “Online Aggregation”, In the Proceedings of 1997 ACM SIGMOD International Conference on Management of Data, Y. E. Iaonnidis and V. Poosala, “Histograms-Based Approximation to Set- Valued Query Answers”, In the proceedings of 25th International Conference on Very Large Databases, K. Chakrabarti, M. Garofalakis, R. Rastogi and K. Shim, “Approximate Query Processing Using Wavelets”, The Proceedings of the 26th Conference on Very Large Databases, Eygpt, F. Olken, “Random Sampling in Databases”, PhD Thesis, University of California at Berkeley, A.C. Gilbert, Y. Kotidis, S. Muthukrishnan and M. J. Strass, “One-pass wavelet Decomposition of Data Streams”, IEEE Transactions of Knowledge and Data Engineering, Vol. 15, No.3, May/June A. Ta-Shma, D. Zuckerman, and S. Safra, “Extractors from Reed-Muller Codes” In Proceedings of 42nd Annual IEEE Symposium on Foundations of Computer Science, 2001.

Questions & Answers Thanks to the following for their sincere help in this project Dr. Asim Karim, Dr. Sarmad Abbasi, Dr. Asim Loan, Dr. Sohaib A. Khan and all my friends specially Laeeq Aslam and Aimal Tariq Rextin.