One-Pass Wavelet Decompositions of Data Streams TKDE May 2002 Anna C. Gilbert,Yannis Kotidis, S. Muthukrishanan, Martin J. Strauss Presented by James Chan.

Slides:



Advertisements
Similar presentations
1+eps-Approximate Sparse Recovery Eric Price MIT David Woodruff IBM Almaden.
Advertisements

Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.
Online Mining of Frequent Query Trees over XML Data Streams Hua-Fu Li*, Man-Kwan Shan and Suh-Yin Lee Department of Computer Science.
Sampling From a Moving Window Over Streaming Data Brian Babcock * Mayur Datar Rajeev Motwani * Speaker Stanford University.
Raghavendra Madala. Introduction Icicles Icicle Maintenance Icicle-Based Estimators Quality Guarantee Performance Evaluation Conclusion 2 ICICLES: Self-tuning.
Kaushik Chakrabarti(Univ Of Illinois) Minos Garofalakis(Bell Labs) Rajeev Rastogi(Bell Labs) Kyuseok Shim(KAIST and AITrc) Presented at 26 th VLDB Conference,
Counting Distinct Objects over Sliding Windows Presented by: Muhammad Aamir Cheema Joint work with Wenjie Zhang, Ying Zhang and Xuemin Lin University of.
Probabilistic Skyline Operator over Sliding Windows Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Xuemin Lin, Ying Zhang, Wei.
An Improved Data Stream Summary: The Count-Min Sketch and its Applications Graham Cormode, S. Muthukrishnan 2003.
Fast Algorithms For Hierarchical Range Histogram Constructions
3/13/2012Data Streams: Lecture 161 CS 410/510 Data Streams Lecture 16: Data-Stream Sampling: Basic Techniques and Results Kristin Tufte, David Maier.
1 CS 361 Lecture 5 Approximate Quantiles and Histograms 9 Oct 2002 Gurmeet Singh Manku
Optimal Workload-Based Weighted Wavelet Synopsis
New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007.
COMP53311 Data Stream Prepared by Raymond Wong Presented by Raymond Wong
Deterministic Wavelet Thresholding for Maximum-Error Metrics Minos Garofalakis Bell Laboratories Lucent Technologies 600 Mountain Avenue Murray Hill, NJ.
Finding Aggregates from Streaming Data in Single Pass Medha Atre Course Project for CS631 (Autumn 2002) under Prof. Krithi Ramamritham (IITB).
Ph.D. DefenceUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor:
Processing Data-Stream Joins Using Skimmed Sketches Minos Garofalakis Internet Management Research Department Bell Labs, Lucent Technologies Joint work.
Wavelet Packets For Wavelets Seminar at Haifa University, by Eugene Mednikov.
Approximate Computation of Multidimensional Aggregates of Sparse Data Using Wavelets Based on the work of Jeffrey Scott Vitter and Min Wang.
What ’ s Hot and What ’ s Not: Tracking Most Frequent Items Dynamically G. Cormode and S. Muthukrishman Rutgers University ACM Principles of Database Systems.
Reverse Hashing for Sketch Based Change Detection in High Speed Networks Ashish Gupta Elliot Parsons with Robert Schweller, Theory Group Advisor: Yan Chen.
Time-Decaying Sketches for Sensor Data Aggregation Graham Cormode AT&T Labs, Research Srikanta Tirthapura Dept. of Electrical and Computer Engineering.
Lecture 3 Aug 31, 2011 Goals: Chapter 2 (algorithm analysis) Examples: Selection sorting rules for algorithm analysis discussion of lab – permutation generation.
Estimating Set Expression Cardinalities over Data Streams Sumit Ganguly Minos Garofalakis Rajeev Rastogi Internet Management Research Department Bell Labs,
Statistic estimation over data stream Slides modified from Minos Garofalakis ( yahoo! research) and S. Muthukrishnan (Rutgers University)
Computing Sketches of Matrices Efficiently & (Privacy Preserving) Data Mining Petros Drineas Rensselaer Polytechnic Institute (joint.
CS591A1 Fall Sketch based Summarization of Data Streams Manish R. Sharma and Weichao Ma.
Carnegie Mellon Adaptive Mapping of Linear DSP Algorithms to Fixed-Point Arithmetic Lawrence J. Chang Inpyo Hong Yevgen Voronenko Markus Püschel Department.
1 Wavelet synopses with Error Guarantees Minos Garofalakis Phillip B. Gibbons Information Sciences Research Center Bell Labs, Lucent Technologies Murray.
Lecture 3 Feb 7, 2011 Goals: Chapter 2 (algorithm analysis) Examples: Selection sorting rules for algorithm analysis Image representation Image processing.
CS 580S Sensor Networks and Systems Professor Kyoung Don Kang Lecture 7 February 13, 2006.
CS 591 A11 Algorithms for Data Streams Dhiman Barman CS 591 A1 Algorithms for the New Age 2 nd Dec, 2002.
RACE: Time Series Compression with Rate Adaptivity and Error Bound for Sensor Networks Huamin Chen, Jian Li, and Prasant Mohapatra Presenter: Jian Li.
Fast Approximate Wavelet Tracking on Streams Graham Cormode Minos Garofalakis Dimitris Sacharidis
1 A Bayesian Method for Guessing the Extreme Values in a Data Set Mingxi Wu, Chris Jermaine University of Florida September 2007.
Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002.
Finding Frequent Items in Data Streams [Charikar-Chen-Farach-Colton] Paper report By MH, 2004/12/17.
Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.
Streaming Algorithms Piotr Indyk MIT. Data Streams A data stream is a sequence of data that is too large to be stored in available memory Examples: –Network.
Constructing Optimal Wavelet Synopses Dimitris Sacharidis Timos Sellis
© 2010 AT&T Intellectual Property. All rights reserved. AT&T and the AT&T logo are trademarks of AT&T Intellectual Property. Structure-Aware Sampling:
Efficient Elastic Burst Detection in Data Streams Yunyue Zhu and Dennis Shasha Department of Computer Science Courant Institute of Mathematical Sciences.
Data Streams Part 3: Approximate Query Evaluation Reynold Cheng 23 rd July, 2002.
Data Stream Algorithms Ke Yi Hong Kong University of Science and Technology.
New Sampling-Based Summary Statistics for Improving Approximate Query Answers Yinghui Wang
Sparse Signals Reconstruction Via Adaptive Iterative Greedy Algorithm Ahmed Aziz, Ahmed Salim, Walid Osamy Presenter : 張庭豪 International Journal of Computer.
Data in Motion Michael Hoffman (Leicester) S Muthukrishnan (Google) Rajeev Raman (Leicester)
Data Mining: Concepts and Techniques Mining data streams
Calculating frequency moments of Data Stream
APPLICATION OF A WAVELET-BASED RECEIVER FOR THE COHERENT DETECTION OF FSK SIGNALS Dr. Robert Barsanti, Charles Lehman SSST March 2008, University of New.
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Dense-Region Based Compact Data Cube
Confidence Intervals Cont.
Data Transformation: Normalization
Data-Streams and Histograms
Matrix Sketching over Sliding Windows
Streaming & sampling.
Fast Approximate Query Answering over Sensor Data with Deterministic Error Guarantees Chunbin Lin Joint with Etienne Boursier, Jacque Brito, Yannis Katsis,
Spatial Online Sampling and Aggregation
Optimal Elephant Flow Detection Presented by: Gil Einziger,
Y. Kotidis, S. Muthukrishnan,
Approximate Frequency Counts over Data Streams
CSCI B609: “Foundations of Data Science”
Introduction to Stream Computing and Reservoir Sampling
Wavelet-based histograms for selectivity estimation
Chapter 15: Wavelets (i) Fourier spectrum provides all the frequencies
Lu Tang , Qun Huang, Patrick P. C. Lee
Maintaining Stream Statistics over Sliding Windows
Presentation transcript:

One-Pass Wavelet Decompositions of Data Streams TKDE May 2002 Anna C. Gilbert,Yannis Kotidis, S. Muthukrishanan, Martin J. Strauss Presented by James Chan CSCI 599 Multidimensional Databases Fall 2003

Outline of Talk Introduction Background Proposed Algorithm Experiments End Notes

Streaming Applications Telephone Call Duration Call Detail Record (CDR) IP Traffic Flow Bank ATM Transactions Mission Critical Task: –Fraud –Security –Performance Monitoring

Data Stream Model Data Stream Problem One Pass – no backtracking Unbounded Data – Algorithms require small memory usage Continuous – Need to run real time Stream Processing Engine (Approximate) Answer Synopsis in Memory Data Streams

Data Stream Strategies Many stream algorithms produce approximate answers and have: –Deterministic Bounds: answers are within ±  –Probabilistic Bounds: answers have high success probability (1-  ) within ± 

Data Stream Strategies Windows: New elements expire after time t Samples: Approximate entire domain with a sample Histograms: Partitioning element domain values into buckets (Equi-depth, V-Opt) Wavelets: Haar, Construction and maintenance (difficult for large domain) Sketch Techniques: estimate of L2 norm of a signal

Proposed Stream Model

Background: Cash Register vs. Aggregate Cash Register: incoming stream represents domain (increment or decrement range of that domain) Aggregate: incoming stream represents range, (update range of that domain) Note: Examples in this paper assume –each cash register element as +1 unit –no duplicate elements in aggregate models

Background: Cash Register vs. Aggregate Cash Register (domain) Aggregate (range) Ordered Easiest Eg. Time Series Unordered General Challenging Eg. Network volume Contiguous Same as aggregate unordered n/a

Background: Wavelet Basics Wavelet transforms capture trends in a signal Typical transform involves log n passes Each pass creates two sets of n/2 averages and differences. Process repeated on averages Output: Wavelet Basis vectors – one average and n-1 coefficients

Background: Haar Wavelet Notation High pass filter Low pass filter Input: signal a Basis Coefficients Coefficients Scaling Factor Psi Vectors (un-normalized)

Background: Haar Wavelet Example

Background: Small B Representation Most signals in nature have small B representation Only keep largest B wavelet coefficients to estimate energy of signal Additional coefficients do not help reduce squared sum error Energy: SSE:

Background: Storage Highest B wavelet coefficients Log N Straddling coefficients, one per level of the wavelet tree Original Signal

Background: Bounding Theorems Theorem 1 Given O(B+logN) storage (B is number of dimensions) time to compute new data item is O(B+logN) in ordered aggregate model Theorem 2 Any algorithm that calculates the 2 nd largest wavelet coefficient of the signal in unordered CR / unordered agg uses at least N/polylog(N) This holds if: –You only care about existence, not the coefficients value –Only calculating up to a factor of 2

Proposed Algorithm: Overview Avoid keeping anything domain size N in memory Estimate wavelet coefficients using sketches which are size log(N) Sketch is maintained in memory and is updated as data entries stream in

What’s a Sketch? Distortion Parameter  (epsilon) Failure Probability  (delta) Failure Threshold  (eta) Original Signala Random vector of {-1,+1}sr Seed for rs Atomic Sketch dot product of a and r SketchO(log(N/  )/  ^2) atomic sketches We use the same j to index the atomic sketch, seed, and random vector, so there are j atomic sketches in a sketch

Updating a Sketch Cash Register –Add corresponding to the j atomic sketches Aggregate –Add corresponding to the j atomic sketches  Use generator that takes in seed which is log(N) to compute

Reed Muller Generator Pseudo random generator meeting these requirements: –Variables are 4 wise independent Expected value of product of any 4 distinct r is 0 –Requires O(log N) space for seeding –Performs computation in polylog(N) time {0} {d} {c} …. {d,c,b,a}

X = median ( ) Estimation of Inner Product … O(log(1/  )) O(log(1/  ^2)) = mean ( ) …… … …

Boosting Accuracy and Confidence Improve accuracy to  by averaging over more independent copies of for each average Improve Confidence by increasing number of averages to take median over O(log(1/  ^2)) copies of … X = median of ( ) = means ( ) … … O(log(1/  )) copies of …

Using the sketches We can approximate to maintain Bs Note a point query is where e is a vector with a 1 at index i and 0s everywhere else Atomic Sketches in memory

Maintaining Top B Coefficients At most Log N +1 coefficient updates May need to approximate straddling coefficients to aggregate with already existing or near variables Compare updates with top B and update top B if necessary updated unaffected

Algorithm Space and Time Their algorithm uses polylog(N) space and per item time to maintain B terms (by approximation)

Experiments Data: one week of AT&T call detail (unordered cash register model) Modes –Batch: Query only between intervals –Online: Query anytime Direct Point: calc sketch of (ei is zero vector except with 1 at i) Direct Wavelets: estimate all supporting coefficients and use wavelet reconstruction to calculate point a(i) Top B: Reconstruction of point is done with Top B (maintained by sketch)

Top B – Day 0

Top B - 1 Week (fixed-set) Value updates only. no replacement

Sketch Size on Accuracy

Heavy Hitters Points that contribute significantly to the energy of the signal Direct point estimates are very accurate for heavy hitters but gross estimates for non heavy hitters Adaptive Greedy pursuit: by removing the first heavy hitter from the signal, you improve the accuracy of calculating the next biggest heavy hitter However an error is introduced with each subtraction of a heavy hitter

Processing Heavy Hitters Adaptive Greedy Pursuit

End Notes First Provable Guarantees for haar wavelet over data streams Can estimate Haar coefficients ci= Top B is updated in: This paper is superseded by "Fast, Small-space algorithms for approximate histogram maintenance" STOC 2002 –Discusses how to select top B and find heavy hitters