Presentation is loading. Please wait.

Presentation is loading. Please wait.

One-Pass Wavelet Decompositions of Data Streams TKDE May 2002 Anna C. Gilbert,Yannis Kotidis, S. Muthukrishanan, Martin J. Strauss Presented by James Chan.

Similar presentations


Presentation on theme: "One-Pass Wavelet Decompositions of Data Streams TKDE May 2002 Anna C. Gilbert,Yannis Kotidis, S. Muthukrishanan, Martin J. Strauss Presented by James Chan."— Presentation transcript:

1 One-Pass Wavelet Decompositions of Data Streams TKDE May 2002 Anna C. Gilbert,Yannis Kotidis, S. Muthukrishanan, Martin J. Strauss Presented by James Chan CSCI 599 Multidimensional Databases Fall 2003

2 Outline of Talk Introduction Background Proposed Algorithm Experiments End Notes

3 Streaming Applications Telephone Call Duration Call Detail Record (CDR) IP Traffic Flow Bank ATM Transactions Mission Critical Task: –Fraud –Security –Performance Monitoring

4 Data Stream Model Data Stream Problem One Pass – no backtracking Unbounded Data – Algorithms require small memory usage Continuous – Need to run real time Stream Processing Engine (Approximate) Answer Synopsis in Memory Data Streams

5 Data Stream Strategies Many stream algorithms produce approximate answers and have: –Deterministic Bounds: answers are within ±  –Probabilistic Bounds: answers have high success probability (1-  ) within ± 

6 Data Stream Strategies Windows: New elements expire after time t Samples: Approximate entire domain with a sample Histograms: Partitioning element domain values into buckets (Equi-depth, V-Opt) Wavelets: Haar, Construction and maintenance (difficult for large domain) Sketch Techniques: estimate of L2 norm of a signal

7 Proposed Stream Model

8 Background: Cash Register vs. Aggregate Cash Register: incoming stream represents domain (increment or decrement range of that domain) Aggregate: incoming stream represents range, (update range of that domain) Note: Examples in this paper assume –each cash register element as +1 unit –no duplicate elements in aggregate models

9 Background: Cash Register vs. Aggregate Cash Register (domain) Aggregate (range) Ordered Easiest Eg. Time Series Unordered General Challenging Eg. Network volume Contiguous Same as aggregate unordered n/a

10 Background: Wavelet Basics Wavelet transforms capture trends in a signal Typical transform involves log n passes Each pass creates two sets of n/2 averages and differences. Process repeated on averages Output: Wavelet Basis vectors – one average and n-1 coefficients

11 Background: Haar Wavelet Notation High pass filter Low pass filter Input: signal a Basis Coefficients Coefficients Scaling Factor Psi Vectors (un-normalized)

12 Background: Haar Wavelet Example

13 Background: Small B Representation Most signals in nature have small B representation Only keep largest B wavelet coefficients to estimate energy of signal Additional coefficients do not help reduce squared sum error Energy: SSE:

14 Background: Storage Highest B wavelet coefficients Log N Straddling coefficients, one per level of the wavelet tree 2 2 0 2 3 5 4 4 -1.25 2.75 0.5 0 0 0 + - + + + ++ + + - - ---- Original Signal

15 Background: Bounding Theorems Theorem 1 Given O(B+logN) storage (B is number of dimensions) time to compute new data item is O(B+logN) in ordered aggregate model Theorem 2 Any algorithm that calculates the 2 nd largest wavelet coefficient of the signal in unordered CR / unordered agg uses at least N/polylog(N) This holds if: –You only care about existence, not the coefficients value –Only calculating up to a factor of 2

16 Proposed Algorithm: Overview Avoid keeping anything domain size N in memory Estimate wavelet coefficients using sketches which are size log(N) Sketch is maintained in memory and is updated as data entries stream in

17 What’s a Sketch? Distortion Parameter  (epsilon) Failure Probability  (delta) Failure Threshold  (eta) Original Signala Random vector of {-1,+1}sr Seed for rs Atomic Sketch dot product of a and r SketchO(log(N/  )/  ^2) atomic sketches We use the same j to index the atomic sketch, seed, and random vector, so there are j atomic sketches in a sketch

18 Updating a Sketch Cash Register –Add corresponding to the j atomic sketches Aggregate –Add corresponding to the j atomic sketches  Use generator that takes in seed which is log(N) to compute

19 Reed Muller Generator Pseudo random generator meeting these requirements: –Variables are 4 wise independent Expected value of product of any 4 distinct r is 0 –Requires O(log N) space for seeding –Performs computation in polylog(N) time {0} {d} {c} …. {d,c,b,a}

20 X = median ( ) Estimation of Inner Product … O(log(1/  )) O(log(1/  ^2)) = mean ( ) …… … …

21 Boosting Accuracy and Confidence Improve accuracy to  by averaging over more independent copies of for each average Improve Confidence by increasing number of averages to take median over O(log(1/  ^2)) copies of … X = median of ( ) = means ( ) … … O(log(1/  )) copies of …

22 Using the sketches We can approximate to maintain Bs Note a point query is where e is a vector with a 1 at index i and 0s everywhere else Atomic Sketches in memory

23 Maintaining Top B Coefficients At most Log N +1 coefficient updates May need to approximate straddling coefficients to aggregate with already existing or near variables Compare updates with top B and update top B if necessary updated unaffected

24 Algorithm Space and Time Their algorithm uses polylog(N) space and per item time to maintain B terms (by approximation)

25 Experiments Data: one week of AT&T call detail (unordered cash register model) Modes –Batch: Query only between intervals –Online: Query anytime Direct Point: calc sketch of (ei is zero vector except with 1 at i) Direct Wavelets: estimate all supporting coefficients and use wavelet reconstruction to calculate point a(i) Top B: Reconstruction of point is done with Top B (maintained by sketch)

26 Top B – Day 0

27 Top B - 1 Week (fixed-set) Value updates only. no replacement

28 Sketch Size on Accuracy

29 Heavy Hitters Points that contribute significantly to the energy of the signal Direct point estimates are very accurate for heavy hitters but gross estimates for non heavy hitters Adaptive Greedy pursuit: by removing the first heavy hitter from the signal, you improve the accuracy of calculating the next biggest heavy hitter However an error is introduced with each subtraction of a heavy hitter

30 Processing Heavy Hitters Adaptive Greedy Pursuit

31 End Notes First Provable Guarantees for haar wavelet over data streams Can estimate Haar coefficients ci= Top B is updated in: This paper is superseded by "Fast, Small-space algorithms for approximate histogram maintenance" STOC 2002 –Discusses how to select top B and find heavy hitters


Download ppt "One-Pass Wavelet Decompositions of Data Streams TKDE May 2002 Anna C. Gilbert,Yannis Kotidis, S. Muthukrishanan, Martin J. Strauss Presented by James Chan."

Similar presentations


Ads by Google