Fast Approximate Wavelet Tracking on Streams Graham Cormode Minos Garofalakis Dimitris Sacharidis

Slides:



Advertisements
Similar presentations
Rectangle-Efficient Aggregation in Spatial Data Streams Srikanta Tirthapura David Woodruff Iowa State IBM Almaden.
Advertisements

Fast Moment Estimation in Data Streams in Optimal Space Daniel Kane, Jelani Nelson, Ely Porat, David Woodruff Harvard MIT Bar-Ilan IBM.
The Data Stream Space Complexity of Cascaded Norms T.S. Jayram David Woodruff IBM Almaden.
Online Mining of Frequent Query Trees over XML Data Streams Hua-Fu Li*, Man-Kwan Shan and Suh-Yin Lee Department of Computer Science.
On Complexity, Sampling, and -Nets and -Samples. Range Spaces A range space is a pair, where is a ground set, it’s elements called points and is a family.
Kaushik Chakrabarti(Univ Of Illinois) Minos Garofalakis(Bell Labs) Rajeev Rastogi(Bell Labs) Kyuseok Shim(KAIST and AITrc) Presented at 26 th VLDB Conference,
Counting Distinct Objects over Sliding Windows Presented by: Muhammad Aamir Cheema Joint work with Wenjie Zhang, Ying Zhang and Xuemin Lin University of.
An Improved Data Stream Summary: The Count-Min Sketch and its Applications Graham Cormode, S. Muthukrishnan 2003.
Fast Algorithms For Hierarchical Range Histogram Constructions
Summarizing Distributed Data Ke Yi HKUST += ?. Small summaries for BIG data  Allow approximate computation with guarantees and small space – save space,
3/13/2012Data Streams: Lecture 161 CS 410/510 Data Streams Lecture 16: Data-Stream Sampling: Basic Techniques and Results Kristin Tufte, David Maier.
Maintaining Variance and k-Medians over Data Stream Windows Brian Babcock, Mayur Datar, Rajeev Motwani, Liadan O’Callaghan Stanford University.
Fast Bayesian Matching Pursuit Presenter: Changchun Zhang ECE / CMR Tennessee Technological University November 12, 2010 Reading Group (Authors: Philip.
Optimal Workload-Based Weighted Wavelet Synopsis
New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007.
Deterministic Wavelet Thresholding for Maximum-Error Metrics Minos Garofalakis Bell Laboratories Lucent Technologies 600 Mountain Avenue Murray Hill, NJ.
1 Reversible Sketches for Efficient and Accurate Change Detection over Network Data Streams Robert Schweller Ashish Gupta Elliot Parsons Yan Chen Computer.
Ph.D. DefenceUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor:
1 An Adaptive Nearest Neighbor Classification Algorithm for Data Streams Yan-Nei Law & Carlo Zaniolo University of California, Los Angeles PKDD, Porto,
Processing Data-Stream Joins Using Skimmed Sketches Minos Garofalakis Internet Management Research Department Bell Labs, Lucent Technologies Joint work.
Wavelet Packets For Wavelets Seminar at Haifa University, by Eugene Mednikov.
Approximate Computation of Multidimensional Aggregates of Sparse Data Using Wavelets Based on the work of Jeffrey Scott Vitter and Min Wang.
Lecture 3 Aug 31, 2011 Goals: Chapter 2 (algorithm analysis) Examples: Selection sorting rules for algorithm analysis discussion of lab – permutation generation.
Estimating Set Expression Cardinalities over Data Streams Sumit Ganguly Minos Garofalakis Rajeev Rastogi Internet Management Research Department Bell Labs,
Statistic estimation over data stream Slides modified from Minos Garofalakis ( yahoo! research) and S. Muthukrishnan (Rutgers University)
CS591A1 Fall Sketch based Summarization of Data Streams Manish R. Sharma and Weichao Ma.
1 Wavelet synopses with Error Guarantees Minos Garofalakis Phillip B. Gibbons Information Sciences Research Center Bell Labs, Lucent Technologies Murray.
Lecture 3 Feb 7, 2011 Goals: Chapter 2 (algorithm analysis) Examples: Selection sorting rules for algorithm analysis Image representation Image processing.
CS107 Introduction to Computer Science Lecture 7, 8 An Introduction to Algorithms: Efficiency of algorithms.
One-Pass Wavelet Decompositions of Data Streams TKDE May 2002 Anna C. Gilbert,Yannis Kotidis, S. Muthukrishanan, Martin J. Strauss Presented by James Chan.
Hashed Samples Selectivity Estimators for Set Similarity Selection Queries.
1 Chapter 24 Developing Efficient Algorithms. 2 Executing Time Suppose two algorithms perform the same task such as search (linear search vs. binary search)
Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002.
By Graham Cormode and Marios Hadjieleftheriou Presented by Ankur Agrawal ( )
Finding Frequent Items in Data Streams [Charikar-Chen-Farach-Colton] Paper report By MH, 2004/12/17.
Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.
Chapter 3 Sec 3.3 With Question/Answer Animations 1.
Energy-Aware Scheduling with Quality of Surveillance Guarantee in Wireless Sensor Networks Jaehoon Jeong, Sarah Sharafkandi and David H.C. Du Dept. of.
Streaming Algorithms Piotr Indyk MIT. Data Streams A data stream is a sequence of data that is too large to be stored in available memory Examples: –Network.
Clustering Spatial Data Using Random Walk David Harel and Yehuda Koren KDD 2001.
Wavelet Synopses with Predefined Error Bounds: Windfalls of Duality Panagiotis Karras DB seminar, 23 March, 2006.
Constructing Optimal Wavelet Synopses Dimitris Sacharidis Timos Sellis
© 2010 AT&T Intellectual Property. All rights reserved. AT&T and the AT&T logo are trademarks of AT&T Intellectual Property. Structure-Aware Sampling:
Data Streams Part 3: Approximate Query Evaluation Reynold Cheng 23 rd July, 2002.
The Haar + Tree: A Refined Synopsis Data Structure Panagiotis Karras HKU, September 7 th, 2006.
The Impact of Duality on Data Synopsis Problems Panagiotis Karras KDD, San Jose, August 13 th, 2007 work with Dimitris Sacharidis and Nikos Mamoulis.
Data Stream Algorithms Ke Yi Hong Kong University of Science and Technology.
Data in Motion Michael Hoffman (Leicester) S Muthukrishnan (Google) Rajeev Raman (Leicester)
The Impact of Duality on Data Representation Problems Panagiotis Karras HKU, June 14 th, 2007.
1 Approximate XML Query Answers Presenter: Hongyu Guo Authors: N. polyzotis, M. Garofalakis, Y. Ioannidis.
By: Gang Zhou Computer Science Department University of Virginia 1 Medians and Beyond: New Aggregation Techniques for Sensor Networks CS851 Seminar Presentation.
Liang, Introduction to Java Programming, Sixth Edition, (c) 2007 Pearson Education, Inc. All rights reserved Chapter 23 Algorithm Efficiency.
XCluster Synopses for Structured XML Content Alkis Polyzotis (UC Santa Cruz) Minos Garofalakis (Intel Research, Berkeley)
Graph Indexing From managing and mining graph data.
Sketch-Based Multi-Query Processing over Data Streams Minos Garofalakis Internet Management Research Department Bell Labs, Lucent Technologies Joint work.
INTRO2CS Tirgul 8 1. Searching and Sorting  Tips for debugging  Binary search  Sorting algorithms:  Bogo sort  Bubble sort  Quick sort and maybe.
Fast Pseudo-Random Fingerprints Yoram Bachrach, Microsoft Research Cambridge Ely Porat – Bar Ilan-University.
Continuous Monitoring of Distributed Data Streams over a Time-based Sliding Window MADALGO – Center for Massive Data Algorithmics, a Center of the Danish.
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Dense-Region Based Compact Data Cube
New Characterizations in Turnstile Streams with Applications
Streaming & sampling.
Lattice Histograms: A Resilient Synopsis Structure
Query-Friendly Compression of Graph Streams
Lecture 7: Dynamic sampling Dimension Reduction
Algorithm An algorithm is a finite set of steps required to solve a problem. An algorithm must have following properties: Input: An algorithm must have.
Unit-2 Divide and Conquer
Y. Kotidis, S. Muthukrishnan,
Heavy Hitters in Streams and Sliding Windows
Algorithm Course Algorithms Lecture 3 Sorting Algorithm-1
Presentation transcript:

Fast Approximate Wavelet Tracking on Streams Graham Cormode Minos Garofalakis Dimitris Sacharidis

outline introduction –motivation –problem formulation background –wavelet synopses –the AMS sketch the GCS algorithm –our approach –the Group Count Sketch –finding L 2 heavy items –sketching the wavelet domain experimental results conclusions

motivation numerous emerging data management applications require to continuously generate, process and analyze massive amounts of data –e.g. continuous event monitoring applications: network-event tracking in ISPs, transaction-log monitoring in large web-server farms the data streaming paradigm –large volumes (~Terabytes/day) of monitoring data arriving at high rates that need to be processed on-line analysis in data streaming scenarios rely on building and maintaining approximate synopses in real time and in one pass over streaming data –require small space to summarize key features of streaming data –provide approximate query answers with quality guarantees

problem formulation our focus: maintain a wavelet synopsis over data streams algorithmic requirements: –small memory footprint (sublinear in data size) –fast per stream-item process time (sublinear in required memory) –fast query time (sublinear in data size) –quality guarantees on query answers stream processing model assume data vector a of size N stream items are of the form (i,±u) denoting a net change of ±u in the a[i] entry a[i] := a[i] ± u important items are only seen once in the fixed order of arrival and do not come ordered in i interpretation u insertions/deletions of the i th entry (we also allow entries to take negative values)

outline introduction –motivation –problem formulation background –wavelet synopses –the AMS sketch the GCS algorithm –our approach –the Group Count Sketch –finding L 2 heavy items –sketching the wavelet domain experimental results conclusions

wavelet synopses the (Haar) wavelet decomposition hierarchically decomposes a data vector –for every pair of consequent values, compute the average and the semi- difference (a.k.a. detail) values (coefficients) –iteratively repeat on the lower-resolution data consisting of only the averages –final decomposition is the overall average plus all details to obtain the optimal, in sum-squared-error sense, wavelet synopsis only keep the highest in absolute normalized value coefficients –implicitly set other coefficients to zero easily extendable to multiple dimensions

the AMS sketch (1/2) the AMS sketch is a powerful data stream synopsis structure serving as the building block in a variety of applications: –e.g.: estimating (multi-way) join size, constructing histograms and wavelet synopses, finding frequent items and quantiles it consists of O(1/ε 2 ) × O(log(1/δ)) atomic sketches an atomic AMS sketch X of a is a randomized linear projection –X = = Σ i a[i]ξ(i), where ξ denotes a random vector of four-wise independent random variables {±1} –the random variable can be generated in just O(logN) bits space for seeding, using standard pseudo-random hash functions X is updated as stream updates (i, ±u) arrive: X := X ± uξ(i)

the AMS sketch (2/2) the AMS sketch estimates the L 2 norm (energy) of a –let Z be the O(log(1/δ))-wise median of O(1/ε 2 )-wise means of the square of independent atomic AMS sketches –then Z estimates ||a|| 2 within ±ε||a|| 2 (w.h.p. ≥1-δ) –it can also estimate inner products an improvement: fast AMS sketch –introducing a level of hashing reduces update time by O(1/ε 2 ) while providing the same guarantees and requiring same space

outline introduction –motivation –problem formulation background –wavelet synopses –the AMS sketch the GCS algorithm –our approach –the Group Count Sketch –finding L 2 heavy items –sketching the wavelet domain experimental results conclusions

our approach (1/3) two shortcomings of existing approach [GKMS] (using AMS sketches): 1.updating the sketch requires O(|sketch|) updates per streaming item 2.querying for the largest coefficients requires superlinear Ω(NlogN) time (even when using range-summable random variables) blows up in the multi-dimensional case can we fix it? use the fast-AMS sketch to speed up update time (not enough) we introduce the GCS algorithm that satisfies all algorithmic requirements makes summarizing large multi-dimensional streams feasible streaming requirements GKMS fast- GKMS GCS small space fast update time × fast query time ××

our approach (2/3) the GCS algorithm relies on two ideas: (1) sketch the wavelet domain (2) quickly identify large coefficients (1) is easy to accomplish: translate updates in the original domain to updates in the wavelet domain –just polylog more updates are required, even for multi-d

our approach (3/3) for (2) we would like to perform a binary- search-like procedure enforce a hierarchical grouping on coefficients prune groups of coefficients that are not L 2 -heavy, as they may not contain L 2 -heavy coefficients only the remaining groups need to be examined more closely iteratively keep pruning until you reach singleton groups but, how do we estimate the L 2 (energy) for groups of coefficients? this is a difficult task, requiring a novel technical result more difficult than finding frequent items! enter group count sketch

group count sketch (1/2) the group count sketch (GCS) consists of b buckets each having c sub- buckets, repeated t times this gives a total of t·b·c counters s[1][1][1] through s[t][b][c] goal: estimate the L 2 of all k groups forming a partition on the domain of a update the sketch per stream element (i, ±u) repeat t times: get item’s group -> hash into a bucket -> hash into a sub-bucket -> update counter by {±1}·u id identifies the group of an item h m hashes groups into buckets f m hashes items into sub-buckets 4-wise random variables {±1} ξ m

group count sketch (2/2) estimate L 2 of group g return the median of t instances of Σ j (s[m][h m (g)][j]) 2 for all j in [c] estimates are (w.h.p. 1-δ) within additive error ε·||a|| 2 analysis results: t=O(log(1/δ)) b=O(1/ε) c=O(1/ε 2 ) space O(1/ε 3 log(1/δ)) counters update cost O(log(1/δ)) query cost O(1/ε 2 log(1/δ)

finding L 2 -heavy items keep one GCS per level of hierarchy space and update time complexities increase (roughly) by a factor of logN query: find all items with L 2 greater than φ||a|| 2 query time increases by 1/φ·logN (1/φ L 2 -heavy items per level) w.h.p. we get all items with L 2 greater than ( φ+ε)||a|| 2 w.h.p. we get no items with L 2 less than ( φ-ε)||a|| 2

sketching the wavelet domain the GCS algorithm: –translate updates into the wavelet domain –maintain log r N group count sketches –find L 2 heavy coefficients with energy above φ||a|| 2 note: changing the degree (r) of the search tree allows for query-update time trade-off but, what should the threshold φ be? assuming the data satisfies the small-B property: there is a B-term synopsis with energy at least η ||a|| 2 setting φ = εη/Β we obtain a synopsis (with no more than B coeffs) with energy at least (1-ε)η||a|| 2

outline introduction –motivation –problem formulation background –wavelet synopses –the AMS sketch the GCS algorithm –our approach –the Group Count Sketch –finding L 2 heavy items –sketching the wavelet domain experimental results conclusions

experiments update and query time vs domain size all methods are given same space GCS-r is GCS with search tree of degree 2 r

experiments two-dimensional update and query time for both wavelet decomposition forms standard formnon-standard form

experiments multi-dimensional update and query time for both wavelet decomposition forms S: standard NS: non-standard

conclusions the GCS algorithm allows for efficient tracking of wavelet synopses over multi-dimensional data streams the Group Count Sketch satisfies all streaming requirements: –small polylog space –fast polylog update time –fast polylog query time –approximate answers with quality guarantees future research directions: –other error metrics –histograms

thank you!

experiments update and query time vs sketch size GCS-r is GCS with search tree of degree 2 r