Processing Data-Stream Joins Using Skimmed Sketches Minos Garofalakis Internet Management Research Department Bell Labs, Lucent Technologies Joint work.

Slides:



Advertisements
Similar presentations
Raghavendra Madala. Introduction Icicles Icicle Maintenance Icicle-Based Estimators Quality Guarantee Performance Evaluation Conclusion 2 ICICLES: Self-tuning.
Advertisements

Kaushik Chakrabarti(Univ Of Illinois) Minos Garofalakis(Bell Labs) Rajeev Rastogi(Bell Labs) Kyuseok Shim(KAIST and AITrc) Presented at 26 th VLDB Conference,
Counting Distinct Objects over Sliding Windows Presented by: Muhammad Aamir Cheema Joint work with Wenjie Zhang, Ying Zhang and Xuemin Lin University of.
Probabilistic Skyline Operator over Sliding Windows Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Xuemin Lin, Ying Zhang, Wei.
An Improved Data Stream Summary: The Count-Min Sketch and its Applications Graham Cormode, S. Muthukrishnan 2003.
Fast Algorithms For Hierarchical Range Histogram Constructions
Towards Estimating the Number of Distinct Value Combinations for a Set of Attributes Xiaohui Yu 1, Calisto Zuzarte 2, Ken Sevcik 1 1 University of Toronto.
Ariel Rosenfeld Network Traffic Engineering. Call Record Analysis. Sensor Data Analysis. Medical, Financial Monitoring. Etc,
Maintaining Variance and k-Medians over Data Stream Windows Brian Babcock, Mayur Datar, Rajeev Motwani, Liadan O’Callaghan Stanford University.
Maintaining Variance over Data Stream Windows Brian Babcock, Mayur Datar, Rajeev Motwani, Liadan O ’ Callaghan, Stanford University ACM Symp. on Principles.
Probabilistic Histograms for Probabilistic Data Graham Cormode AT&T Labs-Research Antonios Deligiannakis Technical University of Crete Minos Garofalakis.
Mining Data Streams.
Estimating Join-Distinct Aggregates over Update Streams Minos Garofalakis Bell Labs, Lucent Technologies (Joint work with Sumit Ganguly, Amit Kumar, Rajeev.
1 CS 361 Lecture 5 Approximate Quantiles and Histograms 9 Oct 2002 Gurmeet Singh Manku
New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007.
Streaming Algorithms for Robust, Real- Time Detection of DDoS Attacks S. Ganguly, M. Garofalakis, R. Rastogi, K. Sabnani Krishan Sabnani Bell Labs Research.
1 Reversible Sketches for Efficient and Accurate Change Detection over Network Data Streams Robert Schweller Ashish Gupta Elliot Parsons Yan Chen Computer.
Ph.D. DefenceUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor:
Ph.D. SeminarUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor:
Data Stream Mining and Querying
Data Stream Processing (Part I) Alon,, Matias, Szegedy. “The space complexity of approximating the frequency moments”, ACM STOC’1996. Alon, Gibbons, Matias,
Reverse Hashing for Sketch Based Change Detection in High Speed Networks Ashish Gupta Elliot Parsons with Robert Schweller, Theory Group Advisor: Yan Chen.
Time-Decaying Sketches for Sensor Data Aggregation Graham Cormode AT&T Labs, Research Srikanta Tirthapura Dept. of Electrical and Computer Engineering.
Estimating Set Expression Cardinalities over Data Streams Sumit Ganguly Minos Garofalakis Rajeev Rastogi Internet Management Research Department Bell Labs,
Statistic estimation over data stream Slides modified from Minos Garofalakis ( yahoo! research) and S. Muthukrishnan (Rutgers University)
CS591A1 Fall Sketch based Summarization of Data Streams Manish R. Sharma and Weichao Ma.
1 Wavelet synopses with Error Guarantees Minos Garofalakis Phillip B. Gibbons Information Sciences Research Center Bell Labs, Lucent Technologies Murray.
1 Administrivia  List of potential projects will be out by the end of the week  If you have specific project ideas, catch me during office hours (right.
Data Stream Processing (Part III) Gibbons. “Distinct sampling for highly accurate answers to distinct values queries and event reports”, VLDB’2001. Ganguly,
Correlating XML Data Streams Using Tree-Edit Distance Embeddings Minos Garofalakis & Amit Kumar Internet Management Research Department Bell Labs, Lucent.
CS 591 A11 Algorithms for Data Streams Dhiman Barman CS 591 A1 Algorithms for the New Age 2 nd Dec, 2002.
One-Pass Wavelet Decompositions of Data Streams TKDE May 2002 Anna C. Gilbert,Yannis Kotidis, S. Muthukrishanan, Martin J. Strauss Presented by James Chan.
Scalable Approximate Query Processing through Scalable Error Estimation Kai Zeng UCLA Advisor: Carlo Zaniolo 1.
Fast Approximate Wavelet Tracking on Streams Graham Cormode Minos Garofalakis Dimitris Sacharidis
Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman.
Hashed Samples Selectivity Estimators for Set Similarity Selection Queries.
© Copyright McGraw-Hill CHAPTER 6 The Normal Distribution.
Processing Continuous Network-Data Streams Minos Garofalakis Internet Management Research Department Bell Labs, Lucent Technologies.
CONGRESSIONAL SAMPLES FOR APPROXIMATE ANSWERING OF GROUP-BY QUERIES Swarup Acharya Phillip Gibbons Viswanath Poosala ( Information Sciences Research Center,
Finding Frequent Items in Data Streams [Charikar-Chen-Farach-Colton] Paper report By MH, 2004/12/17.
Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.
Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach Wenjie Zhang, Xuemin Lin The University of New South Wales & NICTA Ming Hua,
Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.
Streaming Algorithms Piotr Indyk MIT. Data Streams A data stream is a sequence of data that is too large to be stored in available memory Examples: –Network.
Swarup Acharya Phillip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented By Vinay Hoskere.
1 Approximating Quantiles over Sliding Windows Srimathi Harinarayanan CMPS 565.
End-biased Samples for Join Cardinality Estimation Cristian Estan, Jeffrey F. Naughton Computer Sciences Department University of Wisconsin-Madison.
Data Streams Part 3: Approximate Query Evaluation Reynold Cheng 23 rd July, 2002.
A Formal Analysis of Conservative Update Based Approximate Counting Gil Einziger and Roy Freidman Technion, Haifa.
New Sampling-Based Summary Statistics for Improving Approximate Query Answers Yinghui Wang
Analyzing Massive Data Streams: Past, Present, and Future Minos Garofalakis Internet Management Research Department Bell Labs, Lucent Technologies.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Hash-Based Indexes Chapter 11 Modified by Donghui Zhang Jan 30, 2006.
Space-Efficient Online Computation of Quantile Summaries SIGMOD 01 Michael Greenwald & Sanjeev Khanna Presented by ellery.
Facility Location in Dynamic Geometric Data Streams Christiane Lammersen Christian Sohler.
Data Mining: Concepts and Techniques Mining data streams
Streaming Algorithms for Robust, Real-Time Detection of DDoS Attacks S. Ganguly M. Garofalakis R. Rastogi K.Sabnani Indian Inst. Of Tech. India Yahoo!
By: Gang Zhou Computer Science Department University of Virginia 1 Medians and Beyond: New Aggregation Techniques for Sensor Networks CS851 Seminar Presentation.
Evaluating Window Joins over Unbounded Streams Jaewoo Kang Jeffrey F. Naughton Stratis D. Viglas {jaewoo, naughton, Univ. of Wisconsin-Madison.
HASE: A Hybrid Approach to Selectivity Estimation for Conjunctive Queries Xiaohui Yu University of Toronto Joint work with Nick Koudas.
Chapter Eleven Sample Size Determination Chapter Eleven.
Sketch-Based Multi-Query Processing over Data Streams Minos Garofalakis Internet Management Research Department Bell Labs, Lucent Technologies Joint work.
Written By: Presented By: Swarup Acharya,Amr Elkhatib Phillip B. Gibbons, Viswanath Poosala, Sridhar Ramaswamy Join Synopses for Approximate Query Answering.
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
A paper on Join Synopses for Approximate Query Answering
The Variable-Increment Counting Bloom Filter
Finding Frequent Items in Data Streams
Feifei Li, Ching Chang, George Kollios, Azer Bestavros
Range-Efficient Computation of F0 over Massive Data Streams
Maintaining Stream Statistics over Sliding Windows
(Learned) Frequency Estimation Algorithms
Presentation transcript:

Processing Data-Stream Joins Using Skimmed Sketches Minos Garofalakis Internet Management Research Department Bell Labs, Lucent Technologies Joint work with Sumit Ganguly and Rajeev Rastogi (Bell Labs)

2 Talk Outline  Introduction & Basic Stream Computation Model  Basic Sketching for Binary Joins  The Problems with Basic Sketching  Our Solution –Sketch Skimming –Hash Sketches  Experimental Study  Conclusions

3 Data-Stream Management data sets  Traditional DBMS – data stored in finite, persistent data sets  Data Streams – distributed, continuous, unbounded, rapid, time varying, noisy,...  Data-Stream Management – variety of modern applications –Network monitoring and traffic engineering –Telecom call-detail records –Network security –Financial applications –Sensor networks –Manufacturing processes –Web logs and clickstreams –Massive data sets

4 Data - Stream Processing Model  Approximate answers often suffice, e.g., trend analysis, anomaly detection  Requirements for stream synopses –Single Pass: Each record is examined at most once, in (fixed) arrival order –Small Space: Log or polylog in data stream size –Real-time: Per-record processing time (to maintain synopses) must be low –Delete-Proof: Can handle record deletions as well as insertions Stream Processing Engine Approximate Answer with Error Guarantees “Within 2% of exact answer with high probability” Stream Synopses (in memory) Continuous Data Streams AGG(R S) R S (GigaBytes) (KiloBytes)

5 Synopses for Relational Streams  Conventional data summaries fall short –Quantiles and 1-d histograms [MRL98,99], [GK01], [GKMS02] Cannot capture attribute correlations Little support for approximation guarantees –Samples (e.g., using Reservoir Sampling) Perform poorly for joins [AGMS99] or distinct values [CCMN00] Cannot handle deletion of records –Multi-d histograms/wavelets Construction requires multiple passes over the data  Different approach: Pseudo-random sketch synopses –Only logarithmic space –Probabilistic guarantees on the quality of the approximate answer –Support insertion as well as deletion of records

6 Linear-Projection (aka AMS) Sketch Synopses  Goal:  Goal: Build small-space summary for distribution vector f(i) (i=1,..., M) seen as a stream of i-values  Basic Construct:  Basic Construct: Randomized Linear Projection of f() = project onto inner/dot product of f-vector –Simple to compute over the stream: Add whenever the i-th value is seen –Generate ‘s in small (logM) space using pseudo-random generators –Tunable probabilistic guarantees on approximation error –Delete-Proof: Just subtract to delete an i-th value occurrence Data stream: 3, 1, 2, 4, 2, 3, 5,... f(1) f(2) f(3) f(4) f(5) where = vector of random values from an appropriate distribution

7 Binary-Join COUNT Query  Problem: Compute answer for the query COUNT(R A S)  Example:  Exact solution: too expensive, requires O(N) space! –M = sizeof(domain(A)) Data stream R.A: Data stream S.A: = 10 ( )

8 Basic AMS Sketching Technique [AMS96]  Key Intuition: Use randomized linear projections of f() to define random variable X such that –X is easily computed over the stream (in small space) –E[X] = COUNT(R A S) –Var[X] is small  Basic Idea: –Define a family of 4-wise independent {-1, +1} random variables –Pr[ = +1] = Pr[ = -1] = 1/2 Expected value of each, E[ ] = 0 –Variables are 4-wise independent Expected value of product of 4 distinct = 0 –Variables can be generated using pseudo-random generator using only O(log M) space (for seeding)! Probabilistic error guarantees (e.g., actual answer is 10 ± 1 with probability 0.9)

9 AMS Sketch Construction  Compute random variables: and –Simply add to X R (X S ) whenever the i-th value is observed in R.A (S.A) Define X = X R X S to be estimate of COUNT query  E[X] = COUNT(R A S), – is the self-join size of R Data stream S.A: Data stream R.A:

10 Summary of Binary-Join AMS Sketching  Step 1: Compute random variables: and  Step 2: Define X= X R X S  Steps 3 & 4: Average independent copies of X; Return median of averages  Main Theorem (AGMS99): Sketching approximates COUNT to within a relative error of with probability using space –Remember: O(log M) space for “seeding” the construction of each X x x x Average y x x x y x x x y copies median 2log(1/ )

11 Problems with Basic Sketching  Accurate estimates only for large joins (wrt self-join product) –Lower bound [AGMS99]: Any technique for estimating a join of size J requires at least space N is the number of stream tuples –BUT the worst-case space requirement of basic sketching is Each self-join is in the worst case Quite far from the AGMS lower bound!  Another important problem: Sketch-update time –Time per stream element is proportional to total synopsis size Must update every atomic sketch on each arrival –Problematic for rapid-rate data streams!

12 Our Solution: Skimmed Sketches  Solves both problems of basic sketching for data-stream joins  First streaming method to –Match the AGMS lower bound for join-size estimation –Guarantee small, logarithmic-time updates per stream element  Extends naturally to other aggregates, multi-joins, multiple queries, etc… –Essentially gives same guarantees as basic sketching using only square root the synopsis space and log-time updates!  Two key technical ideas –Sketch skimming –Hash sketches

13 Sketch Skimming  Remember: Variance is proportional to product of self-join sizes  Key Idea:  Key Idea: Skim large (“dense”) frequencies away from the sketches built for R and S (with high probability) –i is “dense” in R iff (appropriately-defined threshold T) –Use extracted frequencies directly to estimate the “dense-dense” sub-join –Use left-over “skimmed” sketches for the other sub-joins –Residual frequencies left in the skimmed sketches are small (“sparse”) Small self-join sizes => Improved accuracy/space!  Discover dense frequencies efficiently using dyadic intervals “Binary search” over logM dyadic levels

14 Sketch Skimming (contd.)  Find large frequencies (using variant of [CCF02]) and skim them from the sketches  Estimate “dense-dense” directly from the extracted dense frequencies  Estimate “dense-sparse” combinations from and  Estimate “sparse-sparse” from the skimmed sketches –Self-join sizes for residual vectors are much smaller! skim

15 Hash Sketches  Key Idea:  Key Idea: Organize atomic sketches for each stream in hash tables, with one sketch per bucket (one random family/table) –Each element only updates the sketch for the bucket it hashes into  For join-size estimation: Join corresponding buckets for each table pair in the two streams and add across the table; Take median across tables –Similar accuracy guarantees with only update cost stream element e h1(e) h2(e) h3(e) h4(e)

16 Main Result  Our Skimmed-Sketches method approximates COUNT to within a relative error of with probability using time per stream element and space  Matches the lower bound of [AGMS99] to within log and constant factors

17 Experimental Study  Compare our skimmed-sketches technique against the basic AGMS method for stream joins –Basic metric = estimation accuracy –Modified relative error Treat over/under-estimation symmetrically  Joins between Zipfian and right-shifted Zipfian –Domain size = 256K, number of stream tuples = 4M –Qualitatively similar results for Census data

18 Synthetic Data, z=1.0

19 Synthetic Data, z=1.5

20 Conclusions  Introduced the Skimmed-Sketches technique for stream joins -- first streaming method to –Match the AGMS space lower bound for join estimation –Offer guaranteed log-time updates for the synopsis –Handle insertions as well as deletions  Two key technical ideas: Sketch Skimming and Hash Sketches  Experimental results verify its superiority over basic sketching for join-size estimation –Accuracy improvements from factor of 5 up to orders of magnitude

21 Thank you!

22 Census Data