Estimating Set Expression Cardinalities over Data Streams Sumit Ganguly Minos Garofalakis Rajeev Rastogi Internet Management Research Department Bell Labs,

Slides:



Advertisements
Similar presentations
Estimating Distinct Elements, Optimally
Advertisements

The Average Case Complexity of Counting Distinct Elements David Woodruff IBM Almaden.
An Optimal Algorithm for the Distinct Elements Problem
Xiaoming Sun Tsinghua University David Woodruff MIT
Why Simple Hash Functions Work : Exploiting the Entropy in a Data Stream Michael Mitzenmacher Salil Vadhan And improvements with Kai-Min Chung.
Size-estimation framework with applications to transitive closure and reachability Presented by Maxim Kalaev Edith Cohen AT&T Bell Labs 1996.
Kaushik Chakrabarti(Univ Of Illinois) Minos Garofalakis(Bell Labs) Rajeev Rastogi(Bell Labs) Kyuseok Shim(KAIST and AITrc) Presented at 26 th VLDB Conference,
Counting Distinct Objects over Sliding Windows Presented by: Muhammad Aamir Cheema Joint work with Wenjie Zhang, Ying Zhang and Xuemin Lin University of.
Embedding the Ulam metric into ℓ 1 (Ενκρεβάτωση του μετρικού χώρου Ulam στον ℓ 1 ) Για το μάθημα “Advanced Data Structures” Αντώνης Αχιλλέως.
1 A Deterministic Algorithm for Summarizing Asynchronous Streams over a Sliding Window Costas Busch Rensselaer Polytechnic Institute Srikanta Tirthapura.
Mining Data Streams.
Estimating Join-Distinct Aggregates over Update Streams Minos Garofalakis Bell Labs, Lucent Technologies (Joint work with Sumit Ganguly, Amit Kumar, Rajeev.
1 The Monte Carlo method. 2 (0,0) (1,1) (-1,-1) (-1,1) (1,-1) 1 Z= 1 If  X 2 +Y 2  1 0 o/w (X,Y) is a point chosen uniformly at random in a 2  2 square.
1 Maintaining Bernoulli Samples Over Evolving Multisets Rainer Gemulla Wolfgang Lehner Technische Universität Dresden Peter J. Haas IBM Almaden Research.
New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007.
Discrete Structures & Algorithms More about sets EECE 320 — UBC.
Distributed Set-Expression Cardinality Estimation Abhinandan Das (Cornell U.) Sumit Ganguly (I.I.T. Kanpur) Minos Garofalakis (Bell Labs.) Rajeev Rastogi.
Algorithms for Distributed Functional Monitoring Ke Yi HKUST Joint work with Graham Cormode (AT&T Labs) S. Muthukrishnan (Google Inc.)
Streaming Algorithms for Robust, Real- Time Detection of DDoS Attacks S. Ganguly, M. Garofalakis, R. Rastogi, K. Sabnani Krishan Sabnani Bell Labs Research.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 12 June 18, 2006
Algorithms for massive data sets Lecture 2 (Mar 14, 2004) Yossi Matias & Ely Porat (partially based on various presentations & notes)
11.Hash Tables Hsu, Lih-Hsing. Computer Theory Lab. Chapter 11P Directed-address tables Direct addressing is a simple technique that works well.
Ph.D. DefenceUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor:
1 Distributed Streams Algorithms for Sliding Windows Phillip B. Gibbons, Srikanta Tirthapura.
Processing Data-Stream Joins Using Skimmed Sketches Minos Garofalakis Internet Management Research Department Bell Labs, Lucent Technologies Joint work.
Data Stream Mining and Querying
What ’ s Hot and What ’ s Not: Tracking Most Frequent Items Dynamically G. Cormode and S. Muthukrishman Rutgers University ACM Principles of Database Systems.
1 Anna Östlin Pagh and Rasmus Pagh IT University of Copenhagen Advanced Database Technology March 25, 2004 QUERY COMPILATION II Lecture based on [GUW,
Reverse Hashing for Sketch Based Change Detection in High Speed Networks Ashish Gupta Elliot Parsons with Robert Schweller, Theory Group Advisor: Yan Chen.
Time-Decaying Sketches for Sensor Data Aggregation Graham Cormode AT&T Labs, Research Srikanta Tirthapura Dept. of Electrical and Computer Engineering.
Statistic estimation over data stream Slides modified from Minos Garofalakis ( yahoo! research) and S. Muthukrishnan (Rutgers University)
Data Stream Processing (Part III) Gibbons. “Distinct sampling for highly accurate answers to distinct values queries and event reports”, VLDB’2001. Ganguly,
One-Pass Wavelet Decompositions of Data Streams TKDE May 2002 Anna C. Gilbert,Yannis Kotidis, S. Muthukrishanan, Martin J. Strauss Presented by James Chan.
Fast Approximate Wavelet Tracking on Streams Graham Cormode Minos Garofalakis Dimitris Sacharidis
Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman.
Database Management 9. course. Execution of queries.
Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002.
Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.
Query Optimization Arash Izadpanah. Introduction: What is Query Optimization? Query optimization is the process of selecting the most efficient query-evaluation.
Streaming Algorithms Piotr Indyk MIT. Data Streams A data stream is a sequence of data that is too large to be stored in available memory Examples: –Network.
The Selection Problem. 2 Median and Order Statistics In this section, we will study algorithms for finding the i th smallest element in a set of n elements.
Data Streams Part 3: Approximate Query Evaluation Reynold Cheng 23 rd July, 2002.
A Formal Analysis of Conservative Update Based Approximate Counting Gil Einziger and Roy Freidman Technion, Haifa.
Open Problem: Dynamic Planar Nearest Neighbors CSCE 620 Problem 63 from the Open Problems Project
The Bloom Paradox Ori Rottenstreich Joint work with Yossi Kanizo and Isaac Keslassy Technion, Israel.
New Sampling-Based Summary Statistics for Improving Approximate Query Answers Yinghui Wang
Document duplication (exact or approximate) Paolo Ferragina Dipartimento di Informatica Università di Pisa Slides only!
1 Online Computation and Continuous Maintaining of Quantile Summaries Tian Xia Database CCIS Northeastern University April 16, 2004.
Space-Efficient Online Computation of Quantile Summaries SIGMOD 01 Michael Greenwald & Sanjeev Khanna Presented by ellery.
Data Stream Algorithms Lower Bounds Graham Cormode
Streaming Algorithms for Robust, Real-Time Detection of DDoS Attacks S. Ganguly M. Garofalakis R. Rastogi K.Sabnani Indian Inst. Of Tech. India Yahoo!
Calculating frequency moments of Data Stream
By: Gang Zhou Computer Science Department University of Virginia 1 Medians and Beyond: New Aggregation Techniques for Sensor Networks CS851 Seminar Presentation.
ICS 353: Design and Analysis of Algorithms
Beating CountSketch for Heavy Hitters in Insertion Streams Vladimir Braverman (JHU) Stephen R. Chestnut (ETH) Nikita Ivkin (JHU) David P. Woodruff (IBM)
REU 2009-Traffic Analysis of IP Networks Daniel S. Allen, Mentor: Dr. Rahul Tripathi Department of Computer Science & Engineering Data Streams Data streams.
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Continuous Monitoring of Distributed Data Streams over a Time-based Sliding Window MADALGO – Center for Massive Data Algorithmics, a Center of the Danish.
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Mining Data Streams (Part 1)
New Characterizations in Turnstile Streams with Applications
The Variable-Increment Counting Bloom Filter
RE-Tree: An Efficient Index Structure for Regular Expressions
Sublinear Algorithmic Tools 2
COMS E F15 Lecture 2: Median trick + Chernoff, Distinct Count, Impossibility Results Left to the title, a presenter can insert his/her own image.
Range-Efficient Counting of Distinct Elements
Y. Kotidis, S. Muthukrishnan,
CSCI B609: “Foundations of Data Science”
Feifei Li, Ching Chang, George Kollios, Azer Bestavros
Range-Efficient Computation of F0 over Massive Data Streams
Presentation transcript:

Estimating Set Expression Cardinalities over Data Streams Sumit Ganguly Minos Garofalakis Rajeev Rastogi Internet Management Research Department Bell Labs, Lucent Technologies

2 Data Streaming Assumptions Stream: sequence of insertion and deletion operations. Look Once: Each operation seen once by stream processor. Storage is limited compared to stream size. Streaming Sub-Models: Insert only. Sliding Window. Insert and Delete. Applications –Network Management, network anomaly detection. –Database Statistics Maintenance, etc.

3 Problem Definition Data Streams A,B,C,…, etc. viewed as sets of elements. Given a set expression, e.g., Estimate Cardinality of Set Expression. A basic problem. Randomized approximation algorithm.

4 Previous Work Flajolet and Martin ’84. Estimates cardinality of union of streams. Minwise Independent Permutations (MIP), Broder et.al. ’98, Cohen ’97, Indyk ’99. Distinct Sampling Technique, Gibbons ’01. Estimate set expression cardinality. Above results easily extended to sliding windows. No scheme when streams contain both insertion and deletion operations.

5 b 2-level sketches A second level array [2] by [log N] of counters per level. Let hashes to level b. At level b, SecondLevel[ ][ ] is incremented for insertion and decremented for deletion. log N log N-1 level levels 1 2 bit positions 1 to log N bit value 1 bit value 0 N = Domain Size, a an arbitrary stream element. b a hash fn h

6 Updates to Second Level Singleton Levels Size of set = m. Assume m is well-estimated. Let 1.Probability level l is singleton = 2.Is level l singleton? Answered easily from second level array. 3.Assume level l is singleton. Singleton element is easily identified. Probability an element of set is in the singleton level is 1/m.

7 Distinct Sample A singleton level gives an elementary distinct sample. Suppose there are 2-level sketches. Then, number of singleton levels is at least with probability at least. Extends Gibbons’ Distinct Sampling / Min wise permutations to update streams.

8 Singleton Level in Union of Streams Streams A,B. Keep one parallel (same hash function) 2-level sketch pair for A and B. Is level l singleton for A U B? 1.Level l is singleton for A and empty for B, OR 2.Level l is empty for A and singleton for B, OR 3.Level l is singleton for both A and B and the occupants are identical.

9 Set Difference Condition Streams A,B. Goal: estimate |A-B|. Keep a parallel 2-level sketch for A and B (i.e., same hash function h). Assume level l is singleton. Probability that level l is singleton for A and empty for B is A-B Condition: Level l is singleton for A and empty for B.

10 Estimating |A-B| Keep independent parallel 2-level sketch pairs for A and B. Let. Estimate for |A-B|. At level l, 1.X= Count number of singleton sketch pairs for A U B. 2.D= Count number of sketch pairs satisfying A-B Condition. 3.Estimate = m*D / X.

11 Estimation Guarantees Estimate lies within relative error with probability at least if Lower bound using communication complexity arguments, where op = or.

12 Set Expression Condition Set expression W composed out of set names X1,X2,…,Xr and operators, union, intersection and difference. Parallel sketch array for X1, X2, …, Xr. Transform set expression W into a boolean sketch expression E(W) over parallel sketches. Similar transformation for MIPs, analogous for Distinct Sampling.

13 Set Expressions to Sketch Expressions Given set expression W. Create boolean expression E(W) over parallel sketches recursively. 1.Replace Set name X by IsSingleton(sketch(X),l). 2.Replace X Y by E(X) AND E(Y). 3.Replace X-Y by E(X) AND (NOT E(Y)). 4.Replace X Y by E(X) OR E(Y). 5.Add final conjunct IsSingleton(sketch(X1),sketch(X2),…,sketch(Xr),l). Suppose level l is singleton for the union. Then, Probability E(W) is satisfied by a parallel sketch = |W|/m

14 Estimating Set Expression Size Given set expression W. Create sketch expression E(W). Estimate m = union size of sets in W. Let l = ceil(log m). Keep n parallel sketches for each set in W. At level l, 1. X= Count number of singleton parallel sketches for the union. 2. D= Count number of parallel sketches satisfying E(W). 3. Estimate = m*D / X. Estimate lies within relative error with probability at least if

15 Conclusions Basic tool for estimating cardinality of COUNT DISTINCT single clause SQL queries over update databases involving Simple predicates. Single and Multi-dimensional Range predicates, distinct histograms etc. Set expression cardinality estimation. Extends naturally to sliding window stream model.

16 Thank You !