Estimating Set Expression Cardinalities over Data Streams Sumit Ganguly Minos Garofalakis Rajeev Rastogi Internet Management Research Department Bell Labs, Lucent Technologies
2 Data Streaming Assumptions Stream: sequence of insertion and deletion operations. Look Once: Each operation seen once by stream processor. Storage is limited compared to stream size. Streaming Sub-Models: Insert only. Sliding Window. Insert and Delete. Applications –Network Management, network anomaly detection. –Database Statistics Maintenance, etc.
3 Problem Definition Data Streams A,B,C,…, etc. viewed as sets of elements. Given a set expression, e.g., Estimate Cardinality of Set Expression. A basic problem. Randomized approximation algorithm.
4 Previous Work Flajolet and Martin ’84. Estimates cardinality of union of streams. Minwise Independent Permutations (MIP), Broder et.al. ’98, Cohen ’97, Indyk ’99. Distinct Sampling Technique, Gibbons ’01. Estimate set expression cardinality. Above results easily extended to sliding windows. No scheme when streams contain both insertion and deletion operations.
5 b 2-level sketches A second level array [2] by [log N] of counters per level. Let hashes to level b. At level b, SecondLevel[ ][ ] is incremented for insertion and decremented for deletion. log N log N-1 level levels 1 2 bit positions 1 to log N bit value 1 bit value 0 N = Domain Size, a an arbitrary stream element. b a hash fn h
6 Updates to Second Level Singleton Levels Size of set = m. Assume m is well-estimated. Let 1.Probability level l is singleton = 2.Is level l singleton? Answered easily from second level array. 3.Assume level l is singleton. Singleton element is easily identified. Probability an element of set is in the singleton level is 1/m.
7 Distinct Sample A singleton level gives an elementary distinct sample. Suppose there are 2-level sketches. Then, number of singleton levels is at least with probability at least. Extends Gibbons’ Distinct Sampling / Min wise permutations to update streams.
8 Singleton Level in Union of Streams Streams A,B. Keep one parallel (same hash function) 2-level sketch pair for A and B. Is level l singleton for A U B? 1.Level l is singleton for A and empty for B, OR 2.Level l is empty for A and singleton for B, OR 3.Level l is singleton for both A and B and the occupants are identical.
9 Set Difference Condition Streams A,B. Goal: estimate |A-B|. Keep a parallel 2-level sketch for A and B (i.e., same hash function h). Assume level l is singleton. Probability that level l is singleton for A and empty for B is A-B Condition: Level l is singleton for A and empty for B.
10 Estimating |A-B| Keep independent parallel 2-level sketch pairs for A and B. Let. Estimate for |A-B|. At level l, 1.X= Count number of singleton sketch pairs for A U B. 2.D= Count number of sketch pairs satisfying A-B Condition. 3.Estimate = m*D / X.
11 Estimation Guarantees Estimate lies within relative error with probability at least if Lower bound using communication complexity arguments, where op = or.
12 Set Expression Condition Set expression W composed out of set names X1,X2,…,Xr and operators, union, intersection and difference. Parallel sketch array for X1, X2, …, Xr. Transform set expression W into a boolean sketch expression E(W) over parallel sketches. Similar transformation for MIPs, analogous for Distinct Sampling.
13 Set Expressions to Sketch Expressions Given set expression W. Create boolean expression E(W) over parallel sketches recursively. 1.Replace Set name X by IsSingleton(sketch(X),l). 2.Replace X Y by E(X) AND E(Y). 3.Replace X-Y by E(X) AND (NOT E(Y)). 4.Replace X Y by E(X) OR E(Y). 5.Add final conjunct IsSingleton(sketch(X1),sketch(X2),…,sketch(Xr),l). Suppose level l is singleton for the union. Then, Probability E(W) is satisfied by a parallel sketch = |W|/m
14 Estimating Set Expression Size Given set expression W. Create sketch expression E(W). Estimate m = union size of sets in W. Let l = ceil(log m). Keep n parallel sketches for each set in W. At level l, 1. X= Count number of singleton parallel sketches for the union. 2. D= Count number of parallel sketches satisfying E(W). 3. Estimate = m*D / X. Estimate lies within relative error with probability at least if
15 Conclusions Basic tool for estimating cardinality of COUNT DISTINCT single clause SQL queries over update databases involving Simple predicates. Single and Multi-dimensional Range predicates, distinct histograms etc. Set expression cardinality estimation. Extends naturally to sliding window stream model.
16 Thank You !