Presentation is loading. Please wait.

Presentation is loading. Please wait.

Range-Efficient Counting of Distinct Elements

Similar presentations


Presentation on theme: "Range-Efficient Counting of Distinct Elements"— Presentation transcript:

1 Range-Efficient Counting of Distinct Elements
Srikanta Tirthapura Iowa State University (joint work with Phillip Gibbons, Aduri Pavan)

2 IIT Kanpur Streams Workshop
Range-Efficient F0 Stream: [100,200], [0,10], [60, 120], [5,25] F0: |[0,25] U [60,200]| = 167 120 5 10 25 60 100 200 11/21/2018 IIT Kanpur Streams Workshop

3 IIT Kanpur Streams Workshop
Range-Efficient F0 Input Stream: Sequence of ranges [l1,r1], [l2,r2] … [lm,rm] for each i, 0 <= li <= ri <= n, and li, ri are integers Output: Return | [l1,r1] U [l2,r2] U … U [lm,rm]| i.e. number of distinct elements in the union (F0) Constraints: Single pass through the data Small Workspace Fast Processing Time 11/21/2018 IIT Kanpur Streams Workshop

4 Reductions to Range-Efficient F0
Duplicate Insensitive Sum Max-Dominance Norm Range-Efficient F0 Counting Triangles in Graphs 11/21/2018 IIT Kanpur Streams Workshop

5 Duplicate-Insensitive Sum
Problem: Sum of all distinct elements in a stream of integers Input Stream: Sequence of integers S = a1,a2,….., an Output: distinct ai in S ai Example: S = 4, 5, 15, 4, 100, 4, 16, 15 Distinct Elements = 4,5,15,100, 16 Sum = 140 11/21/2018 IIT Kanpur Streams Workshop

6 Reduction from Dup-Insensitive Sum to F0
Stream from U = [0,m-1] Alternate Stream from U’=[0,m2-1] S S’ 4 [4m, 4m+1, .., 4m+3] 5 [5m,..,5m+4] 15 [15m,…,15m+14] [4m,…,4m+3] 100 [100m,…,100m+99] Duplicate-Insensitive Sum Number of Distinct Elements 11/21/2018 IIT Kanpur Streams Workshop

7 IIT Kanpur Streams Workshop
Max Dominance Norm Given k streams of m integers each, (the elements of the streams arrive in an arbitrary order), where 1 ≤ ai,j ≤ n a1,1 a1, a1,m a2,1 a2,2 … a2,m ak,1 ak,2 … ak,m Return j=1m max1 ≤ i ≤ k ai,j a b 11/21/2018 IIT Kanpur Streams Workshop

8 Reduction From Max Dominance Norm
Input stream I, output stream O: F0 of Output Stream = Dominance Norm of Input Stream Assign ranges to the k positions: [1,n] [n+1,2n] … [(k-1)n+1, kn] When element ai,j is received, generate the range [(j-1)m+1, (j-1)m+1+ai,j] Observation: F0 of the resulting stream of ranges is the dominance norm of the input stream a b 11/21/2018 IIT Kanpur Streams Workshop

9 IIT Kanpur Streams Workshop
Talk Outline Range Efficient F0 Reductions Among Data Stream Problems Algorithm for Range Efficient F0 (building on distinct sampling) Update Streams Open Questions 11/21/2018 IIT Kanpur Streams Workshop

10 Counting Distinct Elements (F0)
Example How many different users accessed my website today? Stream = 1,1,2,3,4,1, F0 = 4 Numerous Applications in databases and networking Prior Work Flajolet-Martin (1985) Alon, Matias and Szegedy (1996) Gibbons and Tirthapura (2001) Bar-Yossef et al. (2002) (currently most space-efficient) Indyk-Woodruff (2003) (Lower Bounds) 11/21/2018 IIT Kanpur Streams Workshop

11 Range-Efficient F0 (Pavan and Tirthapura)
Range Sampling for 2-way Independent Hash Functions Distinct Sampling Algorithm for F0 + 11/21/2018 IIT Kanpur Streams Workshop

12 Sampling Based Algorithm for F0 (Gibbons and Tirthapura 2001)
D = Distinct Elements In Stream U = {1,2,3,…..,n} S0 p=1/2 D  S1 S0, S1, S2.. stored implicitly implicitly using hash functions {2,4,7,…} S1 p=1/2 D  S2 {4,7,11,..} S2 11/21/2018 IIT Kanpur Streams Workshop

13 IIT Kanpur Streams Workshop
Distinct Sampling Sample = {}, p = 1 Target Workspace = 4 numbers 11/21/2018 IIT Kanpur Streams Workshop

14 IIT Kanpur Streams Workshop
Distinct Sampling 5 Sample = {5}, p = 1 Target Workspace = 4 numbers 11/21/2018 IIT Kanpur Streams Workshop

15 IIT Kanpur Streams Workshop
Distinct Sampling 5 3 Sample = {5,3}, p = 1 Target Workspace = 4 numbers 11/21/2018 IIT Kanpur Streams Workshop

16 IIT Kanpur Streams Workshop
Distinct Sampling 5 3 7 Sample = {5,3,7}, p = 1 Target Workspace = 4 numbers 11/21/2018 IIT Kanpur Streams Workshop

17 IIT Kanpur Streams Workshop
Distinct Sampling 5 3 7 Sample = {5,3,7}, p = 1 Target Workspace = 4 numbers 11/21/2018 IIT Kanpur Streams Workshop

18 IIT Kanpur Streams Workshop
Distinct Sampling 5 3 7 6 Sample = {5,3,7,6}, p = 1 Target Workspace = 4 numbers 11/21/2018 IIT Kanpur Streams Workshop

19 IIT Kanpur Streams Workshop
Distinct Sampling 5 3 7 6 8 Sample = {5,3,7,6,8}, p = 1 Overflow Sample = Sample  S1 Sample = {3,6,8}, p = ½ Target Workspace = 4 numbers 11/21/2018 IIT Kanpur Streams Workshop

20 IIT Kanpur Streams Workshop
Distinct Sampling 5 3 7 6 8 9 Sample = {3,6,8,9}, p= ½ Target Workspace = 4 numbers 11/21/2018 IIT Kanpur Streams Workshop

21 IIT Kanpur Streams Workshop
Distinct Sampling 5 3 7 6 8 9 Same Decision for both Sample = {3,6,8,9}, p= ½ Target Workspace = 4 numbers 11/21/2018 IIT Kanpur Streams Workshop

22 IIT Kanpur Streams Workshop
Distinct Sampling 5 3 7 6 8 9 2 Sample = {3,6,8,9,2}, p= ½ Overflow Sample = Sample  S2 Sample = {6,9}, p=¼ Target Workspace = 4 numbers 11/21/2018 IIT Kanpur Streams Workshop

23 IIT Kanpur Streams Workshop
Distinct Sampling 5 3 7 6 8 9 2 Finally, Sample = {6,9}, p=¼ Estimate of F0 = (Sample Size)(4) = 8 11/21/2018 IIT Kanpur Streams Workshop

24 Counting Distinct Elements
Finally, return a sample of distinct elements of the stream of a “large enough” size If target workspace = O((1/2)(log(1/)) integers, then estimate of F0 is a (, )-approximation Hash functions need only be pairwise independent and can be stored in small space 11/21/2018 IIT Kanpur Streams Workshop

25 IIT Kanpur Streams Workshop
Sampling Using Independent Coin Tosses Distinct Sampling Using Hash Functions Hash Function 1 1 11/21/2018 IIT Kanpur Streams Workshop

26 Adaptive Sampling for Range-Efficient F0
Naïve Approach: Given range [x,y], successively insert {x, x+1, … y} into F0 sampling algorithm Problem: Time per range very large Range-Sampling: Given stream element [p,q], how to sample all elements in [p,q] quickly? At sampling level i, quickly compute |[p,q] ∩ Si| 11/21/2018 IIT Kanpur Streams Workshop

27 Hash Functions, and S0,S1,S2…
v2 h(x)=(ax+b) mod p p prime a,b random in [0,p-1] v3 v1 p-1 n If h(x) Є[0,vi], then x Є Si 11/21/2018 IIT Kanpur Streams Workshop

28 IIT Kanpur Streams Workshop
Range Sampling v 1 X1 X2 p-1 n f(x)=(ax+b) mod p Compute |{x Є [x1,x2] : f(x) Є [0,v] }| 11/21/2018 IIT Kanpur Streams Workshop

29 Arithmetic Progression
p-1 v f(x1) f(x1+1) 1 X1 X2 n f(x)=(ax+b) mod p Common Difference = a 11/21/2018 IIT Kanpur Streams Workshop

30 Low and High Revolutions
p-1 v f(x1) f(x1+1) Each revolution, number of hits on [0,v] is floor(v/a) (low rev) floor(v/a) +1 (high rev) Task: Count number of low, high revolutions 11/21/2018 IIT Kanpur Streams Workshop

31 Starting Points of Revolutions
p-1 v f(x1) f(x1+1) Can find r = (v - v mod a) such that: If starting point in [0,r], then high revolution Else low revolution Task: Count the number of revolutions with starting point in [0,r] r 11/21/2018 IIT Kanpur Streams Workshop

32 IIT Kanpur Streams Workshop
Recursive Algorithm p-1 r a a-1 r modulo a circle modulo p circle Observation: Starting Points form an Arithmetic Progression with difference (- p mod a) 11/21/2018 IIT Kanpur Streams Workshop

33 IIT Kanpur Streams Workshop
Recursive Algorithm Focus on common difference Two Reductions Possible Common Difference a- (p mod a) Common Difference a Common Difference (p mod a) At least one of the two common differences is smaller than a/2 11/21/2018 IIT Kanpur Streams Workshop

34 IIT Kanpur Streams Workshop
Range Sampling Theorem: There is an algorithm for sampling range [x,y] using 2-way independent hash functions with Time complexity O(log (y-x)) Space Complexity O(log (y-x) + log m) Plug back into distinct sampling to get range-efficient F0 algorithm 11/21/2018 IIT Kanpur Streams Workshop

35 IIT Kanpur Streams Workshop
Input Stream Sequence of ranges [l1,r1], [l2,r2] … [lm,rm] for each i, 0 <= li <= ri < n, and li, ri are integers Output | [l1,r1] U [l2,r2] U … U [lm,rm]| Results Randomized (,)-Approximation Algorithm for Range-efficient F0 of a data stream Processing Time (n is the size of the universe): Amortized processing time per interval: O(log(1/) (log (n/))) Time to answer a query for F0 is a constant WorkSpace: O((1/2)(log(1/)) (log n)) Pavan,Tirthapura SICOMP (to appear) 11/21/2018 IIT Kanpur Streams Workshop

36 IIT Kanpur Streams Workshop
Prior Work Bar-Yossef, Kumar, Sivakumar 2002 First studied range-efficient F0 Algorithms with higher space complexity Cormode, Muthukrishnan 2003 Max-dominance Norm Nath, Gibbons, Seshan, Anderson 2004 Duplicate-insensitive Sum assuming ideal hash functions 11/21/2018 IIT Kanpur Streams Workshop

37 Cormode, Muthukrishnan
Comparison Range-Efficient F0 Bar-Yossef et al. Pavan and Tirthapura Time O(log5 n)(1/5)(log 1/) O(log n + log 1/)(log 1/) Space O(1/3)(log n)(log 1/) O(1/2)(log n)(log 1/) Max-Dominance Norm Cormode, Muthukrishnan Pavan and Tirthapura Time O(1/4 ) (log n) (log m) (log 1/) O(log n + log 1/)(log 1/) Space O (1/2) (log n+1/ (log m) (log log m)) (log 1/) O (1/2) (log m+ log n) (log 1/) 11/21/2018 IIT Kanpur Streams Workshop

38 Other Applications of Distinct Sampling
Sample of distinct elements of the stream of any desired target size Approximate median of all distinct elements in stream (duplicate insensitive median) Distinct Frequent elements (“heavy hitters” in network monitoring) 11/21/2018 IIT Kanpur Streams Workshop

39 IIT Kanpur Streams Workshop
Update Streams Insertions and Deletions of elements into the streams (11, +1), (7, +3), (4, +2), (7, -2), (11,-1)… Distinct Elements Problem: How many elements have a positive cumulative weight? Assume a “sanity constraint”, no element has weight less than 0 Sampling algorithm described so far fails, since it can only decrease sampling probability as stream becomes larger 11/21/2018 IIT Kanpur Streams Workshop

40 Distinct Sampling on Update Streams (three independent approaches)
Sumit Ganguly, Minos N. Garofalakis, Rajeev Rastogi: Processing Set Expressions over Continuous Update Streams. SIGMOD 2003, followed up by Ganguly, 2005 and Ganguly, Majumder 2006 Graham Cormode, S. Muthukrishnan, Irina Rozenbaum: Summarizing and Mining Inverse Distributions on Data Streams via Dynamic Inverse Sampling. VLDB 2005 Gereon Frahling, Piotr Indyk, Christian Sohler: Sampling in dynamic data streams and applications. SocG 2005 11/21/2018 IIT Kanpur Streams Workshop

41 Distinct Elements on Update Streams
Use of K-Set Structure in storing samples Ganguly, Garofalakis, Rastogi 2003 Ganguly 2005 Ganguly, Majumder 2006 11/21/2018 IIT Kanpur Streams Workshop

42 IIT Kanpur Streams Workshop
K-Set Structure Small space data structure for multi-set S (size Ỡ(K)) Operations Insert (x,v) into S Delete (x,v’) from S Membership Query (is x in S?) what is the number of distinct elements in S? If |S| ≤ K, then Queries answered correctly K Active Silent Active 11/21/2018 IIT Kanpur Streams Workshop

43 Counting Distinct Elements on Update Streams
Sample Stream at different probabilities, 1, ½, ¼,….. Store each of (D ∩ S0, D ∩ S1, D ∩ S2,…..) in a k-set structure for an appropriate value of k When queried, use the highest probability sample that hasn’t overflowed yet 11/21/2018 IIT Kanpur Streams Workshop

44 IIT Kanpur Streams Workshop
Distributed Streams Alice Workspace = $$ Stream A Sketch(A) Referee Bob Compute Dup-Ins-Sum(A,B) Workspace = $$ Sketch(B) Stream B 11/21/2018 IIT Kanpur Streams Workshop

45 Summary Range-Efficiency (range-sampling)
Update Streams (k-set structure) Sliding Windows (multiple samples) Distinct Sampling 11/21/2018 IIT Kanpur Streams Workshop

46 IIT Kanpur Streams Workshop
Open Questions Can we efficiently handle higher-dimensional ranges? Klee’s measure problem in streaming model 11/21/2018 IIT Kanpur Streams Workshop

47 IIT Kanpur Streams Workshop
Open Questions Range-Efficient F0 under update streams Duplicate-insensitive Fk (k ≥ 2), range-efficient Fk 11/21/2018 IIT Kanpur Streams Workshop


Download ppt "Range-Efficient Counting of Distinct Elements"

Similar presentations


Ads by Google