Range-Efficient Counting of Distinct Elements Srikanta Tirthapura Iowa State University (joint work with Phillip Gibbons, Aduri Pavan)
IIT Kanpur Streams Workshop Range-Efficient F0 Stream: [100,200], [0,10], [60, 120], [5,25] F0: |[0,25] U [60,200]| = 167 120 5 10 25 60 100 200 11/21/2018 IIT Kanpur Streams Workshop
IIT Kanpur Streams Workshop Range-Efficient F0 Input Stream: Sequence of ranges [l1,r1], [l2,r2] … [lm,rm] for each i, 0 <= li <= ri <= n, and li, ri are integers Output: Return | [l1,r1] U [l2,r2] U … U [lm,rm]| i.e. number of distinct elements in the union (F0) Constraints: Single pass through the data Small Workspace Fast Processing Time 11/21/2018 IIT Kanpur Streams Workshop
Reductions to Range-Efficient F0 Duplicate Insensitive Sum Max-Dominance Norm Range-Efficient F0 Counting Triangles in Graphs 11/21/2018 IIT Kanpur Streams Workshop
Duplicate-Insensitive Sum Problem: Sum of all distinct elements in a stream of integers Input Stream: Sequence of integers S = a1,a2,….., an Output: distinct ai in S ai Example: S = 4, 5, 15, 4, 100, 4, 16, 15 Distinct Elements = 4,5,15,100, 16 Sum = 140 11/21/2018 IIT Kanpur Streams Workshop
Reduction from Dup-Insensitive Sum to F0 Stream from U = [0,m-1] Alternate Stream from U’=[0,m2-1] S S’ 4 [4m, 4m+1, .., 4m+3] 5 [5m,..,5m+4] 15 [15m,…,15m+14] [4m,…,4m+3] 100 [100m,…,100m+99] Duplicate-Insensitive Sum Number of Distinct Elements 11/21/2018 IIT Kanpur Streams Workshop
IIT Kanpur Streams Workshop Max Dominance Norm Given k streams of m integers each, (the elements of the streams arrive in an arbitrary order), where 1 ≤ ai,j ≤ n a1,1 a1,2 .. a1,m a2,1 a2,2 … a2,m … ak,1 ak,2 … ak,m Return j=1m max1 ≤ i ≤ k ai,j a b 11/21/2018 IIT Kanpur Streams Workshop
Reduction From Max Dominance Norm Input stream I, output stream O: F0 of Output Stream = Dominance Norm of Input Stream Assign ranges to the k positions: [1,n] [n+1,2n] … [(k-1)n+1, kn] When element ai,j is received, generate the range [(j-1)m+1, (j-1)m+1+ai,j] Observation: F0 of the resulting stream of ranges is the dominance norm of the input stream a b 11/21/2018 IIT Kanpur Streams Workshop
IIT Kanpur Streams Workshop Talk Outline Range Efficient F0 Reductions Among Data Stream Problems Algorithm for Range Efficient F0 (building on distinct sampling) Update Streams Open Questions 11/21/2018 IIT Kanpur Streams Workshop
Counting Distinct Elements (F0) Example How many different users accessed my website today? Stream = 1,1,2,3,4,1,2 F0 = 4 Numerous Applications in databases and networking Prior Work Flajolet-Martin (1985) Alon, Matias and Szegedy (1996) Gibbons and Tirthapura (2001) Bar-Yossef et al. (2002) (currently most space-efficient) Indyk-Woodruff (2003) (Lower Bounds) 11/21/2018 IIT Kanpur Streams Workshop
Range-Efficient F0 (Pavan and Tirthapura) Range Sampling for 2-way Independent Hash Functions Distinct Sampling Algorithm for F0 + 11/21/2018 IIT Kanpur Streams Workshop
Sampling Based Algorithm for F0 (Gibbons and Tirthapura 2001) D = Distinct Elements In Stream U = {1,2,3,…..,n} S0 p=1/2 D S1 S0, S1, S2.. stored implicitly implicitly using hash functions {2,4,7,…} S1 p=1/2 D S2 {4,7,11,..} S2 11/21/2018 IIT Kanpur Streams Workshop
IIT Kanpur Streams Workshop Distinct Sampling Sample = {}, p = 1 Target Workspace = 4 numbers 11/21/2018 IIT Kanpur Streams Workshop
IIT Kanpur Streams Workshop Distinct Sampling 5 Sample = {5}, p = 1 Target Workspace = 4 numbers 11/21/2018 IIT Kanpur Streams Workshop
IIT Kanpur Streams Workshop Distinct Sampling 5 3 Sample = {5,3}, p = 1 Target Workspace = 4 numbers 11/21/2018 IIT Kanpur Streams Workshop
IIT Kanpur Streams Workshop Distinct Sampling 5 3 7 Sample = {5,3,7}, p = 1 Target Workspace = 4 numbers 11/21/2018 IIT Kanpur Streams Workshop
IIT Kanpur Streams Workshop Distinct Sampling 5 3 7 Sample = {5,3,7}, p = 1 Target Workspace = 4 numbers 11/21/2018 IIT Kanpur Streams Workshop
IIT Kanpur Streams Workshop Distinct Sampling 5 3 7 6 Sample = {5,3,7,6}, p = 1 Target Workspace = 4 numbers 11/21/2018 IIT Kanpur Streams Workshop
IIT Kanpur Streams Workshop Distinct Sampling 5 3 7 6 8 Sample = {5,3,7,6,8}, p = 1 Overflow Sample = Sample S1 Sample = {3,6,8}, p = ½ Target Workspace = 4 numbers 11/21/2018 IIT Kanpur Streams Workshop
IIT Kanpur Streams Workshop Distinct Sampling 5 3 7 6 8 9 Sample = {3,6,8,9}, p= ½ Target Workspace = 4 numbers 11/21/2018 IIT Kanpur Streams Workshop
IIT Kanpur Streams Workshop Distinct Sampling 5 3 7 6 8 9 Same Decision for both Sample = {3,6,8,9}, p= ½ Target Workspace = 4 numbers 11/21/2018 IIT Kanpur Streams Workshop
IIT Kanpur Streams Workshop Distinct Sampling 5 3 7 6 8 9 2 Sample = {3,6,8,9,2}, p= ½ Overflow Sample = Sample S2 Sample = {6,9}, p=¼ Target Workspace = 4 numbers 11/21/2018 IIT Kanpur Streams Workshop
IIT Kanpur Streams Workshop Distinct Sampling 5 3 7 6 8 9 2 Finally, Sample = {6,9}, p=¼ Estimate of F0 = (Sample Size)(4) = 8 11/21/2018 IIT Kanpur Streams Workshop
Counting Distinct Elements Finally, return a sample of distinct elements of the stream of a “large enough” size If target workspace = O((1/2)(log(1/)) integers, then estimate of F0 is a (, )-approximation Hash functions need only be pairwise independent and can be stored in small space 11/21/2018 IIT Kanpur Streams Workshop
IIT Kanpur Streams Workshop Sampling Using Independent Coin Tosses Distinct Sampling Using Hash Functions Hash Function 1 1 11/21/2018 IIT Kanpur Streams Workshop
Adaptive Sampling for Range-Efficient F0 Naïve Approach: Given range [x,y], successively insert {x, x+1, … y} into F0 sampling algorithm Problem: Time per range very large Range-Sampling: Given stream element [p,q], how to sample all elements in [p,q] quickly? At sampling level i, quickly compute |[p,q] ∩ Si| 11/21/2018 IIT Kanpur Streams Workshop
Hash Functions, and S0,S1,S2… v2 h(x)=(ax+b) mod p p prime a,b random in [0,p-1] v3 v1 p-1 n If h(x) Є[0,vi], then x Є Si 11/21/2018 IIT Kanpur Streams Workshop
IIT Kanpur Streams Workshop Range Sampling v 1 X1 X2 p-1 n f(x)=(ax+b) mod p Compute |{x Є [x1,x2] : f(x) Є [0,v] }| 11/21/2018 IIT Kanpur Streams Workshop
Arithmetic Progression p-1 v f(x1) f(x1+1) 1 X1 X2 n f(x)=(ax+b) mod p Common Difference = a 11/21/2018 IIT Kanpur Streams Workshop
Low and High Revolutions p-1 v f(x1) f(x1+1) Each revolution, number of hits on [0,v] is floor(v/a) (low rev) floor(v/a) +1 (high rev) Task: Count number of low, high revolutions 11/21/2018 IIT Kanpur Streams Workshop
Starting Points of Revolutions p-1 v f(x1) f(x1+1) Can find r = (v - v mod a) such that: If starting point in [0,r], then high revolution Else low revolution Task: Count the number of revolutions with starting point in [0,r] r 11/21/2018 IIT Kanpur Streams Workshop
IIT Kanpur Streams Workshop Recursive Algorithm p-1 r a a-1 r modulo a circle modulo p circle Observation: Starting Points form an Arithmetic Progression with difference (- p mod a) 11/21/2018 IIT Kanpur Streams Workshop
IIT Kanpur Streams Workshop Recursive Algorithm Focus on common difference Two Reductions Possible Common Difference a- (p mod a) Common Difference a Common Difference (p mod a) At least one of the two common differences is smaller than a/2 11/21/2018 IIT Kanpur Streams Workshop
IIT Kanpur Streams Workshop Range Sampling Theorem: There is an algorithm for sampling range [x,y] using 2-way independent hash functions with Time complexity O(log (y-x)) Space Complexity O(log (y-x) + log m) Plug back into distinct sampling to get range-efficient F0 algorithm 11/21/2018 IIT Kanpur Streams Workshop
IIT Kanpur Streams Workshop Input Stream Sequence of ranges [l1,r1], [l2,r2] … [lm,rm] for each i, 0 <= li <= ri < n, and li, ri are integers Output | [l1,r1] U [l2,r2] U … U [lm,rm]| Results Randomized (,)-Approximation Algorithm for Range-efficient F0 of a data stream Processing Time (n is the size of the universe): Amortized processing time per interval: O(log(1/) (log (n/))) Time to answer a query for F0 is a constant WorkSpace: O((1/2)(log(1/)) (log n)) Pavan,Tirthapura SICOMP (to appear) 11/21/2018 IIT Kanpur Streams Workshop
IIT Kanpur Streams Workshop Prior Work Bar-Yossef, Kumar, Sivakumar 2002 First studied range-efficient F0 Algorithms with higher space complexity Cormode, Muthukrishnan 2003 Max-dominance Norm Nath, Gibbons, Seshan, Anderson 2004 Duplicate-insensitive Sum assuming ideal hash functions 11/21/2018 IIT Kanpur Streams Workshop
Cormode, Muthukrishnan Comparison Range-Efficient F0 Bar-Yossef et al. Pavan and Tirthapura Time O(log5 n)(1/5)(log 1/) O(log n + log 1/)(log 1/) Space O(1/3)(log n)(log 1/) O(1/2)(log n)(log 1/) Max-Dominance Norm Cormode, Muthukrishnan Pavan and Tirthapura Time O(1/4 ) (log n) (log m) (log 1/) O(log n + log 1/)(log 1/) Space O (1/2) (log n+1/ (log m) (log log m)) (log 1/) O (1/2) (log m+ log n) (log 1/) 11/21/2018 IIT Kanpur Streams Workshop
Other Applications of Distinct Sampling Sample of distinct elements of the stream of any desired target size Approximate median of all distinct elements in stream (duplicate insensitive median) Distinct Frequent elements (“heavy hitters” in network monitoring) 11/21/2018 IIT Kanpur Streams Workshop
IIT Kanpur Streams Workshop Update Streams Insertions and Deletions of elements into the streams (11, +1), (7, +3), (4, +2), (7, -2), (11,-1)… Distinct Elements Problem: How many elements have a positive cumulative weight? Assume a “sanity constraint”, no element has weight less than 0 Sampling algorithm described so far fails, since it can only decrease sampling probability as stream becomes larger 11/21/2018 IIT Kanpur Streams Workshop
Distinct Sampling on Update Streams (three independent approaches) Sumit Ganguly, Minos N. Garofalakis, Rajeev Rastogi: Processing Set Expressions over Continuous Update Streams. SIGMOD 2003, followed up by Ganguly, 2005 and Ganguly, Majumder 2006 Graham Cormode, S. Muthukrishnan, Irina Rozenbaum: Summarizing and Mining Inverse Distributions on Data Streams via Dynamic Inverse Sampling. VLDB 2005 Gereon Frahling, Piotr Indyk, Christian Sohler: Sampling in dynamic data streams and applications. SocG 2005 11/21/2018 IIT Kanpur Streams Workshop
Distinct Elements on Update Streams Use of K-Set Structure in storing samples Ganguly, Garofalakis, Rastogi 2003 Ganguly 2005 Ganguly, Majumder 2006 11/21/2018 IIT Kanpur Streams Workshop
IIT Kanpur Streams Workshop K-Set Structure Small space data structure for multi-set S (size Ỡ(K)) Operations Insert (x,v) into S Delete (x,v’) from S Membership Query (is x in S?) what is the number of distinct elements in S? If |S| ≤ K, then Queries answered correctly K Active Silent Active 11/21/2018 IIT Kanpur Streams Workshop
Counting Distinct Elements on Update Streams Sample Stream at different probabilities, 1, ½, ¼,….. Store each of (D ∩ S0, D ∩ S1, D ∩ S2,…..) in a k-set structure for an appropriate value of k When queried, use the highest probability sample that hasn’t overflowed yet 11/21/2018 IIT Kanpur Streams Workshop
IIT Kanpur Streams Workshop Distributed Streams Alice Workspace = $$ Stream A Sketch(A) 11 54 21 11 2 45 21 1… Referee Bob Compute Dup-Ins-Sum(A,B) Workspace = $$ 1 5 21 2 54 21 35 … Sketch(B) Stream B 11/21/2018 IIT Kanpur Streams Workshop
Summary Range-Efficiency (range-sampling) Update Streams (k-set structure) Sliding Windows (multiple samples) Distinct Sampling 11/21/2018 IIT Kanpur Streams Workshop
IIT Kanpur Streams Workshop Open Questions Can we efficiently handle higher-dimensional ranges? Klee’s measure problem in streaming model 11/21/2018 IIT Kanpur Streams Workshop
IIT Kanpur Streams Workshop Open Questions Range-Efficient F0 under update streams Duplicate-insensitive Fk (k ≥ 2), range-efficient Fk 11/21/2018 IIT Kanpur Streams Workshop