Download presentation
Presentation is loading. Please wait.
1
Range-Efficient Counting of Distinct Elements
Srikanta Tirthapura Iowa State University (joint work with Phillip Gibbons, Aduri Pavan)
2
IIT Kanpur Streams Workshop
Range-Efficient F0 Stream: [100,200], [0,10], [60, 120], [5,25] F0: |[0,25] U [60,200]| = 167 120 5 10 25 60 100 200 11/21/2018 IIT Kanpur Streams Workshop
3
IIT Kanpur Streams Workshop
Range-Efficient F0 Input Stream: Sequence of ranges [l1,r1], [l2,r2] … [lm,rm] for each i, 0 <= li <= ri <= n, and li, ri are integers Output: Return | [l1,r1] U [l2,r2] U … U [lm,rm]| i.e. number of distinct elements in the union (F0) Constraints: Single pass through the data Small Workspace Fast Processing Time 11/21/2018 IIT Kanpur Streams Workshop
4
Reductions to Range-Efficient F0
Duplicate Insensitive Sum Max-Dominance Norm Range-Efficient F0 Counting Triangles in Graphs 11/21/2018 IIT Kanpur Streams Workshop
5
Duplicate-Insensitive Sum
Problem: Sum of all distinct elements in a stream of integers Input Stream: Sequence of integers S = a1,a2,….., an Output: distinct ai in S ai Example: S = 4, 5, 15, 4, 100, 4, 16, 15 Distinct Elements = 4,5,15,100, 16 Sum = 140 11/21/2018 IIT Kanpur Streams Workshop
6
Reduction from Dup-Insensitive Sum to F0
Stream from U = [0,m-1] Alternate Stream from U’=[0,m2-1] S S’ 4 [4m, 4m+1, .., 4m+3] 5 [5m,..,5m+4] 15 [15m,…,15m+14] [4m,…,4m+3] 100 [100m,…,100m+99] Duplicate-Insensitive Sum Number of Distinct Elements 11/21/2018 IIT Kanpur Streams Workshop
7
IIT Kanpur Streams Workshop
Max Dominance Norm Given k streams of m integers each, (the elements of the streams arrive in an arbitrary order), where 1 ≤ ai,j ≤ n a1,1 a1, a1,m a2,1 a2,2 … a2,m … ak,1 ak,2 … ak,m Return j=1m max1 ≤ i ≤ k ai,j a b 11/21/2018 IIT Kanpur Streams Workshop
8
Reduction From Max Dominance Norm
Input stream I, output stream O: F0 of Output Stream = Dominance Norm of Input Stream Assign ranges to the k positions: [1,n] [n+1,2n] … [(k-1)n+1, kn] When element ai,j is received, generate the range [(j-1)m+1, (j-1)m+1+ai,j] Observation: F0 of the resulting stream of ranges is the dominance norm of the input stream a b 11/21/2018 IIT Kanpur Streams Workshop
9
IIT Kanpur Streams Workshop
Talk Outline Range Efficient F0 Reductions Among Data Stream Problems Algorithm for Range Efficient F0 (building on distinct sampling) Update Streams Open Questions 11/21/2018 IIT Kanpur Streams Workshop
10
Counting Distinct Elements (F0)
Example How many different users accessed my website today? Stream = 1,1,2,3,4,1, F0 = 4 Numerous Applications in databases and networking Prior Work Flajolet-Martin (1985) Alon, Matias and Szegedy (1996) Gibbons and Tirthapura (2001) Bar-Yossef et al. (2002) (currently most space-efficient) Indyk-Woodruff (2003) (Lower Bounds) 11/21/2018 IIT Kanpur Streams Workshop
11
Range-Efficient F0 (Pavan and Tirthapura)
Range Sampling for 2-way Independent Hash Functions Distinct Sampling Algorithm for F0 + 11/21/2018 IIT Kanpur Streams Workshop
12
Sampling Based Algorithm for F0 (Gibbons and Tirthapura 2001)
D = Distinct Elements In Stream U = {1,2,3,…..,n} S0 p=1/2 D S1 S0, S1, S2.. stored implicitly implicitly using hash functions {2,4,7,…} S1 p=1/2 D S2 {4,7,11,..} S2 11/21/2018 IIT Kanpur Streams Workshop
13
IIT Kanpur Streams Workshop
Distinct Sampling Sample = {}, p = 1 Target Workspace = 4 numbers 11/21/2018 IIT Kanpur Streams Workshop
14
IIT Kanpur Streams Workshop
Distinct Sampling 5 Sample = {5}, p = 1 Target Workspace = 4 numbers 11/21/2018 IIT Kanpur Streams Workshop
15
IIT Kanpur Streams Workshop
Distinct Sampling 5 3 Sample = {5,3}, p = 1 Target Workspace = 4 numbers 11/21/2018 IIT Kanpur Streams Workshop
16
IIT Kanpur Streams Workshop
Distinct Sampling 5 3 7 Sample = {5,3,7}, p = 1 Target Workspace = 4 numbers 11/21/2018 IIT Kanpur Streams Workshop
17
IIT Kanpur Streams Workshop
Distinct Sampling 5 3 7 Sample = {5,3,7}, p = 1 Target Workspace = 4 numbers 11/21/2018 IIT Kanpur Streams Workshop
18
IIT Kanpur Streams Workshop
Distinct Sampling 5 3 7 6 Sample = {5,3,7,6}, p = 1 Target Workspace = 4 numbers 11/21/2018 IIT Kanpur Streams Workshop
19
IIT Kanpur Streams Workshop
Distinct Sampling 5 3 7 6 8 Sample = {5,3,7,6,8}, p = 1 Overflow Sample = Sample S1 Sample = {3,6,8}, p = ½ Target Workspace = 4 numbers 11/21/2018 IIT Kanpur Streams Workshop
20
IIT Kanpur Streams Workshop
Distinct Sampling 5 3 7 6 8 9 Sample = {3,6,8,9}, p= ½ Target Workspace = 4 numbers 11/21/2018 IIT Kanpur Streams Workshop
21
IIT Kanpur Streams Workshop
Distinct Sampling 5 3 7 6 8 9 Same Decision for both Sample = {3,6,8,9}, p= ½ Target Workspace = 4 numbers 11/21/2018 IIT Kanpur Streams Workshop
22
IIT Kanpur Streams Workshop
Distinct Sampling 5 3 7 6 8 9 2 Sample = {3,6,8,9,2}, p= ½ Overflow Sample = Sample S2 Sample = {6,9}, p=¼ Target Workspace = 4 numbers 11/21/2018 IIT Kanpur Streams Workshop
23
IIT Kanpur Streams Workshop
Distinct Sampling 5 3 7 6 8 9 2 Finally, Sample = {6,9}, p=¼ Estimate of F0 = (Sample Size)(4) = 8 11/21/2018 IIT Kanpur Streams Workshop
24
Counting Distinct Elements
Finally, return a sample of distinct elements of the stream of a “large enough” size If target workspace = O((1/2)(log(1/)) integers, then estimate of F0 is a (, )-approximation Hash functions need only be pairwise independent and can be stored in small space 11/21/2018 IIT Kanpur Streams Workshop
25
IIT Kanpur Streams Workshop
Sampling Using Independent Coin Tosses Distinct Sampling Using Hash Functions Hash Function 1 1 11/21/2018 IIT Kanpur Streams Workshop
26
Adaptive Sampling for Range-Efficient F0
Naïve Approach: Given range [x,y], successively insert {x, x+1, … y} into F0 sampling algorithm Problem: Time per range very large Range-Sampling: Given stream element [p,q], how to sample all elements in [p,q] quickly? At sampling level i, quickly compute |[p,q] ∩ Si| 11/21/2018 IIT Kanpur Streams Workshop
27
Hash Functions, and S0,S1,S2…
v2 h(x)=(ax+b) mod p p prime a,b random in [0,p-1] v3 v1 p-1 n If h(x) Є[0,vi], then x Є Si 11/21/2018 IIT Kanpur Streams Workshop
28
IIT Kanpur Streams Workshop
Range Sampling v 1 X1 X2 p-1 n f(x)=(ax+b) mod p Compute |{x Є [x1,x2] : f(x) Є [0,v] }| 11/21/2018 IIT Kanpur Streams Workshop
29
Arithmetic Progression
p-1 v f(x1) f(x1+1) 1 X1 X2 n f(x)=(ax+b) mod p Common Difference = a 11/21/2018 IIT Kanpur Streams Workshop
30
Low and High Revolutions
p-1 v f(x1) f(x1+1) Each revolution, number of hits on [0,v] is floor(v/a) (low rev) floor(v/a) +1 (high rev) Task: Count number of low, high revolutions 11/21/2018 IIT Kanpur Streams Workshop
31
Starting Points of Revolutions
p-1 v f(x1) f(x1+1) Can find r = (v - v mod a) such that: If starting point in [0,r], then high revolution Else low revolution Task: Count the number of revolutions with starting point in [0,r] r 11/21/2018 IIT Kanpur Streams Workshop
32
IIT Kanpur Streams Workshop
Recursive Algorithm p-1 r a a-1 r modulo a circle modulo p circle Observation: Starting Points form an Arithmetic Progression with difference (- p mod a) 11/21/2018 IIT Kanpur Streams Workshop
33
IIT Kanpur Streams Workshop
Recursive Algorithm Focus on common difference Two Reductions Possible Common Difference a- (p mod a) Common Difference a Common Difference (p mod a) At least one of the two common differences is smaller than a/2 11/21/2018 IIT Kanpur Streams Workshop
34
IIT Kanpur Streams Workshop
Range Sampling Theorem: There is an algorithm for sampling range [x,y] using 2-way independent hash functions with Time complexity O(log (y-x)) Space Complexity O(log (y-x) + log m) Plug back into distinct sampling to get range-efficient F0 algorithm 11/21/2018 IIT Kanpur Streams Workshop
35
IIT Kanpur Streams Workshop
Input Stream Sequence of ranges [l1,r1], [l2,r2] … [lm,rm] for each i, 0 <= li <= ri < n, and li, ri are integers Output | [l1,r1] U [l2,r2] U … U [lm,rm]| Results Randomized (,)-Approximation Algorithm for Range-efficient F0 of a data stream Processing Time (n is the size of the universe): Amortized processing time per interval: O(log(1/) (log (n/))) Time to answer a query for F0 is a constant WorkSpace: O((1/2)(log(1/)) (log n)) Pavan,Tirthapura SICOMP (to appear) 11/21/2018 IIT Kanpur Streams Workshop
36
IIT Kanpur Streams Workshop
Prior Work Bar-Yossef, Kumar, Sivakumar 2002 First studied range-efficient F0 Algorithms with higher space complexity Cormode, Muthukrishnan 2003 Max-dominance Norm Nath, Gibbons, Seshan, Anderson 2004 Duplicate-insensitive Sum assuming ideal hash functions 11/21/2018 IIT Kanpur Streams Workshop
37
Cormode, Muthukrishnan
Comparison Range-Efficient F0 Bar-Yossef et al. Pavan and Tirthapura Time O(log5 n)(1/5)(log 1/) O(log n + log 1/)(log 1/) Space O(1/3)(log n)(log 1/) O(1/2)(log n)(log 1/) Max-Dominance Norm Cormode, Muthukrishnan Pavan and Tirthapura Time O(1/4 ) (log n) (log m) (log 1/) O(log n + log 1/)(log 1/) Space O (1/2) (log n+1/ (log m) (log log m)) (log 1/) O (1/2) (log m+ log n) (log 1/) 11/21/2018 IIT Kanpur Streams Workshop
38
Other Applications of Distinct Sampling
Sample of distinct elements of the stream of any desired target size Approximate median of all distinct elements in stream (duplicate insensitive median) Distinct Frequent elements (“heavy hitters” in network monitoring) 11/21/2018 IIT Kanpur Streams Workshop
39
IIT Kanpur Streams Workshop
Update Streams Insertions and Deletions of elements into the streams (11, +1), (7, +3), (4, +2), (7, -2), (11,-1)… Distinct Elements Problem: How many elements have a positive cumulative weight? Assume a “sanity constraint”, no element has weight less than 0 Sampling algorithm described so far fails, since it can only decrease sampling probability as stream becomes larger 11/21/2018 IIT Kanpur Streams Workshop
40
Distinct Sampling on Update Streams (three independent approaches)
Sumit Ganguly, Minos N. Garofalakis, Rajeev Rastogi: Processing Set Expressions over Continuous Update Streams. SIGMOD 2003, followed up by Ganguly, 2005 and Ganguly, Majumder 2006 Graham Cormode, S. Muthukrishnan, Irina Rozenbaum: Summarizing and Mining Inverse Distributions on Data Streams via Dynamic Inverse Sampling. VLDB 2005 Gereon Frahling, Piotr Indyk, Christian Sohler: Sampling in dynamic data streams and applications. SocG 2005 11/21/2018 IIT Kanpur Streams Workshop
41
Distinct Elements on Update Streams
Use of K-Set Structure in storing samples Ganguly, Garofalakis, Rastogi 2003 Ganguly 2005 Ganguly, Majumder 2006 11/21/2018 IIT Kanpur Streams Workshop
42
IIT Kanpur Streams Workshop
K-Set Structure Small space data structure for multi-set S (size Ỡ(K)) Operations Insert (x,v) into S Delete (x,v’) from S Membership Query (is x in S?) what is the number of distinct elements in S? If |S| ≤ K, then Queries answered correctly K Active Silent Active 11/21/2018 IIT Kanpur Streams Workshop
43
Counting Distinct Elements on Update Streams
Sample Stream at different probabilities, 1, ½, ¼,….. Store each of (D ∩ S0, D ∩ S1, D ∩ S2,…..) in a k-set structure for an appropriate value of k When queried, use the highest probability sample that hasn’t overflowed yet 11/21/2018 IIT Kanpur Streams Workshop
44
IIT Kanpur Streams Workshop
Distributed Streams Alice Workspace = $$ Stream A Sketch(A) … Referee Bob Compute Dup-Ins-Sum(A,B) Workspace = $$ … Sketch(B) Stream B 11/21/2018 IIT Kanpur Streams Workshop
45
Summary Range-Efficiency (range-sampling)
Update Streams (k-set structure) Sliding Windows (multiple samples) Distinct Sampling 11/21/2018 IIT Kanpur Streams Workshop
46
IIT Kanpur Streams Workshop
Open Questions Can we efficiently handle higher-dimensional ranges? Klee’s measure problem in streaming model 11/21/2018 IIT Kanpur Streams Workshop
47
IIT Kanpur Streams Workshop
Open Questions Range-Efficient F0 under update streams Duplicate-insensitive Fk (k ≥ 2), range-efficient Fk 11/21/2018 IIT Kanpur Streams Workshop
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.