Range-Efficient Counting of Distinct Elements

Slides:



Advertisements
Similar presentations
Estimating Distinct Elements, Optimally
Advertisements

Rectangle-Efficient Aggregation in Spatial Data Streams Srikanta Tirthapura David Woodruff Iowa State IBM Almaden.
The Data Stream Space Complexity of Cascaded Norms T.S. Jayram David Woodruff IBM Almaden.
Tight Bounds for Distributed Functional Monitoring David Woodruff IBM Almaden Qin Zhang Aarhus University MADALGO.
The Average Case Complexity of Counting Distinct Elements David Woodruff IBM Almaden.
An Optimal Algorithm for the Distinct Elements Problem
Data Stream Algorithms Frequency Moments
Counting Distinct Objects over Sliding Windows Presented by: Muhammad Aamir Cheema Joint work with Wenjie Zhang, Ying Zhang and Xuemin Lin University of.
Probabilistic Skyline Operator over Sliding Windows Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Xuemin Lin, Ying Zhang, Wei.
An Improved Data Stream Summary: The Count-Min Sketch and its Applications Graham Cormode, S. Muthukrishnan 2003.
Analysis of Algorithms
Summarizing Distributed Data Ke Yi HKUST += ?. Small summaries for BIG data  Allow approximate computation with guarantees and small space – save space,
Vladimir(Vova) Braverman UCLA Joint work with Rafail Ostrovsky.
1 A Deterministic Algorithm for Summarizing Asynchronous Streams over a Sliding Window Costas Busch Rensselaer Polytechnic Institute Srikanta Tirthapura.
Mining Data Streams.
Heavy Hitters Piotr Indyk MIT. Last Few Lectures Recap (last few lectures) –Update a vector x –Maintain a linear sketch –Can compute L p norm of x (in.
1 CS 361 Lecture 5 Approximate Quantiles and Histograms 9 Oct 2002 Gurmeet Singh Manku
Turnstile Streaming Algorithms Might as Well Be Linear Sketches Yi Li Huy L. Nguyen David Woodruff.
New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 13 June 25, 2006
Algorithms for Distributed Functional Monitoring Ke Yi HKUST Joint work with Graham Cormode (AT&T Labs) S. Muthukrishnan (Google Inc.)
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 12 June 18, 2006
1 Distributed Streams Algorithms for Sliding Windows Phillip B. Gibbons, Srikanta Tirthapura.
Tirgul 8 Universal Hashing Remarks on Programming Exercise 1 Solution to question 2 in theoretical homework 2.
What ’ s Hot and What ’ s Not: Tracking Most Frequent Items Dynamically G. Cormode and S. Muthukrishman Rutgers University ACM Principles of Database Systems.
Time-Decaying Sketches for Sensor Data Aggregation Graham Cormode AT&T Labs, Research Srikanta Tirthapura Dept. of Electrical and Computer Engineering.
A survey on stream data mining
Estimating Set Expression Cardinalities over Data Streams Sumit Ganguly Minos Garofalakis Rajeev Rastogi Internet Management Research Department Bell Labs,
CS591A1 Fall Sketch based Summarization of Data Streams Manish R. Sharma and Weichao Ma.
Cloud and Big Data Summer School, Stockholm, Aug Jeffrey D. Ullman.
Neighbourhood Sampling for Local Properties on a Graph Stream A. Pavan, Iowa State University Kanat Tangwongsan, IBM Research Srikanta Tirthapura, Iowa.
Tight Bounds for Graph Problems in Insertion Streams Xiaoming Sun and David P. Woodruff Chinese Academy of Sciences and IBM Research-Almaden.
Streaming Algorithms Piotr Indyk MIT. Data Streams A data stream is a sequence of data that is too large to be stored in available memory Examples: –Network.
Summarizing and mining inverse distributions on data streams via dynamic inverse sampling Graham Cormode S. Muthukrishnan
Information Theory for Data Streams David P. Woodruff IBM Almaden.
Data Stream Algorithms Ke Yi Hong Kong University of Science and Technology.
Facility Location in Dynamic Geometric Data Streams Christiane Lammersen Christian Sohler.
Embedding and Sketching Sketching for streaming Alexandr Andoni (MSR)
Data Stream Algorithms Lower Bounds Graham Cormode
Calculating frequency moments of Data Stream
Chapter 3 Chapter Summary  Algorithms o Example Algorithms searching for an element in a list sorting a list so its elements are in some prescribed.
Big Data Lecture 5: Estimating the second moment, dimension reduction, applications.
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Chapter 11 Sorting Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and Mount.
Mining Data Streams (Part 1)
Information Complexity Lower Bounds
New Characterizations in Turnstile Streams with Applications
The Stream Model Sliding Windows Counting 1’s
COMP108 Algorithmic Foundations Algorithm efficiency
Open Problems in Streaming
Finding Frequent Items in Data Streams
Estimating L2 Norm MIT Piotr Indyk.
Randomized Algorithms
Approximate Matchings in Dynamic Graph Streams
Query-Friendly Compression of Graph Streams
Skip Lists.
Sublinear Algorithmic Tools 2
CS 154, Lecture 6: Communication Complexity
Lecture 7: Dynamic sampling Dimension Reduction
Turnstile Streaming Algorithms Might as Well Be Linear Sketches
Enumerating Distances Using Spanners of Bounded Degree
Skip Lists S3 + - S2 + - S1 + - S0 + -
Randomized Algorithms
Overview Massive data sets Streaming algorithms Regression
Range-Efficient Computation of F0 over Massive Data Streams
Parasol Lab, Dept. CSE, Texas A&M University
Lecture 6: Counting triangles Dynamic graphs & sampling
Heavy Hitters in Streams and Sliding Windows
Joint work with Morteza Monemizadeh
Approximation and Load Shedding Sampling Methods
Presentation transcript:

Range-Efficient Counting of Distinct Elements Srikanta Tirthapura Iowa State University (joint work with Phillip Gibbons, Aduri Pavan)

IIT Kanpur Streams Workshop Range-Efficient F0 Stream: [100,200], [0,10], [60, 120], [5,25] F0: |[0,25] U [60,200]| = 167 120 5 10 25 60 100 200 11/21/2018 IIT Kanpur Streams Workshop

IIT Kanpur Streams Workshop Range-Efficient F0 Input Stream: Sequence of ranges [l1,r1], [l2,r2] … [lm,rm] for each i, 0 <= li <= ri <= n, and li, ri are integers Output: Return | [l1,r1] U [l2,r2] U … U [lm,rm]| i.e. number of distinct elements in the union (F0) Constraints: Single pass through the data Small Workspace Fast Processing Time 11/21/2018 IIT Kanpur Streams Workshop

Reductions to Range-Efficient F0 Duplicate Insensitive Sum Max-Dominance Norm Range-Efficient F0 Counting Triangles in Graphs 11/21/2018 IIT Kanpur Streams Workshop

Duplicate-Insensitive Sum Problem: Sum of all distinct elements in a stream of integers Input Stream: Sequence of integers S = a1,a2,….., an Output: distinct ai in S ai Example: S = 4, 5, 15, 4, 100, 4, 16, 15 Distinct Elements = 4,5,15,100, 16 Sum = 140 11/21/2018 IIT Kanpur Streams Workshop

Reduction from Dup-Insensitive Sum to F0 Stream from U = [0,m-1] Alternate Stream from U’=[0,m2-1] S S’ 4 [4m, 4m+1, .., 4m+3] 5 [5m,..,5m+4] 15 [15m,…,15m+14] [4m,…,4m+3] 100 [100m,…,100m+99] Duplicate-Insensitive Sum Number of Distinct Elements 11/21/2018 IIT Kanpur Streams Workshop

IIT Kanpur Streams Workshop Max Dominance Norm Given k streams of m integers each, (the elements of the streams arrive in an arbitrary order), where 1 ≤ ai,j ≤ n a1,1 a1,2 .. a1,m a2,1 a2,2 … a2,m … ak,1 ak,2 … ak,m Return j=1m max1 ≤ i ≤ k ai,j a b 11/21/2018 IIT Kanpur Streams Workshop

Reduction From Max Dominance Norm Input stream I, output stream O: F0 of Output Stream = Dominance Norm of Input Stream Assign ranges to the k positions: [1,n] [n+1,2n] … [(k-1)n+1, kn] When element ai,j is received, generate the range [(j-1)m+1, (j-1)m+1+ai,j] Observation: F0 of the resulting stream of ranges is the dominance norm of the input stream a b 11/21/2018 IIT Kanpur Streams Workshop

IIT Kanpur Streams Workshop Talk Outline Range Efficient F0 Reductions Among Data Stream Problems Algorithm for Range Efficient F0 (building on distinct sampling) Update Streams Open Questions 11/21/2018 IIT Kanpur Streams Workshop

Counting Distinct Elements (F0) Example How many different users accessed my website today? Stream = 1,1,2,3,4,1,2 F0 = 4 Numerous Applications in databases and networking Prior Work Flajolet-Martin (1985) Alon, Matias and Szegedy (1996) Gibbons and Tirthapura (2001) Bar-Yossef et al. (2002) (currently most space-efficient) Indyk-Woodruff (2003) (Lower Bounds) 11/21/2018 IIT Kanpur Streams Workshop

Range-Efficient F0 (Pavan and Tirthapura) Range Sampling for 2-way Independent Hash Functions Distinct Sampling Algorithm for F0 + 11/21/2018 IIT Kanpur Streams Workshop

Sampling Based Algorithm for F0 (Gibbons and Tirthapura 2001) D = Distinct Elements In Stream U = {1,2,3,…..,n} S0 p=1/2 D  S1 S0, S1, S2.. stored implicitly implicitly using hash functions {2,4,7,…} S1 p=1/2 D  S2 {4,7,11,..} S2 11/21/2018 IIT Kanpur Streams Workshop

IIT Kanpur Streams Workshop Distinct Sampling Sample = {}, p = 1 Target Workspace = 4 numbers 11/21/2018 IIT Kanpur Streams Workshop

IIT Kanpur Streams Workshop Distinct Sampling 5 Sample = {5}, p = 1 Target Workspace = 4 numbers 11/21/2018 IIT Kanpur Streams Workshop

IIT Kanpur Streams Workshop Distinct Sampling 5 3 Sample = {5,3}, p = 1 Target Workspace = 4 numbers 11/21/2018 IIT Kanpur Streams Workshop

IIT Kanpur Streams Workshop Distinct Sampling 5 3 7 Sample = {5,3,7}, p = 1 Target Workspace = 4 numbers 11/21/2018 IIT Kanpur Streams Workshop

IIT Kanpur Streams Workshop Distinct Sampling 5 3 7 Sample = {5,3,7}, p = 1 Target Workspace = 4 numbers 11/21/2018 IIT Kanpur Streams Workshop

IIT Kanpur Streams Workshop Distinct Sampling 5 3 7 6 Sample = {5,3,7,6}, p = 1 Target Workspace = 4 numbers 11/21/2018 IIT Kanpur Streams Workshop

IIT Kanpur Streams Workshop Distinct Sampling 5 3 7 6 8 Sample = {5,3,7,6,8}, p = 1 Overflow Sample = Sample  S1 Sample = {3,6,8}, p = ½ Target Workspace = 4 numbers 11/21/2018 IIT Kanpur Streams Workshop

IIT Kanpur Streams Workshop Distinct Sampling 5 3 7 6 8 9 Sample = {3,6,8,9}, p= ½ Target Workspace = 4 numbers 11/21/2018 IIT Kanpur Streams Workshop

IIT Kanpur Streams Workshop Distinct Sampling 5 3 7 6 8 9 Same Decision for both Sample = {3,6,8,9}, p= ½ Target Workspace = 4 numbers 11/21/2018 IIT Kanpur Streams Workshop

IIT Kanpur Streams Workshop Distinct Sampling 5 3 7 6 8 9 2 Sample = {3,6,8,9,2}, p= ½ Overflow Sample = Sample  S2 Sample = {6,9}, p=¼ Target Workspace = 4 numbers 11/21/2018 IIT Kanpur Streams Workshop

IIT Kanpur Streams Workshop Distinct Sampling 5 3 7 6 8 9 2 Finally, Sample = {6,9}, p=¼ Estimate of F0 = (Sample Size)(4) = 8 11/21/2018 IIT Kanpur Streams Workshop

Counting Distinct Elements Finally, return a sample of distinct elements of the stream of a “large enough” size If target workspace = O((1/2)(log(1/)) integers, then estimate of F0 is a (, )-approximation Hash functions need only be pairwise independent and can be stored in small space 11/21/2018 IIT Kanpur Streams Workshop

IIT Kanpur Streams Workshop Sampling Using Independent Coin Tosses Distinct Sampling Using Hash Functions Hash Function 1 1 11/21/2018 IIT Kanpur Streams Workshop

Adaptive Sampling for Range-Efficient F0 Naïve Approach: Given range [x,y], successively insert {x, x+1, … y} into F0 sampling algorithm Problem: Time per range very large Range-Sampling: Given stream element [p,q], how to sample all elements in [p,q] quickly? At sampling level i, quickly compute |[p,q] ∩ Si| 11/21/2018 IIT Kanpur Streams Workshop

Hash Functions, and S0,S1,S2… v2 h(x)=(ax+b) mod p p prime a,b random in [0,p-1] v3 v1 p-1 n If h(x) Є[0,vi], then x Є Si 11/21/2018 IIT Kanpur Streams Workshop

IIT Kanpur Streams Workshop Range Sampling v 1 X1 X2 p-1 n f(x)=(ax+b) mod p Compute |{x Є [x1,x2] : f(x) Є [0,v] }| 11/21/2018 IIT Kanpur Streams Workshop

Arithmetic Progression p-1 v f(x1) f(x1+1) 1 X1 X2 n f(x)=(ax+b) mod p Common Difference = a 11/21/2018 IIT Kanpur Streams Workshop

Low and High Revolutions p-1 v f(x1) f(x1+1) Each revolution, number of hits on [0,v] is floor(v/a) (low rev) floor(v/a) +1 (high rev) Task: Count number of low, high revolutions 11/21/2018 IIT Kanpur Streams Workshop

Starting Points of Revolutions p-1 v f(x1) f(x1+1) Can find r = (v - v mod a) such that: If starting point in [0,r], then high revolution Else low revolution Task: Count the number of revolutions with starting point in [0,r] r 11/21/2018 IIT Kanpur Streams Workshop

IIT Kanpur Streams Workshop Recursive Algorithm p-1 r a a-1 r modulo a circle modulo p circle Observation: Starting Points form an Arithmetic Progression with difference (- p mod a) 11/21/2018 IIT Kanpur Streams Workshop

IIT Kanpur Streams Workshop Recursive Algorithm Focus on common difference Two Reductions Possible Common Difference a- (p mod a) Common Difference a Common Difference (p mod a) At least one of the two common differences is smaller than a/2 11/21/2018 IIT Kanpur Streams Workshop

IIT Kanpur Streams Workshop Range Sampling Theorem: There is an algorithm for sampling range [x,y] using 2-way independent hash functions with Time complexity O(log (y-x)) Space Complexity O(log (y-x) + log m) Plug back into distinct sampling to get range-efficient F0 algorithm 11/21/2018 IIT Kanpur Streams Workshop

IIT Kanpur Streams Workshop Input Stream Sequence of ranges [l1,r1], [l2,r2] … [lm,rm] for each i, 0 <= li <= ri < n, and li, ri are integers Output | [l1,r1] U [l2,r2] U … U [lm,rm]| Results Randomized (,)-Approximation Algorithm for Range-efficient F0 of a data stream Processing Time (n is the size of the universe): Amortized processing time per interval: O(log(1/) (log (n/))) Time to answer a query for F0 is a constant WorkSpace: O((1/2)(log(1/)) (log n)) Pavan,Tirthapura SICOMP (to appear) 11/21/2018 IIT Kanpur Streams Workshop

IIT Kanpur Streams Workshop Prior Work Bar-Yossef, Kumar, Sivakumar 2002 First studied range-efficient F0 Algorithms with higher space complexity Cormode, Muthukrishnan 2003 Max-dominance Norm Nath, Gibbons, Seshan, Anderson 2004 Duplicate-insensitive Sum assuming ideal hash functions 11/21/2018 IIT Kanpur Streams Workshop

Cormode, Muthukrishnan Comparison Range-Efficient F0 Bar-Yossef et al. Pavan and Tirthapura Time O(log5 n)(1/5)(log 1/) O(log n + log 1/)(log 1/) Space O(1/3)(log n)(log 1/) O(1/2)(log n)(log 1/) Max-Dominance Norm Cormode, Muthukrishnan Pavan and Tirthapura Time O(1/4 ) (log n) (log m) (log 1/) O(log n + log 1/)(log 1/) Space O (1/2) (log n+1/ (log m) (log log m)) (log 1/) O (1/2) (log m+ log n) (log 1/) 11/21/2018 IIT Kanpur Streams Workshop

Other Applications of Distinct Sampling Sample of distinct elements of the stream of any desired target size Approximate median of all distinct elements in stream (duplicate insensitive median) Distinct Frequent elements (“heavy hitters” in network monitoring) 11/21/2018 IIT Kanpur Streams Workshop

IIT Kanpur Streams Workshop Update Streams Insertions and Deletions of elements into the streams (11, +1), (7, +3), (4, +2), (7, -2), (11,-1)… Distinct Elements Problem: How many elements have a positive cumulative weight? Assume a “sanity constraint”, no element has weight less than 0 Sampling algorithm described so far fails, since it can only decrease sampling probability as stream becomes larger 11/21/2018 IIT Kanpur Streams Workshop

Distinct Sampling on Update Streams (three independent approaches) Sumit Ganguly, Minos N. Garofalakis, Rajeev Rastogi: Processing Set Expressions over Continuous Update Streams. SIGMOD 2003, followed up by Ganguly, 2005 and Ganguly, Majumder 2006 Graham Cormode, S. Muthukrishnan, Irina Rozenbaum: Summarizing and Mining Inverse Distributions on Data Streams via Dynamic Inverse Sampling. VLDB 2005 Gereon Frahling, Piotr Indyk, Christian Sohler: Sampling in dynamic data streams and applications. SocG 2005 11/21/2018 IIT Kanpur Streams Workshop

Distinct Elements on Update Streams Use of K-Set Structure in storing samples Ganguly, Garofalakis, Rastogi 2003 Ganguly 2005 Ganguly, Majumder 2006 11/21/2018 IIT Kanpur Streams Workshop

IIT Kanpur Streams Workshop K-Set Structure Small space data structure for multi-set S (size Ỡ(K)) Operations Insert (x,v) into S Delete (x,v’) from S Membership Query (is x in S?) what is the number of distinct elements in S? If |S| ≤ K, then Queries answered correctly K Active Silent Active 11/21/2018 IIT Kanpur Streams Workshop

Counting Distinct Elements on Update Streams Sample Stream at different probabilities, 1, ½, ¼,….. Store each of (D ∩ S0, D ∩ S1, D ∩ S2,…..) in a k-set structure for an appropriate value of k When queried, use the highest probability sample that hasn’t overflowed yet 11/21/2018 IIT Kanpur Streams Workshop

IIT Kanpur Streams Workshop Distributed Streams Alice Workspace = $$ Stream A Sketch(A) 11 54 21 11 2 45 21 1… Referee Bob Compute Dup-Ins-Sum(A,B) Workspace = $$ 1 5 21 2 54 21 35 … Sketch(B) Stream B 11/21/2018 IIT Kanpur Streams Workshop

Summary Range-Efficiency (range-sampling) Update Streams (k-set structure) Sliding Windows (multiple samples) Distinct Sampling 11/21/2018 IIT Kanpur Streams Workshop

IIT Kanpur Streams Workshop Open Questions Can we efficiently handle higher-dimensional ranges? Klee’s measure problem in streaming model 11/21/2018 IIT Kanpur Streams Workshop

IIT Kanpur Streams Workshop Open Questions Range-Efficient F0 under update streams Duplicate-insensitive Fk (k ≥ 2), range-efficient Fk 11/21/2018 IIT Kanpur Streams Workshop