Sampling for Windows on Data Streams by Vladimir Braverman
Data Stream Sequence of elements D=p 1,p 2,…,p N p i is drown from [m]. Objective: Calculate a function f(D). Restrictions: single pass, sub-linear memory, fast processing time (per element). p1p1 p2p2 p3p3 p4p4 p5p5 p6p6 p7p7 p8p8 p N-5 p N-4 p N-3 p N-2 …. p N-1 pNpN … p N-6 Time
Motivation Today’s applications: Huge amounts of data is whizzing by Objective Mining the data, computing statistics etc. Restrictions Expensive overload is not allowed Useful for many applications Networking, databases etc.
Data Stream Intensive theoretical research Streaming Systems Stream(Stanford), StreamMill (UCLA), Aurora (Brown), GigaScope (Rutgers), Nile (Purdue), Niagara (Wisconsin), Telegraph (Berkley) etc.
Data Stream The model allows insertions only What about deletions? Turnstile model Sliding Windows p1p1 p2p2 p3p3 p4p4 p5p5 p6p6 p7p7 p8p8 p N-5 p N-4 p N-3 p N-2 …. p N-1 pNpN … p N-6 Time
p1p1 p2p2 p3p3 p4p4 p5p5 p6p6 p7p7 p8p8 n=5 Sliding Windows expired active SW contains n most recent elements that are “active”. Older elements are “expired”.
p1p1 p2p2 p3p3 p4p4 p5p5 p6p6 p7p7 p8p8 Time p N-6 p N-5 p N-4 p N-3 …. p N-2 p N-1 pNpN p N-7 n=5 Sliding Windows expired active SW contains n most recent elements that are “active”. Older elements are “expired”.
p1p1 p2p2 p3p3 p4p4 p5p5 p6p6 p7p7 p8p8 Time p N-6 p N-5 p N-4 p N-3 …. p N-2 p N-1 pNpN p N-7 n=5, n is “huge” Sequence-based Windows expired active SW contains n most recent elements that are “active”. Older elements are “expired”.
p1p1 p2p2 p3p3 p4p4 p5p5 Time Timestamp-based windows p6p6 p7p7 p8p8 p9p9 p 10 p 11 p 12 p 13
What is known on sliding windows [BDM 02]Random sampling [DGIM 02]Sum, Count, average, Lp, 0<p≤2, weakly additive functions. [DM 02]Rarity, similarity [GT 02]Distributed sum, count [FKZ 02], [CS 04]Diameter [BDMO 03]Variance, k-medians [GDDLM 03]Frequent elements [AM 04]Counts, quantiles [AGHLRS 04]LIS [LT 06]Frequent items [LT 06]Count [ZG 06]Variance [CCM 07]Entropy
Random Sampling
Fundamental approximation method Pick a subset S of D Use f(S) to approximate f(D) p1p1 p2p2 p3p3 p4p4 p5p5 p6p6 p7p7 p8p8 p N-5 p N-4 p N-3 p N-2 …. p N-1 pNpN p N-6 p9p9 p 10
Types of k-sampling With replacement Samples x 1,…,x k are independent Without replacement Repetitions are forbidden, i.e., x i ≠ x j
Properties of Random Sampling General, simple, first-to-try method Stores an element, not aggregation Allows to change f a posteriori. Can be used for multiple statistics. Provides effective solutions with worst- case guarantees The only known solution for many problems
Some Known Methods for Data Streams Reservoir Sampling[V 85] Concise Sampling[GM 98] Inverse Sampling[CMR 05] Weighted Sampling[CMN 99] Biased Sampling[A 06] Priority Sampling[ADLT 05] Dynamic Sampling[FIS 05] Chain Sampling[BDM 02]
Streaming Sampling Easy if N is fixed Pick random index I from {1,2,…,N} Output p I But: N is not known in advance Naïve methods Store the whole stream Linear memory “Guess” the final value of N Not really uniform
Reservoir Sampling (Vitter 85) Maintains k uniform samples without replacement using Θ(k) space. Outputs sample for every prefix Intuition: The probability to pick p decreases as N grows probabilities can be adjusted dynamically
Reservoir Sampling (Vitter 85) Reservoir (array) of k elements, initially empty Algorithm: Insert k first elements into the reservoir. For i>k, pick p i with probability 1/i If p i is chosen Pick one of samples in the reservoir randomly Replace it with p i
Sampling on Sliding Windows: Problem Definition Maintain uniform random sampling on sliding windows Output a sample for every window Use provably optimal memory
Sampling for Sliding Windows Can we use previous methods? No - samples expire p1p1 p2p2 p3p3 p4p4 p5p5 p6p6 p7p7 p8p8 Time n=5
Naïve Approach Store the whole window Linear memory => compute f(W) directly
p1p1 p2p2 p3p3 p4p4 p5p5 p6p6 p7p7 p8p8 Time n=5 Periodic Sampling Pick a sample p i from the first window When p i expires, take the new element Continue…
Periodic Sampling: problems Vulnerability to malicious behavior Given one sample, it is possible to predict all future samples Poor representation of periodic data If the period “agrees” with the sample Unacceptable for applications
Sampling on Sliding Windows: Problem Definition Maintain uniform random sampling on sliding windows Use provably optimal memory Samples on distinct windows are independent
Chain and Priority Methods Babcock, Datar, Motwani, SODA Maintain uniform random sampling on sliding windows Chain Sampling Sequence-based windows, with replacement. Uses optimal memory in expectation Uses O(k log{n}) w.h.p. Samples on distinct windows are weakly dependent Priority Sampling Timestamp-based windows, with replacement. Uses optimal memory in expectation and w.h.p. Samples on distinct windows are independent
S 3 Algorithms Maintain uniform random sampling on sliding windows Supports all cases Provably optimal Samples on distinct windows are independent
Sequence-basedTimestamp-based With Replacement Without Replacement Window Sampling Taxonomy
Sampling With Replacement on Sequence-Based Windows SamplingMemoryDependency NaïveO(n)No PeriodicO(k)Yes Chain (BDM 02) O(k) in expectation Weak S 3 (our result) O(k)No
Sampling Without Replacement on Sequence-Based Windows SamplingMemoryDependency NaïveO(n)No PeriodicO(k)Yes S3S3 O(k)No
Sampling With Replacement on Time-Based Windows SamplingMemoryDependency NaïveO(n)No Priority (BDM 02) O(k log(n)) w.h.p. No S3S3 O(k log(n))No
Sampling Without Replacement on Time-Based Windows SamplingMemoryDependency NaïveO(n)No S3S3 O(k)No
Sequence-basedTimestamp-based With Replacement O(k)O(k*log n) Without Replacement O(k) O(k*log n) Window Sampling S 3 : Recap
Concepts Prior algorithms: Replacement policy for expired samples S 3 algorithms: Divide stream into buckets Sample(s) for each bucket Combination rule
Sampling With Replacement for Sequence-Based Windows
p1p1 p2p2 p3p3 p4p4 p5p5 p6p6 p7p7 p8p8 Time p N-5 p N-4 p N-3 p N-2 …. p N-1 pNpN p N+1 p N-6 B1B1 B2B2 p9p9 p 10 B N/n B N/n+1 p N+2 p N+3 Active element Bucket Expired elementFuture element Notations
The Algorithm (for one sample) Divide D into buckets of size n Maintain random sample for each bucket (reservoir algorithm) Combine samples of buckets that have active elements: There are at most two such buckets p1p1 p2p2 p3p3 p4p4 p5p5 p6p6 p7p7 p8p8 p N-5 p N-4 p N-3 p N-2 …. p N-1 pNpN p N+1 p N-6 B1B1 B2B2 p9p9 p 10 B N/n B N/n+1 p N+2 p N+3 R1R1 R2R2 R N/n R N/n+1 Time
p N-5 p N-4 p N-3 p N-2 p N-1 pNpN p N+1 p N-6 B N/n B N/n+1 p N+2 p N+3 Time …. R1R1 R2R2 X
p N-5 p N-4 p N-3 p N-2 p N-1 pNpN p N+1 p N-6 B N/M B N/M+1 p N+2 p N+3 Time …. X Case 1
p N-5 p N-4 p N-3 p N-2 p N-1 pNpN p N+1 p N-6 B N/n B N/n+1 p N+2 p N+3 Time …. R1R1 R2R2 X Case 2
Sampling Without Replacement for Sequence-Based Windows
The Algorithm Divide D into buckets of size n Maintain k random samples for each bucket Combine samples of buckets that have active elements: p1p1 p2p2 p3p3 p4p4 p5p5 p6p6 p7p7 p8p8 p N-5 p N-4 p N-3 p N-2 …. p N-1 pNpN p N+1 p N-6 B1B1 B2B2 p9p9 p 10 B N/M B N/M+1 p N+2 p N+3 R 1,1 R 1,2 R 2,1 R 2,2 R 2,1 R 2,2 R 2,1 R 2,2 Time k=2
p N-5 p N-4 p N-3 p N-2 …. p N-1 pNpN p N+1 p N-6 B N/n B N/n+1 p N+2 p N+3 R 1,1 R 1,2 R 2,1 R 2,2 Time …. R 1,1 R 2,2 X= R1=R1=R2=R2=
Sampling With Replacement for Timestamp-Based Windows
Timestamp-based window n is unknown! Can be changed arbitrary Does our concept work? How to divide stream into buckets? How to combine samples?
AB p N-16 p N-15 p N-14 p N-13 p N-12 p N-11 p N-10 p N-9 p N-5 p N-4 p N-3 p N-2 …. p N-1 pNpN p N+1 p N-6 p N-8 p N-7 p N+2 p N+3 n=13 a=|A|=5b=|B|=10 What if we can maintain buckets A, B as before Samples from A and B a=|A|, b=|B|, c=|A ∩ W| If sample from A expired, X = sample from B If sample from A is active, X= sample from A with probability a/n Otherwise X= sample from B c= |A∩W|=3 The main idea, revised
AB p N-16 p N-15 p N-14 p N-13 p N-12 p N-11 p N-10 p N-9 p N-5 p N-4 p N-3 p N-2 …. p N-1 pNpN p N+1 p N-6 p N-8 p N-7 p N+2 p N+3 n=13 a=|A|=5b=|B|=10 c= |A∩W|=3 Correctness
AB p N-16 p N-15 p N-14 p N-13 p N-12 p N-11 p N-10 p N-9 p N-5 p N-4 p N-3 p N-2 …. p N-1 pNpN p N+1 p N-6 p N-8 p N-7 p N+2 p N+3 n=13 a=|A|=5b=|B|=10 The combination rule works if: 1. a ≤ n 2. It is possible to generate events w.p. a/M c= |A∩W|=3 Conclusions
The First Problem How to maintain A, B at any moment? |A| is less then n
The solution: ζ-decomposition List of buckets B 1,…,B s Contain all active elements 2 samples from each buckets B 1 may contain expired elements as well B1B1 B2B2 B3B3 B4B4 B s-1 BsBs …… Define Ensure that |A| ≤ |B| and s = O(log n)
ζ-decomposition : implementation Similar idea to smooth histograms Slightly different structure
AB p N-16 p N-15 p N-14 p N-13 p N-12 p N-11 p N-10 p N-9 p N-5 p N-4 p N-3 p N-2 …. p N-1 pNpN p N+1 p N-6 p N-8 p N-7 p N+2 p N+3 M=13 a=|A|=5 c= |A∩W|=3 b=|B|=10 Assuming a ≤ b ≤ n, how to generate events w.p. a/n? a,b are known, c is unknown and n=b+c The Second Problem
Approach Generate “biased” sample Y on A, using such that Y expires w.p. b/n Use Y to obtain probability a/n The details are in the paper
AB p N-16 p N-15 p N-14 p N-13 p N-12 p N-11 p N-10 p N-9 p N-5 p N-4 p N-3 p N-2 …. p N-1 pNpN p N+1 p N-6 p N-8 p N-7 p N+2 p N+3 M=13 a=|A|=5 c= |A∩W|=3 b=|B|=10 Given random sample from A, it is possible to construct random variable Y on A such that Lemma 1
Generate random vector V on D = Ax{0,1} a V = of independent random variables Q, H 1,…,H a Q ~ U(A) H i = 1 w.p. ab/(b+i)(b+i+1) Define a set of subspaces of D: A i = {p N-b-I } x {0,1} i-1 x {1} x {0,1} a-i
Lemma 2 Given Y from Lemma 1, it is possible to construct 0-1 random variable Z such that P(Z=1) = a/n Proof sketch: - Generate event T that happens w.p. a/b It is possible since a ≤ b and a,b are known
Sampling Without Replacement for Timestamp-Based Windows
Main idea Implement k-sample without replacement using k independent samples What can we do if the same point is sampled more then once? Approach: sample from different domains
Cascading lemma H i j j-sample (without replacement) from {1,…,i} Given H i j and H i+1 1, we can construct H i+1 j+1.
Cascading Lemma (Illustration) H 1 n-k+1 H 1 n-k+2 H 1 n-k+3 H 1 n-k+4 H 1 n-1 H1nH1n ….. H 2 n-k+2 H 3 n-k+3 H 4 n-k+4 H k-1 n-1 HknHkn
Conclusions Random Sampling Optimally solved Gives worst-case solutions for many problems
Thank you!