3/13/2012Data Streams: Lecture 161 CS 410/510 Data Streams Lecture 16: Data-Stream Sampling: Basic Techniques and Results Kristin Tufte, David Maier
3/13/2012 Data Streams: Lecture 16 2 Data Stream Sampling Sampling provides a synopsis of a data stream Sample can serve as input for Answering queries “statistical inference about the contents of the stream” “variety of analytical procedures” Focus on: obtaining a sample from the window (sample size « window size)
3/13/2012 Data Streams: Lecture 16 3 Windows Stationary Window Endpoints of window fixed (think relation) Sliding Window Endpoints of window move What we’ve been talking about More complex than stationary window because elements must be removed from sample when they expire from window
3/13/2012 Data Streams: Lecture 16 4 Simple Random Sampling (SRS) What is a “representative” sample? SRS for a sample of k elements from a window with n elements Every possible sample (of size k) is equally likely, that is has probability: 1/ Every element is equally likely to be in sample Stratified Sampling Divide window into disjoint segments (strata) SRS over each stratum Advantageous when stream elements close together in stream have similar values nknk ( )
3/13/2012 Data Streams: Lecture 16 5 Bernoulli Sampling Includes each element in the sample with probability q The sample size is not fixed, sample size is binomially distributed Probability that sample contains k elements is: Expected sample size is nq ( ) q k (1-q) n-k nknk
3/13/2012 Data Streams: Lecture 16 6 Binomial Distribution - Example Expected Sample Size = 20*0.5 = 10 Binomial Distribution (n=20, q=0.5) Probability Sample Size
3/13/2012 Data Streams: Lecture 16 7 Binomial Distribution - Example Expected Sample Size = 20*1/3 ≈ Binomial Distribution (n=20, q=1/3) Probability Sample Size
3/13/2012 Data Streams: Lecture 16 8 Bernoulli Sampling - Implementation Naïve: Elements inserted with probability q (ignored with probability 1-q) Use a sequence of pseudorandom numbers (U 1, U 2, U 3, …) U i [0,1] Element e i is included if U i ≤ q e1e1 Sample: e2e2 e6e6 e5e5 e4e4 e3e3 U 1 =0.5U 2 =0.1 e2e2 e5e5 U 3 =0.9 e7e7 U 4 =0.8 U 5 =0.2U 6 =0.3 e7e7 U 7 =0.0 Example q = 0.2
3/13/2012 Data Streams: Lecture 16 9 Bernoulli Sampling – Efficient Implementation Calculate number of elements to be skipped after an insertion (Δ i ) Pr {Δ i = j} = q(1-q) j If you skip zero elements, must get: U i ≤ q (pr: q) Skip one element, must get: U i > q, U i+1 ≤ q (pr: (1-q)q) Skip two elements: U i > q, U i+1 > q, U i+2 ≤ q (pr: (1-q) 2 q) Δ i has a geometric distribution
3/13/2012 Data Streams: Lecture Geometric Distribution - Example Geometric Distribution q = 0.2 Probability Number of Skips (Δ i )
3/13/2012 Data Streams: Lecture Bernoulli Sampling - Algorithm
3/13/2012 Data Streams: Lecture Bernoulli Sampling Straightforward, SRS, easy to implement But… Sample size is not fixed! Look at algorithms with deterministic sample size Reservoir Sampling Stratified Sampling Biased Sampling Schemes
3/13/2012 Data Streams: Lecture Reservoir Sampling Produces a SRS of size k from a window of length n (k is specified) Initialize a “reservoir” using first k elements For every following element, insert with probability p i (ignore with probability 1-p i ) p i = k/i for i>k (p i = 1 for i ≤ k) p i changes as i increases Remove one element from reservoir before insertion
3/13/2012 Data Streams: Lecture Reservoir Sampling e1e1 Reservoir Sample: e2e2 e6e6 e5e5 e4e4 e3e3 Sample size 3 (k=3) Recall: p i = 1 i≤k, p i = i/k i>k p 1 =1p 2 =1 e1e1 e2e2 p 3 =1 e3e3 p 4 =3/4 p 5 =3/5p 6 =3/6 e7e7 p 7 =3/7 e8e8 p 8 =3/8 U 4 =0.5U 5 =0.1 U 6 =0.9U 4 =0.8 U 5 =0.2 e4e4 e5e5 e8e8
3/13/2012 Data Streams: Lecture Reservoir Sampling - SRS Why set p i = k/i? Want S j to be a SRS from U j = {e 1, e 2, …, e j } Sj is the sample from Uj Recall SRS means every sample of size k is equally likely Intuition: Probability that e i is included in SRS from U i is k/i k is sample size, i is “window” size k/i = (#samples containing e i )/(#samples of size k) = ( ) i-1 k-1 ( ) ikik
3/13/2012 Data Streams: Lecture Reservoir Sampling - Observations Insertion probability (p i = k/i i>k) decreases as i increases Also, opportunities for an element in the sample to be removed from the sample decrease as i increases These trends offset each other Probability of being in final sample is same for all elements in the window
3/13/2012 Data Streams: Lecture Other Sampling Schemes Stratified Sampling Divide window into strata, SRS in each stratum Deterministic & Semi-Deterministic Schemes i.e. Sample every 10 th element Biased Sampling Schemes Bias sample towards recently-received elements Biased Reservoir Sampling Biased Sampling by Halving
3/13/2012 Data Streams: Lecture Stratified Sampling
3/13/2012 Data Streams: Lecture Stratified Sampling When elements close to each other in window have similar values, algorithms such as reservoir sampling can have bad luck Alternative: divide window into strata and do SRS in each strata If you know there is a correlation between data values (i.e. timestamp) and position in stream, you may wish to use stratified sampling
3/13/2012 Data Streams: Lecture Deterministic Semi-deterministic Schemes Produce sample of size k by inserting every n/k th element into the sample Simple, but not random Can’t make statistical conclusions about window from sample Bad if data is periodic Can be good if data exhibits a trend Ensures sampled elements are spread throughout the window e1e1 e2e2 e6e6 e5e5 e4e4 e3e3 e7e7 e9e9 e8e8 e 11 e 10 e 12 e 13 e 17 e 16 e 15 e 14 e 18 n=18, k=6
3/13/2012 Data Streams: Lecture Biased Reservoir Sampling Recall: Reservoir sampling – probability of inclusion decreased as we got further into the window (p i = i/k) What if p i was constant? (p i = p) Alternative: p i decreases more slowly than i/k Will favor recently-arrived elements Recently-arrived elements are more likely to be in sample than long-ago-arrived elements
3/13/2012 Data Streams: Lecture ( ) Biased Reservoir Sampling For reservoir sampling, Probability that e i is included in sample S: If p i is fixed, that is set p i = p (0,1) Probability that e i is in final sample increases geometrically as i increases Pr {e i S} = p i j=max(i, k)+1 n k-p j k Pr {e i S} = p n - max(i, k) k-p k
3/13/2012 Data Streams: Lecture Biased Reservoir Sampling Probability e i is included in final sample, p=0.2, k=10, n=40 Element index (i) Probability ( ) max(i, 10)
3/13/2012 Data Streams: Lecture kk Biased Sampling by Halving Break into strata (Λ i ), Sample of size 2k Step 1: S = unbiased SRS samples of size k from Λ 1 and Λ 2 (i.e. use reservoir sampling) Step 2: Sub-sample S to produce a sample of size k, insert SRS of size k from Λ 3 into S Λ1Λ1 Λ2Λ2 Λ3Λ3 Λ4Λ4 kk kk
3/13/2012 Data Streams: Lecture Sampling from Sliding Windows Harder than sampling from stationary window Must remove elements from sample as the elements expire from the window Difficult to maintain a sample of a fixed size Window Types: Sequence-based windows - contain n most recent elements (row-based window) Timestamp-based windows - contains all elements that arrived within past t time units (time-based windows) Unbiased sampling from within a window
3/13/2012 Data Streams: Lecture Sequence-based Windows W j is a window of length n, j ≥ 1 W j = {e j, e j+1, … e j+n-1 } Want a SRS S j of k elements from W j Tradeoff between amount of memory required and degree of dependence between S j ’s
3/13/2012 Data Streams: Lecture Complete Resampling Window size = 5, Sample size = 2 Maintain full window (W j ) Each time window changes, use reservoir sampling to create S j from W j Very expensive – memory, CPU O(n) (n=window-size) e1e1 e2e2 e3e3 e4e4 e5e5 e6e6 e7e7 e8e8 e9e9 e 10 e 11 e 12 e 13 e 14 e 15 W1W1 W2W2 S 1 = {e2, e4} S 2 = {e3, e5}
3/13/2012 Data Streams: Lecture Passive Algorithm Window size = 5, sample size = 2 When an element in the sample expires, insert the newly-arrived element into sample S j is a SRS from W j S j ’s are highly correlated If S 1 is a bad sample, S 2 will be also… Memory is O(k), k = sample size e1e1 e2e2 e3e3 e4e4 e5e5 e6e6 e7e7 e8e8 e9e9 e 10 e 11 e 12 e 13 e 14 e 15 W1W1 W2W2 S 1 = {e 2, e 4 }S 2 = {e 2, e 4 } W3W3 S 3 = {e 7, e 4 }
3/13/2012 Data Streams: Lecture Chain Sampling (Babcock, et al.) Improved independence properties compared to passive algorithm Expected memory usage: O(k) Basic algorithm – maintains sample of size 1 Get sample of size k, by running k chain- samplers
3/13/2012 Data Streams: Lecture Chain Sampling - Issue Behaves as reservoir sampler for first n elements Insert additional elements into sample with probability 1/n e1e1 Sample: e2e2 e5e5 e4e4 e3e3 e1e1 W1W1 p 1 =1 p 2 =1/2p 3 =1/3 p 4 =1/3 e2e2 W2W2 W3W3 Now, what do we do?
3/13/2012 Data Streams: Lecture Chain Sampling - Solution When e i is selected for inclusion in sample, select K from {i+1, i+2, … i+n}, e K will replace e i if e i expires while part of sample S Know e k will be in window when e i expires e1e1 Sample: e2e2 e5e5 e4e4 e3e3 e1e1 W1W1 p 2 =1/2p 3 =1/3 p 4 =1/3 e2e2 W2W2 W3W3 Choose K {3, 4, 5}, K=5 e5e5 Choose K {6, 7, 8}, K=7 e7e7 e5e5 e7e7
3/13/2012 Data Streams: Lecture Chain Sampling - Summary Expected memory consumptin O(k) Chain sampling produces a SRS with replacement for each sliding window If we use k chain-samplers to get a sample of size k, may get duplicates in that sample Can over sample (use sample size k + α), then sub-sample to get a sample of size k
3/13/2012 Data Streams: Lecture Stratified Sampling Divide window into strata and do SRS in each strata
3/13/2012 Data Streams: Lecture Stratified Sampling – Sliding Window e1e1 e2e2 e3e3 e4e4 e5e5 e6e6 e7e7 e8e8 e9e9 e 10 e 11 e 12 e 13 e 14 e 15 W1W1 ss 1 = {e 1,e 2 } Window size = 12 (n), stratum size 4 (m), stratum sample size = 2 (k) W j overlaps between 3 and 4 strata (l, l+1 strata) l = win_size/stratum_size = n/m (=3) Paper says sample size is between k(l-1) and k∙l, think should be k(l-1) – k(l+1) ss 2 = {e 6,e 7 }ss 3 = {e 9,e 11 } e 16 ss 2 = {e 14,e 16 } W2W2 W3W3
3/13/2012 Data Streams: Lecture Timestamp-Based Windows Number of elements in window changes over time Multiple elements in sample expire at once Chain sampling relies on insertion probability = 1/n (n is window size) Stratified Sampling – wouldn’t be able to bound sample size
3/13/2012 Data Streams: Lecture Priority Sampling (Babcock, et al.) Priority Sampler maintains a SRS of size 1, use k priority samplers to get SRS of size k Assign random, uniformly-distributed priority (0,1) to each element Current sample is element in window with highest priority Keep elements for which there is no other element with both higher priority and higher (later) timestamp
3/13/2012 Data Streams: Lecture Priority Sampling - Example Keep elements for which there is no element with: higher priority and higher (later) timestamp e1e1 e2e2 e3e3 e4e4 e5e5 e6e6 e7e7 e8e8 e9e9 e 10 e 11 e 12 e 13 e 14 e 15 W1W1 W2W2 W3W priority: elt in sample elt stored in mem elt in window, not stored
3/13/2012 Data Streams: Lecture Inference From a Sample What do we do with these samples? SRS samples can be used to estimate “population sums” If each element e i is a sales transaction and v(e i ) is dollar value of transaction v(e i ) = total sales of transactions in W Count: h(e i ) = 1 if v(e i ) > $1000, h(e i ) = number of transactions in window for > $1000 Can also do average e i W
3/13/2012 Data Streams: Lecture SRS Sampling To estimate a population sum from a SRS of size k, expansion estimator: To estimate average, use sample average: α = Θ/n = (1/k) h(e i ) ^ eiSeiS ^ eiSeiS Θ = (n/k) h(e i ) ^ Also works for Stratified Sampling
3/13/2012 Data Streams: Lecture Estimating Different Results SRS sampling is good for estimating population sums, statistics But, use different algorithms for different results Heavy Hitters algorithm Find elements (values) that occur commonly in the stream Min-Hash Computation set resemblance
3/13/2012 Data Streams: Lecture Heavy Hitters Goal: Find all stream elements that occur in at least a fraction s of all transactions For example, find sourceIPs that occur in at least 1% of network flows sourceIPs from which we are getting a lot of traffic
3/13/2012 Data Streams: Lecture Heavy Hitters Divide window into buckets of width w Current bucket id = N/w , N is current stream length Data structure D : (e, f, Δ) e - element f – estimated frequency Δ – maximum possible error in f If we are looking for common sourceIPs in a network stream D : (sourceIP, f, Δ)
3/13/2012 Data Streams: Lecture Heavy Hitters Data structure D : (e, f, Δ) New element e: Check if e exists in D If so, f = f+1 If not, new entry (e, 1, b current -1) At bucket boundary (when b current changes) Delete all elements (e, f, Δ) if f + Δ b current If only one instance of f in bucket, entry for f deleted Deleting items that occur once per bucket For threshold s, output items: f (s-ε)N (w = 1/ε ) (N is stream size)
3/13/2012 Data Streams: Lecture Min-Hash Resemblance, ρ, of two sets A, B = Min-hash signature is a representation of a set from which one can estimate the resemblance of two sets ρ(A,B) = | A B | / | A B | Let h 1, h 2, … h n be hash functions s i (A) = min(h i (a) | a A) (minimum hash value of h i over A) Signature of A: S(A) = (s 1 (A), s 2 (A), …, s n (A))
3/13/2012 Data Streams: Lecture Min-Hash Resemblance estimator: ρ(A,B) = I(s i (A), s i (B)) I(x,y) = 1 if x=y, 0 otherwise ρ(A,B) = | A B | / | A B | h 1, h 2, … h n hash functions s i (A) = min(h i (a) | a A) S(A) = (s 1 (A), s 2 (A), …, s n (A)) i=1 n Count # times min hash value is equal Can substitute N minimum values of one hash function for minimum values of N hash functions ^