3/13/2012Data Streams: Lecture 161 CS 410/510 Data Streams Lecture 16: Data-Stream Sampling: Basic Techniques and Results Kristin Tufte, David Maier.

Slides:



Advertisements
Similar presentations
Estimating Distinct Elements, Optimally
Advertisements

Rectangle-Efficient Aggregation in Spatial Data Streams Srikanta Tirthapura David Woodruff Iowa State IBM Almaden.
Optimal Approximations of the Frequency Moments of Data Streams Piotr Indyk David Woodruff.
1+eps-Approximate Sparse Recovery Eric Price MIT David Woodruff IBM Almaden.
The Average Case Complexity of Counting Distinct Elements David Woodruff IBM Almaden.
Sublinear-time Algorithms for Machine Learning Ken Clarkson Elad Hazan David Woodruff IBM Almaden Technion IBM Almaden.
Sampling From a Moving Window Over Streaming Data Brian Babcock * Mayur Datar Rajeev Motwani * Speaker Stanford University.
Introduction Simple Random Sampling Stratified Random Sampling
Semantics and Evaluation Techniques for Window Aggregates in Data Streams Jin Li, David Maier, Kristin Tufte, Vassilis Papadimos, Peter A. Tucker SIGMOD.
Probabilistic Skyline Operator over Sliding Windows Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Xuemin Lin, Ying Zhang, Wei.
An Improved Data Stream Summary: The Count-Min Sketch and its Applications Graham Cormode, S. Muthukrishnan 2003.
Fast Algorithms For Hierarchical Range Histogram Constructions
Summarizing Distributed Data Ke Yi HKUST += ?. Small summaries for BIG data  Allow approximate computation with guarantees and small space – save space,
Maintaining Variance and k-Medians over Data Stream Windows Brian Babcock, Mayur Datar, Rajeev Motwani, Liadan O’Callaghan Stanford University.
Mining Data Streams.
Fundamentals of Python: From First Programs Through Data Structures
CHAPTER 8: Producing Data: Sampling
1 CS 361 Lecture 5 Approximate Quantiles and Histograms 9 Oct 2002 Gurmeet Singh Manku
1 Maintaining Bernoulli Samples Over Evolving Multisets Rainer Gemulla Wolfgang Lehner Technische Universität Dresden Peter J. Haas IBM Almaden Research.
A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries By : Surajid Chaudhuri Gautam Das Vivek Narasayya Presented by :Sayed.
COMP53311 Data Stream Prepared by Raymond Wong Presented by Raymond Wong
QBM117 Business Statistics Statistical Inference Sampling 1.
Heavy hitter computation over data stream
Resampling techniques Why resampling? Jacknife Cross-validation Bootstrap Examples of application of bootstrap.
Evaluating Hypotheses
What ’ s Hot and What ’ s Not: Tracking Most Frequent Items Dynamically G. Cormode and S. Muthukrishman Rutgers University ACM Principles of Database Systems.
Reverse Hashing for Sketch Based Change Detection in High Speed Networks Ashish Gupta Elliot Parsons with Robert Schweller, Theory Group Advisor: Yan Chen.
A survey on stream data mining
Approximation and Load Shedding Sampling Methods Carlo Zaniolo CSD—UCLA ________________________________________.
Control Charts for Attributes
Other Sampling Methods
A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets Rainer Gemulla (University of Technology Dresden) Wolfgang Lehner (University.
Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman.
Hashed Samples Selectivity Estimators for Set Similarity Selection Queries.
Approximating the MST Weight in Sublinear Time Bernard Chazelle (Princeton) Ronitt Rubinfeld (NEC) Luca Trevisan (U.C. Berkeley)
NGDM’02 1 Efficient Data-Reduction Methods for On-line Association Rule Mining H. Bronnimann B. ChenM. Dash, Y. Qiao, P. ScheuermannP. Haas Polytechnic.
Chap 20-1 Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chapter 20 Sampling: Additional Topics in Sampling Statistics for Business.
IKI 10100: Data Structures & Algorithms Ruli Manurung (acknowledgments to Denny & Ade Azurat) 1 Fasilkom UI Ruli Manurung (Fasilkom UI)IKI10100: Lecture8.
Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002.
Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.
CHAPTER 8: Producing Data: Sampling
VII-1 Stratification Case study to illustrate alternative methods to stratify a sampling frame Dr. Will Yancey, CPA This material is the property of the.
Sampling Design and Analysis MTH 494 LECTURE-12 Ossam Chohan Assistant Professor CIIT Abbottabad.
Hashing Sections 10.2 – 10.3 CS 302 Dr. George Bebis.
Data Stream Algorithms Ke Yi Hong Kong University of Science and Technology.
OPERATING SYSTEMS CS 3530 Summer 2014 Systems with Multi-programming Chapter 4.
Expectation. Let X denote a discrete random variable with probability function p(x) (probability density function f(x) if X is continuous) then the expected.
Lecture PowerPoint Slides Basic Practice of Statistics 7 th Edition.
Sampling Design and Analysis MTH 494 Lecture-21 Ossam Chohan Assistant Professor CIIT Abbottabad.
Monte Carlo Linear Algebra Techniques and Their Parallelization Ashok Srinivasan Computer Science Florida State University
Rainer Gemulla, Wolfgang Lehner and Peter J. Haas VLDB 2006 A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets 2008/8/27 1.
Sampling Design and Analysis MTH 494 Ossam Chohan Assistant Professor CIIT Abbottabad.
REU 2009-Traffic Analysis of IP Networks Daniel S. Allen, Mentor: Dr. Rahul Tripathi Department of Computer Science & Engineering Data Streams Data streams.
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Continuous Monitoring of Distributed Data Streams over a Time-based Sliding Window MADALGO – Center for Massive Data Algorithmics, a Center of the Danish.
Mining Data Streams (Part 1)
Sampling Design and Analysis MTH 494
Frequency Counts over Data Streams
Stochastic Streams: Sample Complexity vs. Space Complexity
CS 332: Algorithms Hash Tables David Luebke /19/2018.
Graduate School of Business Leadership
Streaming & sampling.
CPU Scheduling G.Anuradha
Objective of This Course
Approximate Frequency Counts over Data Streams
Feifei Li, Ching Chang, George Kollios, Azer Bestavros
Hashing Sections 10.2 – 10.3 Lecture 26 CS302 Data Structures
Approximation and Load Shedding Sampling Methods
DESIGN OF EXPERIMENTS by R. C. Baker
Presentation transcript:

3/13/2012Data Streams: Lecture 161 CS 410/510 Data Streams Lecture 16: Data-Stream Sampling: Basic Techniques and Results Kristin Tufte, David Maier

3/13/2012 Data Streams: Lecture 16 2 Data Stream Sampling Sampling provides a synopsis of a data stream Sample can serve as input for  Answering queries  “statistical inference about the contents of the stream”  “variety of analytical procedures” Focus on: obtaining a sample from the window (sample size « window size)

3/13/2012 Data Streams: Lecture 16 3 Windows Stationary Window  Endpoints of window fixed (think relation) Sliding Window  Endpoints of window move  What we’ve been talking about  More complex than stationary window because elements must be removed from sample when they expire from window

3/13/2012 Data Streams: Lecture 16 4 Simple Random Sampling (SRS) What is a “representative” sample? SRS for a sample of k elements from a window with n elements  Every possible sample (of size k) is equally likely, that is has probability: 1/  Every element is equally likely to be in sample Stratified Sampling  Divide window into disjoint segments (strata)  SRS over each stratum  Advantageous when stream elements close together in stream have similar values nknk ( )

3/13/2012 Data Streams: Lecture 16 5 Bernoulli Sampling Includes each element in the sample with probability q The sample size is not fixed, sample size is binomially distributed Probability that sample contains k elements is: Expected sample size is nq ( ) q k (1-q) n-k nknk

3/13/2012 Data Streams: Lecture 16 6 Binomial Distribution - Example Expected Sample Size = 20*0.5 = 10 Binomial Distribution (n=20, q=0.5) Probability Sample Size

3/13/2012 Data Streams: Lecture 16 7 Binomial Distribution - Example Expected Sample Size = 20*1/3 ≈ Binomial Distribution (n=20, q=1/3) Probability Sample Size

3/13/2012 Data Streams: Lecture 16 8 Bernoulli Sampling - Implementation Naïve:  Elements inserted with probability q (ignored with probability 1-q)  Use a sequence of pseudorandom numbers (U 1, U 2, U 3, …) U i  [0,1]  Element e i is included if U i ≤ q e1e1 Sample: e2e2 e6e6 e5e5 e4e4 e3e3 U 1 =0.5U 2 =0.1 e2e2 e5e5 U 3 =0.9 e7e7 U 4 =0.8 U 5 =0.2U 6 =0.3 e7e7 U 7 =0.0 Example q = 0.2

3/13/2012 Data Streams: Lecture 16 9 Bernoulli Sampling – Efficient Implementation Calculate number of elements to be skipped after an insertion (Δ i ) Pr {Δ i = j} = q(1-q) j  If you skip zero elements, must get: U i ≤ q (pr: q)  Skip one element, must get: U i > q, U i+1 ≤ q (pr: (1-q)q)  Skip two elements: U i > q, U i+1 > q, U i+2 ≤ q (pr: (1-q) 2 q) Δ i has a geometric distribution

3/13/2012 Data Streams: Lecture Geometric Distribution - Example Geometric Distribution q = 0.2 Probability Number of Skips (Δ i )

3/13/2012 Data Streams: Lecture Bernoulli Sampling - Algorithm

3/13/2012 Data Streams: Lecture Bernoulli Sampling Straightforward, SRS, easy to implement But… Sample size is not fixed! Look at algorithms with deterministic sample size  Reservoir Sampling Stratified Sampling Biased Sampling Schemes

3/13/2012 Data Streams: Lecture Reservoir Sampling Produces a SRS of size k from a window of length n (k is specified) Initialize a “reservoir” using first k elements For every following element, insert with probability p i (ignore with probability 1-p i ) p i = k/i for i>k (p i = 1 for i ≤ k)  p i changes as i increases Remove one element from reservoir before insertion

3/13/2012 Data Streams: Lecture Reservoir Sampling e1e1 Reservoir Sample: e2e2 e6e6 e5e5 e4e4 e3e3 Sample size 3 (k=3) Recall: p i = 1 i≤k, p i = i/k i>k p 1 =1p 2 =1 e1e1 e2e2 p 3 =1 e3e3 p 4 =3/4 p 5 =3/5p 6 =3/6 e7e7 p 7 =3/7 e8e8 p 8 =3/8 U 4 =0.5U 5 =0.1 U 6 =0.9U 4 =0.8 U 5 =0.2 e4e4 e5e5 e8e8

3/13/2012 Data Streams: Lecture Reservoir Sampling - SRS Why set p i = k/i? Want S j to be a SRS from U j = {e 1, e 2, …, e j }  Sj is the sample from Uj Recall SRS means every sample of size k is equally likely Intuition: Probability that e i is included in SRS from U i is k/i  k is sample size, i is “window” size k/i = (#samples containing e i )/(#samples of size k) = ( ) i-1 k-1 ( ) ikik

3/13/2012 Data Streams: Lecture Reservoir Sampling - Observations Insertion probability (p i = k/i i>k) decreases as i increases Also, opportunities for an element in the sample to be removed from the sample decrease as i increases These trends offset each other Probability of being in final sample is same for all elements in the window

3/13/2012 Data Streams: Lecture Other Sampling Schemes Stratified Sampling  Divide window into strata, SRS in each stratum Deterministic & Semi-Deterministic Schemes  i.e. Sample every 10 th element Biased Sampling Schemes  Bias sample towards recently-received elements  Biased Reservoir Sampling  Biased Sampling by Halving

3/13/2012 Data Streams: Lecture Stratified Sampling

3/13/2012 Data Streams: Lecture Stratified Sampling When elements close to each other in window have similar values, algorithms such as reservoir sampling can have bad luck Alternative: divide window into strata and do SRS in each strata If you know there is a correlation between data values (i.e. timestamp) and position in stream, you may wish to use stratified sampling

3/13/2012 Data Streams: Lecture Deterministic Semi-deterministic Schemes Produce sample of size k by inserting every n/k th element into the sample Simple, but not random  Can’t make statistical conclusions about window from sample Bad if data is periodic Can be good if data exhibits a trend  Ensures sampled elements are spread throughout the window e1e1 e2e2 e6e6 e5e5 e4e4 e3e3 e7e7 e9e9 e8e8 e 11 e 10 e 12 e 13 e 17 e 16 e 15 e 14 e 18 n=18, k=6

3/13/2012 Data Streams: Lecture Biased Reservoir Sampling Recall: Reservoir sampling – probability of inclusion decreased as we got further into the window (p i = i/k) What if p i was constant? (p i = p)  Alternative: p i decreases more slowly than i/k Will favor recently-arrived elements  Recently-arrived elements are more likely to be in sample than long-ago-arrived elements

3/13/2012 Data Streams: Lecture ( ) Biased Reservoir Sampling For reservoir sampling, Probability that e i is included in sample S: If p i is fixed, that is set p i = p  (0,1) Probability that e i is in final sample increases geometrically as i increases Pr {e i  S} = p i  j=max(i, k)+1 n k-p j k Pr {e i  S} = p n - max(i, k) k-p k

3/13/2012 Data Streams: Lecture Biased Reservoir Sampling Probability e i is included in final sample, p=0.2, k=10, n=40 Element index (i) Probability ( ) max(i, 10)

3/13/2012 Data Streams: Lecture kk Biased Sampling by Halving Break into strata (Λ i ), Sample of size 2k Step 1: S = unbiased SRS samples of size k from Λ 1 and Λ 2 (i.e. use reservoir sampling) Step 2: Sub-sample S to produce a sample of size k, insert SRS of size k from Λ 3 into S Λ1Λ1 Λ2Λ2 Λ3Λ3 Λ4Λ4 kk kk

3/13/2012 Data Streams: Lecture Sampling from Sliding Windows Harder than sampling from stationary window  Must remove elements from sample as the elements expire from the window  Difficult to maintain a sample of a fixed size Window Types:  Sequence-based windows - contain n most recent elements (row-based window)  Timestamp-based windows - contains all elements that arrived within past t time units (time-based windows) Unbiased sampling from within a window

3/13/2012 Data Streams: Lecture Sequence-based Windows W j is a window of length n, j ≥ 1 W j = {e j, e j+1, … e j+n-1 } Want a SRS S j of k elements from W j Tradeoff between amount of memory required and degree of dependence between S j ’s

3/13/2012 Data Streams: Lecture Complete Resampling Window size = 5, Sample size = 2 Maintain full window (W j ) Each time window changes, use reservoir sampling to create S j from W j Very expensive – memory, CPU O(n) (n=window-size) e1e1 e2e2 e3e3 e4e4 e5e5 e6e6 e7e7 e8e8 e9e9 e 10 e 11 e 12 e 13 e 14 e 15 W1W1 W2W2 S 1 = {e2, e4} S 2 = {e3, e5}

3/13/2012 Data Streams: Lecture Passive Algorithm Window size = 5, sample size = 2 When an element in the sample expires, insert the newly-arrived element into sample S j is a SRS from W j S j ’s are highly correlated  If S 1 is a bad sample, S 2 will be also… Memory is O(k), k = sample size e1e1 e2e2 e3e3 e4e4 e5e5 e6e6 e7e7 e8e8 e9e9 e 10 e 11 e 12 e 13 e 14 e 15 W1W1 W2W2 S 1 = {e 2, e 4 }S 2 = {e 2, e 4 } W3W3 S 3 = {e 7, e 4 }

3/13/2012 Data Streams: Lecture Chain Sampling (Babcock, et al.) Improved independence properties compared to passive algorithm Expected memory usage: O(k) Basic algorithm – maintains sample of size 1 Get sample of size k, by running k chain- samplers

3/13/2012 Data Streams: Lecture Chain Sampling - Issue Behaves as reservoir sampler for first n elements Insert additional elements into sample with probability 1/n e1e1 Sample: e2e2 e5e5 e4e4 e3e3 e1e1 W1W1 p 1 =1 p 2 =1/2p 3 =1/3 p 4 =1/3 e2e2 W2W2 W3W3 Now, what do we do?

3/13/2012 Data Streams: Lecture Chain Sampling - Solution When e i is selected for inclusion in sample, select K from {i+1, i+2, … i+n}, e K will replace e i if e i expires while part of sample S  Know e k will be in window when e i expires e1e1 Sample: e2e2 e5e5 e4e4 e3e3 e1e1 W1W1 p 2 =1/2p 3 =1/3 p 4 =1/3 e2e2 W2W2 W3W3 Choose K  {3, 4, 5}, K=5 e5e5 Choose K  {6, 7, 8}, K=7 e7e7 e5e5 e7e7

3/13/2012 Data Streams: Lecture Chain Sampling - Summary Expected memory consumptin O(k) Chain sampling produces a SRS with replacement for each sliding window  If we use k chain-samplers to get a sample of size k, may get duplicates in that sample Can over sample (use sample size k + α), then sub-sample to get a sample of size k

3/13/2012 Data Streams: Lecture Stratified Sampling Divide window into strata and do SRS in each strata

3/13/2012 Data Streams: Lecture Stratified Sampling – Sliding Window e1e1 e2e2 e3e3 e4e4 e5e5 e6e6 e7e7 e8e8 e9e9 e 10 e 11 e 12 e 13 e 14 e 15 W1W1 ss 1 = {e 1,e 2 } Window size = 12 (n), stratum size 4 (m), stratum sample size = 2 (k) W j overlaps between 3 and 4 strata (l, l+1 strata) l = win_size/stratum_size = n/m (=3) Paper says sample size is between k(l-1) and k∙l, think should be k(l-1) – k(l+1) ss 2 = {e 6,e 7 }ss 3 = {e 9,e 11 } e 16 ss 2 = {e 14,e 16 } W2W2 W3W3

3/13/2012 Data Streams: Lecture Timestamp-Based Windows Number of elements in window changes over time  Multiple elements in sample expire at once  Chain sampling relies on insertion probability = 1/n (n is window size)  Stratified Sampling – wouldn’t be able to bound sample size

3/13/2012 Data Streams: Lecture Priority Sampling (Babcock, et al.) Priority Sampler maintains a SRS of size 1, use k priority samplers to get SRS of size k Assign random, uniformly-distributed priority (0,1) to each element Current sample is element in window with highest priority Keep elements for which there is no other element with both higher priority and higher (later) timestamp

3/13/2012 Data Streams: Lecture Priority Sampling - Example Keep elements for which there is no element with:  higher priority and  higher (later) timestamp e1e1 e2e2 e3e3 e4e4 e5e5 e6e6 e7e7 e8e8 e9e9 e 10 e 11 e 12 e 13 e 14 e 15 W1W1 W2W2 W3W priority: elt in sample elt stored in mem elt in window, not stored

3/13/2012 Data Streams: Lecture Inference From a Sample What do we do with these samples? SRS samples can be used to estimate “population sums” If each element e i is a sales transaction and v(e i ) is dollar value of transaction   v(e i ) = total sales of transactions in W  Count: h(e i ) = 1 if v(e i ) > $1000,  h(e i ) = number of transactions in window for > $1000  Can also do average e i  W

3/13/2012 Data Streams: Lecture SRS Sampling To estimate a population sum from a SRS of size k, expansion estimator: To estimate average, use sample average: α = Θ/n = (1/k)  h(e i ) ^ eiSeiS ^ eiSeiS Θ = (n/k)  h(e i ) ^ Also works for Stratified Sampling

3/13/2012 Data Streams: Lecture Estimating Different Results SRS sampling is good for estimating population sums, statistics But, use different algorithms for different results Heavy Hitters algorithm  Find elements (values) that occur commonly in the stream Min-Hash Computation  set resemblance

3/13/2012 Data Streams: Lecture Heavy Hitters Goal: Find all stream elements that occur in at least a fraction s of all transactions For example, find sourceIPs that occur in at least 1% of network flows  sourceIPs from which we are getting a lot of traffic

3/13/2012 Data Streams: Lecture Heavy Hitters Divide window into buckets of width w Current bucket id =  N/w , N is current stream length Data structure D : (e, f, Δ)  e - element  f – estimated frequency  Δ – maximum possible error in f If we are looking for common sourceIPs in a network stream  D : (sourceIP, f, Δ)

3/13/2012 Data Streams: Lecture Heavy Hitters Data structure D : (e, f, Δ) New element e:  Check if e exists in D  If so, f = f+1  If not, new entry (e, 1, b current -1) At bucket boundary (when b current changes)  Delete all elements (e, f, Δ) if f + Δ  b current  If only one instance of f in bucket, entry for f deleted  Deleting items that occur  once per bucket For threshold s, output items: f  (s-ε)N (w =  1/ε  ) (N is stream size)

3/13/2012 Data Streams: Lecture Min-Hash Resemblance, ρ, of two sets A, B = Min-hash signature is a representation of a set from which one can estimate the resemblance of two sets ρ(A,B) = | A  B | / | A  B | Let h 1, h 2, … h n be hash functions s i (A) = min(h i (a) | a  A) (minimum hash value of h i over A) Signature of A: S(A) = (s 1 (A), s 2 (A), …, s n (A))

3/13/2012 Data Streams: Lecture Min-Hash Resemblance estimator: ρ(A,B) =  I(s i (A), s i (B)) I(x,y) = 1 if x=y, 0 otherwise ρ(A,B) = | A  B | / | A  B | h 1, h 2, … h n hash functions s i (A) = min(h i (a) | a  A) S(A) = (s 1 (A), s 2 (A), …, s n (A)) i=1 n Count # times min hash value is equal Can substitute N minimum values of one hash function for minimum values of N hash functions ^