Presentation is loading. Please wait.

Presentation is loading. Please wait.

Rainer Gemulla, Wolfgang Lehner and Peter J. Haas VLDB 2006 A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets 2008/8/27 1.

Similar presentations


Presentation on theme: "Rainer Gemulla, Wolfgang Lehner and Peter J. Haas VLDB 2006 A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets 2008/8/27 1."— Presentation transcript:

1 Rainer Gemulla, Wolfgang Lehner and Peter J. Haas VLDB 2006 A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets 2008/8/27 1

2 Outline 1. Introduction 2. Bernoulli Sampling 3. Reservoir Sampling 4. Random Pairing 5. Conclusion 2008/8/27 2

3 Introduction Random sampling approximate query answering data mining data stream processing query optimization data integration For example It is often infeasible to process or store the entire data stream Random sampling is an appealing approach to build synopses of large data streams 2008/8/27 3

4 Uniform Sampling Uniform sampling all samples of the same size are equally likely many statistical procedures assume uniformity flexibility Example a data set (also called population) possible samples of size 2 2008/8/27 4

5 Bernoulli Sampling Each inserted item is included in the sample with probability q and excluded with probability 1-q For a dataset R, the sample size follows the binomial distribution BINOM(|R|,q), so that The main disadvantage is the uncontrollable variability of the sample size. 2008/8/27 5

6 Reservoir Sampling Reservoir sampling Maintains a random sample of fixed size M building block for many sophisticated sampling schemes single-scan algorithm add the first M elements afterwards, flip a coin a) ignore the element (reject) b) replace a random element in the sample (accept) accept probability of the ith element 2008/8/27 6

7 Reservoir Sampling (Example) Example sample size M = 2 2008/8/27 Slide 7 (VLDB 2006)

8 Problems with Reservoir Sampling Problems with reservoir sampling lacks support for deletions (stable data sets) ? 2008/8/27 8

9 An Incorrect Approach Idea use arriving insertions to refill the sample Not uniform! 2008/8/27 9

10 Random Pairing Random pairing compensates deletions with arriving insertions corrects inclusion probabilies General idea (insertion) no uncompensated deletions  reservoir sampling otherwise, randomly select an uncompensated deletion (partner) compensate it: Was it in the sample? yes  add arriving element to sample no  ignore arriving element 2008/8/27 10

11 (Cont.) 2008/8/27 11 The RP algorithm maintains two counters: c 1 records the number of uncompensated deletions in which the deleted item was in the sample c 2 records the number of uncompensated deletions in which the deleted item was not in the sample d= c 1 + c 2 : the total number of uncompensated deletions

12 Random Pairing Example 2008/8/27 12

13 Conclusion Reservoir Sampling lacks support for deletions Random Pairing uses arriving insertions to compensate for deletions Can this sampling schemes be applied to sliding windows?? It may be difficult, because that the number of items in the window is unknown in advance. 2008/8/27 13


Download ppt "Rainer Gemulla, Wolfgang Lehner and Peter J. Haas VLDB 2006 A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets 2008/8/27 1."

Similar presentations


Ads by Google