Rainer Gemulla, Wolfgang Lehner and Peter J. Haas VLDB 2006 A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets 2008/8/27 1.

Slides:



Advertisements
Similar presentations
Sampling From a Moving Window Over Streaming Data Brian Babcock * Mayur Datar Rajeev Motwani * Speaker Stanford University.
Advertisements

Sampling Algorithms for Evolving Datasets Rainer Gemulla Defense of Ph.D. Thesis Faculty of Computer Science, Institute of System Architecture,
MOMENT GENERATING FUNCTION AND STATISTICAL DISTRIBUTIONS
Counting Distinct Objects over Sliding Windows Presented by: Muhammad Aamir Cheema Joint work with Wenjie Zhang, Ying Zhang and Xuemin Lin University of.
A Paper on RANDOM SAMPLING OVER JOINS by SURAJIT CHAUDHARI RAJEEV MOTWANI VIVEK NARASAYYA PRESENTED BY, JEEVAN KUMAR GOGINENI SARANYA GOTTIPATI.
Probabilistic Skyline Operator over Sliding Windows Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Xuemin Lin, Ying Zhang, Wei.
Hypothesis testing Another judgment method of sampling data.
Active Learning for Streaming Networked Data Zhilin Yang, Jie Tang, Yutao Zhang Computer Science Department, Tsinghua University.
3/13/2012Data Streams: Lecture 161 CS 410/510 Data Streams Lecture 16: Data-Stream Sampling: Basic Techniques and Results Kristin Tufte, David Maier.
Sampling Distributions
Mining Data Streams.
Linked Bernoulli Synopses Sampling Along Foreign Keys Rainer Gemulla, Philipp Rösch, Wolfgang Lehner Technische Universität Dresden Faculty of Computer.
Evaluating Diagnostic Accuracy of Prostate Cancer Using Bayesian Analysis Part of an Undergraduate Research course Chantal D. Larose.
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 5-1 Chapter 5 Some Important Discrete Probability Distributions Statistics.
1 Maintaining Bernoulli Samples Over Evolving Multisets Rainer Gemulla Wolfgang Lehner Technische Universität Dresden Peter J. Haas IBM Almaden Research.
New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007.
QBM117 Business Statistics Statistical Inference Sampling 1.
Ph.D. DefenceUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor:
Probability Distributions Finite Random Variables.
Probability Distributions
What ’ s Hot and What ’ s Not: Tracking Most Frequent Items Dynamically G. Cormode and S. Muthukrishman Rutgers University ACM Principles of Database Systems.
Statistical Inference and Sampling Introduction to Business Statistics, 5e Kvanli/Guynes/Pavur (c)2000 South-Western College Publishing.
Approximation and Load Shedding Sampling Methods Carlo Zaniolo CSD—UCLA ________________________________________.
Statistical Inference Lab Three. Bernoulli to Normal Through Binomial One flip Fair coin Heads Tails Random Variable: k, # of heads p=0.5 1-p=0.5 For.
Adaptive Sampling  Based on a hot-list algorithm by Gibbons and Matias (SIGMOD 1998)  Sample elements from the input set Frequently occurring elements.
Chapter 12 Inferring from the Data. Inferring from Data Estimation and Significance testing.
Normal and Sampling Distributions A normal distribution is uniquely determined by its mean, , and variance,  2 The random variable Z = (X-  /  is.
A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets Rainer Gemulla (University of Technology Dresden) Wolfgang Lehner (University.
1 Towards a Synopsis Warehouse Peter J. Haas IBM Almaden Research Center San Jose, CA.
5-1 Introduction 5-2 Inference on the Means of Two Populations, Variances Known Assumptions.
Sampling Distributions. Parameter A number that describes the population Symbols we will use for parameters include  - mean  – standard deviation.
Deferred Maintenance of Disk-Based Random Samples Rainer Gemulla (University of Technology Dresden) Wolfgang Lehner (University of Technology Dresden)
Statistics for Engineer Week II and Week III: Random Variables and Probability Distribution.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. Lecture Slides Elementary Statistics Eleventh Edition and the Triola Statistics Series by.
Introduction to the Practice of Statistics Fifth Edition Chapter 5: Sampling Distributions Copyright © 2005 by W. H. Freeman and Company David S. Moore.
A Survey Based Seminar: Data Cleaning & Uncertain Data Management Speaker: Shawn Yang Supervisor: Dr. Reynold Cheng Prof. David Cheung
Bernoulli Trials Two Possible Outcomes –Success, with probability p –Failure, with probability q = 1  p Trials are independent.
Binomial Probability Distribution
COMP 170 L2 L17: Random Variables and Expectation Page 1.
1 Memory-Limited Execution of Windowed Stream Joins Utkarsh Srivastava, Jennifer Widom Stanford University VLDB’04.
Sampling in Space Restricted Settings Anup Bhattacharya IIT Delhi Joint work with Davis Issac (MPI), Ragesh Jaiswal (IITD) and Amit Kumar (IITD)
Statistics Workshop Tutorial 5 Sampling Distribution The Central Limit Theorem.
Chapter 7 Sampling and Sampling Distributions ©. Simple Random Sample simple random sample Suppose that we want to select a sample of n objects from a.
Statistical Inference for the Mean Objectives: (Chapter 9, DeCoursey) -To understand the terms: Null Hypothesis, Rejection Region, and Type I and II errors.
New Sampling-Based Summary Statistics for Improving Approximate Query Answers Yinghui Wang
How likely is it that…..?. The Law of Large Numbers says that the more times you repeat an experiment the closer the relative frequency of an event will.
Space-Efficient Online Computation of Quantile Summaries SIGMOD 01 Michael Greenwald & Sanjeev Khanna Presented by ellery.
The Markov Chain Monte Carlo Method Isabelle Stanton May 8, 2008 Theory Lunch.
12/24/ Probability Distributions. 12/24/ Probability Distributions Random Variable – a variable whose values are numbers determined.
Duplicate Detection in Click Streams(2005) SubtitleAhmed Metwally Divyakant Agrawal Amr El Abbadi Tian Wang.
Mining of Massive Datasets Ch4. Mining Data Streams.
1 Towards a Synopsis Warehouse Peter J. Haas IBM Almaden Research Center San Jose, CA.
Lecture Slides Elementary Statistics Twelfth Edition
Frequency Counts over Data Streams
Faculty of Computer Science, Institute System Architecture, Database Technology Group A Dip in the Reservoir: Maintaining Sample Synopses of Evolving.
IEE 380 Review.
STAT 312 Chapter 7 - Statistical Intervals Based on a Single Sample
Chapter 10 Inferences for Two Population Means
The Binomial and Geometric Distributions
Chapter 7 – Statistical Inference and Sampling
Lecture Slides Elementary Statistics Twelfth Edition
Chapter 5: Some Discrete Probability Distributions:
Lecture Slides Elementary Statistics Twelfth Edition
Distributions and expected value
6.2/6.3 Probability Distributions and Distribution Mean
TESTs about a population mean
Chapter 7: Introduction to Sampling Distributions
Introduction to Stream Computing and Reservoir Sampling
Bernoulli Trials Two Possible Outcomes Trials are independent.
Approximation and Load Shedding Sampling Methods
Presentation transcript:

Rainer Gemulla, Wolfgang Lehner and Peter J. Haas VLDB 2006 A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets 2008/8/27 1

Outline 1. Introduction 2. Bernoulli Sampling 3. Reservoir Sampling 4. Random Pairing 5. Conclusion 2008/8/27 2

Introduction Random sampling approximate query answering data mining data stream processing query optimization data integration For example It is often infeasible to process or store the entire data stream Random sampling is an appealing approach to build synopses of large data streams 2008/8/27 3

Uniform Sampling Uniform sampling all samples of the same size are equally likely many statistical procedures assume uniformity flexibility Example a data set (also called population) possible samples of size /8/27 4

Bernoulli Sampling Each inserted item is included in the sample with probability q and excluded with probability 1-q For a dataset R, the sample size follows the binomial distribution BINOM(|R|,q), so that The main disadvantage is the uncontrollable variability of the sample size. 2008/8/27 5

Reservoir Sampling Reservoir sampling Maintains a random sample of fixed size M building block for many sophisticated sampling schemes single-scan algorithm add the first M elements afterwards, flip a coin a) ignore the element (reject) b) replace a random element in the sample (accept) accept probability of the ith element 2008/8/27 6

Reservoir Sampling (Example) Example sample size M = /8/27 Slide 7 (VLDB 2006)

Problems with Reservoir Sampling Problems with reservoir sampling lacks support for deletions (stable data sets) ? 2008/8/27 8

An Incorrect Approach Idea use arriving insertions to refill the sample Not uniform! 2008/8/27 9

Random Pairing Random pairing compensates deletions with arriving insertions corrects inclusion probabilies General idea (insertion) no uncompensated deletions  reservoir sampling otherwise, randomly select an uncompensated deletion (partner) compensate it: Was it in the sample? yes  add arriving element to sample no  ignore arriving element 2008/8/27 10

(Cont.) 2008/8/27 11 The RP algorithm maintains two counters: c 1 records the number of uncompensated deletions in which the deleted item was in the sample c 2 records the number of uncompensated deletions in which the deleted item was not in the sample d= c 1 + c 2 : the total number of uncompensated deletions

Random Pairing Example 2008/8/27 12

Conclusion Reservoir Sampling lacks support for deletions Random Pairing uses arriving insertions to compensate for deletions Can this sampling schemes be applied to sliding windows?? It may be difficult, because that the number of items in the window is unknown in advance. 2008/8/27 13