A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets Rainer Gemulla (University of Technology Dresden) Wolfgang Lehner (University.

Slides:



Advertisements
Similar presentations
Sampling Algorithms for Evolving Datasets Rainer Gemulla Defense of Ph.D. Thesis Faculty of Computer Science, Institute of System Architecture,
Advertisements

Raghavendra Madala. Introduction Icicles Icicle Maintenance Icicle-Based Estimators Quality Guarantee Performance Evaluation Conclusion 2 ICICLES: Self-tuning.
Lindsey Bleimes Charlie Garrod Adam Meyerson
Size-estimation framework with applications to transitive closure and reachability Presented by Maxim Kalaev Edith Cohen AT&T Bell Labs 1996.
Counting Distinct Objects over Sliding Windows Presented by: Muhammad Aamir Cheema Joint work with Wenjie Zhang, Ying Zhang and Xuemin Lin University of.
A Paper on RANDOM SAMPLING OVER JOINS by SURAJIT CHAUDHARI RAJEEV MOTWANI VIVEK NARASAYYA PRESENTED BY, JEEVAN KUMAR GOGINENI SARANYA GOTTIPATI.
A sublinear Time Approximation Scheme for Clustering in Metric Spaces Author: Piotr Indyk IEEE FOCS 1999.
3/13/2012Data Streams: Lecture 161 CS 410/510 Data Streams Lecture 16: Data-Stream Sampling: Basic Techniques and Results Kristin Tufte, David Maier.
Introduction to Histograms Presented By: Laukik Chitnis
Maintaining Variance and k-Medians over Data Stream Windows Brian Babcock, Mayur Datar, Rajeev Motwani, Liadan O’Callaghan Stanford University.
1 A Deterministic Algorithm for Summarizing Asynchronous Streams over a Sliding Window Costas Busch Rensselaer Polytechnic Institute Srikanta Tirthapura.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.
Mining Data Streams.
Linked Bernoulli Synopses Sampling Along Foreign Keys Rainer Gemulla, Philipp Rösch, Wolfgang Lehner Technische Universität Dresden Faculty of Computer.
1 NNH: Improving Performance of Nearest- Neighbor Searches Using Histograms Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research) Chen Li (UC Irvine)
1 Maintaining Bernoulli Samples Over Evolving Multisets Rainer Gemulla Wolfgang Lehner Technische Universität Dresden Peter J. Haas IBM Almaden Research.
New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007.
Stabbing the Sky: Efficient Skyline Computation over Sliding Windows COMP9314 Lecture Notes.
Bloom Filters Kira Radinsky Slides based on material from:
TH EDA NTHU-CS VLSI/CAD LAB 1 Re-synthesis for Reliability Design Shih-Chieh Chang Department of Computer Science National Tsing Hua University.
What ’ s Hot and What ’ s Not: Tracking Most Frequent Items Dynamically G. Cormode and S. Muthukrishman Rutgers University ACM Principles of Database Systems.
Improving the Accuracy of Continuous Aggregates & Mining Queries Under Load Shedding Yan-Nei Law* and Carlo Zaniolo Computer Science Dept. UCLA * Bioinformatics.
CS107 Introduction to Computer Science Lecture 7, 8 An Introduction to Algorithms: Efficiency of algorithms.
Hashed Samples Selectivity Estimators for Set Similarity Selection Queries.
1 Towards a Synopsis Warehouse Peter J. Haas IBM Almaden Research Center San Jose, CA.
Query Optimization Allison Griffin. Importance of Optimization Time is money Queries are faster Helps everyone who uses the server Solution to speed lies.
NGDM’02 1 Efficient Data-Reduction Methods for On-line Association Rule Mining H. Bronnimann B. ChenM. Dash, Y. Qiao, P. ScheuermannP. Haas Polytechnic.
CONGRESSIONAL SAMPLES FOR APPROXIMATE ANSWERING OF GROUP-BY QUERIES Swarup Acharya Phillip Gibbons Viswanath Poosala ( Information Sciences Research Center,
Deferred Maintenance of Disk-Based Random Samples Rainer Gemulla (University of Technology Dresden) Wolfgang Lehner (University of Technology Dresden)
 1  Outline  stages and topics in simulation  generation of random variates.
Maintaining Variance and k-Medians over Data Stream Windows Paper by Brian Babcock, Mayur Datar, Rajeev Motwani and Liadan O’Callaghan. Presentation by.
Database Management 9. course. Execution of queries.
The X-Tree An Index Structure for High Dimensional Data Stefan Berchtold, Daniel A Keim, Hans Peter Kriegel Institute of Computer Science Munich, Germany.
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 16 Nov, 3, 2011 Slide credit: C. Conati, S.
Department of Computer Science & Engineering Abstract:. In our time, the advantage of technology is the biggest thing for current scientific works. One.
Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.
Join Synopses for Approximate Query Answering Swarup Acharya Phillip B. Gibbons Viswanath Poosala Sridhar Ramaswamy.
Approximate XML Joins Huang-Chun Yu Li Xu. Introduction XML is widely used to integrate data from different sources. Perform join operation for XML documents:
Swarup Acharya Phillip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented By Vinay Hoskere.
1 Memory-Limited Execution of Windowed Stream Joins Utkarsh Srivastava, Jennifer Widom Stanford University VLDB’04.
New Sampling-Based Summary Statistics for Improving Approximate Query Answers Yinghui Wang
CS4432: Database Systems II Query Processing- Part 2.
The Markov Chain Monte Carlo Method Isabelle Stanton May 8, 2008 Theory Lunch.
Security in Outsourced Association Rule Mining. Agenda  Introduction  Approximate randomized technique  Encryption  Summary and future work.
CONGRESSIONAL SAMPLES FOR APPROXIMATE ANSWERING OF GROUP BY QUERIES Swaroop Acharya,Philip B Gibbons, VishwanathPoosala By Agasthya Padisala Anusha Reddy.
Mining of Massive Datasets Ch4. Mining Data Streams
Complexity and Efficient Algorithms Group / Department of Computer Science Testing the Cluster Structure of Graphs Christian Sohler joint work with Artur.
Rainer Gemulla, Wolfgang Lehner and Peter J. Haas VLDB 2006 A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets 2008/8/27 1.
Clustering Data Streams A presentation by George Toderici.
Written By: Presented By: Swarup Acharya,Amr Elkhatib Phillip B. Gibbons, Viswanath Poosala, Sridhar Ramaswamy Join Synopses for Approximate Query Answering.
Data Mining CH6 Implementation: Real machine learning schemes(2) Reporter: H.C. Tsai.
1 Towards a Synopsis Warehouse Peter J. Haas IBM Almaden Research Center San Jose, CA.
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Computer Science and Engineering Jianye Yang 1, Ying Zhang 2, Wenjie Zhang 1, Xuemin Lin 1 Influence based Cost Optimization on User Preference 1 The University.
ICICLES: Self-tuning Samples for Approximate Query Answering By Venkatesh Ganti, Mong Li Lee, and Raghu Ramakrishnan Shruti P. Gopinath CSE 6339.
Frequency Counts over Data Streams
Data Driven Resource Allocation for Distributed Learning
Faculty of Computer Science, Institute System Architecture, Database Technology Group A Dip in the Reservoir: Maintaining Sample Synopses of Evolving.
Updating SF-Tree Speaker: Ho Wai Shing.
A paper on Join Synopses for Approximate Query Answering
Parallel Density-based Hybrid Clustering
Spatial Online Sampling and Aggregation
AQUA: Approximate Query Answering
StreamApprox Approximate Stream Analytics in Apache Spark
StreamApprox Approximate Computing for Stream Analytics
Enumerating Distances Using Spanners of Bounded Degree
DATABASE HISTOGRAMS E0 261 Jayant Haritsa
Heavy Hitters in Streams and Sliding Windows
Approximation and Load Shedding Sampling Methods
Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research)
Presentation transcript:

A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets Rainer Gemulla (University of Technology Dresden) Wolfgang Lehner (University of Technology Dresden) Peter J. Haas (IBM Almaden Research Center) Faculty of Computer Science, Institute System Architecture, Database Technology Group

Rainer Gemulla, Wolfgang Lehner, Peter J. Haas A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets Slide 2 (VLDB 2006) Outline 1.Introduction 2.Deletions 3.Resizing 4.Experiments 5.Summary

Rainer Gemulla, Wolfgang Lehner, Peter J. Haas A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets Slide 3 (VLDB 2006) Random Sampling Database applications –huge data sets –complex algorithms (space & time) Requirements –performance, performance, performance Random sampling –approximate query answering –data mining –data stream processing –query optimization –data integration Turnover in Europe (TPC-H) 1% 8.46 Mil.  0.15 Mil. 4s 10% 8.51 Mil.  0.05 Mil. 52s 100%8.54 Mil.200s

Rainer Gemulla, Wolfgang Lehner, Peter J. Haas A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets Slide 4 (VLDB 2006) The Problem Space Setting –arbitrary data sets –samples of the data –evolving data Scope of this talk –maintenance of random samples Can we minimize or even avoid access to base data?

Rainer Gemulla, Wolfgang Lehner, Peter J. Haas A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets Slide 5 (VLDB 2006) Types of Data Sets Data sets –variation of data set size –influence on sampling Stable Goal: stable sample Growing Goal: controlled growing sample Shrinking uninteresting

Rainer Gemulla, Wolfgang Lehner, Peter J. Haas A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets Slide 6 (VLDB 2006) Uniform Sampling Uniform sampling –all samples of the same size are equally likely –many statistical procedures assume uniformity –flexibility Example –a data set (also called population) –possible samples of size 2

Rainer Gemulla, Wolfgang Lehner, Peter J. Haas A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets Slide 7 (VLDB 2006) Reservoir Sampling Reservoir sampling –computes a uniform sample of M elements –building block for many sophisticated sampling schemes –single-scan algorithm add the first M elements afterwards, flip a coin a)ignore the element (reject) b)replace a random element in the sample (accept) –accept probability of the ith element

Rainer Gemulla, Wolfgang Lehner, Peter J. Haas A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets Slide 8 (VLDB 2006) Reservoir Sampling (Example) Example –sample size M = 2

Rainer Gemulla, Wolfgang Lehner, Peter J. Haas A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets Slide 9 (VLDB 2006) Problems with Reservoir Sampling Problems with reservoir sampling –lacks support for deletions (stable data sets) –cannot efficiently enlarge sample (growing data sets) ?

Rainer Gemulla, Wolfgang Lehner, Peter J. Haas A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets Slide 10 (VLDB 2006) Outline 1.Introduction 2.Deletions 3.Resizing 4.Experiments 5.Summary

Rainer Gemulla, Wolfgang Lehner, Peter J. Haas A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets Slide 11 (VLDB 2006) Naïve/Prior Approaches unstableconduct deletions, continue with smaller sample (RS with deletions) CommentsTechniqueAlgorithm expensive, low space efficiency in our setting tailored for multiset populationsDistinct-value sampling special case of our RP algorithm developed for data streams (sliding windows only) Passive sampling inexpensive but unstable “coin flip” sampling with deletions, purge if too large Bernoulli s. with purging stable but expensiveimmediately sample from base data to refill the sample CAR(WOR) expensive, unstablelet sample size decrease, but occasionally recompute Backing sample not uniformuse insertions to immediately refill the sample Naïve

Rainer Gemulla, Wolfgang Lehner, Peter J. Haas A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets Slide 12 (VLDB 2006) Random Pairing Random pairing –compensates deletions with arriving insertions –corrects inclusion probabilies General idea (insertion) –no uncompensated deletions  reservoir sampling –otherwise, randomly select an uncompensated deletion (partner) compensate it: Was it in the sample? –yes  add arriving element to sample –no  ignore arriving element

Rainer Gemulla, Wolfgang Lehner, Peter J. Haas A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets Slide 13 (VLDB 2006) Random Pairing Example

Rainer Gemulla, Wolfgang Lehner, Peter J. Haas A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets Slide 14 (VLDB 2006) Random Pairing Details of the algorithm –keeping history of deleted items is expensive, but: –maintenance of two counters suffices –correctness proof is in the paper

Rainer Gemulla, Wolfgang Lehner, Peter J. Haas A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets Slide 15 (VLDB 2006) Outline 1.Introduction 2.Deletions 3.Resizing 4.Experiments 5.Summary

Rainer Gemulla, Wolfgang Lehner, Peter J. Haas A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets Slide 16 (VLDB 2006) Growing Data Sets The problem –growing data set Data set growing data set Random pairing stable sample sampling fraction decreases

Rainer Gemulla, Wolfgang Lehner, Peter J. Haas A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets Slide 17 (VLDB 2006) A Negative Result Negative result –There is no resizing algorithm which can enlarge a bounded-size sample without ever accessing base data. Example –data set –samples of size 2 –new data set –samples of size 3 Not uniform!

Rainer Gemulla, Wolfgang Lehner, Peter J. Haas A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets Slide 18 (VLDB 2006) Resizing Goal –efficiently increase sample size –stay within an upper bound at all times General idea 1.convert sample to Bernoulli sample 2.continue Bernoulli sampling until new sample size is reached 3.convert back to reservoir sample Optimally balance cost –cost of base data accesses (in step 1) –time to reach new sample size (in step 2)

Rainer Gemulla, Wolfgang Lehner, Peter J. Haas A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets Slide 19 (VLDB 2006) Resizing Bernoulli sampling –uniform sampling scheme –each tuple is added to the sample with probability q –sample size follows binomial distribution  no effective upper bound Phase 1: Conversion to a Bernoulli sample –given q, randomly determine sample size –reuse reservoir sample to create Bernoulli sample subsample sample additional tuples (base data access) –choice of q small  less base data accesses large  more base data accesses

Rainer Gemulla, Wolfgang Lehner, Peter J. Haas A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets Slide 20 (VLDB 2006) Resizing Phase 2: Run Bernoulli sampling –accept new tuples with probability q –conduct deletions –stop as soon as new sample size is reached Phase 3: Revert to Reservoir sampling –switchover is trivial Choosing q –determines cost of Phase 1 and Phase 2 –goal: minimize total cost base data access expensive  small q base data access cheap  large q –details in paper

Rainer Gemulla, Wolfgang Lehner, Peter J. Haas A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets Slide 21 (VLDB 2006) Resizing Example –resize by 30% if sampling fraction drops below 9% –dependent on costs of accessing base data Low costs immediate resizing Moderate costs combined solution High costs degenerates to Bernoulli sampling

Rainer Gemulla, Wolfgang Lehner, Peter J. Haas A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets Slide 22 (VLDB 2006) Outline 1.Introduction 2.Deletions 3.Resizing 4.Experiments 5.Summary

Rainer Gemulla, Wolfgang Lehner, Peter J. Haas A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets Slide 23 (VLDB 2006) Total Cost Total cost –stable dataset, 10M operations –sample size 100k, data access 10 times more expensive than sample access Base data access No base data access

Rainer Gemulla, Wolfgang Lehner, Peter J. Haas A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets Slide 24 (VLDB 2006) Sample size –stable dataset, size 1M –sample size 100k Base data access No base data access

Rainer Gemulla, Wolfgang Lehner, Peter J. Haas A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets Slide 25 (VLDB 2006) Outline 1.Introduction 2.Deletions 3.Resizing 4.Experiments 5.Summary

Rainer Gemulla, Wolfgang Lehner, Peter J. Haas A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets Slide 26 (VLDB 2006) Summary Reservoir Sampling –lacks support for deletions –complete recomputation to enlarge the sample Random Pairing –uses arriving insertions to compensate for deletions Resizing –base data access cannot be avoided –minimizes total cost Future work –better q for resizing –combine with existing techniques [4,8,17] to enhance flexibility, scalability

Rainer Gemulla, Wolfgang Lehner, Peter J. Haas A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets Slide 27 (VLDB 2006) Thank you! Questions?

Rainer Gemulla, Wolfgang Lehner, Peter J. Haas A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets Slide 28 (VLDB 2006) Backup: Bounded-Size Sampling Why sampling? –performance, performance, performance How much to sample? –influencing factors 1.storage consumption 2.response time 3.accuracy –choosing the sample size / sampling fraction 1.largest sample that meets storage requirements 2.largest sample that meets response time requirements 3.smallest sample that meets accuracy requirements

Rainer Gemulla, Wolfgang Lehner, Peter J. Haas A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets Slide 29 (VLDB 2006) Backup: Bounded-Size Sampling Example –random pairing vs. bernoulli sampling –average estimation Data setSample size BS violates 1, 2 Standard error BS violates 3

Rainer Gemulla, Wolfgang Lehner, Peter J. Haas A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets Slide 30 (VLDB 2006) Backup: Distinct-Value Sampling Distinct-value sampling (optimistic setting for DV) –DV-scheme knows avg. dataset size in advance –assume no storage for counters & hash functions Sample size RP has better memory utilization Execution time RP is significantly faster 10% 0%10%0% 10ms 100ms 1s 10s 100s 1000s

Rainer Gemulla, Wolfgang Lehner, Peter J. Haas A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets Slide 31 (VLDB 2006) Backup: RS With Deletions Reservoir sampling with deletions –conduct deletions, continue with smaller sample size

Rainer Gemulla, Wolfgang Lehner, Peter J. Haas A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets Slide 32 (VLDB 2006) Backup: Backing Sample Evaluation –data set consists of 1 million elements (on average) –100k sample, clustered insertions/deletions Data set stable Reservoir sampling sample is empty eventually Backing sample expensive, unstable

Rainer Gemulla, Wolfgang Lehner, Peter J. Haas A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets Slide 33 (VLDB 2006) Backup: An Incorrect Approach Idea –use arriving insertions to refill the sample Not uniform!

Rainer Gemulla, Wolfgang Lehner, Peter J. Haas A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets Slide 34 (VLDB 2006) Backup: Random Pairing Evaluation –data set consists of 1 million elements (on average) –100k sample, clustered insertions/deletions Data set stable Reservoir sampling sample gets emtpy eventually Random pairing no base data access!

Rainer Gemulla, Wolfgang Lehner, Peter J. Haas A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets Slide 35 (VLDB 2006) Backup: Average Sample Size Average sample size –stable dataset, 10M operations –sample size 100k

Rainer Gemulla, Wolfgang Lehner, Peter J. Haas A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets Slide 36 (VLDB 2006) Backup: Average Sample Size With Clustered Insertions/Deletions Average sample size with clustered insertions/deletions –stable dataset, size 10M, ~8M operations –sample size 100k

Rainer Gemulla, Wolfgang Lehner, Peter J. Haas A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets Slide 37 (VLDB 2006) Backup: Cost Cost –stable dataset, 10M operations –sample size 100k

Rainer Gemulla, Wolfgang Lehner, Peter J. Haas A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets Slide 38 (VLDB 2006) Backup: Cost With Clustered Insertions/Deletions Cost with clustered insertions/deletions –stable dataset, size 10M, ~8M operations –sample size 100k

Rainer Gemulla, Wolfgang Lehner, Peter J. Haas A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets Slide 39 (VLDB 2006) Backup: Resizing (Value of q) Resizing –enlarge sample from 100k to 200k –base data access 10ms, arrival rate 1ms