Download presentation
Presentation is loading. Please wait.
Published byDiana Edwards Modified over 6 years ago
1
Faculty of Computer Science, Institute System Architecture, Database Technology Group
A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets Rainer Gemulla (University of Technology Dresden) Wolfgang Lehner (University of Technology Dresden) Peter J. Haas (IBM Almaden Research Center)
2
Maintaining Sample Synopses of Evolving Datasets
Outline Introduction Deletions Resizing Experiments Summary Rainer Gemulla, Wolfgang Lehner, Peter J. Haas A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets
3
Maintaining Sample Synopses of Evolving Datasets
Random Sampling Database applications huge data sets complex algorithms (space & time) Requirements performance, performance, performance Random sampling approximate query answering data mining data stream processing query optimization data integration Turnover in Europe (TPC-H) 1% 8.46 Mil. 0.15 Mil. 4s 10% 8.51 Mil. 0.05 Mil. 52s 100% 8.54 Mil. 200s TPCH: scale 1 (6M tuples in fact table), Zipf 1.5, normalized join synopsis 95% condifdence Rainer Gemulla, Wolfgang Lehner, Peter J. Haas A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets
4
Maintaining Sample Synopses of Evolving Datasets
The Problem Space Setting arbitrary data sets samples of the data evolving data Scope of this talk maintenance of random samples Can we minimize or even avoid access to base data? Rainer Gemulla, Wolfgang Lehner, Peter J. Haas A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets
5
variation of data set size influence on sampling
Types of Data Sets Data sets variation of data set size influence on sampling Stable Growing Shrinking initial size 1M random decision between 30 inserts/deletions Goal: stable sample Goal: controlled growing sample uninteresting Rainer Gemulla, Wolfgang Lehner, Peter J. Haas A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets
6
Maintaining Sample Synopses of Evolving Datasets
Uniform Sampling Uniform sampling all samples of the same size are equally likely many statistical procedures assume uniformity flexibility Example a data set (also called population) possible samples of size 2 - decimals have been omitted for brevity Rainer Gemulla, Wolfgang Lehner, Peter J. Haas A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets
7
Maintaining Sample Synopses of Evolving Datasets
Reservoir Sampling Reservoir sampling computes a uniform sample of M elements building block for many sophisticated sampling schemes single-scan algorithm add the first M elements afterwards, flip a coin ignore the element (reject) replace a random element in the sample (accept) accept probability of the ith element can be used to maintain a sample (insertions only) Rainer Gemulla, Wolfgang Lehner, Peter J. Haas A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets
8
Reservoir Sampling (Example)
sample size M = 2 Rainer Gemulla, Wolfgang Lehner, Peter J. Haas A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets
9
Problems with Reservoir Sampling
lacks support for deletions (stable data sets) cannot efficiently enlarge sample (growing data sets) ? Rainer Gemulla, Wolfgang Lehner, Peter J. Haas A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets
10
Maintaining Sample Synopses of Evolving Datasets
Outline Introduction Deletions Resizing Experiments Summary Rainer Gemulla, Wolfgang Lehner, Peter J. Haas A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets
11
Naïve/Prior Approaches
Algorithm Technique Comments (RS with deletions) conduct deletions, continue with smaller sample unstable Naïve use insertions to immediately refill the sample not uniform Backing sample let sample size decrease, but occasionally recompute expensive, unstable CAR(WOR) immediately sample from base data to refill the sample stable but expensive Bernoulli s. with purging “coin flip” sampling with deletions, purge if too large inexpensive but unstable Passive sampling - sample sizes decreases due to deletions developed for data streams (sliding windows only) special case of our RP algorithm Distinct-value sampling tailored for multiset populations expensive, low space efficiency in our setting Rainer Gemulla, Wolfgang Lehner, Peter J. Haas A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets
12
Maintaining Sample Synopses of Evolving Datasets
Random Pairing Random pairing compensates deletions with arriving insertions corrects inclusion probabilies General idea (insertion) no uncompensated deletions reservoir sampling otherwise, randomly select an uncompensated deletion (partner) compensate it: Was it in the sample? yes add arriving element to sample no ignore arriving element like an update Rainer Gemulla, Wolfgang Lehner, Peter J. Haas A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets
13
Maintaining Sample Synopses of Evolving Datasets
Random Pairing Example Rainer Gemulla, Wolfgang Lehner, Peter J. Haas A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets
14
Maintaining Sample Synopses of Evolving Datasets
Random Pairing Details of the algorithm keeping history of deleted items is expensive, but: maintenance of two counters suffices correctness proof is in the paper Rainer Gemulla, Wolfgang Lehner, Peter J. Haas A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets
15
Maintaining Sample Synopses of Evolving Datasets
Outline Introduction Deletions Resizing Experiments Summary Rainer Gemulla, Wolfgang Lehner, Peter J. Haas A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets
16
Growing Data Sets The problem growing data set Data set Random pairing
initial size 1M random decision between 30 inserts/deletions growing data set stable sample sampling fraction decreases Rainer Gemulla, Wolfgang Lehner, Peter J. Haas A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets
17
Maintaining Sample Synopses of Evolving Datasets
A Negative Result Negative result There is no resizing algorithm which can enlarge a bounded-size sample without ever accessing base data. Example data set samples of size 2 new data set samples of size 3 Not uniform! Rainer Gemulla, Wolfgang Lehner, Peter J. Haas A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets
18
Maintaining Sample Synopses of Evolving Datasets
Resizing Goal efficiently increase sample size stay within an upper bound at all times General idea convert sample to Bernoulli sample continue Bernoulli sampling until new sample size is reached convert back to reservoir sample Optimally balance cost cost of base data accesses (in step 1) time to reach new sample size (in step 2) - upper bound avoids unpleasent memory overflow Rainer Gemulla, Wolfgang Lehner, Peter J. Haas A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets
19
Maintaining Sample Synopses of Evolving Datasets
Resizing Bernoulli sampling uniform sampling scheme each tuple is added to the sample with probability q sample size follows binomial distribution no effective upper bound Phase 1: Conversion to a Bernoulli sample given q, randomly determine sample size reuse reservoir sample to create Bernoulli sample subsample sample additional tuples (base data access) choice of q small less base data accesses large more base data accesses q is a parameter tuples are added independently of other tuples uniform Rainer Gemulla, Wolfgang Lehner, Peter J. Haas A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets
20
Maintaining Sample Synopses of Evolving Datasets
Resizing Phase 2: Run Bernoulli sampling accept new tuples with probability q conduct deletions stop as soon as new sample size is reached Phase 3: Revert to Reservoir sampling switchover is trivial Choosing q determines cost of Phase 1 and Phase 2 goal: minimize total cost base data access expensive small q base data access cheap large q details in paper Rainer Gemulla, Wolfgang Lehner, Peter J. Haas A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets
21
resize by 30% if sampling fraction drops below 9%
Resizing Example resize by 30% if sampling fraction drops below 9% dependent on costs of accessing base data Low costs Moderate costs High costs there is no space to print the data set same experimental settings as in the other schemes immediate resizing combined solution degenerates to Bernoulli sampling Rainer Gemulla, Wolfgang Lehner, Peter J. Haas A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets
22
Maintaining Sample Synopses of Evolving Datasets
Outline Introduction Deletions Resizing Experiments Summary Rainer Gemulla, Wolfgang Lehner, Peter J. Haas A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets
23
Maintaining Sample Synopses of Evolving Datasets
Total Cost Total cost stable dataset, 10M operations sample size 100k, data access 10 times more expensive than sample access Base data access we counted number of accesses to sample & bast data logarithmic scale! # (Test 10) Sigmod06Evaluation.evaluatePerformanceSize() # # targetSize = # noOperations = # db sizes = # repetitions = 1000 # minSize = 80000 # qratio = 0.8 No base data access Rainer Gemulla, Wolfgang Lehner, Peter J. Haas A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets
24
Maintaining Sample Synopses of Evolving Datasets
Sample size Sample size stable dataset, size 1M sample size 100k Base data access # (Test 2) Sigmod06Evaluation.evaluateSizeProgression() # # targetSize = # noOperations = # startAfter = # minSize = 80000 No base data access Rainer Gemulla, Wolfgang Lehner, Peter J. Haas A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets
25
Maintaining Sample Synopses of Evolving Datasets
Outline Introduction Deletions Resizing Experiments Summary Rainer Gemulla, Wolfgang Lehner, Peter J. Haas A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets
26
Maintaining Sample Synopses of Evolving Datasets
Summary Reservoir Sampling lacks support for deletions complete recomputation to enlarge the sample Random Pairing uses arriving insertions to compensate for deletions Resizing base data access cannot be avoided minimizes total cost Future work better q for resizing combine with existing techniques [4,8,17] to enhance flexibility, scalability 4,8 = disk-based sampling 17 = warehousing of sample data Rainer Gemulla, Wolfgang Lehner, Peter J. Haas A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets
27
Maintaining Sample Synopses of Evolving Datasets
Thank you! Questions? Rainer Gemulla, Wolfgang Lehner, Peter J. Haas A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets
28
Backup: Bounded-Size Sampling
Why sampling? performance, performance, performance How much to sample? influencing factors storage consumption response time accuracy choosing the sample size / sampling fraction largest sample that meets storage requirements largest sample that meets response time requirements smallest sample that meets accuracy requirements Rainer Gemulla, Wolfgang Lehner, Peter J. Haas A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets
29
Backup: Bounded-Size Sampling
Example random pairing vs. bernoulli sampling average estimation Data set Sample size Standard error # (Test 12) Vldb06Evaluation.evaluateBernoulli() # # DATASET_SIZE = # DATASET_SIZE = # RUNS = # RUNS1_ = # RUNS = # SAMPLING_FRACTION = 0.01 # SAMPLE_SIZE = 10000 # PRINT_RATE = 10000 BS violates 1, 2 BS violates 3 Rainer Gemulla, Wolfgang Lehner, Peter J. Haas A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets
30
Backup: Distinct-Value Sampling
Distinct-value sampling (optimistic setting for DV) DV-scheme knows avg. dataset size in advance assume no storage for counters & hash functions Sample size Execution time 10% 1000s 100s 10s 1s # (Test 13) Vldb06Evaluation.evaluateDistinct() # # DATASET_SIZE = # SYNOPSIS_SIZES = [10, 100, 1000, 10000] 100ms 10ms 0% 10% 0% 10% RP has better memory utilization RP is significantly faster Rainer Gemulla, Wolfgang Lehner, Peter J. Haas A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets
31
Backup: RS With Deletions
Reservoir sampling with deletions conduct deletions, continue with smaller sample size Rainer Gemulla, Wolfgang Lehner, Peter J. Haas A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets
32
Backup: Backing Sample
Evaluation data set consists of 1 million elements (on average) 100k sample, clustered insertions/deletions Data set Reservoir sampling Backing sample at each step, 30 tuples have been inserted/deleted (random) 1M steps stable sample is empty eventually expensive, unstable Rainer Gemulla, Wolfgang Lehner, Peter J. Haas A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets
33
Backup: An Incorrect Approach
Idea use arriving insertions to refill the sample Not uniform! Rainer Gemulla, Wolfgang Lehner, Peter J. Haas A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets
34
Backup: Random Pairing
Evaluation data set consists of 1 million elements (on average) 100k sample, clustered insertions/deletions Data set Reservoir sampling Random pairing at each step, 30 tuples have been inserted/deleted (random) 1M steps stable sample gets emtpy eventually no base data access! Rainer Gemulla, Wolfgang Lehner, Peter J. Haas A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets
35
Backup: Average Sample Size
stable dataset, 10M operations sample size 100k # (Test 10) Sigmod06Evaluation.evaluatePerformanceSize() # # targetSize = # noOperations = # db sizes = # repetitions = 1000 # minSize = 80000 # qratio = 0.8 Rainer Gemulla, Wolfgang Lehner, Peter J. Haas A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets
36
Backup: Average Sample Size With Clustered Insertions/Deletions
stable dataset, size 10M, ~8M operations sample size 100k 2^20 = 1M # (Test 8) Sigmod06Evaluation.evaluatePerformanceCluster() # # targetSize = # startAfter = # noOperations = # repetitions = 1000 # cluster sizes = # minSize = 80000 # qratio = 0.8 Rainer Gemulla, Wolfgang Lehner, Peter J. Haas A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets
37
Maintaining Sample Synopses of Evolving Datasets
Backup: Cost Cost stable dataset, 10M operations sample size 100k # (Test 10) Sigmod06Evaluation.evaluatePerformanceSize() # # targetSize = # noOperations = # db sizes = # repetitions = 1000 # minSize = 80000 # qratio = 0.8 Rainer Gemulla, Wolfgang Lehner, Peter J. Haas A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets
38
Backup: Cost With Clustered Insertions/Deletions
stable dataset, size 10M, ~8M operations sample size 100k 2^20 = 1M # (Test 8) Sigmod06Evaluation.evaluatePerformanceCluster() # # targetSize = # startAfter = # noOperations = # repetitions = 1000 # cluster sizes = # minSize = 80000 # qratio = 0.8 Rainer Gemulla, Wolfgang Lehner, Peter J. Haas A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets
39
Backup: Resizing (Value of q)
enlarge sample from 100k to 200k base data access 10ms, arrival rate 1ms # (Test 6) Sigmod06Evaluation.evaluateResize() # # M = # M' = # pop sizes = , step # repetitions = 1000 # ta = 10.0 # tb = 1.0 Rainer Gemulla, Wolfgang Lehner, Peter J. Haas A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.