Faculty of Computer Science, Institute System Architecture, Database Technology Group A Dip in the Reservoir: Maintaining Sample Synopses of Evolving.

Slides:



Advertisements
Similar presentations
Sampling Algorithms for Evolving Datasets Rainer Gemulla Defense of Ph.D. Thesis Faculty of Computer Science, Institute of System Architecture,
Advertisements

Raghavendra Madala. Introduction Icicles Icicle Maintenance Icicle-Based Estimators Quality Guarantee Performance Evaluation Conclusion 2 ICICLES: Self-tuning.
Lindsey Bleimes Charlie Garrod Adam Meyerson
Size-estimation framework with applications to transitive closure and reachability Presented by Maxim Kalaev Edith Cohen AT&T Bell Labs 1996.
Counting Distinct Objects over Sliding Windows Presented by: Muhammad Aamir Cheema Joint work with Wenjie Zhang, Ying Zhang and Xuemin Lin University of.
A Paper on RANDOM SAMPLING OVER JOINS by SURAJIT CHAUDHARI RAJEEV MOTWANI VIVEK NARASAYYA PRESENTED BY, JEEVAN KUMAR GOGINENI SARANYA GOTTIPATI.
3/13/2012Data Streams: Lecture 161 CS 410/510 Data Streams Lecture 16: Data-Stream Sampling: Basic Techniques and Results Kristin Tufte, David Maier.
Fast Incremental Maintenance of Approximate histograms : Phillip B. Gibbons (Intel Research Pittsburgh) Yossi Matias (Tel Aviv University) Viswanath Poosala.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.
Linked Bernoulli Synopses Sampling Along Foreign Keys Rainer Gemulla, Philipp Rösch, Wolfgang Lehner Technische Universität Dresden Faculty of Computer.
1 Maintaining Bernoulli Samples Over Evolving Multisets Rainer Gemulla Wolfgang Lehner Technische Universität Dresden Peter J. Haas IBM Almaden Research.
New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007.
What ’ s Hot and What ’ s Not: Tracking Most Frequent Items Dynamically G. Cormode and S. Muthukrishman Rutgers University ACM Principles of Database Systems.
Improving the Accuracy of Continuous Aggregates & Mining Queries Under Load Shedding Yan-Nei Law* and Carlo Zaniolo Computer Science Dept. UCLA * Bioinformatics.
16.5 Introduction to Cost- based plan selection Amith KC Student Id: 109.
CS107 Introduction to Computer Science Lecture 7, 8 An Introduction to Algorithms: Efficiency of algorithms.
A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets Rainer Gemulla (University of Technology Dresden) Wolfgang Lehner (University.
Hashed Samples Selectivity Estimators for Set Similarity Selection Queries.
1 Towards a Synopsis Warehouse Peter J. Haas IBM Almaden Research Center San Jose, CA.
NGDM’02 1 Efficient Data-Reduction Methods for On-line Association Rule Mining H. Bronnimann B. ChenM. Dash, Y. Qiao, P. ScheuermannP. Haas Polytechnic.
CONGRESSIONAL SAMPLES FOR APPROXIMATE ANSWERING OF GROUP-BY QUERIES Swarup Acharya Phillip Gibbons Viswanath Poosala ( Information Sciences Research Center,
Deferred Maintenance of Disk-Based Random Samples Rainer Gemulla (University of Technology Dresden) Wolfgang Lehner (University of Technology Dresden)
Access Path Selection in a Relational Database Management System Selinger et al.
Database Management 9. course. Execution of queries.
Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach Wenjie Zhang, Xuemin Lin The University of New South Wales & NICTA Ming Hua,
The X-Tree An Index Structure for High Dimensional Data Stefan Berchtold, Daniel A Keim, Hans Peter Kriegel Institute of Computer Science Munich, Germany.
Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.
Join Synopses for Approximate Query Answering Swarup Acharya Phillip B. Gibbons Viswanath Poosala Sridhar Ramaswamy.
Swarup Acharya Phillip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented By Vinay Hoskere.
Ripple Joins for Online Aggregation by Peter J. Haas and Joseph M. Hellerstein published in June 1999 presented by Nag Prajval B.C.
Data Stream Algorithms Ke Yi Hong Kong University of Science and Technology.
Join Synopses for Approximate Query Answering Swarup Acharya, Philip B. Gibbons, Viswanath Poosala, Sridhar Ramaswamy By Vladimir Gamaley.
New Sampling-Based Summary Statistics for Improving Approximate Query Answers Yinghui Wang
CS4432: Database Systems II Query Processing- Part 2.
Presented By Anirban Maiti Chandrashekar Vijayarenu
Security in Outsourced Association Rule Mining. Agenda  Introduction  Approximate randomized technique  Encryption  Summary and future work.
CONGRESSIONAL SAMPLES FOR APPROXIMATE ANSWERING OF GROUP BY QUERIES Swaroop Acharya,Philip B Gibbons, VishwanathPoosala By Agasthya Padisala Anusha Reddy.
Mining of Massive Datasets Ch4. Mining Data Streams
Designing Factorial Experiments with Binary Response Tel-Aviv University Faculty of Exact Sciences Department of Statistics and Operations Research Hovav.
Rainer Gemulla, Wolfgang Lehner and Peter J. Haas VLDB 2006 A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets 2008/8/27 1.
University of Texas at Arlington Presented By Srikanth Vadada Fall CSE rd Sep 2010 Dynamic Sample Selection for Approximate Query Processing.
Written By: Presented By: Swarup Acharya,Amr Elkhatib Phillip B. Gibbons, Viswanath Poosala, Sridhar Ramaswamy Join Synopses for Approximate Query Answering.
1 Towards a Synopsis Warehouse Peter J. Haas IBM Almaden Research Center San Jose, CA.
Computer Science and Engineering Jianye Yang 1, Ying Zhang 2, Wenjie Zhang 1, Xuemin Lin 1 Influence based Cost Optimization on User Preference 1 The University.
ICICLES: Self-tuning Samples for Approximate Query Answering By Venkatesh Ganti, Mong Li Lee, and Raghu Ramakrishnan Shruti P. Gopinath CSE 6339.
Frequency Counts over Data Streams
Data Driven Resource Allocation for Distributed Learning
Updating SF-Tree Speaker: Ho Wai Shing.
A paper on Join Synopses for Approximate Query Answering
Database Applications (15-415) DBMS Internals- Part VII Lecture 16, October 25, 2016 Mohammad Hammoud.
Parallel Density-based Hybrid Clustering
Ripple Joins for Online Aggregation
CPSC-608 Database Systems
Query Sampling in DB2.
ICICLES: Self-tuning Samples for Approximate Query Answering
Spatial Online Sampling and Aggregation
AQUA: Approximate Query Answering
Advanced Topics in Data Management
Enumerating Distances Using Spanners of Bounded Degree
Query Sampling in DB2.
Lecture 2- Query Processing (continued)
Database Design and Programming
DATABASE HISTOGRAMS E0 261 Jayant Haritsa
Evaluation of Relational Operations: Other Techniques
Heavy Hitters in Streams and Sliding Windows
Minwise Hashing and Efficient Search
Approximation and Load Shedding Sampling Methods
Indexing, Access and Database System Architecture
Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research)
CSE 326: Data Structures Lecture #10 B-Trees
Presentation transcript:

Faculty of Computer Science, Institute System Architecture, Database Technology Group A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets Rainer Gemulla (University of Technology Dresden) Wolfgang Lehner (University of Technology Dresden) Peter J. Haas (IBM Almaden Research Center)

Maintaining Sample Synopses of Evolving Datasets Outline Introduction Deletions Resizing Experiments Summary Rainer Gemulla, Wolfgang Lehner, Peter J. Haas A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets

Maintaining Sample Synopses of Evolving Datasets Random Sampling Database applications huge data sets complex algorithms (space & time) Requirements performance, performance, performance Random sampling approximate query answering data mining data stream processing query optimization data integration Turnover in Europe (TPC-H) 1% 8.46 Mil.  0.15 Mil. 4s 10% 8.51 Mil.  0.05 Mil. 52s 100% 8.54 Mil. 200s TPCH: scale 1 (6M tuples in fact table), Zipf 1.5, normalized join synopsis 95% condifdence Rainer Gemulla, Wolfgang Lehner, Peter J. Haas A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets

Maintaining Sample Synopses of Evolving Datasets The Problem Space Setting arbitrary data sets samples of the data evolving data Scope of this talk maintenance of random samples Can we minimize or even avoid access to base data? Rainer Gemulla, Wolfgang Lehner, Peter J. Haas A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets

variation of data set size influence on sampling Types of Data Sets Data sets variation of data set size influence on sampling Stable Growing Shrinking initial size 1M random decision between 30 inserts/deletions Goal: stable sample Goal: controlled growing sample uninteresting Rainer Gemulla, Wolfgang Lehner, Peter J. Haas A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets

Maintaining Sample Synopses of Evolving Datasets Uniform Sampling Uniform sampling all samples of the same size are equally likely many statistical procedures assume uniformity flexibility Example a data set (also called population) possible samples of size 2 - decimals have been omitted for brevity Rainer Gemulla, Wolfgang Lehner, Peter J. Haas A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets

Maintaining Sample Synopses of Evolving Datasets Reservoir Sampling Reservoir sampling computes a uniform sample of M elements building block for many sophisticated sampling schemes single-scan algorithm add the first M elements afterwards, flip a coin ignore the element (reject) replace a random element in the sample (accept) accept probability of the ith element can be used to maintain a sample (insertions only) Rainer Gemulla, Wolfgang Lehner, Peter J. Haas A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets

Reservoir Sampling (Example) sample size M = 2 Rainer Gemulla, Wolfgang Lehner, Peter J. Haas A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets

Problems with Reservoir Sampling lacks support for deletions (stable data sets) cannot efficiently enlarge sample (growing data sets) ? Rainer Gemulla, Wolfgang Lehner, Peter J. Haas A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets

Maintaining Sample Synopses of Evolving Datasets Outline Introduction Deletions Resizing Experiments Summary Rainer Gemulla, Wolfgang Lehner, Peter J. Haas A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets

Naïve/Prior Approaches Algorithm Technique Comments (RS with deletions) conduct deletions, continue with smaller sample unstable Naïve use insertions to immediately refill the sample not uniform Backing sample let sample size decrease, but occasionally recompute expensive, unstable CAR(WOR) immediately sample from base data to refill the sample stable but expensive Bernoulli s. with purging “coin flip” sampling with deletions, purge if too large inexpensive but unstable Passive sampling - sample sizes decreases due to deletions developed for data streams (sliding windows only) special case of our RP algorithm Distinct-value sampling tailored for multiset populations expensive, low space efficiency in our setting Rainer Gemulla, Wolfgang Lehner, Peter J. Haas A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets

Maintaining Sample Synopses of Evolving Datasets Random Pairing Random pairing compensates deletions with arriving insertions corrects inclusion probabilies General idea (insertion) no uncompensated deletions  reservoir sampling otherwise, randomly select an uncompensated deletion (partner) compensate it: Was it in the sample? yes  add arriving element to sample no  ignore arriving element like an update Rainer Gemulla, Wolfgang Lehner, Peter J. Haas A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets

Maintaining Sample Synopses of Evolving Datasets Random Pairing Example Rainer Gemulla, Wolfgang Lehner, Peter J. Haas A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets

Maintaining Sample Synopses of Evolving Datasets Random Pairing Details of the algorithm keeping history of deleted items is expensive, but: maintenance of two counters suffices correctness proof is in the paper Rainer Gemulla, Wolfgang Lehner, Peter J. Haas A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets

Maintaining Sample Synopses of Evolving Datasets Outline Introduction Deletions Resizing Experiments Summary Rainer Gemulla, Wolfgang Lehner, Peter J. Haas A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets

Growing Data Sets The problem growing data set Data set Random pairing initial size 1M random decision between 30 inserts/deletions growing data set stable sample sampling fraction decreases Rainer Gemulla, Wolfgang Lehner, Peter J. Haas A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets

Maintaining Sample Synopses of Evolving Datasets A Negative Result Negative result There is no resizing algorithm which can enlarge a bounded-size sample without ever accessing base data. Example data set samples of size 2 new data set samples of size 3 Not uniform! Rainer Gemulla, Wolfgang Lehner, Peter J. Haas A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets

Maintaining Sample Synopses of Evolving Datasets Resizing Goal efficiently increase sample size stay within an upper bound at all times General idea convert sample to Bernoulli sample continue Bernoulli sampling until new sample size is reached convert back to reservoir sample Optimally balance cost cost of base data accesses (in step 1) time to reach new sample size (in step 2) - upper bound avoids unpleasent memory overflow Rainer Gemulla, Wolfgang Lehner, Peter J. Haas A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets

Maintaining Sample Synopses of Evolving Datasets Resizing Bernoulli sampling uniform sampling scheme each tuple is added to the sample with probability q sample size follows binomial distribution  no effective upper bound Phase 1: Conversion to a Bernoulli sample given q, randomly determine sample size reuse reservoir sample to create Bernoulli sample subsample sample additional tuples (base data access) choice of q small  less base data accesses large  more base data accesses q is a parameter tuples are added independently of other tuples uniform Rainer Gemulla, Wolfgang Lehner, Peter J. Haas A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets

Maintaining Sample Synopses of Evolving Datasets Resizing Phase 2: Run Bernoulli sampling accept new tuples with probability q conduct deletions stop as soon as new sample size is reached Phase 3: Revert to Reservoir sampling switchover is trivial Choosing q determines cost of Phase 1 and Phase 2 goal: minimize total cost base data access expensive  small q base data access cheap  large q details in paper Rainer Gemulla, Wolfgang Lehner, Peter J. Haas A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets

resize by 30% if sampling fraction drops below 9% Resizing Example resize by 30% if sampling fraction drops below 9% dependent on costs of accessing base data Low costs Moderate costs High costs there is no space to print the data set same experimental settings as in the other schemes immediate resizing combined solution degenerates to Bernoulli sampling Rainer Gemulla, Wolfgang Lehner, Peter J. Haas A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets

Maintaining Sample Synopses of Evolving Datasets Outline Introduction Deletions Resizing Experiments Summary Rainer Gemulla, Wolfgang Lehner, Peter J. Haas A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets

Maintaining Sample Synopses of Evolving Datasets Total Cost Total cost stable dataset, 10M operations sample size 100k, data access 10 times more expensive than sample access Base data access we counted number of accesses to sample & bast data logarithmic scale! # (Test 10) Sigmod06Evaluation.evaluatePerformanceSize() # # targetSize = 100000 # noOperations = 10000000 # db sizes = 5000000 5100000 5200000 5300000 5400000 5500000 5600000 5700000 5800000 5900000 6000000 6100000 6200000 6300000 6400000 6500000 6600000 6700000 6800000 6900000 7000000 7100000 7200000 7300000 7400000 7500000 7600000 7700000 7800000 7900000 8000000 8100000 8200000 8300000 8400000 8500000 8600000 8700000 8800000 8900000 9000000 9100000 9500000 9300000 9400000 9500000 9600000 9700000 9800000 9900000 10000000 # repetitions = 1000 # minSize = 80000 # qratio = 0.8 No base data access Rainer Gemulla, Wolfgang Lehner, Peter J. Haas A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets

Maintaining Sample Synopses of Evolving Datasets Sample size Sample size stable dataset, size 1M sample size 100k Base data access # (Test 2) Sigmod06Evaluation.evaluateSizeProgression() # # targetSize = 100000 # noOperations = 10000000 # startAfter = 1000000 # minSize = 80000 No base data access Rainer Gemulla, Wolfgang Lehner, Peter J. Haas A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets

Maintaining Sample Synopses of Evolving Datasets Outline Introduction Deletions Resizing Experiments Summary Rainer Gemulla, Wolfgang Lehner, Peter J. Haas A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets

Maintaining Sample Synopses of Evolving Datasets Summary Reservoir Sampling lacks support for deletions complete recomputation to enlarge the sample Random Pairing uses arriving insertions to compensate for deletions Resizing base data access cannot be avoided minimizes total cost Future work better q for resizing combine with existing techniques [4,8,17] to enhance flexibility, scalability 4,8 = disk-based sampling 17 = warehousing of sample data Rainer Gemulla, Wolfgang Lehner, Peter J. Haas A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets

Maintaining Sample Synopses of Evolving Datasets Thank you! Questions? Rainer Gemulla, Wolfgang Lehner, Peter J. Haas A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets

Backup: Bounded-Size Sampling Why sampling? performance, performance, performance How much to sample? influencing factors storage consumption response time accuracy choosing the sample size / sampling fraction largest sample that meets storage requirements largest sample that meets response time requirements smallest sample that meets accuracy requirements Rainer Gemulla, Wolfgang Lehner, Peter J. Haas A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets

Backup: Bounded-Size Sampling Example random pairing vs. bernoulli sampling average estimation Data set Sample size Standard error # (Test 12) Vldb06Evaluation.evaluateBernoulli() # # DATASET_SIZE1 = 1000000 # DATASET_SIZE2 = 2000000 # RUNS1 = 2000000 # RUNS1_2 = 2000000 # RUNS2 = 2000000 # SAMPLING_FRACTION = 0.01 # SAMPLE_SIZE = 10000 # PRINT_RATE = 10000 BS violates 1, 2 BS violates 3 Rainer Gemulla, Wolfgang Lehner, Peter J. Haas A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets

Backup: Distinct-Value Sampling Distinct-value sampling (optimistic setting for DV) DV-scheme knows avg. dataset size in advance assume no storage for counters & hash functions Sample size Execution time 10% 1000s 100s 10s 1s # (Test 13) Vldb06Evaluation.evaluateDistinct() # # DATASET_SIZE = 100000 # SYNOPSIS_SIZES = [10, 100, 1000, 10000] 100ms 10ms 0% 10% 0% 10% RP has better memory utilization RP is significantly faster Rainer Gemulla, Wolfgang Lehner, Peter J. Haas A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets

Backup: RS With Deletions Reservoir sampling with deletions conduct deletions, continue with smaller sample size Rainer Gemulla, Wolfgang Lehner, Peter J. Haas A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets

Backup: Backing Sample Evaluation data set consists of 1 million elements (on average) 100k sample, clustered insertions/deletions Data set Reservoir sampling Backing sample at each step, 30 tuples have been inserted/deleted (random) 1M steps stable sample is empty eventually expensive, unstable Rainer Gemulla, Wolfgang Lehner, Peter J. Haas A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets

Backup: An Incorrect Approach Idea use arriving insertions to refill the sample Not uniform! Rainer Gemulla, Wolfgang Lehner, Peter J. Haas A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets

Backup: Random Pairing Evaluation data set consists of 1 million elements (on average) 100k sample, clustered insertions/deletions Data set Reservoir sampling Random pairing at each step, 30 tuples have been inserted/deleted (random) 1M steps stable sample gets emtpy eventually no base data access! Rainer Gemulla, Wolfgang Lehner, Peter J. Haas A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets

Backup: Average Sample Size stable dataset, 10M operations sample size 100k # (Test 10) Sigmod06Evaluation.evaluatePerformanceSize() # # targetSize = 100000 # noOperations = 10000000 # db sizes = 5000000 5100000 5200000 5300000 5400000 5500000 5600000 5700000 5800000 5900000 6000000 6100000 6200000 6300000 6400000 6500000 6600000 6700000 6800000 6900000 7000000 7100000 7200000 7300000 7400000 7500000 7600000 7700000 7800000 7900000 8000000 8100000 8200000 8300000 8400000 8500000 8600000 8700000 8800000 8900000 9000000 9100000 9500000 9300000 9400000 9500000 9600000 9700000 9800000 9900000 10000000 # repetitions = 1000 # minSize = 80000 # qratio = 0.8 Rainer Gemulla, Wolfgang Lehner, Peter J. Haas A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets

Backup: Average Sample Size With Clustered Insertions/Deletions stable dataset, size 10M, ~8M operations sample size 100k 2^20 = 1M # (Test 8) Sigmod06Evaluation.evaluatePerformanceCluster() # # targetSize = 100000 # startAfter = 10000000 # noOperations = 8388608 # repetitions = 1000 # cluster sizes = 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768 65536 131072 262144 524288 1048576 # minSize = 80000 # qratio = 0.8 Rainer Gemulla, Wolfgang Lehner, Peter J. Haas A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets

Maintaining Sample Synopses of Evolving Datasets Backup: Cost Cost stable dataset, 10M operations sample size 100k # (Test 10) Sigmod06Evaluation.evaluatePerformanceSize() # # targetSize = 100000 # noOperations = 10000000 # db sizes = 5000000 5100000 5200000 5300000 5400000 5500000 5600000 5700000 5800000 5900000 6000000 6100000 6200000 6300000 6400000 6500000 6600000 6700000 6800000 6900000 7000000 7100000 7200000 7300000 7400000 7500000 7600000 7700000 7800000 7900000 8000000 8100000 8200000 8300000 8400000 8500000 8600000 8700000 8800000 8900000 9000000 9100000 9500000 9300000 9400000 9500000 9600000 9700000 9800000 9900000 10000000 # repetitions = 1000 # minSize = 80000 # qratio = 0.8 Rainer Gemulla, Wolfgang Lehner, Peter J. Haas A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets

Backup: Cost With Clustered Insertions/Deletions stable dataset, size 10M, ~8M operations sample size 100k 2^20 = 1M # (Test 8) Sigmod06Evaluation.evaluatePerformanceCluster() # # targetSize = 100000 # startAfter = 10000000 # noOperations = 8388608 # repetitions = 1000 # cluster sizes = 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768 65536 131072 262144 524288 1048576 # minSize = 80000 # qratio = 0.8 Rainer Gemulla, Wolfgang Lehner, Peter J. Haas A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets

Backup: Resizing (Value of q) enlarge sample from 100k to 200k base data access 10ms, arrival rate 1ms # (Test 6) Sigmod06Evaluation.evaluateResize() # # M = 100000 # M' = 200000 # pop sizes = 200000...3000000, step 100000 # repetitions = 1000 # ta = 10.0 # tb = 1.0 Rainer Gemulla, Wolfgang Lehner, Peter J. Haas A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets