Download presentation
Presentation is loading. Please wait.
Published byMilo Allen Modified over 9 years ago
1
Deferred Maintenance of Disk-Based Random Samples Rainer Gemulla (University of Technology Dresden) Wolfgang Lehner (University of Technology Dresden) Faculty of Computer Science, Institute System Architecture, Database Technology Group
2
Rainer Gemulla, Wolfgang LehnerDeferred Maintenance of Disk-Based Random SamplesSlide 2 Outline 1.Introduction 2.Logging Schemes 3.Refresh Algorithms 4.Performance 5.Summary & Outlook
3
Rainer Gemulla, Wolfgang LehnerDeferred Maintenance of Disk-Based Random SamplesSlide 3 Random Sampling Analytical databases –huge data sets –complex algorithms Requirements –Performance, performance, performance! Random sampling –approximate query answering –data mining –data stream processing –query optimization –data integration Turnover in Europe (TPCH) 1% 8.46 Mil. 0.15 Mil. 4s 10% 8.51 Mil. 0.05 Mil. 52s 100%8.54 Mil.200s
4
Rainer Gemulla, Wolfgang LehnerDeferred Maintenance of Disk-Based Random SamplesSlide 4 Offline Sampling Precomputed samples –pros avoid access to base data used multiple times arbitrary base data versatile –cons maintenance!!! Disk-based samples –many, large samples stored on disk –crash safe –typically space-restricted –challenges sequential access is faster blocking of data
5
Rainer Gemulla, Wolfgang LehnerDeferred Maintenance of Disk-Based Random SamplesSlide 5 Basics: Reservoir Sampling Sampling with space-constraints –maintain a sample (reservoir) of M tuples add the first M tuples afterwards, throw a dice a)ignore the tuple (reject) b)replace a random tuple in the sample (accept) –accept probability controls sampling scheme –building block for many sophisticated sampling schemes Example –dataset with 50 tuples (M=5)
6
Rainer Gemulla, Wolfgang LehnerDeferred Maintenance of Disk-Based Random SamplesSlide 6 Evolution of the Sample Random I/O!!!
7
Rainer Gemulla, Wolfgang LehnerDeferred Maintenance of Disk-Based Random SamplesSlide 7 Outline 1.Introduction 2.Logging Schemes 3.Refresh Algorithms 4.Performance 5.Summary & Outlook
8
Rainer Gemulla, Wolfgang LehnerDeferred Maintenance of Disk-Based Random SamplesSlide 8 Full Logging Full Log –track all changes –log is written sequentially –log contains more information than needed
9
Rainer Gemulla, Wolfgang LehnerDeferred Maintenance of Disk-Based Random SamplesSlide 9 Candidate Logging Candidate log –track only changes which affect the sample –log is written sequentially –smaller logs How to implement Candidate Refresh?
10
Rainer Gemulla, Wolfgang LehnerDeferred Maintenance of Disk-Based Random SamplesSlide 10 Outline 1.Introduction 2.Logging Schemes 3.Refresh Algorithms 4.Performance 5.Summary & Outlook
11
Rainer Gemulla, Wolfgang LehnerDeferred Maintenance of Disk-Based Random SamplesSlide 11 Naive Refresh Naive refresh –scan log file sequentially –write each element of the log to a random position in the sample –No improvement at all! random access to sample some elements are written more than once
12
Rainer Gemulla, Wolfgang LehnerDeferred Maintenance of Disk-Based Random SamplesSlide 12 Avoiding Multiple Writes Observation –each candidate can be overwritten by subsequent candidates only –last candidate is never overwritten Approach –scan log in reverse order –write only tuples which have not been written before
13
Rainer Gemulla, Wolfgang LehnerDeferred Maintenance of Disk-Based Random SamplesSlide 13 Avoiding Multiple Writes Probability of overwrites In general –k tuples written to sample (k=0…5) –probability of overwrite: p k = (M-k)/M –number of skipped tuples: P(X k =x)=(1-p k ) x p k (k>0) –X5=–X5= –here: X 1 =0, X 2 =1, X 3 =1, X 4 =6
14
Rainer Gemulla, Wolfgang LehnerDeferred Maintenance of Disk-Based Random SamplesSlide 14 Nomem Refresh Nomem Refresh (Phase 1) –dry run: generate X 4,…,X 1 in advance –reset pseudo-random number generator and generate same sequence again –start at: |C|-X indexes of log file are generated
15
Rainer Gemulla, Wolfgang LehnerDeferred Maintenance of Disk-Based Random SamplesSlide 15 Nomem Refresh Naive update of sample –read generated indexes of the log –write it to a random (free) position in the sample –drawbacks free positions have to be maintained random access to the sample
16
Rainer Gemulla, Wolfgang LehnerDeferred Maintenance of Disk-Based Random SamplesSlide 16 Nomem Refresh Nomem Refresh (Phase 2) –general idea: order of the tuples in sample is unimportant –algorithm (re-)generate next position in the log (6, 8,10,11) generate next position in the sample (1, 2, 3, 5) read from log, write to sample
17
Rainer Gemulla, Wolfgang LehnerDeferred Maintenance of Disk-Based Random SamplesSlide 17 Nomem Refresh Properties –log file is read sequentially –sample is written sequentially –no overwrites –no memory consumption –works on full logs as well (DBMS!)
18
Rainer Gemulla, Wolfgang LehnerDeferred Maintenance of Disk-Based Random SamplesSlide 18 Outline 1.Introduction 2.Logging Schemes 3.Refresh Algorithms 4.Performance 5.Summary & Outlook
19
Rainer Gemulla, Wolfgang LehnerDeferred Maintenance of Disk-Based Random SamplesSlide 19 Experiments Number of operations & execution time –sample size: 1 million tuples –refresh period: 1 million operations
20
Rainer Gemulla, Wolfgang LehnerDeferred Maintenance of Disk-Based Random SamplesSlide 20 Experiments Refresh period & execution time –sample size: 1 million tuples –number of operations: 100 million
21
Rainer Gemulla, Wolfgang LehnerDeferred Maintenance of Disk-Based Random SamplesSlide 21 Outline 1.Introduction 2.Logging Schemes 3.Refresh Algorithms 4.Performance 5.Summary & Outlook
22
Rainer Gemulla, Wolfgang LehnerDeferred Maintenance of Disk-Based Random SamplesSlide 22 Summary & Outlook Logging schemes –full logs: often found in database systems –candidate logs: reduce log file size Nomem Refresh –fast incremental refresh –sequential disk access only –no memory consumption –works with full and candidate logs Future work –more detailed discussion of updates & deletions
23
Rainer Gemulla, Wolfgang LehnerDeferred Maintenance of Disk-Based Random SamplesSlide 23 Thank you! Questions?
24
Rainer Gemulla, Wolfgang LehnerDeferred Maintenance of Disk-Based Random SamplesSlide 24 Extensions –nomem refresh for full logs (DBMS!) dry run: compute candidates, count their number reset random number generator add skips of Nomem Refresh and Reservoir Sampling –deletions and updates store deletions and updates separately process delete and update log first run Nomem Refresh on the insert log requires disjoint logs
25
Rainer Gemulla, Wolfgang LehnerDeferred Maintenance of Disk-Based Random SamplesSlide 25 Experiments Comparison with the Geometric File –sample size: 1 million tuples –number of operations: 100 million
26
Rainer Gemulla, Wolfgang LehnerDeferred Maintenance of Disk-Based Random SamplesSlide 26 Experiments Computational overhead –sample size: 1 million tuples
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.