Download presentation
Presentation is loading. Please wait.
Published byDoris McKenzie Modified over 9 years ago
1
Efficient Snapshot Differential Algorithms for Data Warehousing Wilburt Juan LabioHector Garcia-Molina
2
Purpose detect modifications from information source extract modifications from information source information source is not sophisticated (e.g., legacy system) Data Warehouse Local DB modifications
3
Problem Outline file containing distinct records {R 1, R 2, …R n }, where R i is given two snapshots F 1 and F 2 produce modifications and F out possible modifications generated: –
4
Difficulties physical location of record may be different between snapshots wasted messages: –useless delete-insert pairs introduces waste delete then insert same record, do nothing delete then insert record with, update –useless insert-delete pairs introduces correctness problem insert then delete same record, do nothing insert then delete record with K, update
5
Example: with physical movement F t-1 K i B i K i+1 B i+1 K i+2 B i+2 K i+3 B i+3 K i+4 B i+4 K i+5 B i+5 K i+6 B i+6 FtFt K i B i K i+3 B i+3 K i+2 B i+2 K i+4 B’ i+4 K i+5 B i+5 K j B j K i+6 B i+6 Modifications made:
6
Example: wasted messages F t-1 K i B i K i+1 B i+1 K i+2 B i+2 K i+3 B i+3 K i+4 B i+4 K i+5 B i+5 K i+6 B i+6 K i+7 B i+7 FtFt K i+3 B i+3 K i+2 B i+2 K i+4 B’ i+4 K i+6 B i+6 K j B j K i+5 B’ i+5 K i B i useless insert-delete or: useless delete-insert or:
7
Related Solutions maintain log of modifications add timestamp to base table joins
8
Proposed Solutions alter extraction application, code is worn parse system log, need DBA privilege to get log snapshot differential File t-1 out differ data warehouse
9
Algorithm Compromises related to joins, but cost less allow some useless delete-insert pairs change all insert-delete pairs to delete-insert pairs batch and send all deletes first may miss a few modifications save file for next snapshot differential
10
Sort Merge Join I part I: sort two input files –save sorted file from previous snapshot –use multi-way merge sort for F 2 creates runs, which are sequences of blocks with sorted records merge runs till 1 run remains 4 * |F 2 | IO operations, assuming |F 2 | 1/2 < |M| part II: merge takes |F 1 | + |F 2 | IO operations
11
Sort Merge Join II reduce IO operations reuse F 1 from previous differential part I: produce sorted runs for F 2 –sort F 2 into runs F runs creates runs, which are sequences of blocks with sorted records 2 * |F 2 | IO operations, assuming |F 2 | 1/2 < |M| part II: create sorted F 2 while merging files –merge takes |F 1 | + 2 * |F 2 | IO operations read into memory 1 block from each run in F runs select record with smallest K value
12
Ex. Expected Number of Good Days let n = 32, # records in F = 1,789,570 P(collision) = 2 -n P(no error) = (1 - E) records(F) N(good days) = 1/(1 - P(no error)) = 2,430 snapshot comparisons if file size increases, then increase size of n
13
Extending ad hoc join Algorithms |F|: # of blocks in file |M|: # of blocks in memory Sort Merge join I: –|F 1 | + 5 * |F 2 | IO Sort Merge join II: – |F 1 | + 4 * |F 2 | IO Partitioned Hash Join: – |F 1 | + 3 * |F 2 | IO
14
Compression Technique reduce record size => reduce IO lossy compression: –higher compression –different uncompressed values maybe mapped into the same compressed value compress object of b bits into n bits, b > n 2 b /2 n values mapped to each compressed value P(collision) = ((2 b /2 n ) - 1)/2 b => 2 -n = E P(no error) = (1 - E) records(F) N(good days) = (1 - P(no error))*Σ 1<=i i* P(no error) i-1 = 1/(1 - P(no error))
15
Outer Join with Compression |f 1 | + 3*|F 2 | + |f 2 | IO sort F 2 into runs: f 2run r 1 = f 1.pop() r 2 = f 2runs.pop() f 2sort.put(r 2.K, compress(r 2.B)) while((r 1 != null) V (r 2 != null)) –if((r 1 == null) V (r 1.K > r 2.K)) /* insert */ F out.put(insert, r 2.K, r 2.B) r 2 = f 2runs.pop() f 2sort. put( r 2.K, compress(r 2.B)) –else if((r 2 == null) V (r 1.K < r 2.K))/* delete */ –else if(r 1.K == r 2.K) if(r 1.b != compress(r 2.B))/* update */
16
Outer Join with Compression |f 1 | + |F 2 | + 3*|f 2sort | + U + I IO compress F 2 during creation of sorted runs into f 2run r 1 = f 1.pop() r 2 = f 2run.pop()/* p -> record */ f 2sort.put(r 2.K, r 2.b, r 2.p)/* b compressed B */ while((r 1 != null) V (r 2 != null)) –if((r 1 == null) V (r 1.K > r 2.K)) /* insert */ F out.put(insert, r 2.K, getTuple( r 2.p ).B) r 2 = f 2run.pop() f 2sort. put( r 2.K, r 2.b, r 2.p )/* what about p */ –else if((r 2 == null) V (r 1.K < r 2.K))/* delete */ –else if(r 1.K == r 2.K) if(r 1.b != r 2.b)/* update */
17
Partitioned hash Outer Join compression –|f 1 | + 3*|F 2 | + |f 2sort | IO compression –|f 1 | + |F 2 | + 2*|f 2sort | + I + U IO
18
Window Algorithm reads snapshots only once assumes records do not move much divide memory into 4 four parts: –input buffers 1 and 2 –aging buffers 1 and 2 |f 1 | + |F 2 | IO distance between snapshots –sum of absolute values of distances, for matching records –normalize by maximum distance for snapshots
19
Input Buffer 1 Input Buffer 2 Aging Buffer 1 Aging Buffer 2 DISK Transfer blocks ki1 jm l : : etc. TailHead TailHead MemoryBuckets
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.