Efficient Snapshot Differential Algorithms for Data Warehousing Wilburt Juan LabioHector Garcia-Molina
Purpose detect modifications from information source extract modifications from information source information source is not sophisticated (e.g., legacy system) Data Warehouse Local DB modifications
Problem Outline file containing distinct records {R 1, R 2, …R n }, where R i is given two snapshots F 1 and F 2 produce modifications and F out possible modifications generated: –
Difficulties physical location of record may be different between snapshots wasted messages: –useless delete-insert pairs introduces waste delete then insert same record, do nothing delete then insert record with, update –useless insert-delete pairs introduces correctness problem insert then delete same record, do nothing insert then delete record with K, update
Example: with physical movement F t-1 K i B i K i+1 B i+1 K i+2 B i+2 K i+3 B i+3 K i+4 B i+4 K i+5 B i+5 K i+6 B i+6 FtFt K i B i K i+3 B i+3 K i+2 B i+2 K i+4 B’ i+4 K i+5 B i+5 K j B j K i+6 B i+6 Modifications made:
Example: wasted messages F t-1 K i B i K i+1 B i+1 K i+2 B i+2 K i+3 B i+3 K i+4 B i+4 K i+5 B i+5 K i+6 B i+6 K i+7 B i+7 FtFt K i+3 B i+3 K i+2 B i+2 K i+4 B’ i+4 K i+6 B i+6 K j B j K i+5 B’ i+5 K i B i useless insert-delete or: useless delete-insert or:
Related Solutions maintain log of modifications add timestamp to base table joins
Proposed Solutions alter extraction application, code is worn parse system log, need DBA privilege to get log snapshot differential File t-1 out differ data warehouse
Algorithm Compromises related to joins, but cost less allow some useless delete-insert pairs change all insert-delete pairs to delete-insert pairs batch and send all deletes first may miss a few modifications save file for next snapshot differential
Sort Merge Join I part I: sort two input files –save sorted file from previous snapshot –use multi-way merge sort for F 2 creates runs, which are sequences of blocks with sorted records merge runs till 1 run remains 4 * |F 2 | IO operations, assuming |F 2 | 1/2 < |M| part II: merge takes |F 1 | + |F 2 | IO operations
Sort Merge Join II reduce IO operations reuse F 1 from previous differential part I: produce sorted runs for F 2 –sort F 2 into runs F runs creates runs, which are sequences of blocks with sorted records 2 * |F 2 | IO operations, assuming |F 2 | 1/2 < |M| part II: create sorted F 2 while merging files –merge takes |F 1 | + 2 * |F 2 | IO operations read into memory 1 block from each run in F runs select record with smallest K value
Ex. Expected Number of Good Days let n = 32, # records in F = 1,789,570 P(collision) = 2 -n P(no error) = (1 - E) records(F) N(good days) = 1/(1 - P(no error)) = 2,430 snapshot comparisons if file size increases, then increase size of n
Extending ad hoc join Algorithms |F|: # of blocks in file |M|: # of blocks in memory Sort Merge join I: –|F 1 | + 5 * |F 2 | IO Sort Merge join II: – |F 1 | + 4 * |F 2 | IO Partitioned Hash Join: – |F 1 | + 3 * |F 2 | IO
Compression Technique reduce record size => reduce IO lossy compression: –higher compression –different uncompressed values maybe mapped into the same compressed value compress object of b bits into n bits, b > n 2 b /2 n values mapped to each compressed value P(collision) = ((2 b /2 n ) - 1)/2 b => 2 -n = E P(no error) = (1 - E) records(F) N(good days) = (1 - P(no error))*Σ 1<=i i* P(no error) i-1 = 1/(1 - P(no error))
Outer Join with Compression |f 1 | + 3*|F 2 | + |f 2 | IO sort F 2 into runs: f 2run r 1 = f 1.pop() r 2 = f 2runs.pop() f 2sort.put(r 2.K, compress(r 2.B)) while((r 1 != null) V (r 2 != null)) –if((r 1 == null) V (r 1.K > r 2.K)) /* insert */ F out.put(insert, r 2.K, r 2.B) r 2 = f 2runs.pop() f 2sort. put( r 2.K, compress(r 2.B)) –else if((r 2 == null) V (r 1.K < r 2.K))/* delete */ –else if(r 1.K == r 2.K) if(r 1.b != compress(r 2.B))/* update */
Outer Join with Compression |f 1 | + |F 2 | + 3*|f 2sort | + U + I IO compress F 2 during creation of sorted runs into f 2run r 1 = f 1.pop() r 2 = f 2run.pop()/* p -> record */ f 2sort.put(r 2.K, r 2.b, r 2.p)/* b compressed B */ while((r 1 != null) V (r 2 != null)) –if((r 1 == null) V (r 1.K > r 2.K)) /* insert */ F out.put(insert, r 2.K, getTuple( r 2.p ).B) r 2 = f 2run.pop() f 2sort. put( r 2.K, r 2.b, r 2.p )/* what about p */ –else if((r 2 == null) V (r 1.K < r 2.K))/* delete */ –else if(r 1.K == r 2.K) if(r 1.b != r 2.b)/* update */
Partitioned hash Outer Join compression –|f 1 | + 3*|F 2 | + |f 2sort | IO compression –|f 1 | + |F 2 | + 2*|f 2sort | + I + U IO
Window Algorithm reads snapshots only once assumes records do not move much divide memory into 4 four parts: –input buffers 1 and 2 –aging buffers 1 and 2 |f 1 | + |F 2 | IO distance between snapshots –sum of absolute values of distances, for matching records –normalize by maximum distance for snapshots
Input Buffer 1 Input Buffer 2 Aging Buffer 1 Aging Buffer 2 DISK Transfer blocks ki1 jm l : : etc. TailHead TailHead MemoryBuckets