Efficient Snapshot Differential Algorithms for Data Warehousing Wilburt Juan LabioHector Garcia-Molina.

Slides:



Advertisements
Similar presentations
CS 400/600 – Data Structures External Sorting.
Advertisements

Sorting Really Big Files Sorting Part 3. Using K Temporary Files Given  N records in file F  M records will fit into internal memory  Use K temp files,
Introduction to Computer Science 2 Lecture 7: Extended binary trees
Equality Join R X R.A=S.B S : : Relation R M PagesN Pages Relation S Pr records per page Ps records per page.
CS252: Systems Programming Ninghui Li Program Interview Questions.
6.830 Lecture 9 10/1/2014 Join Algorithms. Database Internals Outline Front End Admission Control Connection Management (sql) Parser (parse tree) Rewriter.
Lecture 8 Join Algorithms. Intro Until now, we have used nested loops for joining data – This is slow, n^2 comparisons How can we do better? – Sorting.
Data Compressor---Huffman Encoding and Decoding. Huffman Encoding Compression Typically, in files and messages, Each character requires 1 byte or 8 bits.
BTrees & Bitmap Indexes
IELM 230: File Storage and Indexes Agenda: - Physical storage of data in Relational DB’s - Indexes and other means to speed Data access - Defining indexes.
Midterm Review Spring Overview Sorting Hashing Selections Joins.
Lecture 24: Query Execution Monday, November 20, 2000.
CPSC-608 Database Systems Fall 2008 Instructor: Jianer Chen Office: HRBB 309B Phone: Notes #6.
CSCI 5708: Query Processing II Pusheng Zhang University of Minnesota Feb 5, 2004.
1  Simple Nested Loops Join:  Block Nested Loops Join  Index Nested Loops Join  Sort Merge Join  Hash Join  Hybrid Hash Join Evaluation of Relational.
SPRING 2004CENG 3521 Join Algorithms Chapter 14. SPRING 2004CENG 3522 Schema for Examples Similar to old schema; rname added for variations. Reserves:
Cosequential Processing Chapter 8. Cosequential processing model Two or more input files sorted the same way on the same keys set current record to first.
Information Retrieval IR 4. Plan This time: Index construction.
Unary Query Processing Operators CS 186, Spring 2006 Background for Homework 2.
CS 4432query processing - lecture 171 CS4432: Database Systems II Lecture #17 Join Processing Algorithms (cont). Professor Elke A. Rundensteiner.
Evaluation of Relational Operations. Relational Operations v We will consider how to implement: – Selection ( ) Selects a subset of rows from relation.
Chapter 8 Physical Database Design. McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved. Outline Overview of Physical Database.
1 Relational Operators. 2 Outline Logical/physical operators Cost parameters and sorting One-pass algorithms Nested-loop joins Two-pass algorithms.
External Sorting Chapter 13.. Why Sort? A classic problem in computer science! Data requested in sorted order  e.g., find students in increasing gpa.
External Sorting Problem: Sorting data sets too large to fit into main memory. –Assume data are stored on disk drive. To sort, portions of the data must.
CS212: DATA STRUCTURES Lecture 10:Hashing 1. Outline 2  Map Abstract Data type  Map Abstract Data type methods  What is hash  Hash tables  Bucket.
Database Management 9. course. Execution of queries.
Implementing Natural Joins, R. Ramakrishnan and J. Gehrke with corrections by Christoph F. Eick 1 Implementing Natural Joins.
External Storage Primary Storage : Main Memory (RAM). Secondary Storage: Peripheral Devices –Disk Drives –Tape Drives Secondary storage is CHEAP. Secondary.
CPSC 461 Final Review I Hessam Zakerzadeh Dina Said.
Multi pass algorithms. Nested-Loop joins Tuple-Based Nested-loop Join Algorithm: FOR each tuple s in S DO FOR each tuple r in R DO IF r and s join to.
Sorting and Searching by Dr P.Padmanabham Professor (CSE)&Director
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
10.1 Chapter 10 Error Detection and Correction Data can be corrupted during transmission. Some applications require that errors be detected and.
13- 1 Chapter 13.  Overview of Sequential File Processing  Sequential File Updating - Creating a New Master File  Validity Checking in Update Procedures.
Relational Operator Evaluation. Overview Application Programmer (e.g., business analyst, Data architect) Sophisticated Application Programmer (e.g.,
Chapter 15 A External Methods. © 2004 Pearson Addison-Wesley. All rights reserved 15 A-2 A Look At External Storage External storage –Exists beyond the.
Chapter 8 Physical Database Design. Outline Overview of Physical Database Design Inputs of Physical Database Design File Structures Query Optimization.
B+ Trees: An IO-Aware Index Structure Lecture 13.
Copyright © 2009 Pearson Education, Inc. Publishing as Prentice Hall Chapter 9 Designing Databases 9.1.
Lecture 17: Query Execution Tuesday, February 28, 2001.
Course Code #IDCGRF001-A 5.1: Searching and sorting concepts Programming Techniques.
CS 440 Database Management Systems Lecture 5: Query Processing 1.
Lecture 3 - Query Processing (continued) Advanced Databases Masood Niazi Torshiz Islamic Azad university- Mashhad Branch
Computing & Information Sciences Kansas State University Wednesday, 08 Nov 2006CIS 560: Database System Concepts Lecture 32 of 42 Monday, 06 November 2006.
Query Processing – Implementing Set Operations and Joins Chap. 19.
Implementation of Database Systems, Jarek Gryz1 Evaluation of Relational Operations Chapter 12, Part A.
CS 540 Database Management Systems
Query Execution Query compiler Execution engine Index/record mgr. Buffer manager Storage manager storage User/ Application Query update Query execution.
Alon Levy 1 Relational Operations v We will consider how to implement: – Selection ( ) Selects a subset of rows from relation. – Projection ( ) Deletes.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Evaluation of Relational Operations Chapter 14, Part A (Joins)
1 Lecture 23: Query Execution Monday, November 26, 2001.
© 2006 Pearson Addison-Wesley. All rights reserved15 A-1 Chapter 15 External Methods.
Storage Access Paging Buffer Replacement Page Replacement
Modified from Stanford CS276 slides Lecture 4: Index Construction
Relational Operations
CS222P: Principles of Data Management Notes #11 Selection, Projection
Lecture#12: External Sorting (R&G, Ch13)
Selected Topics: External Sorting, Join Algorithms, …
Lecture 2- Query Processing (continued)
One-Pass Algorithms for Database Operations (15.2)
Lecture 31: The IO Model 2 Repacking
CS222: Principles of Data Management Notes #11 Selection, Projection
Sorting We may build an index on the relation, and then use the index to read the relation in sorted order. May lead to one disk block access for each.
Lecture 11: B+ Trees and Query Execution
Updating Databases With Open SQL
CS222/CS122C: Principles of Data Management UCI, Fall 2018 Notes #10 Selection, Projection Instructor: Chen Li.
Lecture 20: Query Execution
Updating Databases With Open SQL
External Sorting Dina Said
Presentation transcript:

Efficient Snapshot Differential Algorithms for Data Warehousing Wilburt Juan LabioHector Garcia-Molina

Purpose detect modifications from information source extract modifications from information source information source is not sophisticated (e.g., legacy system) Data Warehouse Local DB modifications

Problem Outline file containing distinct records {R 1, R 2, …R n }, where R i is given two snapshots F 1 and F 2 produce modifications and F out possible modifications generated: –

Difficulties physical location of record may be different between snapshots wasted messages: –useless delete-insert pairs introduces waste delete then insert same record, do nothing delete then insert record with, update –useless insert-delete pairs introduces correctness problem insert then delete same record, do nothing insert then delete record with K, update

Example: with physical movement F t-1 K i B i K i+1 B i+1 K i+2 B i+2 K i+3 B i+3 K i+4 B i+4 K i+5 B i+5 K i+6 B i+6 FtFt K i B i K i+3 B i+3 K i+2 B i+2 K i+4 B’ i+4 K i+5 B i+5 K j B j K i+6 B i+6 Modifications made:

Example: wasted messages F t-1 K i B i K i+1 B i+1 K i+2 B i+2 K i+3 B i+3 K i+4 B i+4 K i+5 B i+5 K i+6 B i+6 K i+7 B i+7 FtFt K i+3 B i+3 K i+2 B i+2 K i+4 B’ i+4 K i+6 B i+6 K j B j K i+5 B’ i+5 K i B i useless insert-delete or: useless delete-insert or:

Related Solutions maintain log of modifications add timestamp to base table joins

Proposed Solutions alter extraction application, code is worn parse system log, need DBA privilege to get log snapshot differential File t-1 out differ data warehouse

Algorithm Compromises related to joins, but cost less allow some useless delete-insert pairs change all insert-delete pairs to delete-insert pairs batch and send all deletes first may miss a few modifications save file for next snapshot differential

Sort Merge Join I part I: sort two input files –save sorted file from previous snapshot –use multi-way merge sort for F 2 creates runs, which are sequences of blocks with sorted records merge runs till 1 run remains 4 * |F 2 | IO operations, assuming |F 2 | 1/2 < |M| part II: merge takes |F 1 | + |F 2 | IO operations

Sort Merge Join II reduce IO operations reuse F 1 from previous differential part I: produce sorted runs for F 2 –sort F 2 into runs F runs creates runs, which are sequences of blocks with sorted records 2 * |F 2 | IO operations, assuming |F 2 | 1/2 < |M| part II: create sorted F 2 while merging files –merge takes |F 1 | + 2 * |F 2 | IO operations read into memory 1 block from each run in F runs select record with smallest K value

Ex. Expected Number of Good Days let n = 32, # records in F = 1,789,570 P(collision) = 2 -n P(no error) = (1 - E) records(F) N(good days) = 1/(1 - P(no error)) = 2,430 snapshot comparisons if file size increases, then increase size of n

Extending ad hoc join Algorithms |F|: # of blocks in file |M|: # of blocks in memory Sort Merge join I: –|F 1 | + 5 * |F 2 | IO Sort Merge join II: – |F 1 | + 4 * |F 2 | IO Partitioned Hash Join: – |F 1 | + 3 * |F 2 | IO

Compression Technique reduce record size => reduce IO lossy compression: –higher compression –different uncompressed values maybe mapped into the same compressed value compress object of b bits into n bits, b > n 2 b /2 n values mapped to each compressed value P(collision) = ((2 b /2 n ) - 1)/2 b => 2 -n = E P(no error) = (1 - E) records(F) N(good days) = (1 - P(no error))*Σ 1<=i i* P(no error) i-1 = 1/(1 - P(no error))

Outer Join with Compression |f 1 | + 3*|F 2 | + |f 2 | IO sort F 2 into runs: f 2run r 1 = f 1.pop() r 2 = f 2runs.pop() f 2sort.put(r 2.K, compress(r 2.B)) while((r 1 != null) V (r 2 != null)) –if((r 1 == null) V (r 1.K > r 2.K)) /* insert */ F out.put(insert, r 2.K, r 2.B) r 2 = f 2runs.pop() f 2sort. put( r 2.K, compress(r 2.B)) –else if((r 2 == null) V (r 1.K < r 2.K))/* delete */ –else if(r 1.K == r 2.K) if(r 1.b != compress(r 2.B))/* update */

Outer Join with Compression |f 1 | + |F 2 | + 3*|f 2sort | + U + I IO compress F 2 during creation of sorted runs into f 2run r 1 = f 1.pop() r 2 = f 2run.pop()/* p -> record */ f 2sort.put(r 2.K, r 2.b, r 2.p)/* b compressed B */ while((r 1 != null) V (r 2 != null)) –if((r 1 == null) V (r 1.K > r 2.K)) /* insert */ F out.put(insert, r 2.K, getTuple( r 2.p ).B) r 2 = f 2run.pop() f 2sort. put( r 2.K, r 2.b, r 2.p )/* what about p */ –else if((r 2 == null) V (r 1.K < r 2.K))/* delete */ –else if(r 1.K == r 2.K) if(r 1.b != r 2.b)/* update */

Partitioned hash Outer Join compression –|f 1 | + 3*|F 2 | + |f 2sort | IO compression –|f 1 | + |F 2 | + 2*|f 2sort | + I + U IO

Window Algorithm reads snapshots only once assumes records do not move much divide memory into 4 four parts: –input buffers 1 and 2 –aging buffers 1 and 2 |f 1 | + |F 2 | IO distance between snapshots –sum of absolute values of distances, for matching records –normalize by maximum distance for snapshots

Input Buffer 1 Input Buffer 2 Aging Buffer 1 Aging Buffer 2 DISK Transfer blocks ki1 jm l : : etc. TailHead TailHead MemoryBuckets