1 Noget helt andet… Platon vil gerne være vært (i Århus) for et BIT møde i efteråret – SOA eller MDM – Fint for mig, men hvad siger i ? Platon inviterer.

Slides:



Advertisements
Similar presentations
CHAPTER OBJECTIVE: NORMALIZATION THE SNOWFLAKE SCHEMA.
Advertisements

Relational Algebra Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY courtesy of Joe Hellerstein for some slides.
Outline What is a data warehouse? A multi-dimensional data model Data warehouse architecture Data warehouse implementation Further development of data.
Tracing the Lineage of View Data in a Warehousing Environment Seminar : “Digital Information Curation“ Winterterm 2005/2006 Siniša Avramović
Constraint Optimization Presentation by Nathan Stender Chapter 13 of Constraint Processing by Rina Dechter 3/25/20131Constraint Optimization.
CS4432: Database Systems II Query Operator & Algebraic Expressions 1.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Relational Algebra Chapter 4, Part A Modified by Donghui Zhang.
1 Relational Algebra & Calculus. 2 Relational Query Languages  Query languages: Allow manipulation and retrieval of data from a database.  Relational.
CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 52 Database Systems I Relational Algebra.
Dimensional Modeling CS 543 – Data Warehousing. CS Data Warehousing (Sp ) - Asim LUMS2 From Requirements to Data Models.
1 Primitives for Workload Summarization and Implications for SQL Prasanna Ganesan* Stanford University Surajit Chaudhuri Vivek Narasayya Microsoft Research.
Midterm Review Lecture 14b. 14 Lectures So Far 1.Introduction 2.The Relational Model 3.Disks and Files 4.Relational Algebra 5.File Org, Indexes 6.Relational.
By relieving the brain of all unnecessary work, a good notation sets it free to concentrate on more advanced problems, and, in effect, increases the mental.
CS263 Lecture 19 Query Optimisation.  Motivation for Query Optimisation  Phases of Query Processing  Query Trees  RA Transformation Rules  Heuristic.
Database Systems Chapter 6 ITM Relational Algebra The basic set of operations for the relational model is the relational algebra. –enable the specification.
1 Distributed Databases CS347 Lecture 14 May 30, 2001.
Lineage Tracing in Data Warehouses Yingwei Cui Stanford University Database Group.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Relational Algebra Chapter 4, Part A.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 6 The Relational Algebra and Relational Calculus.
Comparing path-based and vertically-partitioned RDF databases Preetha Lakshmi & Chris Mueller 12/10/2007 CSCI 8715 Shashi Shekhar.
CS405G: Introduction to Database Systems Final Review.
By relieving the brain of all unnecessary work, a good notation sets it free to concentrate on more advanced problems, and, in effect, increases the mental.
1 Chapter 2 Reviewing Tables and Queries. 2 Chapter Objectives Identify the steps required to develop an Access application Specify the characteristics.
1 Relational Algebra and Calculus Yanlei Diao UMass Amherst Feb 1, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
Rutgers University Relational Algebra 198:541 Rutgers University.
CSCD343- Introduction to databases- A. Vaisman1 Relational Algebra.
Database Management Systems, 2 nd Edition. R. Ramakrishnan and J. Gehrke1 Decision Support Chapter 23.
Database System Concepts and Architecture Lecture # 3 22 June 2012 National University of Computer and Emerging Sciences.
CS 345: Topics in Data Warehousing Tuesday, October 19, 2004.
Stochastic Algorithms Some of the fastest known algorithms for certain tasks rely on chance Stochastic/Randomized Algorithms Two common variations – Monte.
Chapter 7: Database Systems Succeeding with Technology: Second Edition.
Physical Database Design & Performance. Optimizing for Query Performance For DBs with high retrieval traffic as compared to maintenance traffic, optimizing.
Lecture 05 Structured Query Language. 2 Father of Relational Model Edgar F. Codd ( ) PhD from U. of Michigan, Ann Arbor Received Turing Award.
Chapter 2 Adapted from Silberschatz, et al. CHECK SLIDE 16.
Discovering Computers Fundamentals Fifth Edition Chapter 9 Database Management.
CSE314 Database Systems The Relational Algebra and Relational Calculus Doç. Dr. Mehmet Göktürk src: Elmasri & Navanthe 6E Pearson Ed Slide Set.
CPSC 404, Laks V.S. Lakshmanan1 Evaluation of Relational Operations: Other Operations Chapter 14 Ramakrishnan & Gehrke (Sections ; )
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Relational Algebra.
1 Relational Algebra & Calculus Chapter 4, Part A (Relational Algebra)
1.1 CAS CS 460/660 Introduction to Database Systems Relational Algebra.
Efficient RDF Storage and Retrieval in Jena2 Written by: Kevin Wilkinson, Craig Sayers, Harumi Kuno, Dave Reynolds Presented by: Umer Fareed 파리드.
CS 321 Programming Languages and Compilers Lectures 16 & 17 Introduction to Formal Languages Regular Languages Lexical Analysis.
1 Relational Algebra Chapter 4, Sections 4.1 – 4.2.
© 1999 FORWISS General Research Report Implementation and Optimization Issues of the ROLAP Algebra F. Ramsak, M.S. (UIUC) Dr. V. Markl Prof. R. Bayer,
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Database Management Systems Chapter 4 Relational Algebra.
CSCD34-Data Management Systems - A. Vaisman1 Relational Algebra.
Mathematical Preliminaries
Database Management Systems, R. Ramakrishnan1 Relational Algebra Module 3, Lecture 1.
Lineage Tracing for General Data Warehouse Transformations Yingwei Cui and Jennifer Widom Computer Science Department, Stanford University Presentation.
CS4432: Database Systems II Query Processing- Part 2.
Introduction.  Administration  Simple DBMS  CMPT 454 Topics John Edgar2.
Query Processing – Query Trees. Evaluation of SQL Conceptual order of evaluation – Cartesian product of all tables in from clause – Rows not satisfying.
CS 440 Database Management Systems Lecture 5: Query Processing 1.
CS411 Database Systems Kazuhiro Minami 04: Relational Schema Design.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Data Warehousing and Decision Support Chapter 25.
Module 2: Intro to Relational Model
Chapter 2: Intro to Relational Model
Pig Latin - A Not-So-Foreign Language for Data Processing
Relational Algebra Chapter 4, Part A
Relational Algebra Chapter 4, Sections 4.1 – 4.2
Chapter 2: Intro to Relational Model
Probabilistic Databases
Chapter 2: Intro to Relational Model
Example of a Relation attributes (or columns) tuples (or rows)
Chapter 2: Intro to Relational Model
Terminology Product Attribute names Name Price Category Manufacturer
Course Instructor: Supriya Gupta Asstt. Prof
Lecture 2 Relational Database
Presentation transcript:

1 Noget helt andet… Platon vil gerne være vært (i Århus) for et BIT møde i efteråret – SOA eller MDM – Fint for mig, men hvad siger i ? Platon inviterer alle til – 7-8 juni – Special pris for BIT medlemmer: 2995 kr. – Tilmelding via Jørgen Davidsen,

Lineage Tracing in Data Warehouses Torben Bach Pedersen Based on work by Yingwei Cui and Jennifer Widom Stanford University Database Group

3 Motivation: Data Warehousing Data Warehouse Source 1Source 2Source 3 Lucrative Fields Databases $8800K Theory $320K Networks $800K StudentsEnrollmentsCourses Wow?! Databases $8800K

4 CoursesEnrollmentsStudents Oh, I see... Source 1Source 2Source 3 Lineage Tracer Data Warehouse Lucrative Fields Database 1800 Theory $320K Networks $800K Databases $8800K CS145 Ted CS154 Joe CS244 Bob CS145 Ann CS245 Jane …… Bob MS $1K Jane Web $5K Ann BS $1K Joe BS $1K Ted Web $5K ……… CS145 Databases CS154 Theory CS244 Networks CS245 Databases

5 The Data Lineage Problem Data warehouses integrate data from multiple sources for analysis and mining Data lineage Data lineage: given data item o in the warehouse, which data items in the sources were used to derive o? Sometimes called “drill-through” in industry – “Drill-through” often limited

6 Challenges Warehouse of relational views over relational sources – What is a good formal definition for lineage? – How do we trace data lineage for arbitrary views? – How do we make it efficient? Warehouse defined by graph of data transformations – No fixed, well-defined relational operators – Large transformation sequences and graphs

7 Outline of Talk Part 1: Lineage tracing for relational views Part 2: Lineage tracing for general data transformations

8 Part 1: Lineage Tracing for Relational Views Declarative definition of data lineage Lineage tracing algorithms Using auxiliary views for efficient lineage tracing Experimental results (small sample)

9 Views We Consider Relational algebra Arbitrary use of aggregation Set semantics Also in thesis – Set operators – Bag semantics        RST V

10  V V =  (  ( R S )) Y,sum(Z)X >Z R S X Y Z 3 2a b b Y sum a2 b6 X Y Z 32a b b b X Y 3 a Y Z 2 a 0b 9b 6b 8b Y,sum(Z)X >Z TU b6 b80 b b b 0b 6b 8b Simple Lineage Example select Y,sum(Z) from R natural join S where X>Z group by Y

11 Lineage for Relational Operators Unary relational operators definition took a long time op R R*t Lineage of t according to op is the maximal subset R*  R such that (1) op(R*) = {t} - output of R* through op is t (2)  t*  R*: op({t*})  - op used on t* is nonempty 

12 Example 1 – the two conditions ensure that only tuples contributing to t are included in lineage R  X Y Z 3 2a b b a b b b X >Z Lineage of t according to op is the maximal subset R*  R such that (1) op(R*) = {t} (2)  t*  R*: op({t*})  Lineage for Relational Operators b86 86b

13 Example 2 –”maximal” requirement ensures that (8,b,0) tuple in included in (b,6) lineage R  X Y Z 3 2a b b Y sum a2 b6 Y,sum(Z) maximal Lineage of t according to op is the maximal subset R*  R such that (1) op(R*) = {t} (2)  t*  R*: op({t*})  Lineage for Relational Operators b6 b80 b86

14 N-ary relational operators ( , ,  ) – lineage unique Lineage for Relational Operators maximal Lineage of t according to op is the maximal subsets R i *  R i for i = 1..n such that (1) op(R 1 *, …, R n *) = {t} (2)  t i *  R i *: op(R 1, …, {t i *}, …, R n )  op R1R1 * * R2R2 R2R2 R1R1

15 Lineage for Relational Views Lineage of a tuple set is union of lineage of each tuple in the set Lineage for views is defined recursively => naive, but inefficient, algorithm (need to recompute/store all intermediate results) op 1 2 VU R1R1 R2R2 t U* * * R1R1 R2R2 Lineage of t is  R 1 *, R 2 * 

16 Lineage Tracing segmented normal form (SPJ+agg) Convert view into segmented normal form (SPJ+agg)    E 1 … E n  Each segment tracing query Generate one tracing query for each segment Apply tracing queries recursively – # non-top  + 1 Proof: lineage result is unaffected by normalization and segment-level tracing Proof: lineage result is unaffected by normalization and segment-level tracing

17 Tracing Query for One Segment VY sum a2 b6 R S TQ = Split (  ( R S )) X >Z  Y=b R,S  Y,sum(Z)  X >Z b 6 b X Y 3 a 8 Y Z 2 a 0 9b b  R*={(8,b)}, S*={(b,0),(b,6)}  b 0 6 b b 8 b6 V =  (  ( R S )) X >ZY,sum(Z) Split = ”unjoin” – project over R+S schemas

18 Recursive Tracing Procedure   VW avg p 4 q 6 U R S X Y 3 a Y Z 2 a 0b 9b 6b 8b T Y sum a2 b 6 Y W ap p q b b  TQ = Split (  ( U T )) W=q1U,T TQ = Split (  ( R S )) X >Z  Y=b 2R,S  b 6 qb 8b 0b 6b q 6  R*={(8,b)}, S*={(b,0),(b,6)}, T*={(b,q)}  8b 0b 6b qb V =  (  (  ( R S )) T )) W, avg(sum)Y,sum(Z)X >Z

19 Making It Efficient Source accesses are usually expensive or impossible Need some intermediate results for lineage tracing auxiliary views  Store auxiliary views at the warehouse – Reduce or eliminate source accesses – Reduce recomputation of intermediate results

20 Aux View Example

21 Aux View Example

22 Auxiliary Views There are many possible auxiliary views For single-segment views – Identified 10 possible auxiliary view schemes – Studied performance tradeoffs For arbitrary views – Hard optimization problem – Exhaustive and heuristic algorithms – Performance study    R 1 … R n 

23 Single Segment Schemes Store nothing (NO) Store Base Tables (BT) Store Lineage Views (LV) Store Split Lineage Tables (SLT) Store Partial Base Tables (PBT) Store Base Table Projections (BP) Store Lineage View Projections (LP) Self-maintainable variations: LV-S, SLT-S, PBT-S

24 + Always improve lineage tracing – Must be maintained when sources change + Can also help with maintenance of original user views Auxiliary Views: Performance Tradeoffs

25 Auxiliary View Schemes for Single-Segment Views Parameters: - 3-way SPJ view - sources: 10MB each - disk: 1Mbps - network: 50kbps operations - q/u ratio = 4 Measurements: - tracing time - maintenance time

26 Auxiliary View Selection Algorithms for Arbitrary Views

27 Part 2: Transformation Graphs Lineage definition Tracing algorithms Combining transformations for lineage tracing Experimental results (tiny sample) Source 1 Data Warehouse Source 2 Source 3 T6T6 T4T4 T5T5 T3T3 T2T2 T1T1

28 T1T1 T3T3 T4T4 T6T6 T7T7 T5T5 id cust date prod-list 1 A 2/8/99 1(10),2(10) 2 C 4/5/99 2(5),3(10) 3 D 6/1/99 1(20),2(10) 4 B 8/6/99 1(10),3(5) 5 D 10/8/99 1(5),3(10) 6 B 12/1/99 2(10),3(10) id name price valid 1 imac /1/98- 2 vaio /1/98-9/1/99 2 vaio /2/99- 3 palm 500 2/1/98-7/1/98 3 palm 400 7/2/98-9/1/99 3 palm 300 9/2/99- name avg3 Q4 palm 2K 6K 3 palm 400 7/2/98-9/1/99 3 palm 300 9/2/99- 2 C 4/5/99 2(5),3(10) 4 B 8/6/99 1(10),3(5) 5 D 10/8/99 1(5),3(10) 6 B 12/1/99 2(10),3(10) SalesJump Order Product T2T2 Transformation Example selection “join”splitpivotprojectionselectionprojection

29 Lineage for General Transformations transformation A transformation can be an arbitrary program T  select … from … where …  main(int argc, char** argv) {…}  sed “s/string1/string2/g” … ? – One extreme: relational operators – Another extreme: we know nothing about T – Middle ground: based on transformation properties

30 Transformation Properties Transformation classes Additional properties – Transformation subclasses – Schema information – Provided inverse or tracing procedure

31 i  I  I: T(I) =  T({i}) dispatcher T*(o) = {i | o  T({i})} Transformation Classes Produces 0 or more output items per input item Applying T on complete set is the same as on each input item separately

32 Dispatcher Example id cust date prod-list 1 A 2/8/99 1(10),2(10) 2 C 4/5/99 2(5),3(10) 3 D 6/1/99 1(20),2(10) 4 B 8/6/99 1(10),3(5) 5 D 10/8/99 1(5),3(10) 6 B 12/1/99 2(10),3(10) Order id cust date pid quant 1 A 2/8/ A 2/8/ : : : 5 D 10/8/ D 10/8/ B 12/1/ B 12/1/ T1T1 O1O1 5 D 10/8/99 1(5),3(10) 5 D 10/8/ D 10/8/ D 10/8/99 1(5),3(10) A non-relational operator, but a typical dispatcher

33 i  I  I: T(I) =  T({i}) dispatcher  I and T(I)={o 1 …o n }:  unique partition I 1..I n of I s.t. T(I k ) = {o k } aggregator T*(o k ) = I k T*(o) = {i | o  T({i})} Transformation Classes

34 Aggregator Example T4T4 name Q1 Q2 Q3 Q4 imac 12K 24K 12K 6K vaio 24K 12K 24K 18K palm 0K 4K 2K 6K O3O3 O4O4 oid name date price quant 1 imac 2/8/ vaio 2/8/ vaio 4/5/ imac 6/1/ vaio 6/1/ imac 8/6/ palm 8/6/ imac 10/8/ palm 10/8/ vaio 12/1/ palm 12/1/ palm 4/5/ palm 8/6/ palm 12/1/ palm 0K 4K 2K 6K 5 palm 10/8/ palm 0K 4K 2K 6K 2 palm 4/5/ palm 8/6/ palm 12/1/ palm 10/8/ T4 computes quarterly sales per product by ”pivoting” Again, a non-relational operator, but a typical aggregator

35 i  I  I: T(I) =  T({i}) dispatcher  I and T(I)={o 1 …o n }:  unique partition I 1..I n of I s.t. T(I k ) = {o k } aggregatorblack-box All others T*(o k ) = I k T*(o) = I T*(o) = {i | o  T({i})} Transformation Classes

36 Most transformations are dispatchers, aggregators, or their compositions A transformation can be both dispatcher and aggregator – Proof: Lineage definitions are then equivalent Transformations can be relational operators – Lineage definitions same as relational definitions Transformation Classes

37 Transformation Properties Transformation classes Additional properties – Transformation subclasses – Schema information – Provided inverse or tracing procedure

38 Transformation Subclasses Permit more efficient lineage tracing Filter is a special dispatcher – Each input data item produces itself or nothing Context-free aggregator – Whether two input data items are in the same partition is independent of other items Key-preserving aggregator – Any subset of an input partition always produces the same output key

39 Tracing Example: Aggregators Consider T(I) = {o 1 …o n } Tracing the lineage of o for aggregator – Partition input I into I 1 …I n such that T(I k ) = {o k } – Return I k such that T(I k ) = {o} Tracing the lineage of o for context-free aggregator – Partition input I into I 1 …I n such that |T(I k )| = 1 – Return I k such that T(I k ) = {o} – 2^n versus n^2 running time !

40 Schema Information Input schema A=(A 1 …A n ) and key A key Output schema B=(B 1 …B n ) and key B key Schema mappings: f(A)  B and A  g(B) Transformations with special schema mappings – Forward key-map: f(A)  B key – Backward key-map: A key  g(B) – Backward total-map: A  g(B) – More efficient tracing for these

41 Tracing Example: Forward Key-Maps T4T4 O3O3 O4O4 name Q1 Q2 Q3 Q4 imac 12K 24K 12K 6K vaio 24K 12K 24K 18K palm 0K 4K 2K 6K oid name date price quant 1 imac 2/8/ vaio 2/8/ vaio 4/5/ imac 6/1/ vaio 6/1/ imac 8/6/ palm 8/6/ imac 10/8/ palm 10/8/ vaio 12/1/ palm 12/1/ palm 4/5/ palm 8/6/ palm 12/1/ palm 10/8/ ”name” is carried over as key - trace of ”palm” is easy : the O3 tuples with name = ’palm’

42 Other Properties Transformation author provides Tracing Procedure Provided Transformation Inverse T –1 – If T is an aggregator, then o’s lineage is T –1 ({o}) – Not always true for dispatchers or black-boxes

43 Tracing Procedures PropertyProcedure# T Calls# Accesses dispatcher TraceDS O(|I|) aggregator TraceAG O(2 |I| ) black-box return I; 0O(|I|) filter return o; 00 context-free aggr. TraceCF O(|I| 2 ) key-preserving aggr. TraceKP O(|I|) forward key-map TraceFM 0O(|I|) backward key-map TraceBM 0O(|I|) backward total-map TraceTM 00 Provided tracing-proc.provided??

44 Property Hierarchy ANY provided tracing-proc. or inverse black-box aggregator dispatcher context-free aggr. key-preserving aggr. filter forward key-map backward key-map total-map

45 Summary of Our Approach for One Transformation Properties are provided with transformations – Specified by the transformation author – Declared in prepackaged transformations – Derived using recent techniques [Clio01, RB01] The best property of a transformation is selected based on the hierarchy The tracing procedure using the best property is called at tracing time Indexing techniques

46 Transformation Sequences Naive algorithm traces backwards one transformation at a time – Need all intermediate results –Poor performance for long sequences T1T1 T2T2 T3T3 TnTn I O

47 T1T1 T2T2 T3T3 TnTn I O T’TnTn I O Combine transformations and trace as one – Reduces number of intermediate results – By combining judiciously  Reduces tracing cost  Doesn’t lose accuracy Transformation Sequences