Download presentation
Presentation is loading. Please wait.
2
Lineage Tracing in Data Warehouses Yingwei Cui Stanford University Database Group
3
2 Motivation: Data Warehousing Data Warehouse Source 1Source 2Source 3 Lucrative Fields Databases $8800K Theory $320K Networks $800K StudentsEnrollmentsCourses Wow?! Databases $8800K
4
3 CoursesEnrollmentsStudents Oh, I see... Source 1Source 2Source 3 Lineage Tracer Data Warehouse Lucrative Fields Database 1800 Theory $320K Networks $800K Databases $8800K CS145 Ted CS154 Joe CS244 Bob CS145 Ann CS245 Jane …… Bob MS $1K Jane Web $5K Ann BS $1K Joe BS $1K Ted Web $5K ……… CS145 Databases CS154 Theory CS244 Networks CS245 Databases
5
4 The Data Lineage Problem Data warehouses integrate data from multiple sources for analysis and mining Data lineage Data lineage: given data item o in the warehouse, which data items in the sources were used to derive o? Sometimes called “drill-through” in industry
6
5 Challenges Warehouse of relational views over relational sources – What is a good formal definition for lineage? – How do we trace data lineage for arbitrary views? – How do we make it efficient? Warehouse defined by graph of data transformations – No fixed, well-defined relational operators – Large transformation sequences and graphs
7
6 Contributions Thesis contributions – Basics of lineage tracing for relational views [TODS’00] – Lineage tracing system prototype [ICDE’00 demo] – Performance study and optimizations [ICDE’00, DMDW’00] – Lineage tracing for general data transformations [VLDB’01] – View update for deletions using data lineage [TechReport’01] Other contributions (joint with others) – Data warehousing performance issue [VLDB’00] – Data management for wireless networks [Infocom’98, Globecom’97]
8
7 Outline of Talk Part 1: Lineage tracing for relational views Part 2: Lineage tracing for general data transformations Part 3: View update for deletions using data lineage (time permitting)
9
8 Part 1: Lineage Tracing for Relational Views Declarative definition of data lineage Lineage tracing algorithms Using auxiliary views for efficient lineage tracing Experimental results (small sample)
10
9 Views We Consider Relational algebra Arbitrary use of aggregation Set semantics Also in thesis – Set operators – Bag semantics RST V
11
10 V V = ( ( R S )) Y,sum(Z)X >Z R S X Y Z 3 2a b b 8 8 0 6 Y sum a2 b6 X Y Z 32a 80 89 86 b b b X Y 3 a Y Z 2 a 0b 9b 6b 8b Y,sum(Z)X >Z TU b6 b80 b86 80 86 b b 0b 6b 8b Simple Lineage Example
12
11 Lineage for Relational Operators Unary relational operators op R R*t Lineage of t according to op is the maximal subset R* R such that (1) op(R*) = {t} (2) t* R*: op({t*})
13
12 Example 1 R X Y Z 3 2a b b 8 8 0 6 32a 80 89 86 b b b X >Z Lineage of t according to op is the maximal subset R* R such that (1) op(R*) = {t} (2) t* R*: op({t*}) Lineage for Relational Operators b86 86b
14
13 Example 2 R X Y Z 3 2a b b 8 8 0 6 Y sum a2 b6 Y,sum(Z) maximal Lineage of t according to op is the maximal subset R* R such that (1) op(R*) = {t} (2) t* R*: op({t*}) Lineage for Relational Operators b6 b80 b86
15
14 N-ary relational operators (e.g., ) Lineage for Relational Operators maximal Lineage of t according to op is the maximal subsets R i * R i for i = 1..n such that (1) op(R 1 *, …, R n *) = {t} (2) t i * R i *: op(R 1, …, {t i *}, …, R n ) op R1R1 * * R2R2 R2R2 R1R1
16
15 Lineage for Relational Views Lineage of a tuple set is union of lineage of each tuple in the set Lineage for views is defined recursively op 1 2 VU R1R1 R2R2 t U* * * R1R1 R2R2 Lineage of t is R 1 *, R 2 *
17
16 Lineage Tracing a segmented normal form Convert view into a segmented normal form E 1 … E n Each segment tracing query Generate one tracing query for each segment Apply tracing queries recursively – # non-top + 1 Lineage result is unaffected by normalization and segment-level tracing Lineage result is unaffected by normalization and segment-level tracing
18
17 Tracing Query for One Segment VY sum a2 b6 R S TQ = Split ( ( R S )) X >Z Y=b R,S Y,sum(Z) X >Z b 6 b X Y 3 a 8 Y Z 2 a 0 9b b R*={(8,b)}, S*={(b,0),(b,6)} b 0 6 b b 8 b6 V = ( ( R S )) X >ZY,sum(Z)
19
18 Recursive Tracing Procedure VW avg p 4 q 6 U R S X Y 3 a Y Z 2 a 0b 9b 6b 8b T Y sum a2 b 6 Y W ap p q b b TQ = Split ( ( U T )) W=q1U,T TQ = Split ( ( R S )) X >Z Y=b 2R,S b 6 qb 8b 0b 6b q 6 R*={(8,b)}, S*={(b,0),(b,6)}, T*={(b,q)} 8b 0b 6b qb V = ( ( ( R S )) T )) W, avg(sum)Y,sum(Z)X >Z
20
19 Making It Efficient Source accesses are usually expensive or impossible Need some intermediate results for lineage tracing auxiliary views Store auxiliary views at the warehouse – Reduce or eliminate source accesses – Reduce recomputation of intermediate results
21
20 Auxiliary Views There are many possible auxiliary views For single-segment views – Identified 10 possible auxiliary view schemes – Studied performance tradeoffs For arbitrary views – Hard optimization problem – Exhaustive and heuristic algorithms – Performance study R 1 … R n
22
21 + Always improve lineage tracing – Must be maintained when sources change + Can also help with maintenance of original user views Auxiliary Views: Performance Tradeoffs
23
22 Auxiliary View Schemes for Single-Segment Views Parameters: - 3-way SPJ view - sources: 10MB each - disk: 1Mbps - network: 50kbps - 1000 operations - q/u ratio = 4 Measurements: - tracing time - maintenance time
24
23 Auxiliary View Selection Algorithms for Arbitrary Views
25
24 Part 2: Transformation Graphs Lineage definition Tracing algorithms Combining transformations for lineage tracing Experimental results (tiny sample) Source 1 Data Warehouse Source 2 Source 3 T6T6 T4T4 T5T5 T3T3 T2T2 T1T1
26
25 T1T1 T3T3 T4T4 T6T6 T7T7 T5T5 id cust date prod-list 1 A 2/8/99 1(10),2(10) 2 C 4/5/99 2(5),3(10) 3 D 6/1/99 1(20),2(10) 4 B 8/6/99 1(10),3(5) 5 D 10/8/99 1(5),3(10) 6 B 12/1/99 2(10),3(10) id name price valid 1 imac 1200 10/1/98- 2 vaio 2400 6/1/98-9/1/99 2 vaio 1800 9/2/99- 3 palm 500 2/1/98-7/1/98 3 palm 400 7/2/98-9/1/99 3 palm 300 9/2/99- name avg3 Q4 palm 2K 6K 3 palm 400 7/2/98-9/1/99 3 palm 300 9/2/99- 2 C 4/5/99 2(5),3(10) 4 B 8/6/99 1(10),3(5) 5 D 10/8/99 1(5),3(10) 6 B 12/1/99 2(10),3(10) SalesJump Order Product T2T2 Transformation Example selection “join”splitpivotprojectionselectionprojection
27
26 Lineage for General Transformations transformation A transformation can be an arbitrary program T select … from … where … main(int argc, char** argv) {…} sed “s/string1/string2/g” … ? – One extreme: relational operators – Another extreme: we know nothing about T – Middle ground: based on transformation properties
28
27 Transformation Properties Transformation classes Additional properties – Transformation subclasses – Schema information – Provided inverse or tracing procedure
29
28 i I I: T(I) = T({i}) dispatcher T*(o) = {i | o T({i})} Transformation Classes
30
29 Dispatcher Example id cust date prod-list 1 A 2/8/99 1(10),2(10) 2 C 4/5/99 2(5),3(10) 3 D 6/1/99 1(20),2(10) 4 B 8/6/99 1(10),3(5) 5 D 10/8/99 1(5),3(10) 6 B 12/1/99 2(10),3(10) Order id cust date pid quant 1 A 2/8/99 1 10 1 A 2/8/99 2 10 : : : 5 D 10/8/99 1 5 5 D 10/8/99 3 10 6 B 12/1/99 2 10 6 B 12/1/99 3 10 T1T1 O1O1 5 D 10/8/99 1(5),3(10) 5 D 10/8/99 1 5 5 D 10/8/99 3 10 5 D 10/8/99 1(5),3(10)
31
30 i I I: T(I) = T({i}) dispatcher I and T(I)={o 1 …o n }: unique partition I 1..I n of I s.t. T(I k ) = {o k } aggregator T*(o k ) = I k T*(o) = {i | o T({i})} Transformation Classes
32
31 Aggregator Example T4T4 name Q1 Q2 Q3 Q4 imac 12K 24K 12K 6K vaio 24K 12K 24K 18K palm 0K 4K 2K 6K O3O3 O4O4 oid name date price quant 1 imac 2/8/99 1200 10 1 vaio 2/8/99 2400 10 2 vaio 4/5/99 2400 5 3 imac 6/1/99 1200 20 3 vaio 6/1/99 2400 10 4 imac 8/6/99 1200 10 4 palm 8/6/99 400 5 5 imac 10/8/99 1200 5 5 palm 10/8/99 300 10 6 vaio 12/1/99 1800 10 6 palm 12/1/99 300 10 2 palm 4/5/99 400 10 4 palm 8/6/99 400 5 6 palm 12/1/99 300 10 palm 0K 4K 2K 6K 5 palm 10/8/99 300 10 palm 0K 4K 2K 6K 2 palm 4/5/99 400 10 4 palm 8/6/99 400 5 6 palm 12/1/99 300 10 5 palm 10/8/99 300 10
33
32 i I I: T(I) = T({i}) dispatcher I and T(I)={o 1 …o n }: unique partition I 1..I n of I s.t. T(I k ) = {o k } aggregatorblack-box All others T*(o k ) = I k T*(o) = I T*(o) = {i | o T({i})} Transformation Classes
34
33 Most transformations are dispatchers, aggregators, or their compositions A transformation can be both dispatcher and aggregator – Lineage definitions are equivalent Transformations can be relational operators – Lineage definitions same as relational definitions Transformation Classes
35
34 Transformation Properties Transformation classes Additional properties – Transformation subclasses – Schema information – Provided inverse or tracing procedure
36
35 Transformation Subclasses Permit more efficient lineage tracing Filter is a special dispatcher – Each input data item produces itself or nothing Context-free aggregator – Whether two input data items are in the same partition is independent of other items Key-preserving aggregator – Any subset of an input partition always produces the same output key
37
36 Tracing Example: Aggregators Consider T(I) = {o 1 …o n } Tracing the lineage of o for aggregator – Partition input I into I 1 …I n such that T(I k ) = {o k } – Return I k such that T(I k ) = {o} Tracing the lineage of o for context-free aggregator – Partition input I into I 1 …I n such that |T(I k )| = 1 – Return I k such that T(I k ) = {o}
38
37 Schema Information Input schema A=(A 1 …A n ) and key A key Output schema B=(B 1 …B n ) and key B key Schema mappings: f(A) B and A g(B) Transformations with special schema mappings – Forward key-map: f(A) B key – Backward key-map: A key g(B) – Backward total-map: A g(B)
39
38 Tracing Example: Forward Key-Maps T4T4 O3O3 O4O4 name Q1 Q2 Q3 Q4 imac 12K 24K 12K 6K vaio 24K 12K 24K 18K palm 0K 4K 2K 6K oid name date price quant 1 imac 2/8/99 1200 10 1 vaio 2/8/99 2400 10 2 vaio 4/5/99 2400 5 3 imac 6/1/99 1200 20 3 vaio 6/1/99 2400 10 4 imac 8/6/99 1200 10 4 palm 8/6/99 400 5 5 imac 10/8/99 1200 5 5 palm 10/8/99 300 10 6 vaio 12/1/99 1800 10 6 palm 12/1/99 300 10 2 palm 4/5/99 400 10 4 palm 8/6/99 400 5 6 palm 12/1/99 300 10 5 palm 10/8/99 300 10
40
39 Other Properties Provided Tracing Procedure Provided Transformation Inverse T –1 – If T is an aggregator, then o’s lineage is T –1 ({o}) – Not always true for dispatchers or black-boxes
41
40 Tracing Procedures PropertyProcedure# T Calls# Accesses dispatcher TraceDS O(|I|) aggregator TraceAG O(2 |I| ) black-box return I; 0O(|I|) filter return o; 00 context-free aggr. TraceCF O(|I| 2 ) key-preserving aggr. TraceKP O(|I|) forward key-map TraceFM 0O(|I|) backward key-map TraceBM 0O(|I|) backward total-map TraceTM 00 Provided tracing-proc.provided??
42
41 Property Hierarchy ANY provided tracing-proc. or inverse black-box aggregator dispatcher context-free aggr. key-preserving aggr. filter forward key-map backward key-map total-map
43
42 Summary of Our Approach for One Transformation Properties are provided with transformations – Specified by the transformation author – Declared in prepackaged transformations – Derived using recent techniques [Clio01, RB01] The best property of a transformation is selected based on the hierarchy The tracing procedure using the best property is called at tracing time Indexing techniques
44
43 Transformation Sequences Naive algorithm traces backwards one transformation at a time – Need all intermediate results –Poor performance for long sequences T1T1 T2T2 T3T3 TnTn I O
45
44 T1T1 T2T2 T3T3 TnTn I O T’TnTn I O Combine transformations and trace as one – Reduces number of intermediate results – By combining judiciously Reduces tracing cost Doesn’t lose accuracy Transformation Sequences
46
45 Overall Approach Algorithm for deriving properties of T = T 1 T 2 from properties of T 1 and T 2 Coarse-grained cost metric for a tracing sequence based on transformation properties Greedy algorithm
47
46 Example of Greedy Algorithm T 4 T 6 T 7 T 5 fkmap(2)btmap(1)filter(1)bkmap(2) blkbox(5) bkmap(2) fkmap(2)btmap(1) fkmap(2) T4’T4’ T 6 T 7 bkmap(2)filter(1)bkmap(2) T6’T6’ fkmap(2) T4’T4’
48
47 Multiple-Input Example T3T3 id cust date pid quant 1 A 2/8/99 1 10 1 A 2/8/99 2 10 : : : 5 D 10/8/99 1 5 5 D 10/8/99 3 10 6 B 12/1/99 2 10 6 B 12/1/99 3 10 id name price valid 1 imac 1200 10/1/98- 2 vaio 2400 6/1/98-9/1/99 2 vaio 1800 9/2/99- 3 palm 400 7/2/98-9/1/99 3 palm 300 9/2/99- oid name date price quant 1 imac 2/8/99 1200 10 1 vaio 2/8/99 2400 10 : : : 5 imac 10/8/99 1200 5 5 palm 10/8/99 300 10 6 vaio 12/1/99 1800 10 6 palm 12/1/99 300 10 5 palm 10/8/99 300 10 5 D 10/8/99 3 10 3 palm 300 9/2/99- dispatcher O3O3 O1O1 O2O2
49
48 Transformation Graphs I1I1 I2I2 O Definition time – Specify properties of each transformation in graph
50
49 Transformation Graphs Definition time – Specify properties of each transformation in graph – Consider each path as a transformation sequence – Combine transformations in each sequence I1I1 I2I2 O
51
50 Transformation Graphs Load time – Save intermediate results and build indices as desired Tracing time – Trace lineage through each sequence – Combine results Definition time I1I1 I2I2 O
52
51 Example Revisited T1T1 T3T3 T4T4 T6T6 Product SalesJumpT7T7 T5T5 Order T2T2 bkmap dispatcherfkmapfilterbtmap filter dispatcher T1T1 T3T3 T4T4 T6T6 Product SalesJumpT7T7 T5T5 Order T2T2 bkmap fkmap bkmap dispatcher
53
52 Experimental Results Transformation graph based on a complex TPC-D query (Q12)
54
53 Part 3: View Update Using Data Lineage View update: translating updates on views to updates on base tables Obvious connection to lineage in case of view deletions Fresh approach with improved results
55
54 View Update Translations: Valid and Exact V t R1R1 R2R2 RnRn ……
56
55 V t R1R1 R2R2 RnRn …… View Update Translations: Valid and Exact
57
56 V t R1R1 R2R2 RnRn …… View Update Translations: Valid and Exact
58
57 Our Algorithm Uses lineage to: – Find an exact translation whenever one exists (in linear time for many cases) – Find a “good” translation when no exact translation exists Fully automatic Previous approaches – Don’t always find an exact translation – Often require user input – Consider restricted classes of views
59
58 Related Work Schema-level lineage tracing (annotation-based) [BB99, HQGW93, RS98] Drill-down or drill-through on data cubes [Gray95] “Weak inverse” for transformations [WS97] Warehouse load resumption [LGMW00] Data cleaning [GFSS+01] View update [DB82, Mas84, Kel85]
60
59 Conclusions Data lineage problem in two scenarios – Warehouse defined by relational views – Warehouse defined by general data transformations For both scenarios, we provide: – Formal lineage definition – Lineage tracing algorithms – Optimization techniques – System prototype and performance study Use lineage for the view update problem
61
60 Some Open Problems Lineage of “missing” view or base tuples Deriving transformation properties Combining with annotation-based approach View update – Translation ambiguity – Base table constraints – Multiple interacting views
62
61
63
62 Lineage Applications On-line analytical processing (OLAP) Scientific databases Sensory and monitoring systems Data cleaning Warehouse resumption Data security View update
64
63 a segmented normal form Convert view definition into a segmented normal form tracing query Generate one tracing query for each ASPJ segment Apply tracing queries top-down through view definition Lineage result is unaffected by normalization Lineage result is unaffected by normalization RST V W RST V W Lineage Tracing
65
64 V K1 X 1a K2 X Z 2b 4a 1b 8d b2 R S 1 2 3 4 3c Y 9b5 X avg a4 b 6 p q r V = ( ( R S )) X,avg(Z)K1<K2 TQ = Split ( ( R S )) K1<K2 X=b R,S 3b b2 3 9b5 q b 6 Tracing Example
66
65 Split Lineage Tables (SLT) V K1 X 1a K2 X Z 2b 4a 3b 8d b2 R S 1 2 3 4 3c Y 9b5 X avg a4 b 6 p q r K1 X 1a b2 K2 X Z 4a2 Y 1b3 9b5 R'R'S'S' Split p qb2 q 3b3 9b5 b 6
67
66 Base Table Projections (BP) VX avg a4 b 6 R SK2 X Z 2b 4a 1b 8 d 1 2 3 4 8b5 K1 X 1a b2 3c Y p q r 3b b2 3 9b5 q b 6 1a b2 3c K2 X b a b d 1 2 3 4 b5 R’ S’ b2 b3 b5
68
67 Context-Free Aggregator Example T4T4 name Q1 Q2 Q3 Q4 imac 12K 24K 12K 6K vaio 24K 12K 24K 18K palm 0K 4K 2K 6K O3O3 O4O4 oid name date price quant 1 imac 2/8/99 1200 10 1 vaio 2/8/99 2400 10 3 imac 6/1/99 1200 20 3 vaio 6/1/99 2400 10 4 imac 8/6/99 1200 10 4 palm 8/6/99 400 5 5 imac 10/8/99 1200 5 5 palm 10/8/99 300 10 6 vaio 12/1/99 1800 10 6 palm 12/1/99 300 10 2 vaio 4/5/99 2400 5 2 palm 4/5/99 400 10 1 imac 2/8/99 1200 10 3 imac 6/1/99 1200 20 1 vaio 2/8/99 2400 10 2 vaio 4/5/99 2400 5 3 vaio 6/1/99 2400 10 2 palm 4/5/99 400 10 4 imac 8/6/99 1200 10 5 imac 10/8/99 1200 5 6 vaio 12/1/99 1800 10 4 palm 8/6/99 400 5 5 palm 10/8/99 300 10 6 palm 12/1/99 300 10 palm 0K 4K 2K 6K 2 palm 4/5/99 400 10 4 palm 8/6/99 400 5 5 palm 10/8/99 300 10 6 palm 12/1/99 300 10
69
68 Tracing Example 1 Tracing procedure for context-free aggregators – Partition input I into I 1 …I n such that |T(I k )| = 1; – Return I k s.t. T(I k ) = {o};
70
69 Lineage Equivalence Lineage of equivalent SPJ views are equivalent Not for ASPJ views R UX Y Z 3 2a b b 8 8 0 6 Y sum a2 b6 Y,sum(Z) b6 b80 b86 Lineage of equivalent SPJ views are equivalent Not for ASPJ views
71
70 Lineage Equivalence Lineage of equivalent SPJ views are equivalent Not for ASPJ views R UX Y Z 3 2a b b 8 8 0 6 Y sum a2 b6 B=0 Y,sum(Z) b6 b86
72
71 Non-Context-Free Example
73
72 Non-Context-Free Example
74
73 Indices Help! Conventional index – On input key A key for a backward key-map with A key g(B) Functional index – On f(A) for a forward key-map with f(A) B key – On T(A) for a dispatcher Lineage index – Mapping the key of each output data item o to the keys of input data items in o’s lineage
75
74 Experimental Results Tracing through an “SP” transformation over TPC-D table PartSupp
76
75 Tracing Through Sequences Tracing cost estimation – Divide properties into 5 groups – T’s cost level depends on the group of its best property – Associate a sequence with N[1..5] where N[k] records the number of transformations with cost level k Greedy algorithm – Pick a combination that results in the lowest N
77
76 Lineage Annotation (Appendix) 1 2 3 {1} {1,2} {2,4} {4} 4 {1,2} {1,2,4} {4} T1T1 T2T2 T1*T1*T2*T2*
78
77 Multiple Inputs and Outputs Define properties for each input and output Trace lineage for each input/output pair using single-input single-output tracing procedures T I1I1 I2I2 ImIm... O1O1 O2O2 OnOn
79
78 View Update Deletions on SPJ view deletions on base database View tuple deletion request –t and base tuple deletion D D is a translation for –t if {t} V = V(D) – V(D – D) Side-effect E = V – {t}; D is exact if E = D V’ UVUV D’ UD?UD? V
80
79 Relationships to Data Lineage t R1R1 R2R2 RnRn … A C t i belongs to t’s exclusive lineage R i ** iff {t} = ( (R 1 …{t i }… R n )) Intuition: t i contributes only to t AC t i R i belongs to t’s lineage R i * iff {t} ( (R 1 …{t i }… R n )) AC For an SPJ view:
81
80 The Problem D V’ D’ ? V View update View update for deletions t R1R1 R2R2 RnRn A C …
82
81 Relationships to Data Lineage Deleting a lineage branch R i *of t is always a translation for –t t R1R1 R2R2 RnRn … A C
83
82 Deleting a lineage branch R i *of t is always a translation for –t t R1R1 R2R2 RnRn … A C Deleting any subset of t’s exclusive lineage D** never causes side-effect Relationships to Data Lineage
84
83 Deleting a lineage branch R i *of t is always a translation for –t t R1R1 R2R2 RnRn … A C If –t has an exact translation D, it must also has an exact translation within t’s lineage Deleting any subset of t’s exclusive lineage D** never causes side-effect Relationships to Data Lineage
85
84 Translating View Tuple Deletions DELETE(t, V, D) compute lineage D* and exclusive lineage D**; IF D** is a translation THEN RETURN; IF i s.t. R i * causes no side-effect THEN RETURN; FOR each subset D of D* DO IF D is not a translation THEN prune all subsets of D; ELSE IF D causes a side-effect THEN prune all supersets of D; ELSE RETURN;
86
85 Detailed Computations Is D a translation for –t? if t ( ((R 1 *– R 1 ) … (R n *– R n ))) then D is a translation Does D cause side-effect? E ( (R 1 … R i … R n ))) – {t} if E ( ((R 1 – R 1 ) … (R n – R n ))) then D is exact Further pruning by sizes AC AC i=1..n AC
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.