Download presentation
Presentation is loading. Please wait.
Published byDouglas Young Modified over 9 years ago
1
Data Provenance in ETL Scenarios Panos Vassiliadis University of Ioannina (joint work with Alkis Simitsis, IBM Almaden Research Center, Timos Sellis and Dimitrios Skoutas, NTUA & ICCS)
2
PrOPr 20072 Outline Introduction Conceptual Level Logical Level Physical Level Provenance &ETL
3
PrOPr 20073 Outline Introduction Conceptual Level Logical Level Physical Level Provenance &ETL
4
PrOPr 20074 Data Warehouse Environment
5
PrOPr 20075 Extract-Transform-Load (ETL)
6
PrOPr 20076 ETL: importance ETL and Data Cleaning tools cost 30% of effort and expenses in the budget of the DW 55% of the total costs of DW runtime 80% of the development time in a DW project ETL market: a multi-million market IBM paid $1.1 billion dollars for Ascential ETL tools in the market software packages in-house development No standard, no common model most vendors implement a core set of operators and provide GUI to create a data flow
7
PrOPr 20077 Fundamental research question Now: currently, ETL designers work directly at the physical level (typically, via libraries of physical- level templates) Challenge: can we design ETL flows as declaratively as possible? Detail independence: no care for the algorithmic choices no care about the order of the transformations (hopefully) no care for the details of the inter-attribute mappings
8
PrOPr 20078 Engine Physical templates DW Involved data stores + Now: Physical scenario
9
PrOPr 20079 DW Schema mappings Conceptual to logical mapper Conceptual to logical mapping Optimizer Engine Logical templates Physical templates Logical scenario Physical scenario Engine ETL tool Vision: Physical templates DW Involved data stores + Physical scenario
10
PrOPr 200710 DW Schema mappings Conceptual to logical mapper Conceptual to logical mapping Optimizer Engine Logical templates Physical templates Logical scenario Physical scenario ETL tool Detail independence Automate (as much as possible) Conceptual: the details of the inter- attribute mappings Logical: the order of the transformations Physical: the algorithmic choices
11
PrOPr 200711 Outline Introduction Conceptual Level Logical Level Physical Level Provenance &ETL
12
PrOPr 200712 Conceptual Model: first attempts
13
PrOPr 200713 Conceptual Model: The Data Mapping Diagram Extension of UML to handle inter-attribute mappings
14
PrOPr 200714 Conceptual Model: The Data Mapping Diagram Aggregating computes the quarterly sales for each product.
15
PrOPr 200715 Conceptual Model: Skoutas’ annotations Application vocabulary V C = {product, store} V Pproduct = {pid, pName, quantity, price, type, storage} V Pstore = {sid, sName, city, street} V Fpid = {source_pid, dw_pid} V Fsid = {source_sid, dw_sid} V Fprice = {dollars, euros} V Ttype = {software, hardware} V Tcity = {paris, rome, athens} Datastore mappings Datastore annotation
16
PrOPr 200716 Conceptual Model: Skoutas’ annotations The class hierarchy Definition for class DS1_Products
17
PrOPr 200717 Outline Introduction Conceptual Level Logical Level Physical Level Provenance &ETL
18
PrOPr 200718 Logical Model AddAttr2 SOURCE SK 2 DS.PS1.PKEY, LOOKUP_PS.SKEY, SOURCE $ 2€ COSTDATE DS.PS1 SK 1 DS.PS2.PKEY, LOOKUP_PS.SKEY, SOURCE COSTDATE=SYSDATE AddDate U DS.PS2 Log rejected Log rejected A2EDate NotNULL Log rejected Log rejected Log rejected DIFF 2 DS.PSNEW2.PKEY, DS.PSOLD2.PKEY DS.PSNEW2 DS.PSOLD2 DW.PARTS Aggregate 1 PKEY, DAY MIN(COST) Aggregate 2 PKEY, MONTH AVG(COST) V2 V1 TIME DW.PARTSUPP.DATE, DAY FTP 2 S2.PARTS S1.PARTS FTP 1 DS.PSNEW1 DIFF 1 DS.PSOLD1 DS.PSNEW1.PKEY, DS.PSOLD1.PKEY Sources DW DSA Log rejected γ QTY,COST PK PKEY,DATE Log rejected
19
PrOPr 200719 Logical Model Main question: What information should we put inside a metadata repository to be able to answer questions like: what is the architecture of my DW back stage? which attributes/tables are involved in the population of an attribute? what part of the scenario is affected if we delete an attribute?
20
PrOPr 200720 Architecture Graph $ 2€ COSTDATE DS.PS1 SK 1 DS.PS2.PKEY, LOOKUP_PS.SKEY, SOURCE COSTDATE=SYSDATE AddDate U Log rejected A2EDate NotNULL Log rejected Log rejected Log rejected DIFF 2 DS.PSNEW2.PKEY, DS.PSOLD2.PKEY DS.PSNEW2 DS.PSOLD2 DW.PARTS Aggregate 1 PKEY, DAY MIN(COST) Aggregate 2 PKEY, MONTH AVG(COST) V2 V1 TIME DW.PARTSUPP.DATE, DAY FTP 2 S2.PARTS S1.PARTS FTP 1 DS.PSNEW1 DIFF 1 DS.PSOLD1 DS.PSNEW1.PKEY, DS.PSOLD1.PKEY Sources DW DSA γ QTY,COST PK PKEY,DATE Log rejected AddAttr2 SOURCE SK 2 DS.PS1.PKEY, LOOKUP_PS.SKEY, SOURCE DS.PS2 Log rejected Log rejected
21
PrOPr 200721 Architecture Graph Example 2
22
PrOPr 200722 Architecture Graph Example 2
23
PrOPr 200723 Optimization Execution order… which is the proper execution order?
24
PrOPr 200724 Optimization Execution order… order equivalence? SK,f 1,f 2 or SK,f 2,f 1 or... ?
25
PrOPr 200725 Logical Optimization Can we push selection early enough? Can we aggregate before $2€ takes place?
26
PrOPr 200726 Outline Introduction Conceptual Level Logical Level Physical Level Provenance &ETL
27
PrOPr 200727 DW Schema mappings Conceptual to logical mapper Conceptual to logical mapping Optimizer Engine Logical templates Physical templates Logical scenario Physical scenario ETL tool “identify the best possible physical implementation for a given logical ETL workflow” Logical to Physical
28
PrOPr 200728 Problem formulation Given a logical-level ETL workflow G L Compute a physical-level ETL workflow G P Such that the semantics of the workflow do not change all constraints are met the cost is minimal
29
PrOPr 200729 Solution We model the problem of finding the physical implementation of an ETL process as a state-space search problem. States. A state is a graph G P that represents a physical-level ETL workflow. The initial state G 0 P is produced after the random assignment of physical implementations to logical activities w.r.t. preconditions and constraints. Transitions. Given a state G P, a new state G P’ is generated by replacing the implementation of a physical activity a P of G P with another valid implementation for the same activity. Extension: introduction of a sorter activity (at the physical-level) as a new node in the graph. Sorter introduction Intentionally introduce sorters to reduce execution & resumption costs
30
PrOPr 200730 Sorters: impact We intentionally introduce orderings, (via appropriate physical-level sorter activities) towards obtaining physical plans of lower cost. Semantics: unaffected Price to pay: cost of sorting the stream of processed data Gain: it is possible to employ order-aware algorithms that significantly reduce processing cost It is possible to amortize the cost over activities that utilize common useful orderings
31
PrOPr 200731 Sorter gains Cost(G) = 100.000+10.000 +3*[5.000*log 2 (5.000)+5.000] = 309.316 If sorter S A,B is added to V: Cost(G’) = 100.000+10.000 +2*5.000+[5.000*log 2 (5.000)+5.000] = 247.877 Without order cost(σ i ) = n cost SO (γ) = n*log 2 (n)+n With appropriate order cost(σ i ) = sel i * n cost SO (γ) = n
32
PrOPr 200732 Interesting orders A ascA desc{A,B, [A,B]}
33
PrOPr 200733 Outline Introduction Conceptual Level Logical Level Physical Level Provenance &ETL
34
PrOPr 200734 DW Schema mappings Conceptual to logical mapper Conceptual to logical mapping Optimizer Engine Logical templates Physical templates Logical scenario Physical scenario ETL tool A principled architecture for ETL WHY WHAT HOW
35
PrOPr 200735 Logical Model: Questions revisited What information should we put inside a metadata repository to be able to answer questions like: what is the architecture of my DW back stage? it is described as the Architecture Graph which attributes/tables are involved in the population of an attribute? what part of the scenario is affected if we delete an attribute? follow the appropriate path in the Architecture Graph
36
PrOPr 200736 Fundamental questions on provenance & ETL Why do we have a certain record in the DW? Because there is a process (described by the Architecture Graph at the logical level + the conceptual model) that produces this kind of tuples Where did this record come from in my DW? Hard! If there is a way to derive an “inverse” workflow that links the DW tuples to their sources you can answer it. Not always possible: transformations are not invertible, and a DW is supposed to progressively summarize data… Widom’s work on record lineage…
37
PrOPr 200737 Fundamental questions on provenance & ETL How are updates to the sources managed? (update takes place at the source, DW+data marts must be updated) Done, although in a tedious way: log sniffing, mainly. Also, “diff” comparison of extracted snapshots When errors are discovered during the ETL process, how are they handled? (update takes place at the data staging area, sources must be updated) Too hard to “back-fuse” data into the sources, both for political and workload issues. Currently, this is not automated.
38
PrOPr 200738 Fundamental questions on provenance & ETL What happens if there are updates to the schema of the involved data sources? Currently this is not automated, although the automation of the task is part of the detail independence vision What happens if we must update the workflow structure and semantics? Nothing is versioned back – still, not really any user requests for this to be supported What is the equivalent of citations in ETL? … nothing really …
39
PrOPr 200739 Thank you!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.