Data Provenance in ETL Scenarios Panos Vassiliadis University of Ioannina (joint work with Alkis Simitsis, IBM Almaden Research Center, Timos Sellis and.

Data Provenance in ETL Scenarios Panos Vassiliadis University of Ioannina (joint work with Alkis Simitsis, IBM Almaden Research Center, Timos Sellis and Dimitrios Skoutas, NTUA & ICCS)

PrOPr 20072 Outline  Introduction  Conceptual Level  Logical Level  Physical Level  Provenance &ETL

PrOPr 20074 Data Warehouse Environment

PrOPr 20075 Extract-Transform-Load (ETL)

PrOPr 20076 ETL: importance  ETL and Data Cleaning tools cost 30% of effort and expenses in the budget of the DW 55% of the total costs of DW runtime 80% of the development time in a DW project  ETL market: a multi-million market IBM paid $1.1 billion dollars for Ascential  ETL tools in the market software packages in-house development  No standard, no common model most vendors implement a core set of operators and provide GUI to create a data flow

PrOPr 20077 Fundamental research question  Now: currently, ETL designers work directly at the physical level (typically, via libraries of physical- level templates)  Challenge: can we design ETL flows as declaratively as possible?  Detail independence: no care for the algorithmic choices no care about the order of the transformations (hopefully) no care for the details of the inter-attribute mappings

PrOPr 20078 Engine Physical templates DW Involved data stores + Now: Physical scenario

PrOPr 20079 DW Schema mappings Conceptual to logical mapper Conceptual to logical mapping Optimizer Engine Logical templates Physical templates Logical scenario Physical scenario Engine ETL tool Vision: Physical templates DW Involved data stores + Physical scenario

PrOPr 200710 DW Schema mappings Conceptual to logical mapper Conceptual to logical mapping Optimizer Engine Logical templates Physical templates Logical scenario Physical scenario ETL tool Detail independence Automate (as much as possible) Conceptual: the details of the inter- attribute mappings Logical: the order of the transformations Physical: the algorithmic choices

PrOPr 200712 Conceptual Model: first attempts

PrOPr 200713 Conceptual Model: The Data Mapping Diagram Extension of UML to handle inter-attribute mappings

PrOPr 200714 Conceptual Model: The Data Mapping Diagram Aggregating computes the quarterly sales for each product.

PrOPr 200715 Conceptual Model: Skoutas’ annotations  Application vocabulary V C = {product, store} V Pproduct = {pid, pName, quantity, price, type, storage} V Pstore = {sid, sName, city, street} V Fpid = {source_pid, dw_pid} V Fsid = {source_sid, dw_sid} V Fprice = {dollars, euros} V Ttype = {software, hardware} V Tcity = {paris, rome, athens}  Datastore mappings  Datastore annotation

PrOPr 200716 Conceptual Model: Skoutas’ annotations  The class hierarchy  Definition for class DS1_Products

PrOPr 200718 Logical Model AddAttr2 SOURCE SK 2 DS.PS1.PKEY, LOOKUP_PS.SKEY, SOURCE $ 2€ COSTDATE DS.PS1 SK 1 DS.PS2.PKEY, LOOKUP_PS.SKEY, SOURCE COSTDATE=SYSDATE AddDate U DS.PS2 Log rejected Log rejected A2EDate NotNULL Log rejected Log rejected Log rejected DIFF 2 DS.PSNEW2.PKEY, DS.PSOLD2.PKEY DS.PSNEW2 DS.PSOLD2 DW.PARTS Aggregate 1 PKEY, DAY MIN(COST) Aggregate 2 PKEY, MONTH AVG(COST) V2 V1 TIME  DW.PARTSUPP.DATE, DAY FTP 2 S2.PARTS S1.PARTS FTP 1 DS.PSNEW1 DIFF 1 DS.PSOLD1 DS.PSNEW1.PKEY, DS.PSOLD1.PKEY Sources DW DSA Log rejected γ QTY,COST PK PKEY,DATE Log rejected

PrOPr 200719 Logical Model  Main question: What information should we put inside a metadata repository to be able to answer questions like: what is the architecture of my DW back stage? which attributes/tables are involved in the population of an attribute? what part of the scenario is affected if we delete an attribute?

PrOPr 200720 Architecture Graph $ 2€ COSTDATE DS.PS1 SK 1 DS.PS2.PKEY, LOOKUP_PS.SKEY, SOURCE COSTDATE=SYSDATE AddDate U Log rejected A2EDate NotNULL Log rejected Log rejected Log rejected DIFF 2 DS.PSNEW2.PKEY, DS.PSOLD2.PKEY DS.PSNEW2 DS.PSOLD2 DW.PARTS Aggregate 1 PKEY, DAY MIN(COST) Aggregate 2 PKEY, MONTH AVG(COST) V2 V1 TIME  DW.PARTSUPP.DATE, DAY FTP 2 S2.PARTS S1.PARTS FTP 1 DS.PSNEW1 DIFF 1 DS.PSOLD1 DS.PSNEW1.PKEY, DS.PSOLD1.PKEY Sources DW DSA γ QTY,COST PK PKEY,DATE Log rejected AddAttr2 SOURCE SK 2 DS.PS1.PKEY, LOOKUP_PS.SKEY, SOURCE DS.PS2 Log rejected Log rejected

PrOPr 200721 Architecture Graph Example 2

PrOPr 200722 Architecture Graph Example 2

PrOPr 200723 Optimization  Execution order… which is the proper execution order?

PrOPr 200724 Optimization  Execution order… order equivalence? SK,f 1,f 2 or SK,f 2,f 1 or... ?

PrOPr 200725 Logical Optimization  Can we push selection early enough?  Can we aggregate before $2€ takes place?

PrOPr 200727 DW Schema mappings Conceptual to logical mapper Conceptual to logical mapping Optimizer Engine Logical templates Physical templates Logical scenario Physical scenario ETL tool “identify the best possible physical implementation for a given logical ETL workflow” Logical to Physical

PrOPr 200728 Problem formulation  Given a logical-level ETL workflow G L  Compute a physical-level ETL workflow G P  Such that the semantics of the workflow do not change all constraints are met the cost is minimal

PrOPr 200729 Solution  We model the problem of finding the physical implementation of an ETL process as a state-space search problem.  States. A state is a graph G P that represents a physical-level ETL workflow. The initial state G 0 P is produced after the random assignment of physical implementations to logical activities w.r.t. preconditions and constraints.  Transitions. Given a state G P, a new state G P’ is generated by replacing the implementation of a physical activity a P of G P with another valid implementation for the same activity. Extension: introduction of a sorter activity (at the physical-level) as a new node in the graph.  Sorter introduction Intentionally introduce sorters to reduce execution & resumption costs

PrOPr 200730 Sorters: impact  We intentionally introduce orderings, (via appropriate physical-level sorter activities) towards obtaining physical plans of lower cost.  Semantics: unaffected  Price to pay: cost of sorting the stream of processed data  Gain: it is possible to employ order-aware algorithms that significantly reduce processing cost It is possible to amortize the cost over activities that utilize common useful orderings

PrOPr 200731 Sorter gains Cost(G) = 100.000+10.000 +3*[5.000*log 2 (5.000)+5.000] = 309.316 If sorter S A,B is added to V: Cost(G’) = 100.000+10.000 +2*5.000+[5.000*log 2 (5.000)+5.000] = 247.877  Without order cost(σ i ) = n cost SO (γ) = n*log 2 (n)+n  With appropriate order cost(σ i ) = sel i * n cost SO (γ) = n

PrOPr 200732 Interesting orders A ascA desc{A,B, [A,B]}

PrOPr 200734 DW Schema mappings Conceptual to logical mapper Conceptual to logical mapping Optimizer Engine Logical templates Physical templates Logical scenario Physical scenario ETL tool A principled architecture for ETL WHY WHAT HOW

PrOPr 200735 Logical Model: Questions revisited What information should we put inside a metadata repository to be able to answer questions like: what is the architecture of my DW back stage?  it is described as the Architecture Graph which attributes/tables are involved in the population of an attribute? what part of the scenario is affected if we delete an attribute?  follow the appropriate path in the Architecture Graph

PrOPr 200736 Fundamental questions on provenance & ETL Why do we have a certain record in the DW? Because there is a process (described by the Architecture Graph at the logical level + the conceptual model) that produces this kind of tuples  Where did this record come from in my DW? Hard! If there is a way to derive an “inverse” workflow that links the DW tuples to their sources you can answer it. Not always possible: transformations are not invertible, and a DW is supposed to progressively summarize data… Widom’s work on record lineage…

PrOPr 200737 Fundamental questions on provenance & ETL  How are updates to the sources managed? (update takes place at the source, DW+data marts must be updated) Done, although in a tedious way: log sniffing, mainly. Also, “diff” comparison of extracted snapshots  When errors are discovered during the ETL process, how are they handled? (update takes place at the data staging area, sources must be updated) Too hard to “back-fuse” data into the sources, both for political and workload issues. Currently, this is not automated.

PrOPr 200738 Fundamental questions on provenance & ETL  What happens if there are updates to the schema of the involved data sources?  Currently this is not automated, although the automation of the task is part of the detail independence vision  What happens if we must update the workflow structure and semantics?  Nothing is versioned back – still, not really any user requests for this to be supported  What is the equivalent of citations in ETL?  … nothing really …

PrOPr 200739 Thank you!

Data Provenance in ETL Scenarios Panos Vassiliadis University of Ioannina (joint work with Alkis Simitsis, IBM Almaden Research Center, Timos Sellis and.

Similar presentations

Presentation on theme: "Data Provenance in ETL Scenarios Panos Vassiliadis University of Ioannina (joint work with Alkis Simitsis, IBM Almaden Research Center, Timos Sellis and."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Data Provenance in ETL Scenarios Panos Vassiliadis University of Ioannina (joint work with Alkis Simitsis, IBM Almaden Research Center, Timos Sellis and.

Similar presentations

Presentation on theme: "Data Provenance in ETL Scenarios Panos Vassiliadis University of Ioannina (joint work with Alkis Simitsis, IBM Almaden Research Center, Timos Sellis and."— Presentation transcript:

Similar presentations

About project

Feedback