Data Provenance in ETL Scenarios Panos Vassiliadis University of Ioannina (joint work with Alkis Simitsis, IBM Almaden Research Center, Timos Sellis and Dimitrios Skoutas, NTUA & ICCS)
PrOPr Outline Introduction Conceptual Level Logical Level Physical Level Provenance &ETL
PrOPr Outline Introduction Conceptual Level Logical Level Physical Level Provenance &ETL
PrOPr Data Warehouse Environment
PrOPr Extract-Transform-Load (ETL)
PrOPr ETL: importance ETL and Data Cleaning tools cost 30% of effort and expenses in the budget of the DW 55% of the total costs of DW runtime 80% of the development time in a DW project ETL market: a multi-million market IBM paid $1.1 billion dollars for Ascential ETL tools in the market software packages in-house development No standard, no common model most vendors implement a core set of operators and provide GUI to create a data flow
PrOPr Fundamental research question Now: currently, ETL designers work directly at the physical level (typically, via libraries of physical- level templates) Challenge: can we design ETL flows as declaratively as possible? Detail independence: no care for the algorithmic choices no care about the order of the transformations (hopefully) no care for the details of the inter-attribute mappings
PrOPr Engine Physical templates DW Involved data stores + Now: Physical scenario
PrOPr DW Schema mappings Conceptual to logical mapper Conceptual to logical mapping Optimizer Engine Logical templates Physical templates Logical scenario Physical scenario Engine ETL tool Vision: Physical templates DW Involved data stores + Physical scenario
PrOPr DW Schema mappings Conceptual to logical mapper Conceptual to logical mapping Optimizer Engine Logical templates Physical templates Logical scenario Physical scenario ETL tool Detail independence Automate (as much as possible) Conceptual: the details of the inter- attribute mappings Logical: the order of the transformations Physical: the algorithmic choices
PrOPr Outline Introduction Conceptual Level Logical Level Physical Level Provenance &ETL
PrOPr Conceptual Model: first attempts
PrOPr Conceptual Model: The Data Mapping Diagram Extension of UML to handle inter-attribute mappings
PrOPr Conceptual Model: The Data Mapping Diagram Aggregating computes the quarterly sales for each product.
PrOPr Conceptual Model: Skoutas’ annotations Application vocabulary V C = {product, store} V Pproduct = {pid, pName, quantity, price, type, storage} V Pstore = {sid, sName, city, street} V Fpid = {source_pid, dw_pid} V Fsid = {source_sid, dw_sid} V Fprice = {dollars, euros} V Ttype = {software, hardware} V Tcity = {paris, rome, athens} Datastore mappings Datastore annotation
PrOPr Conceptual Model: Skoutas’ annotations The class hierarchy Definition for class DS1_Products
PrOPr Outline Introduction Conceptual Level Logical Level Physical Level Provenance &ETL
PrOPr Logical Model AddAttr2 SOURCE SK 2 DS.PS1.PKEY, LOOKUP_PS.SKEY, SOURCE $ 2€ COSTDATE DS.PS1 SK 1 DS.PS2.PKEY, LOOKUP_PS.SKEY, SOURCE COSTDATE=SYSDATE AddDate U DS.PS2 Log rejected Log rejected A2EDate NotNULL Log rejected Log rejected Log rejected DIFF 2 DS.PSNEW2.PKEY, DS.PSOLD2.PKEY DS.PSNEW2 DS.PSOLD2 DW.PARTS Aggregate 1 PKEY, DAY MIN(COST) Aggregate 2 PKEY, MONTH AVG(COST) V2 V1 TIME DW.PARTSUPP.DATE, DAY FTP 2 S2.PARTS S1.PARTS FTP 1 DS.PSNEW1 DIFF 1 DS.PSOLD1 DS.PSNEW1.PKEY, DS.PSOLD1.PKEY Sources DW DSA Log rejected γ QTY,COST PK PKEY,DATE Log rejected
PrOPr Logical Model Main question: What information should we put inside a metadata repository to be able to answer questions like: what is the architecture of my DW back stage? which attributes/tables are involved in the population of an attribute? what part of the scenario is affected if we delete an attribute?
PrOPr Architecture Graph $ 2€ COSTDATE DS.PS1 SK 1 DS.PS2.PKEY, LOOKUP_PS.SKEY, SOURCE COSTDATE=SYSDATE AddDate U Log rejected A2EDate NotNULL Log rejected Log rejected Log rejected DIFF 2 DS.PSNEW2.PKEY, DS.PSOLD2.PKEY DS.PSNEW2 DS.PSOLD2 DW.PARTS Aggregate 1 PKEY, DAY MIN(COST) Aggregate 2 PKEY, MONTH AVG(COST) V2 V1 TIME DW.PARTSUPP.DATE, DAY FTP 2 S2.PARTS S1.PARTS FTP 1 DS.PSNEW1 DIFF 1 DS.PSOLD1 DS.PSNEW1.PKEY, DS.PSOLD1.PKEY Sources DW DSA γ QTY,COST PK PKEY,DATE Log rejected AddAttr2 SOURCE SK 2 DS.PS1.PKEY, LOOKUP_PS.SKEY, SOURCE DS.PS2 Log rejected Log rejected
PrOPr Architecture Graph Example 2
PrOPr Architecture Graph Example 2
PrOPr Optimization Execution order… which is the proper execution order?
PrOPr Optimization Execution order… order equivalence? SK,f 1,f 2 or SK,f 2,f 1 or... ?
PrOPr Logical Optimization Can we push selection early enough? Can we aggregate before $2€ takes place?
PrOPr Outline Introduction Conceptual Level Logical Level Physical Level Provenance &ETL
PrOPr DW Schema mappings Conceptual to logical mapper Conceptual to logical mapping Optimizer Engine Logical templates Physical templates Logical scenario Physical scenario ETL tool “identify the best possible physical implementation for a given logical ETL workflow” Logical to Physical
PrOPr Problem formulation Given a logical-level ETL workflow G L Compute a physical-level ETL workflow G P Such that the semantics of the workflow do not change all constraints are met the cost is minimal
PrOPr Solution We model the problem of finding the physical implementation of an ETL process as a state-space search problem. States. A state is a graph G P that represents a physical-level ETL workflow. The initial state G 0 P is produced after the random assignment of physical implementations to logical activities w.r.t. preconditions and constraints. Transitions. Given a state G P, a new state G P’ is generated by replacing the implementation of a physical activity a P of G P with another valid implementation for the same activity. Extension: introduction of a sorter activity (at the physical-level) as a new node in the graph. Sorter introduction Intentionally introduce sorters to reduce execution & resumption costs
PrOPr Sorters: impact We intentionally introduce orderings, (via appropriate physical-level sorter activities) towards obtaining physical plans of lower cost. Semantics: unaffected Price to pay: cost of sorting the stream of processed data Gain: it is possible to employ order-aware algorithms that significantly reduce processing cost It is possible to amortize the cost over activities that utilize common useful orderings
PrOPr Sorter gains Cost(G) = *[5.000*log 2 (5.000)+5.000] = If sorter S A,B is added to V: Cost(G’) = *5.000+[5.000*log 2 (5.000)+5.000] = Without order cost(σ i ) = n cost SO (γ) = n*log 2 (n)+n With appropriate order cost(σ i ) = sel i * n cost SO (γ) = n
PrOPr Interesting orders A ascA desc{A,B, [A,B]}
PrOPr Outline Introduction Conceptual Level Logical Level Physical Level Provenance &ETL
PrOPr DW Schema mappings Conceptual to logical mapper Conceptual to logical mapping Optimizer Engine Logical templates Physical templates Logical scenario Physical scenario ETL tool A principled architecture for ETL WHY WHAT HOW
PrOPr Logical Model: Questions revisited What information should we put inside a metadata repository to be able to answer questions like: what is the architecture of my DW back stage? it is described as the Architecture Graph which attributes/tables are involved in the population of an attribute? what part of the scenario is affected if we delete an attribute? follow the appropriate path in the Architecture Graph
PrOPr Fundamental questions on provenance & ETL Why do we have a certain record in the DW? Because there is a process (described by the Architecture Graph at the logical level + the conceptual model) that produces this kind of tuples Where did this record come from in my DW? Hard! If there is a way to derive an “inverse” workflow that links the DW tuples to their sources you can answer it. Not always possible: transformations are not invertible, and a DW is supposed to progressively summarize data… Widom’s work on record lineage…
PrOPr Fundamental questions on provenance & ETL How are updates to the sources managed? (update takes place at the source, DW+data marts must be updated) Done, although in a tedious way: log sniffing, mainly. Also, “diff” comparison of extracted snapshots When errors are discovered during the ETL process, how are they handled? (update takes place at the data staging area, sources must be updated) Too hard to “back-fuse” data into the sources, both for political and workload issues. Currently, this is not automated.
PrOPr Fundamental questions on provenance & ETL What happens if there are updates to the schema of the involved data sources? Currently this is not automated, although the automation of the task is part of the detail independence vision What happens if we must update the workflow structure and semantics? Nothing is versioned back – still, not really any user requests for this to be supported What is the equivalent of citations in ETL? … nothing really …
PrOPr Thank you!