Data Provenance in ETL Scenarios Panos Vassiliadis University of Ioannina (joint work with Alkis Simitsis, IBM Almaden Research Center, Timos Sellis and.

Slides:



Advertisements
Similar presentations
Everything HelpDesk® The Academic Preventive Maintenance Solution Helping you stay one step ahead with Ticket Templates GroupLink Corporation.
Advertisements

ETL Workflows: From Formal Specification to Optimization Timos Sellis National Technical University of Athens (joint work with Alkis Simitsis, IBM Almaden.
Outline What is a data warehouse? A multi-dimensional data model Data warehouse architecture Data warehouse implementation Further development of data.
Data Model driven applications using CASE Data Models as the nucleus of software development in a Computer Aided Software Engineering environment.
Towards a Benchmark for ETL Workflows Panos Vassiliadis Anastasios Karagiannis Vasiliki Tziovara Alkis Simitsis Univ. of Ioannina Almaden Research Center.
Basic guidelines for the creation of a DW Create corporate sponsors and plan thoroughly Determine a scalable architectural framework for the DW Identify.
G. Papastefanatos 1, P. Vassiliadis 2, A. Simitsis 3, T. Sellis 1,4, Y. Vassiliou 1 (1) National Technical University of Athens, Athens, Hellas (Greece)
L4-1-S1 UML Overview © M.E. Fayad SJSU -- CmpE Software Architectures Dr. M.E. Fayad, Professor Computer Engineering Department, Room #283I.
Dimensional Modeling – Part 2
Introduction to Software Design Chapter 1. Chapter 1: Introduction to Software Design2 Chapter Objectives To become familiar with the software challenge.
Graph-Based Modeling of ETL Activities with Multi-Level Transformations and Updates Alkis Simitsis 1, Panos Vassiliadis 2, Manolis Terrovitis 1, Spiros.
Supporting Streaming Updates in an Active Data Warehouse Neoklis Polyzotis, Spiros Skiadopoulos, Panos Vassiliadis, Alkis Simitsis, Nils-Erik Frantzell.
George Papastefanatos 1, Panos Vassiliadis 2, Alkis Simitsis 3,Yannis Vassiliou 1 (1) National Technical University of Athens
ETL Queues for Active Data Warehousing Alexis Karakasidis Panos Vassiliadis Evaggelia Pitoura Dept. of Computer Science University of Ioannina.
Deciding the Physical Implementation of ETL Workflows Vasiliki Tziovara Panos Vassiliadis Alkis Simitsis Univ. of Ioannina Almaden Research Center.
Data Warehousing: Defined and Its Applications Pete Johnson April 2002.
1 Chapter 2 Database Environment. 2 Chapter 2 - Objectives u Purpose of three-level database architecture. u Contents of external, conceptual, and internal.
Segment Two: Business Requirements Drive the Technical Updates January 26-27, 2012 Idaho ICD-10 Site Visit Training segments to assist the State of Idaho.
© 2003, Prentice-Hall Chapter Chapter 2: The Data Warehouse Modern Data Warehousing, Mining, and Visualization: Core Concepts by George M. Marakas.
ETL By Dr. Gabriel.
 Workflow  ETL workflow  Complex event processing(CEP) Mona Alnahari.
BUSINESS INTELLIGENCE/DATA INTEGRATION/ETL/INTEGRATION AN INTRODUCTION Presented by: Gautam Sinha.
MDC Open Information Model West Virginia University CS486 Presentation Feb 18, 2000 Lijian Liu (OIM:
January, 23, 2006 Ilkay Altintas
SSIS Over DTS Sagayaraj Putti (139460). 5 September What is DTS?  Data Transformation Services (DTS)  DTS is a set of objects and utilities that.
Katanosh Morovat.   This concept is a formal approach for identifying the rules that encapsulate the structure, constraint, and control of the operation.
1 Yolanda Gil Information Sciences InstituteJanuary 10, 2010 Requirements for caBIG Infrastructure to Support Semantic Workflows Yolanda.
Best Practices for Data Warehousing. 2 Agenda – Best Practices for DW-BI Best Practices in Data Modeling Best Practices in ETL Best Practices in Reporting.
Data Warehousing Seminar Chapter 5. Data Warehouse Design Methodology Data Warehousing Lab. HyeYoung Cho.
ISV Innovation Presented by ISV Innovation Presented by Business Intelligence Fundamentals: Data Loading Ola Ekdahl IT Mentors 9/12/08.
Context Tailoring the DBMS –To support particular applications Beyond alphanumerical data Beyond retrieve + process –To support particular hardware New.
Network Aware Resource Allocation in Distributed Clouds.
Database Systems: Design, Implementation, and Management Ninth Edition
SOFTWARE DESIGN AND ARCHITECTURE LECTURE 21. Review ANALYSIS PHASE (OBJECT ORIENTED DESIGN) Functional Modeling – Use case Diagram Description.
Session 4: The HANA Curriculum and Demos Dr. Bjarne Berg Associate professor Computer Science Lenoir-Rhyne University.
Marakas: Decision Support Systems, 2nd Edition © 2003, Prentice-Hall Chapter Chapter 10: The Data Warehouse Decision Support Systems in the 21 st.
Lecture2: Database Environment Prepared by L. Nouf Almujally & Aisha AlArfaj 1 Ref. Chapter2 College of Computer and Information Sciences - Information.
Dimitrios Skoutas Alkis Simitsis
 Three-Schema Architecture Three-Schema Architecture  Internal Level Internal Level  Conceptual Level Conceptual Level  External Level External Level.
Hybrid Transformation Modeling Integrating a Declarative with an Imperative Model Transformation Language Pieter Van Gorp
A Taxonomy of ETL Activities Panos Vassiliadis 1, Alkis Simitsis 2, Eftychia Baikousi 1 (1) University of Ioannina (2) HP Labs.
L6-S1 UML Overview 2003 SJSU -- CmpE Advanced Object-Oriented Analysis & Design Dr. M.E. Fayad, Professor Computer Engineering Department, Room #283I College.
Database Environment Session 2 Course Name: Database System Year : 2013.
Conceptual Modeling for ETL processes Panos Vassiliadis, Alkis Simitsis, Spiros Skiadopoulos National Technical.
Metadata By N.Gopinath AP/CSE Metadata and it’s role in the lifecycle. The collection, maintenance, and deployment of metadata Metadata and tool integration.
DataBase System Concepts and Architecture
Chapter 5 System Modeling. What is System modeling? System modeling is the process of developing abstract models of a system, with each model presenting.
Chapter 2 Database Environment.
1 Chapter 2 Database Environment Pearson Education © 2009.
2) Database System Concepts and Architecture. Slide 2- 2 Outline Data Models and Their Categories Schemas, Instances, and States Three-Schema Architecture.
Data Warehouse A place the information system department puts the data that is turned into information. Data must be properly prepared,organized,and presented.
Copyright © 2007, Oracle. All rights reserved. Managing Items and Item Catalogs.
Data Management: Data Processing Types of Data Processing at USGS There are several ways to classify Data Processing activities at USGS, and here are some.
Development Project Management Jim Kowalkowski. Outline Planning and managing software development – Definitions – Organizing schedule and work (overall.
Copyright © 2006, Oracle. All rights reserved. Czinkóczki László oktató Using the Oracle Warehouse Builder.
2 Copyright © 2006, Oracle. All rights reserved. Defining Data Warehouse Concepts and Terminology.
Plan for Populating a DW
Elaboration popo.
Chapter 2 Database Environment.
Chapter 2 Database Environment.
Chapter 2: Database System Concepts and Architecture
Chapter 2 Database Environment Pearson Education © 2009.
Chapter 2 Database Environment.
Data Warehouse A place the information system department puts the data that is turned into information. Data must be properly prepared,organized,and presented.
Database Environment Transparencies
Database Systems Instructor Name: Lecture-3.
Introduction of Week 9 Return assignment 5-2
Chapter 2 Database Environment Pearson Education © 2014.
Chapter 2 Database Environment Pearson Education © 2009.
Chapter 2 Database Environment Pearson Education © 2009.
Presentation transcript:

Data Provenance in ETL Scenarios Panos Vassiliadis University of Ioannina (joint work with Alkis Simitsis, IBM Almaden Research Center, Timos Sellis and Dimitrios Skoutas, NTUA & ICCS)

PrOPr Outline  Introduction  Conceptual Level  Logical Level  Physical Level  Provenance &ETL

PrOPr Outline  Introduction  Conceptual Level  Logical Level  Physical Level  Provenance &ETL

PrOPr Data Warehouse Environment

PrOPr Extract-Transform-Load (ETL)

PrOPr ETL: importance  ETL and Data Cleaning tools cost 30% of effort and expenses in the budget of the DW 55% of the total costs of DW runtime 80% of the development time in a DW project  ETL market: a multi-million market IBM paid $1.1 billion dollars for Ascential  ETL tools in the market software packages in-house development  No standard, no common model most vendors implement a core set of operators and provide GUI to create a data flow

PrOPr Fundamental research question  Now: currently, ETL designers work directly at the physical level (typically, via libraries of physical- level templates)  Challenge: can we design ETL flows as declaratively as possible?  Detail independence: no care for the algorithmic choices no care about the order of the transformations (hopefully) no care for the details of the inter-attribute mappings

PrOPr Engine Physical templates DW Involved data stores + Now: Physical scenario

PrOPr DW Schema mappings Conceptual to logical mapper Conceptual to logical mapping Optimizer Engine Logical templates Physical templates Logical scenario Physical scenario Engine ETL tool Vision: Physical templates DW Involved data stores + Physical scenario

PrOPr DW Schema mappings Conceptual to logical mapper Conceptual to logical mapping Optimizer Engine Logical templates Physical templates Logical scenario Physical scenario ETL tool Detail independence Automate (as much as possible) Conceptual: the details of the inter- attribute mappings Logical: the order of the transformations Physical: the algorithmic choices

PrOPr Outline  Introduction  Conceptual Level  Logical Level  Physical Level  Provenance &ETL

PrOPr Conceptual Model: first attempts

PrOPr Conceptual Model: The Data Mapping Diagram Extension of UML to handle inter-attribute mappings

PrOPr Conceptual Model: The Data Mapping Diagram Aggregating computes the quarterly sales for each product.

PrOPr Conceptual Model: Skoutas’ annotations  Application vocabulary V C = {product, store} V Pproduct = {pid, pName, quantity, price, type, storage} V Pstore = {sid, sName, city, street} V Fpid = {source_pid, dw_pid} V Fsid = {source_sid, dw_sid} V Fprice = {dollars, euros} V Ttype = {software, hardware} V Tcity = {paris, rome, athens}  Datastore mappings  Datastore annotation

PrOPr Conceptual Model: Skoutas’ annotations  The class hierarchy  Definition for class DS1_Products

PrOPr Outline  Introduction  Conceptual Level  Logical Level  Physical Level  Provenance &ETL

PrOPr Logical Model AddAttr2 SOURCE SK 2 DS.PS1.PKEY, LOOKUP_PS.SKEY, SOURCE $ 2€ COSTDATE DS.PS1 SK 1 DS.PS2.PKEY, LOOKUP_PS.SKEY, SOURCE COSTDATE=SYSDATE AddDate U DS.PS2 Log rejected Log rejected A2EDate NotNULL Log rejected Log rejected Log rejected DIFF 2 DS.PSNEW2.PKEY, DS.PSOLD2.PKEY DS.PSNEW2 DS.PSOLD2 DW.PARTS Aggregate 1 PKEY, DAY MIN(COST) Aggregate 2 PKEY, MONTH AVG(COST) V2 V1 TIME  DW.PARTSUPP.DATE, DAY FTP 2 S2.PARTS S1.PARTS FTP 1 DS.PSNEW1 DIFF 1 DS.PSOLD1 DS.PSNEW1.PKEY, DS.PSOLD1.PKEY Sources DW DSA Log rejected γ QTY,COST PK PKEY,DATE Log rejected

PrOPr Logical Model  Main question: What information should we put inside a metadata repository to be able to answer questions like: what is the architecture of my DW back stage? which attributes/tables are involved in the population of an attribute? what part of the scenario is affected if we delete an attribute?

PrOPr Architecture Graph $ 2€ COSTDATE DS.PS1 SK 1 DS.PS2.PKEY, LOOKUP_PS.SKEY, SOURCE COSTDATE=SYSDATE AddDate U Log rejected A2EDate NotNULL Log rejected Log rejected Log rejected DIFF 2 DS.PSNEW2.PKEY, DS.PSOLD2.PKEY DS.PSNEW2 DS.PSOLD2 DW.PARTS Aggregate 1 PKEY, DAY MIN(COST) Aggregate 2 PKEY, MONTH AVG(COST) V2 V1 TIME  DW.PARTSUPP.DATE, DAY FTP 2 S2.PARTS S1.PARTS FTP 1 DS.PSNEW1 DIFF 1 DS.PSOLD1 DS.PSNEW1.PKEY, DS.PSOLD1.PKEY Sources DW DSA γ QTY,COST PK PKEY,DATE Log rejected AddAttr2 SOURCE SK 2 DS.PS1.PKEY, LOOKUP_PS.SKEY, SOURCE DS.PS2 Log rejected Log rejected

PrOPr Architecture Graph Example 2

PrOPr Architecture Graph Example 2

PrOPr Optimization  Execution order… which is the proper execution order?

PrOPr Optimization  Execution order… order equivalence? SK,f 1,f 2 or SK,f 2,f 1 or... ?

PrOPr Logical Optimization  Can we push selection early enough?  Can we aggregate before $2€ takes place?

PrOPr Outline  Introduction  Conceptual Level  Logical Level  Physical Level  Provenance &ETL

PrOPr DW Schema mappings Conceptual to logical mapper Conceptual to logical mapping Optimizer Engine Logical templates Physical templates Logical scenario Physical scenario ETL tool “identify the best possible physical implementation for a given logical ETL workflow” Logical to Physical

PrOPr Problem formulation  Given a logical-level ETL workflow G L  Compute a physical-level ETL workflow G P  Such that the semantics of the workflow do not change all constraints are met the cost is minimal

PrOPr Solution  We model the problem of finding the physical implementation of an ETL process as a state-space search problem.  States. A state is a graph G P that represents a physical-level ETL workflow. The initial state G 0 P is produced after the random assignment of physical implementations to logical activities w.r.t. preconditions and constraints.  Transitions. Given a state G P, a new state G P’ is generated by replacing the implementation of a physical activity a P of G P with another valid implementation for the same activity. Extension: introduction of a sorter activity (at the physical-level) as a new node in the graph.  Sorter introduction Intentionally introduce sorters to reduce execution & resumption costs

PrOPr Sorters: impact  We intentionally introduce orderings, (via appropriate physical-level sorter activities) towards obtaining physical plans of lower cost.  Semantics: unaffected  Price to pay: cost of sorting the stream of processed data  Gain: it is possible to employ order-aware algorithms that significantly reduce processing cost It is possible to amortize the cost over activities that utilize common useful orderings

PrOPr Sorter gains Cost(G) = *[5.000*log 2 (5.000)+5.000] = If sorter S A,B is added to V: Cost(G’) = *5.000+[5.000*log 2 (5.000)+5.000] =  Without order cost(σ i ) = n cost SO (γ) = n*log 2 (n)+n  With appropriate order cost(σ i ) = sel i * n cost SO (γ) = n

PrOPr Interesting orders A ascA desc{A,B, [A,B]}

PrOPr Outline  Introduction  Conceptual Level  Logical Level  Physical Level  Provenance &ETL

PrOPr DW Schema mappings Conceptual to logical mapper Conceptual to logical mapping Optimizer Engine Logical templates Physical templates Logical scenario Physical scenario ETL tool A principled architecture for ETL WHY WHAT HOW

PrOPr Logical Model: Questions revisited What information should we put inside a metadata repository to be able to answer questions like: what is the architecture of my DW back stage?  it is described as the Architecture Graph which attributes/tables are involved in the population of an attribute? what part of the scenario is affected if we delete an attribute?  follow the appropriate path in the Architecture Graph

PrOPr Fundamental questions on provenance & ETL Why do we have a certain record in the DW? Because there is a process (described by the Architecture Graph at the logical level + the conceptual model) that produces this kind of tuples  Where did this record come from in my DW? Hard! If there is a way to derive an “inverse” workflow that links the DW tuples to their sources you can answer it. Not always possible: transformations are not invertible, and a DW is supposed to progressively summarize data… Widom’s work on record lineage…

PrOPr Fundamental questions on provenance & ETL  How are updates to the sources managed? (update takes place at the source, DW+data marts must be updated) Done, although in a tedious way: log sniffing, mainly. Also, “diff” comparison of extracted snapshots  When errors are discovered during the ETL process, how are they handled? (update takes place at the data staging area, sources must be updated) Too hard to “back-fuse” data into the sources, both for political and workload issues. Currently, this is not automated.

PrOPr Fundamental questions on provenance & ETL  What happens if there are updates to the schema of the involved data sources?  Currently this is not automated, although the automation of the task is part of the detail independence vision  What happens if we must update the workflow structure and semantics?  Nothing is versioned back – still, not really any user requests for this to be supported  What is the equivalent of citations in ETL?  … nothing really …

PrOPr Thank you!