MAIME: A Maintenance Manager for ETL Processes

Slides:



Advertisements
Similar presentations
BY LECTURER/ AISHA DAWOOD DW Lab # 3 Overview of Extraction, Transformation, and Loading.
Advertisements

Department of Software and Computing Systems Physical Modeling of Data Warehouses using UML Sergio Luján-Mora Juan Trujillo DOLAP 2004.
G. Papastefanatos 1, P. Vassiliadis 2, A. Simitsis 3, T. Sellis 1,4, Y. Vassiliou 1 (1) National Technical University of Athens, Athens, Hellas (Greece)
Management of the Evolution of Database-Centric Information Systems Panos Vassiliadis 2, George Papastefanatos 1, Timos Sellis 1, Yannis Vassiliou 1 1.
George Papastefanatos 1, Panos Vassiliadis 2, Alkis Simitsis 3,Yannis Vassiliou 1 (1) National Technical University of Athens
G. Papastefanatos 1, P. Vassiliadis 2, A. Simitsis 3, Y. Vassiliou 1 (1) National Technical University of Athens, Athens, Hellas (Greece)
G. Papastefanatos 1, P. Vassiliadis 2, A. Simitsis 3, K. Aggistalis 2, F. Pechlivani 2, Yannis Vassiliou 1 (1) National Technical University of Athens.
Systems Analysis I Data Flow Diagrams
Data Warehouse Components
Business Intelligence Instructor: Bajuna Salehe Web:
TIBCO Designer TIBCO BusinessWorks is a scalable, extensible, and easy to use integration platform that allows you to develop, deploy, and run integration.
Chapter 4: Organizing and Manipulating the Data in Databases
SSIS Over DTS Sagayaraj Putti (139460). 5 September What is DTS?  Data Transformation Services (DTS)  DTS is a set of objects and utilities that.
Best Practices for Data Warehousing. 2 Agenda – Best Practices for DW-BI Best Practices in Data Modeling Best Practices in ETL Best Practices in Reporting.
©Silberschatz, Korth and Sudarshan5.1Database System Concepts Chapter 5: Other Relational Languages Query-by-Example (QBE) Datalog.
Microsoft Access Lecture -13- By lec. (Eng.) Hind Basil University of Technology Department of Materials Engineering 1.
Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.
Validated Model Transformation Tihamér Levendovszky Budapest University of Technology and Economics Department of Automation and Applied Informatics Applied.
Dimitrios Skoutas Alkis Simitsis
DATA, SITE AND RESOURCE MANAGEMENT SOFTWARE. A Windows application software designed for use with Stylitis data loggers. EMMETRON consolidates resources,
Carey Probst Technical Director Technology Business Unit - OLAP Oracle Corporation.
1.Nattawut Chaibuuranapankul M.2/6 No. 8 2.Poonnut Sovanpaiboon M.2/6 No.11 3.Sarin Jirasinvimol M.2/6 No Attadej Rujirawannakun M.2/6 No.28.
Access Chapter 1: Intro to Access Objectives Navigate among objects in Access database Difference between working in storage and memory Good database file.
Metadata By N.Gopinath AP/CSE Metadata and it’s role in the lifecycle. The collection, maintenance, and deployment of metadata Metadata and tool integration.
7 Strategies for Extracting, Transforming, and Loading.
SSIS – Deep Dive Praveen Srivatsa Director, Asthrasoft Consulting Microsoft Regional Director | MVP.
Written By: Presented By: Swarup Acharya,Amr Elkhatib Phillip B. Gibbons, Viswanath Poosala, Sridhar Ramaswamy Join Synopses for Approximate Query Answering.
Retele de senzori Curs 2 - 1st edition UNIVERSITATEA „ TRANSILVANIA ” DIN BRAŞOV FACULTATEA DE INGINERIE ELECTRICĂ ŞI ŞTIINŢA CALCULATOARELOR.
Know your data source well. Who am I? Nik – Shahriar Nikkhah Microsoft MVP 2010 – SQL Server MCITP SQL 2008 MCTS SQL 2008 and s:
SQL Triggers, Functions & Stored Procedures Programming Operations.
2 Copyright © 2008, Oracle. All rights reserved. Building the Physical Layer of a Repository.
Supervisor : Prof . Abbdolahzadeh
Plan for Populating a DW
Chapter (12) – Old Version
Presented By: Jessica M. Moss
Data Warehouse Components
Data Warehousing/Loading the DW—Topics
Creating Repositories from Multidimensional Data Sources
LOCO Extract – Transform - Load
Relational Algebra Chapter 4 1.
A paper on Join Synopses for Approximate Query Answering
Why did you choose us? To address and provide a solution to the many problems associated with your current manual filing system -Problems include: -Lack.
Designing Database Solutions for SQL Server
ICT Database Lesson 1 What is a Database?.
Presented by: Warren Sifre
New Mexico State University
Relational Algebra Chapter 4, Part A
Chapter 10: Process Implementation with Executable Models
Oracle Analytic Views Enhance BI Applications and Simplify Development
Relational Algebra 461 The slides for this text are organized into chapters. This lecture covers relational algebra, from Chapter 4. The relational calculus.
Relational Algebra.
Chapter 2 Database Environment.
Lecture 12: Data Wrangling
BRK2279 Real-World Data Movement and Orchestration Patterns using Azure Data Factory Jason Horner, Attunix Cathrine Wilhelmsen, Inmeta -
Enhance BI Applications and Simplify Development
Relational Algebra Chapter 4 1.
Unidad II Data Warehousing Interview Questions
Relational Algebra Chapter 4, Sections 4.1 – 4.2
The Road to Denormalization
Adding Multiple Logical Table Sources
Metadata The metadata contains
Probabilistic Databases
Introduction to Access
Introduction to Dataflows in Power BI
Practical Database Design and Tuning Objectives
Data Warehousing/Loading the DW—Topics
Data Wrangling for ETL enthusiasts
David Gilmore & Richard Blevins Senior Consultants April 17th, 2012
Visual Data Flows – Azure Data Factory v2
Implementing ETL solution for Incremental Data Load in Microsoft SQL Server Ganesh Lohani SR. Data Analyst Lockheed Martin
Presentation transcript:

MAIME: A Maintenance Manager for ETL Processes Dariuš Butkevičius, Philipp D. Freiberger, Frederik M. Halberg, Jacob B. Hansen, Søren Jensen, Michael Tarp, ”Harry” Xuegang Huang, Christian Thomsen

Motivation A Data Warehouse (DW) contains data from a number of External Data Sources (EDSs) To populate a DW, an Extract-Transform-Load (ETL) process is used It is well-known that it is very time-consuming to construct the ETL process

Motivation Maintaining ETL processes after deployment, however, also takes much time Real examples A pension and insurance company applies weekly changes to its software systems. The BI team then has to update the ETL processes A facility management company has more than 10,000 ETL processes to execute daily. When there is a change in the source systems, the BI team has to find and fix the broken ones The ETL team at an online gaming-engine vendor has to deal with daily changes in the format of data from web services Maintenance of ETL processes requires manual work and is time-consuming and error-prone

MAIME To remedy these problems, we propose the tool MAIME which can detect schema changes in EDSs and (semi-)automatically repair the affected ETL processes MAIME works with SQL Server Integration Services (SSIS) and SQL Server Among the top-3 most used tools (Gartner) SSIS offers an API which makes it possible to change ETL processes programmatically The current prototype supports Aggregate, Conditional Split, Data Conversion, Derived Column, Lookup, Sort, and Union All as well as OLE DB Source and OLE DB Destination

Overview of MAIME

Overview of MAIME The Change Manager captures metadata from the EDSs The current snapshot is compared to the previous snapshot and a list of changes is produced The Maintenance Manager loads the SSIS Data Flow tasks and creates a graph model as an abstraction Makes it easy to represent dependencies between columns Based on the identified changes in the EDSs, the graph model is updated When we make a change in the graph model, corresponding changes are applied to the SSIS Data Flow

The Graph Model An acyclic property graph G = (V, E) where a vertex v∈ V represents a transformation and an edge (v1, v2, columns) represents that columns are transferred from v1 to v2 The transferred columns are ”put on” the edges. This is advantageous for transformations with multiple outgoing edges where each edge can transfer a different set of columns Our vertices have multiple properties A property is a key-value pair. We use the notation v.property The specific properties depend on the represented transformation type, but all have name, type, and dependencies except OLE DB Destination which has no dependencies

The Graph Model – dependencies dependencies shows how columns depend on each other If an Aggregate transformation computes c’ as the average of c, we have that c’ depends on c Formally, dependencies is a mapping from an output column o to a set of input columns {c1, …, cn} We say that o is dependent on {c1, …, cn} and denote this o  {c1, …, cn} We also have trivial dependencies where c depends on c

Examples – dependencies Aggregate: For each output column o computed as AGG(i), o depends on i Derived Column: Each derived column o depends on the set of columns used in the expression defining o. Trivial dependencies in addition Lookup: Each output column o depends on the set of input columns used in the lookup (i.e., the equi-join). Trivial dependencies in addition Conditional Split: Only trivial dependencies

Other Specific Properties

Policies For a change type in the EDS and a vertex type, a policy defines what to do For example p(Deletion, Aggregate) = Propagate Propagate means repair vertices of the given type if a change of the given type renders them invalid Block means that a vertex of the given type (or any of its descendants) will not be repaired Instead, it can optionally mean ”Don’t repair anything if the flow contains a vertex of the given type and the given change type occurred” Prompt means ”Ask the user”

Policies

Example Lookups TotalAmount Computes Amount- Times10 Extracts all from Person Computes Amount- Times10 Lookups TotalAmount

Example Now assume the following changes: Age is renamed to RenamedAge in the Person table TotalAmount is deleted from the Sale table MAIME will traverse the graph to detect problems and apply fixes (i.e., propagate changes) Renames are easily applied everywhere For deletions, dependencies are updated for each vertex From the dependencies, MAIME sees that AmountTimes10 in Derived Column depends on something that does not exist anymore  The derivation is removed (but the transformation stays)

Example It is also detected that one of the edges from the Conditional Split no longer can be taken The edge is removed Its destination is also removed since it has no in-coming edges anymore

Result

Comparison to Manual Approach 1st attempt 2nd attempt 3rd attempt Manual MAIME Time (seconds) 187 4 159 59 Keystrokes 23 15 12 Mouse clicks 88 85 38

Conclusion Maintenance of ETL processes after deployment is time-consuming We presented MAIME which detects schema changes and then identifies affected places in the ETL processes The ETL processes can be repaired automatically – sometimes by removing transformations and edges Positive feedback from BI consultancy companies In the future, the destination database could be modified, e.g, when a column has been added to the source or changed its type

Related Work Hecataeus by G. Papastefanatos, P. Vassiliadis, A. Simitsis, and Yannis Vassiliou Abstracts ETL processes as SQL queries, represented by graphs with subgraphs Detects evolution events and proposes changes to the ETL processes based on policies Propagate (readjust graph), Block (keep old semantics), Prompt Policies can be specified for each vertex/edge E-ETL by A. Wojciechowski Model ETL processes through SQL queries Policies: Propagate, Block, Prompt Different ways to handle changes: Stanadard Rules, Defined Rules, Alternative Scenarios