MAIME: A Maintenance Manager for ETL Processes Dariuš Butkevičius, Philipp D. Freiberger, Frederik M. Halberg, Jacob B. Hansen, Søren Jensen, Michael Tarp, ”Harry” Xuegang Huang, Christian Thomsen
Motivation A Data Warehouse (DW) contains data from a number of External Data Sources (EDSs) To populate a DW, an Extract-Transform-Load (ETL) process is used It is well-known that it is very time-consuming to construct the ETL process
Motivation Maintaining ETL processes after deployment, however, also takes much time Real examples A pension and insurance company applies weekly changes to its software systems. The BI team then has to update the ETL processes A facility management company has more than 10,000 ETL processes to execute daily. When there is a change in the source systems, the BI team has to find and fix the broken ones The ETL team at an online gaming-engine vendor has to deal with daily changes in the format of data from web services Maintenance of ETL processes requires manual work and is time-consuming and error-prone
MAIME To remedy these problems, we propose the tool MAIME which can detect schema changes in EDSs and (semi-)automatically repair the affected ETL processes MAIME works with SQL Server Integration Services (SSIS) and SQL Server Among the top-3 most used tools (Gartner) SSIS offers an API which makes it possible to change ETL processes programmatically The current prototype supports Aggregate, Conditional Split, Data Conversion, Derived Column, Lookup, Sort, and Union All as well as OLE DB Source and OLE DB Destination
Overview of MAIME
Overview of MAIME The Change Manager captures metadata from the EDSs The current snapshot is compared to the previous snapshot and a list of changes is produced The Maintenance Manager loads the SSIS Data Flow tasks and creates a graph model as an abstraction Makes it easy to represent dependencies between columns Based on the identified changes in the EDSs, the graph model is updated When we make a change in the graph model, corresponding changes are applied to the SSIS Data Flow
The Graph Model An acyclic property graph G = (V, E) where a vertex v∈ V represents a transformation and an edge (v1, v2, columns) represents that columns are transferred from v1 to v2 The transferred columns are ”put on” the edges. This is advantageous for transformations with multiple outgoing edges where each edge can transfer a different set of columns Our vertices have multiple properties A property is a key-value pair. We use the notation v.property The specific properties depend on the represented transformation type, but all have name, type, and dependencies except OLE DB Destination which has no dependencies
The Graph Model – dependencies dependencies shows how columns depend on each other If an Aggregate transformation computes c’ as the average of c, we have that c’ depends on c Formally, dependencies is a mapping from an output column o to a set of input columns {c1, …, cn} We say that o is dependent on {c1, …, cn} and denote this o {c1, …, cn} We also have trivial dependencies where c depends on c
Examples – dependencies Aggregate: For each output column o computed as AGG(i), o depends on i Derived Column: Each derived column o depends on the set of columns used in the expression defining o. Trivial dependencies in addition Lookup: Each output column o depends on the set of input columns used in the lookup (i.e., the equi-join). Trivial dependencies in addition Conditional Split: Only trivial dependencies
Other Specific Properties
Policies For a change type in the EDS and a vertex type, a policy defines what to do For example p(Deletion, Aggregate) = Propagate Propagate means repair vertices of the given type if a change of the given type renders them invalid Block means that a vertex of the given type (or any of its descendants) will not be repaired Instead, it can optionally mean ”Don’t repair anything if the flow contains a vertex of the given type and the given change type occurred” Prompt means ”Ask the user”
Policies
Example Lookups TotalAmount Computes Amount- Times10 Extracts all from Person Computes Amount- Times10 Lookups TotalAmount
Example Now assume the following changes: Age is renamed to RenamedAge in the Person table TotalAmount is deleted from the Sale table MAIME will traverse the graph to detect problems and apply fixes (i.e., propagate changes) Renames are easily applied everywhere For deletions, dependencies are updated for each vertex From the dependencies, MAIME sees that AmountTimes10 in Derived Column depends on something that does not exist anymore The derivation is removed (but the transformation stays)
Example It is also detected that one of the edges from the Conditional Split no longer can be taken The edge is removed Its destination is also removed since it has no in-coming edges anymore
Result
Comparison to Manual Approach 1st attempt 2nd attempt 3rd attempt Manual MAIME Time (seconds) 187 4 159 59 Keystrokes 23 15 12 Mouse clicks 88 85 38
Conclusion Maintenance of ETL processes after deployment is time-consuming We presented MAIME which detects schema changes and then identifies affected places in the ETL processes The ETL processes can be repaired automatically – sometimes by removing transformations and edges Positive feedback from BI consultancy companies In the future, the destination database could be modified, e.g, when a column has been added to the source or changed its type
Related Work Hecataeus by G. Papastefanatos, P. Vassiliadis, A. Simitsis, and Yannis Vassiliou Abstracts ETL processes as SQL queries, represented by graphs with subgraphs Detects evolution events and proposes changes to the ETL processes based on policies Propagate (readjust graph), Block (keep old semantics), Prompt Policies can be specified for each vertex/edge E-ETL by A. Wojciechowski Model ETL processes through SQL queries Policies: Propagate, Block, Prompt Different ways to handle changes: Stanadard Rules, Defined Rules, Alternative Scenarios