Graph-Based Modeling of ETL Activities with Multi-Level Transformations and Updates Alkis Simitsis 1, Panos Vassiliadis 2, Manolis Terrovitis 1, Spiros Skiadopoulos 1 (1) National Technical University of Athens (2) University of Ioannina
DaWaK'05, Copenhagen, August Outline Background & Motivation Database updates and composite transformations Zooming in and out the Architecture Graph Conclusions and Future Work
DaWaK'05, Copenhagen, August Outline Background & Motivation Database updates and composite transformations Zooming in and out the Architecture Graph Conclusions and Future Work
DaWaK'05, Copenhagen, August Extract-Transform-Load (ETL)
DaWaK'05, Copenhagen, August Add_SPK 1 SUPPKEY=1 SK 1 DS.PS 1.PKEY, LOOKUP_PS.SKEY, SUPPKEY $ 2€ COSTDATE DS.PS 2 Add_SPK 2 SUPPKEY=2 SK 2 DS.PS 2.PKEY, LOOKUP_PS.SKEY, SUPPKEY COSTDATE=SYSDATE AddDate CheckQTY QTY>0 U DS.PS 1 Log rejected Log rejected A2EDate NotNULL Log rejected Log rejected Log rejected DIFF 1 DS.PS_NEW 1.PKEY, DS.PS_OLD 1.PKEY DS.PS_NEW 1 DS.PS_OLD 1 DW.PARTSU PP Aggregate 1 PKEY, DAY MIN(COST) Aggregate 2 PKEY, MONTH AVG(COST) V2 V1 TIME DW.PARTSUPP.DATE, DAY FTP 1 S 1 _PARTSU PP S 2 _PARTSU PP FTP 2 DS.PS_NEW 2 DIFF 2 DS.PS_OLD 2 DS.PS_NEW 2.PKEY, DS.PS_OLD 2.PKEY Motivation SourcesDW DSA
DaWaK'05, Copenhagen, August Background Traditional workflow modeling has treated workflows as graphs with control-flow semantics We take advantage of the data-centric, script- based nature of ETL activities to model their internals as graphs, too, as a graph, which we call Architecture Graph Our previous efforts [DMDW’02] handled simple cleanings and transformations and templates to simplify the definition of scenarios [CAiSE’03]
DaWaK'05, Copenhagen, August R A1A1 A2A2 A3A3 A4A4 A1A1 A2A2 A3A3 A4A4 IN.A 1 A1A1 A2A2 A3A3 A4A4 A1A1 A2A2 A3A3 A4A4 INOUT POPULATED_FIELD IN NotNull_A1 R SK_A2 … Legend: Background [DMDW’02, CAiSE’03] Black-box model: no semantics in the graph
DaWaK'05, Copenhagen, August Why is it important? What part of the scenario is affected if we delete an attribute? Which attributes/tables are involved in the population of an attribute? Straightforward to follow the data propagation chain How “good” is my design of the ETL scenario? Detection of inconsistencies Detection of important, vulnerable or useless attributes Well-defined quality measurement theory based on graphs
DaWaK'05, Copenhagen, August Graph-based modeling of data centric workflows is important!!! PKEYSUPPKEYPKEY SUPPKEYPKEYSUPPKEY SKEY AddSPK 1 SK1 IN OUT PKEY SKSOURCE LOOKUP_ PS Vulnerable point in the scenario Transitive: InOutDeg SKEY 808 Avg/ attribute Avg/ entity
DaWaK'05, Copenhagen, August Contribution We incorporate update semantics in our graph- based modeling of ETL activities. We also cover complex transformations like negation, aggregation and self-joins. We introduce a systematic way of transforming the Architecture Graph to allow zooming in and out at multiple levels of abstraction. Long version and ER’05: details for adding internal semantics
DaWaK'05, Copenhagen, August Outline Background & Motivation Database updates and composite transformations Zooming in and out the Architecture Graph Conclusions and Future Work
DaWaK'05, Copenhagen, August Updates and Transformations In this paper, we consider adding graph modeling techniques for several kinds of activities: Update (INS, UPD, DEL) activities Aggregates Rules employing negation and aliases Functions We use LDL++ as language that declaratively describes the semantics
DaWaK'05, Copenhagen, August Updates to the database An update expression is of the form head <- query part, update part with the following semantics: 1. a query to the database for the tuples that abide by the query part 2. we update the predicate of the update part as specified in the rule. raise1(Name, Sal, NewSal) <- employee(Name, Sal), Sal = 1100,(a) NewSal = Sal * 1.1,(b) - employee(Name, Sal),(c) + employee(Name, NewSal).(d)
DaWaK'05, Copenhagen, August Updates to the database
DaWaK'05, Copenhagen, August Updates to the database A side-effect rule is treated as an activity, with the corresponding node. The output schema of the activity is derived from the structure of the predicate of the head of the rule. For every predicate with a + or – in the body of the rule, a respective provider edge from the output schema of the side-effect activity is assumed. For every predicate that appears in the rule without a + or – tag, we assume the respective input schema. Provider edges from this predicate towards these schemata are added as usual. The same applies for the attributes of the input and output schemata of the side effect activity.
DaWaK'05, Copenhagen, August Aggregation Aggregation in LDL: 1.grouping of values to a bag and 2.application of an aggregate function over the values of the bag. R16: aggregate1.a_in(skey,suppkey,date,qty,cost)<- dw.partsupp(skey,suppkey,date,qty,cost) R17: temp(skey,day, ) <- aggregate1.a_in(skey,suppkey,date,qty,cost). R18: aggregate1.a_out(skey,day,min_cost) <- temp(skey,day,all_costs), aggr(min,all_costs,min_cost). R19: v1(skey,day,min_cost) <- aggregate1.a_out(skey,day,min_cost).
DaWaK'05, Copenhagen, August Aggregation
DaWaK'05, Copenhagen, August Aggregation Relations which create a set from the values of a field employ a pair of regulator edges through an intermediate node ‘<>’. Provider relations for attributes used as groupers are tagged with ‘ g ’. One of the attributes of the aggr function node consumes data from a constant that indicates which aggregate function should be used (e.g., avg, min, max ).
DaWaK'05, Copenhagen, August Outline Background & Motivation Database updates and composite transformations Zooming in and out the Architecture Graph Conclusions and Future Work
DaWaK'05, Copenhagen, August Zooming in and out There is a principled way of zooming in and out, in various levels of abstraction: The attribute level The schema level The activity level
DaWaK'05, Copenhagen, August Zooming in and out For each node x of the architecture graph G(V,E) representing a schema: 1.for each provider edge ( x a, y ) or ( y, x a ), involving an attribute of x and an entity y, external to x, introduce the respective provider edge between x and y (unless it already exists, of course); 2.remove the provider edges ( x a, y ) and ( y, x a ) of the previous step; 3.remove the nodes of the attributes of x and the respective part-of edges.
DaWaK'05, Copenhagen, August Zooming in and out GsGs GaGa
DaWaK'05, Copenhagen, August Zooming in and out GsGs GaGa MeasureDefinitionGaGs SizeSize(G)725 LengthMax provider path32 Complexity 0.5*ext. edges + int. edges 836 Cohesion F_IN+F_OUT F (IN+OUT) 10.75
DaWaK'05, Copenhagen, August Outline Background & Motivation Database updates and composite transformations Zooming in and out the Architecture Graph Conclusions and Future Work
DaWaK'05, Copenhagen, August Summary We have incorporated update semantics in our graph- based modeling of ETL activities. We also cover complex transformations like negation, aggregation and self-joins. We have introduced a systematic way of transforming the Architecture Graph to allow zooming in and out at multiple levels of abstraction Long version and ER’05: internal semantics for activities and quality measures for the design of ETL activities
DaWaK'05, Copenhagen, August This work is part of the Arktos II project Future work includes research in What-if analysis of ETL scenarios Measures for the quality of the design of ETL scenarios On-going/Future Work Arktos II
DaWaK'05, Copenhagen, August Thank you! Arktos II
DaWaK'05, Copenhagen, August Backup Slides
DaWaK'05, Copenhagen, August Vision – big picture The major research goal is to be able to have a metadata repository that incorporates metadata on the static part of an information system, i.e., tables, constraints, query forms, etc on the dynamic part, i.e., data-centric software modules We invest on a graph-based modeling approach, based on the flexibility of graphs as modeling tools
DaWaK'05, Copenhagen, August Preliminaries Data types Constants Attributes RecordSets Function types Functions R $2€ 1 PKEY my$2€ Integer
DaWaK'05, Copenhagen, August Relationships Instance-Of Relationships Part-Of Relationships Regulator Relationships Provider Relationships Derived Provider Relationships
DaWaK'05, Copenhagen, August Activities Name Input Schemata Output Schema Rejections Schema Parameter List Output/Rejection Operational Semantics Output Activity Parameters Input 1 Input 2 Rejected Rows
DaWaK'05, Copenhagen, August Importance Metrics Dependency: the in-degree of the node with respect to the provider edges; Responsibility: the out-degree of the node with respect to the provider edges; Degree: dependency + responsibility Local vs. Transitive
DaWaK'05, Copenhagen, August Functions Functions are treated as any other predicate in LDL, with the following special characteristics: The function involves a list of parameters, the last of which is the return value of the function. All function parameters referenced in the body of the rule either as homonyms with attributes, of other predicates or through equalities with such attributes, are linked through equality regulator relationships with these attributes. The return value is possibly connected to the output through a provider relationship (or with some other predicate of the body, through a regulator relationship).
DaWaK'05, Copenhagen, August Aliases & Negation Alias relationships. An alias relationship is introduced whenever the same predicate appears in the same rule (e.g., in the case of a self-join). All the nodes representing these occurrences of the same predicate are connected through alias relationships to denote their semantic interrelationship. Note that due to the fact that intra- activity programs do not directly interact with external recordsets or activities, this practically involves the rare case of internal intermediate rules Negation. When a predicates appears negated in a rule body, then the respective part-of edge between the rule and the literal’s node is tagged with ‘⌐’. Note that negated predicates can appear only in the rule body.
DaWaK'05, Copenhagen, August Activity semantics in LDL R06: a_in1(pkey,suppkey,date,qty,cost)<- ps(pkey,suppkey,date,qty,cost). R07: a_in2(pkey,source,skey)<- lookUp(l_pkey,source,l_skey), pkey=l_pkey, skey=l_skey,source=1. R08: a_out(pkey,suppkey,date,qty,cost,skey)<- a_in1(pkey,date,qty,cost), a_in2(pkey,source,l_skey). R09: D2E.a_in(skey,suppkey,date,qty,cost)<- sk.a_out(pkey,suppkey,date,qty,cost,skey)
DaWaK'05, Copenhagen, August Activities
DaWaK'05, Copenhagen, August Zooming in and out