Presentation is loading. Please wait.

Presentation is loading. Please wait.

Graph-Based Modeling of ETL Activities with Multi-Level Transformations and Updates Alkis Simitsis 1, Panos Vassiliadis 2, Manolis Terrovitis 1, Spiros.

Similar presentations


Presentation on theme: "Graph-Based Modeling of ETL Activities with Multi-Level Transformations and Updates Alkis Simitsis 1, Panos Vassiliadis 2, Manolis Terrovitis 1, Spiros."— Presentation transcript:

1 Graph-Based Modeling of ETL Activities with Multi-Level Transformations and Updates Alkis Simitsis 1, Panos Vassiliadis 2, Manolis Terrovitis 1, Spiros Skiadopoulos 1 (1) National Technical University of Athens {asimi,mter,spiros}@dbnet.ece.ntua.gr (2) University of Ioannina pvassil@cs.uoi.gr

2 DaWaK'05, Copenhagen, August 20052 Outline Background & Motivation Database updates and composite transformations Zooming in and out the Architecture Graph Conclusions and Future Work

3 DaWaK'05, Copenhagen, August 20053 Outline Background & Motivation Database updates and composite transformations Zooming in and out the Architecture Graph Conclusions and Future Work

4 DaWaK'05, Copenhagen, August 20054 Extract-Transform-Load (ETL)

5 DaWaK'05, Copenhagen, August 20055 Add_SPK 1 SUPPKEY=1 SK 1 DS.PS 1.PKEY, LOOKUP_PS.SKEY, SUPPKEY $ 2€ COSTDATE DS.PS 2 Add_SPK 2 SUPPKEY=2 SK 2 DS.PS 2.PKEY, LOOKUP_PS.SKEY, SUPPKEY COSTDATE=SYSDATE AddDate CheckQTY QTY>0 U DS.PS 1 Log rejected Log rejected A2EDate NotNULL Log rejected Log rejected Log rejected DIFF 1 DS.PS_NEW 1.PKEY, DS.PS_OLD 1.PKEY DS.PS_NEW 1 DS.PS_OLD 1 DW.PARTSU PP Aggregate 1 PKEY, DAY MIN(COST) Aggregate 2 PKEY, MONTH AVG(COST) V2 V1 TIME  DW.PARTSUPP.DATE, DAY FTP 1 S 1 _PARTSU PP S 2 _PARTSU PP FTP 2 DS.PS_NEW 2 DIFF 2 DS.PS_OLD 2 DS.PS_NEW 2.PKEY, DS.PS_OLD 2.PKEY Motivation SourcesDW DSA

6 DaWaK'05, Copenhagen, August 20056 Background Traditional workflow modeling has treated workflows as graphs with control-flow semantics We take advantage of the data-centric, script- based nature of ETL activities to model their internals as graphs, too, as a graph, which we call Architecture Graph Our previous efforts [DMDW’02] handled simple cleanings and transformations and templates to simplify the definition of scenarios [CAiSE’03]

7 DaWaK'05, Copenhagen, August 20057 R A1A1 A2A2 A3A3 A4A4 A1A1 A2A2 A3A3 A4A4 IN.A 1 A1A1 A2A2 A3A3 A4A4 A1A1 A2A2 A3A3 A4A4 INOUT POPULATED_FIELD IN NotNull_A1 R SK_A2 … Legend: Background [DMDW’02, CAiSE’03] Black-box model: no semantics in the graph

8 DaWaK'05, Copenhagen, August 20058 Why is it important? What part of the scenario is affected if we delete an attribute? Which attributes/tables are involved in the population of an attribute? Straightforward to follow the data propagation chain How “good” is my design of the ETL scenario? Detection of inconsistencies Detection of important, vulnerable or useless attributes Well-defined quality measurement theory based on graphs

9 DaWaK'05, Copenhagen, August 20059 Graph-based modeling of data centric workflows is important!!! PKEYSUPPKEYPKEY SUPPKEYPKEYSUPPKEY SKEY AddSPK 1 SK1 IN OUT PKEY SKSOURCE LOOKUP_ PS Vulnerable point in the scenario Transitive: InOutDeg SKEY 808 Avg/ attribute 2.83 5.67 Avg/ entity 5.67 11.33

10 DaWaK'05, Copenhagen, August 200510 Contribution We incorporate update semantics in our graph- based modeling of ETL activities. We also cover complex transformations like negation, aggregation and self-joins. We introduce a systematic way of transforming the Architecture Graph to allow zooming in and out at multiple levels of abstraction. Long version and ER’05: details for adding internal semantics

11 DaWaK'05, Copenhagen, August 200511 Outline Background & Motivation Database updates and composite transformations Zooming in and out the Architecture Graph Conclusions and Future Work

12 DaWaK'05, Copenhagen, August 200512 Updates and Transformations In this paper, we consider adding graph modeling techniques for several kinds of activities: Update (INS, UPD, DEL) activities Aggregates Rules employing negation and aliases Functions We use LDL++ as language that declaratively describes the semantics

13 DaWaK'05, Copenhagen, August 200513 Updates to the database An update expression is of the form head <- query part, update part with the following semantics: 1. a query to the database for the tuples that abide by the query part 2. we update the predicate of the update part as specified in the rule. raise1(Name, Sal, NewSal) <- employee(Name, Sal), Sal = 1100,(a) NewSal = Sal * 1.1,(b) - employee(Name, Sal),(c) + employee(Name, NewSal).(d)

14 DaWaK'05, Copenhagen, August 200514 Updates to the database

15 DaWaK'05, Copenhagen, August 200515 Updates to the database A side-effect rule is treated as an activity, with the corresponding node. The output schema of the activity is derived from the structure of the predicate of the head of the rule. For every predicate with a + or – in the body of the rule, a respective provider edge from the output schema of the side-effect activity is assumed. For every predicate that appears in the rule without a + or – tag, we assume the respective input schema. Provider edges from this predicate towards these schemata are added as usual. The same applies for the attributes of the input and output schemata of the side effect activity.

16 DaWaK'05, Copenhagen, August 200516 Aggregation Aggregation in LDL: 1.grouping of values to a bag and 2.application of an aggregate function over the values of the bag. R16: aggregate1.a_in(skey,suppkey,date,qty,cost)<- dw.partsupp(skey,suppkey,date,qty,cost) R17: temp(skey,day, ) <- aggregate1.a_in(skey,suppkey,date,qty,cost). R18: aggregate1.a_out(skey,day,min_cost) <- temp(skey,day,all_costs), aggr(min,all_costs,min_cost). R19: v1(skey,day,min_cost) <- aggregate1.a_out(skey,day,min_cost).

17 DaWaK'05, Copenhagen, August 200517 Aggregation

18 DaWaK'05, Copenhagen, August 200518 Aggregation Relations which create a set from the values of a field employ a pair of regulator edges through an intermediate node ‘<>’. Provider relations for attributes used as groupers are tagged with ‘ g ’. One of the attributes of the aggr function node consumes data from a constant that indicates which aggregate function should be used (e.g., avg, min, max ).

19 DaWaK'05, Copenhagen, August 200519 Outline Background & Motivation Database updates and composite transformations Zooming in and out the Architecture Graph Conclusions and Future Work

20 DaWaK'05, Copenhagen, August 200520 Zooming in and out There is a principled way of zooming in and out, in various levels of abstraction: The attribute level The schema level The activity level

21 DaWaK'05, Copenhagen, August 200521 Zooming in and out For each node x of the architecture graph G(V,E) representing a schema: 1.for each provider edge ( x a, y ) or ( y, x a ), involving an attribute of x and an entity y, external to x, introduce the respective provider edge between x and y (unless it already exists, of course); 2.remove the provider edges ( x a, y ) and ( y, x a ) of the previous step; 3.remove the nodes of the attributes of x and the respective part-of edges.

22 DaWaK'05, Copenhagen, August 200522 Zooming in and out GsGs GaGa

23 DaWaK'05, Copenhagen, August 200523 Zooming in and out GsGs GaGa MeasureDefinitionGaGs SizeSize(G)725 LengthMax provider path32 Complexity 0.5*ext. edges + int. edges 836 Cohesion F_IN+F_OUT F (IN+OUT) 10.75

24 DaWaK'05, Copenhagen, August 200524 Outline Background & Motivation Database updates and composite transformations Zooming in and out the Architecture Graph Conclusions and Future Work

25 DaWaK'05, Copenhagen, August 200525 Summary We have incorporated update semantics in our graph- based modeling of ETL activities. We also cover complex transformations like negation, aggregation and self-joins. We have introduced a systematic way of transforming the Architecture Graph to allow zooming in and out at multiple levels of abstraction Long version and ER’05: internal semantics for activities and quality measures for the design of ETL activities http://www.cs.uoi.gr/~pvassil/publications/2005_ER_AG/ETL_blueprints_long.pdf

26 DaWaK'05, Copenhagen, August 200526 This work is part of the Arktos II project Future work includes research in What-if analysis of ETL scenarios Measures for the quality of the design of ETL scenarios http://www.cs.uoi.gr/~pvassil/projects/arktos_II On-going/Future Work Arktos II

27 DaWaK'05, Copenhagen, August 200527 http://www.cs.uoi.gr/~pvassil/projects/arktos_II Thank you! Arktos II

28 DaWaK'05, Copenhagen, August 200528 Backup Slides

29 DaWaK'05, Copenhagen, August 200529 Vision – big picture The major research goal is to be able to have a metadata repository that incorporates metadata on the static part of an information system, i.e., tables, constraints, query forms, etc on the dynamic part, i.e., data-centric software modules We invest on a graph-based modeling approach, based on the flexibility of graphs as modeling tools

30 DaWaK'05, Copenhagen, August 200530 Preliminaries Data types Constants Attributes RecordSets Function types Functions R $2€ 1 PKEY my$2€ Integer

31 DaWaK'05, Copenhagen, August 200531 Relationships Instance-Of Relationships Part-Of Relationships Regulator Relationships Provider Relationships Derived Provider Relationships

32 DaWaK'05, Copenhagen, August 200532 Activities Name Input Schemata Output Schema Rejections Schema Parameter List Output/Rejection Operational Semantics Output Activity Parameters Input 1 Input 2 Rejected Rows

33 DaWaK'05, Copenhagen, August 200533 Importance Metrics Dependency: the in-degree of the node with respect to the provider edges; Responsibility: the out-degree of the node with respect to the provider edges; Degree: dependency + responsibility Local vs. Transitive

34 DaWaK'05, Copenhagen, August 200534 Functions Functions are treated as any other predicate in LDL, with the following special characteristics: The function involves a list of parameters, the last of which is the return value of the function. All function parameters referenced in the body of the rule either as homonyms with attributes, of other predicates or through equalities with such attributes, are linked through equality regulator relationships with these attributes. The return value is possibly connected to the output through a provider relationship (or with some other predicate of the body, through a regulator relationship).

35 DaWaK'05, Copenhagen, August 200535 Aliases & Negation Alias relationships. An alias relationship is introduced whenever the same predicate appears in the same rule (e.g., in the case of a self-join). All the nodes representing these occurrences of the same predicate are connected through alias relationships to denote their semantic interrelationship. Note that due to the fact that intra- activity programs do not directly interact with external recordsets or activities, this practically involves the rare case of internal intermediate rules Negation. When a predicates appears negated in a rule body, then the respective part-of edge between the rule and the literal’s node is tagged with ‘⌐’. Note that negated predicates can appear only in the rule body.

36 DaWaK'05, Copenhagen, August 200536 Activity semantics in LDL R06: a_in1(pkey,suppkey,date,qty,cost)<- ps(pkey,suppkey,date,qty,cost). R07: a_in2(pkey,source,skey)<- lookUp(l_pkey,source,l_skey), pkey=l_pkey, skey=l_skey,source=1. R08: a_out(pkey,suppkey,date,qty,cost,skey)<- a_in1(pkey,date,qty,cost), a_in2(pkey,source,l_skey). R09: D2E.a_in(skey,suppkey,date,qty,cost)<- sk.a_out(pkey,suppkey,date,qty,cost,skey)

37 DaWaK'05, Copenhagen, August 200537 Activities

38 DaWaK'05, Copenhagen, August 200538 Zooming in and out


Download ppt "Graph-Based Modeling of ETL Activities with Multi-Level Transformations and Updates Alkis Simitsis 1, Panos Vassiliadis 2, Manolis Terrovitis 1, Spiros."

Similar presentations


Ads by Google