Graph-Based Modeling of ETL Activities with Multi-Level Transformations and Updates Alkis Simitsis 1, Panos Vassiliadis 2, Manolis Terrovitis 1, Spiros.

Slides:



Advertisements
Similar presentations
ETL Workflows: From Formal Specification to Optimization Timos Sellis National Technical University of Athens (joint work with Alkis Simitsis, IBM Almaden.
Advertisements

Towards a Benchmark for ETL Workflows Panos Vassiliadis Anastasios Karagiannis Vasiliki Tziovara Alkis Simitsis Univ. of Ioannina Almaden Research Center.
G. Papastefanatos 1, P. Vassiliadis 2, A. Simitsis 3, T. Sellis 1,4, Y. Vassiliou 1 (1) National Technical University of Athens, Athens, Hellas (Greece)
1 Static Testing: defect prevention SIM objectives Able to list various type of structured group examinations (manual checking) Able to statically.
©Ian Sommerville 2006Software Engineering, 8th edition. Chapter 8 Slide 1 System models.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 3 The Basic (Flat) Relational Model.
Management of the Evolution of Database-Centric Information Systems Panos Vassiliadis 2, George Papastefanatos 1, Timos Sellis 1, Yannis Vassiliou 1 1.
A First Attempt towards a Logical Model for the PBMS PANDA Meeting, Milano, 18 April 2002 National Technical University of Athens Patterns for Next-Generation.
Software Metrics II Speaker: Jerry Gao Ph.D. San Jose State University URL: Sept., 2001.
Modified from Sommerville’s originalsSoftware Engineering, 7th edition. Chapter 8 Slide 1 System models.
3-1 Chapter 3 Data and Knowledge Management
Supporting Streaming Updates in an Active Data Warehouse Neoklis Polyzotis, Spiros Skiadopoulos, Panos Vassiliadis, Alkis Simitsis, Nils-Erik Frantzell.
Modified from Sommerville’s originalsSoftware Engineering, 7th edition. Chapter 8 Slide 1 System models.
George Papastefanatos 1, Panos Vassiliadis 2, Alkis Simitsis 3,Yannis Vassiliou 1 (1) National Technical University of Athens
ETL Queues for Active Data Warehousing Alexis Karakasidis Panos Vassiliadis Evaggelia Pitoura Dept. of Computer Science University of Ioannina.
G. Papastefanatos 1, P. Vassiliadis 2, A. Simitsis 3, Y. Vassiliou 1 (1) National Technical University of Athens, Athens, Hellas (Greece)
1 Chapter 2 Reviewing Tables and Queries. 2 Chapter Objectives Identify the steps required to develop an Access application Specify the characteristics.
G. Papastefanatos 1, P. Vassiliadis 2, A. Simitsis 3, K. Aggistalis 2, F. Pechlivani 2, Yannis Vassiliou 1 (1) National Technical University of Athens.
Deciding the Physical Implementation of ETL Workflows Vasiliki Tziovara Panos Vassiliadis Alkis Simitsis Univ. of Ioannina Almaden Research Center.
Chapter 4 Database Management Systems. Chapter 4Slide 2 What is a Database Management System (DBMS)?  Database An organized collection of related data.
Data Provenance in ETL Scenarios Panos Vassiliadis University of Ioannina (joint work with Alkis Simitsis, IBM Almaden Research Center, Timos Sellis and.
H ECATAEUS A Framework for Representing SQL Constructs as Graphs George Papastefanatos 1, Kostis Kyzirakos 1, Panos Vassiliadis 2, Yannis Vassiliou 1 1.
Best Practices for Data Warehousing. 2 Agenda – Best Practices for DW-BI Best Practices in Data Modeling Best Practices in ETL Best Practices in Reporting.
Chapter 4 System Models A description of the various models that can be used to specify software systems.
System models Abstract descriptions of systems whose requirements are being analysed Abstract descriptions of systems whose requirements are being analysed.
Context Tailoring the DBMS –To support particular applications Beyond alphanumerical data Beyond retrieve + process –To support particular hardware New.
1 CSBP430 – Database Systems Chapter 1: Databases and Database Users Mamoun Awad College of Information Technology United Arab Emirates University
Database Technical Session By: Prof. Adarsh Patel.
A Z Approach in Validating ORA-SS Data Models Scott Uk-Jin Lee Jing Sun Gillian Dobbie Yuan Fang Li.
Usage of `provenance’: A Tower of Babel Luc Moreau.
Copyright 2002 Prentice-Hall, Inc. Modern Systems Analysis and Design Third Edition Jeffrey A. Hoffer Joey F. George Joseph S. Valacich Chapter 20 Object-Oriented.
General Database Statistics Using Maximum Entropy Raghav Kaushik 1, Christopher Ré 2, and Dan Suciu 3 1 Microsoft Research 2 University of Wisconsin--Madison.
©Ian Sommerville 1995/2000 (Modified by Spiros Mancoridis 1999) Software Engineering, 6th edition. Chapter 7 Slide 1 System models l Abstract descriptions.
Querying Structured Text in an XML Database By Xuemei Luo.
Copyrighted material John Tullis 10/17/2015 page 1 04/15/00 XML Part 3 John Tullis DePaul Instructor
1 The Relational Database Model. 2 Learning Objectives Terminology of relational model. How tables are used to represent data. Connection between mathematical.
Dimitrios Skoutas Alkis Simitsis
Chapter 7 System models.
Conceptual Modelling – Behaviour
System models l Abstract descriptions of systems whose requirements are being analysed.
Modified by Juan M. Gomez Software Engineering, 6th edition. Chapter 7 Slide 1 Chapter 7 System Models.
Software Engineering, 8th edition Chapter 8 1 Courtesy: ©Ian Somerville 2006 April 06 th, 2009 Lecture # 13 System models.
Sommerville 2004,Mejia-Alvarez 2009Software Engineering, 7th edition. Chapter 8 Slide 1 System models.
A Logic Programming Approach to Scientific Workflow Provenance Querying* Shiyong Lu Department of Computer Science Wayne State University, Detroit, MI.
Chapter 10 Analysis and Design Discipline. 2 Purpose The purpose is to translate the requirements into a specification that describes how to implement.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Relational Algebra.
Slide Chapter 5 The Relational Data Model and Relational Database Constraints.
A Taxonomy of ETL Activities Panos Vassiliadis 1, Alkis Simitsis 2, Eftychia Baikousi 1 (1) University of Ioannina (2) HP Labs.
UML Class Diagram Trisha Cummings. What we will be covering What is a Class Diagram? Essential Elements of a UML Class Diagram UML Packages Logical Distribution.
Dr.Basem Alkazemi
1 Class Diagrams. 2 Overview Class diagrams are the most commonly used diagrams in UML. Class diagrams are for visualizing, specifying and documenting.
Conceptual Modeling for ETL processes Panos Vassiliadis, Alkis Simitsis, Spiros Skiadopoulos National Technical.
CSE314 Database Systems Lecture 3 The Relational Data Model and Relational Database Constraints Doç. Dr. Mehmet Göktürk src: Elmasri & Navanthe 6E Pearson.
Chapter 18 Object Database Management Systems. Outline Motivation for object database management Object-oriented principles Architectures for object database.
 To explain why the context of a system should be modelled as part of the RE process  To describe behavioural modelling, data modelling and object modelling.
1 The Relational Data Model David J. Stucki. Relational Model Concepts 2 Fundamental concept: the relation  The Relational Model represents an entire.
COP Introduction to Database Structures
MAIME: A Maintenance Manager for ETL Processes
A Model for Data Warehouse Operational Processes
XML: Extensible Markup Language
Knowledge Representation Techniques
Module 2: Intro to Relational Model
Business Process Measures
Chapter 2: Intro to Relational Model
Relational Algebra Chapter 4, Part A
Relational Algebra 461 The slides for this text are organized into chapters. This lecture covers relational algebra, from Chapter 4. The relational calculus.
Chapter 4 Entity Relationship (ER) Modeling
Chapter 2: Intro to Relational Model
Example of a Relation attributes (or columns) tuples (or rows)
Chapter 2: Intro to Relational Model
Presentation transcript:

Graph-Based Modeling of ETL Activities with Multi-Level Transformations and Updates Alkis Simitsis 1, Panos Vassiliadis 2, Manolis Terrovitis 1, Spiros Skiadopoulos 1 (1) National Technical University of Athens (2) University of Ioannina

DaWaK'05, Copenhagen, August Outline Background & Motivation Database updates and composite transformations Zooming in and out the Architecture Graph Conclusions and Future Work

DaWaK'05, Copenhagen, August Outline Background & Motivation Database updates and composite transformations Zooming in and out the Architecture Graph Conclusions and Future Work

DaWaK'05, Copenhagen, August Extract-Transform-Load (ETL)

DaWaK'05, Copenhagen, August Add_SPK 1 SUPPKEY=1 SK 1 DS.PS 1.PKEY, LOOKUP_PS.SKEY, SUPPKEY $ 2€ COSTDATE DS.PS 2 Add_SPK 2 SUPPKEY=2 SK 2 DS.PS 2.PKEY, LOOKUP_PS.SKEY, SUPPKEY COSTDATE=SYSDATE AddDate CheckQTY QTY>0 U DS.PS 1 Log rejected Log rejected A2EDate NotNULL Log rejected Log rejected Log rejected DIFF 1 DS.PS_NEW 1.PKEY, DS.PS_OLD 1.PKEY DS.PS_NEW 1 DS.PS_OLD 1 DW.PARTSU PP Aggregate 1 PKEY, DAY MIN(COST) Aggregate 2 PKEY, MONTH AVG(COST) V2 V1 TIME  DW.PARTSUPP.DATE, DAY FTP 1 S 1 _PARTSU PP S 2 _PARTSU PP FTP 2 DS.PS_NEW 2 DIFF 2 DS.PS_OLD 2 DS.PS_NEW 2.PKEY, DS.PS_OLD 2.PKEY Motivation SourcesDW DSA

DaWaK'05, Copenhagen, August Background Traditional workflow modeling has treated workflows as graphs with control-flow semantics We take advantage of the data-centric, script- based nature of ETL activities to model their internals as graphs, too, as a graph, which we call Architecture Graph Our previous efforts [DMDW’02] handled simple cleanings and transformations and templates to simplify the definition of scenarios [CAiSE’03]

DaWaK'05, Copenhagen, August R A1A1 A2A2 A3A3 A4A4 A1A1 A2A2 A3A3 A4A4 IN.A 1 A1A1 A2A2 A3A3 A4A4 A1A1 A2A2 A3A3 A4A4 INOUT POPULATED_FIELD IN NotNull_A1 R SK_A2 … Legend: Background [DMDW’02, CAiSE’03] Black-box model: no semantics in the graph

DaWaK'05, Copenhagen, August Why is it important? What part of the scenario is affected if we delete an attribute? Which attributes/tables are involved in the population of an attribute? Straightforward to follow the data propagation chain How “good” is my design of the ETL scenario? Detection of inconsistencies Detection of important, vulnerable or useless attributes Well-defined quality measurement theory based on graphs

DaWaK'05, Copenhagen, August Graph-based modeling of data centric workflows is important!!! PKEYSUPPKEYPKEY SUPPKEYPKEYSUPPKEY SKEY AddSPK 1 SK1 IN OUT PKEY SKSOURCE LOOKUP_ PS Vulnerable point in the scenario Transitive: InOutDeg SKEY 808 Avg/ attribute Avg/ entity

DaWaK'05, Copenhagen, August Contribution We incorporate update semantics in our graph- based modeling of ETL activities. We also cover complex transformations like negation, aggregation and self-joins. We introduce a systematic way of transforming the Architecture Graph to allow zooming in and out at multiple levels of abstraction. Long version and ER’05: details for adding internal semantics

DaWaK'05, Copenhagen, August Outline Background & Motivation Database updates and composite transformations Zooming in and out the Architecture Graph Conclusions and Future Work

DaWaK'05, Copenhagen, August Updates and Transformations In this paper, we consider adding graph modeling techniques for several kinds of activities: Update (INS, UPD, DEL) activities Aggregates Rules employing negation and aliases Functions We use LDL++ as language that declaratively describes the semantics

DaWaK'05, Copenhagen, August Updates to the database An update expression is of the form head <- query part, update part with the following semantics: 1. a query to the database for the tuples that abide by the query part 2. we update the predicate of the update part as specified in the rule. raise1(Name, Sal, NewSal) <- employee(Name, Sal), Sal = 1100,(a) NewSal = Sal * 1.1,(b) - employee(Name, Sal),(c) + employee(Name, NewSal).(d)

DaWaK'05, Copenhagen, August Updates to the database

DaWaK'05, Copenhagen, August Updates to the database A side-effect rule is treated as an activity, with the corresponding node. The output schema of the activity is derived from the structure of the predicate of the head of the rule. For every predicate with a + or – in the body of the rule, a respective provider edge from the output schema of the side-effect activity is assumed. For every predicate that appears in the rule without a + or – tag, we assume the respective input schema. Provider edges from this predicate towards these schemata are added as usual. The same applies for the attributes of the input and output schemata of the side effect activity.

DaWaK'05, Copenhagen, August Aggregation Aggregation in LDL: 1.grouping of values to a bag and 2.application of an aggregate function over the values of the bag. R16: aggregate1.a_in(skey,suppkey,date,qty,cost)<- dw.partsupp(skey,suppkey,date,qty,cost) R17: temp(skey,day, ) <- aggregate1.a_in(skey,suppkey,date,qty,cost). R18: aggregate1.a_out(skey,day,min_cost) <- temp(skey,day,all_costs), aggr(min,all_costs,min_cost). R19: v1(skey,day,min_cost) <- aggregate1.a_out(skey,day,min_cost).

DaWaK'05, Copenhagen, August Aggregation

DaWaK'05, Copenhagen, August Aggregation Relations which create a set from the values of a field employ a pair of regulator edges through an intermediate node ‘<>’. Provider relations for attributes used as groupers are tagged with ‘ g ’. One of the attributes of the aggr function node consumes data from a constant that indicates which aggregate function should be used (e.g., avg, min, max ).

DaWaK'05, Copenhagen, August Outline Background & Motivation Database updates and composite transformations Zooming in and out the Architecture Graph Conclusions and Future Work

DaWaK'05, Copenhagen, August Zooming in and out There is a principled way of zooming in and out, in various levels of abstraction: The attribute level The schema level The activity level

DaWaK'05, Copenhagen, August Zooming in and out For each node x of the architecture graph G(V,E) representing a schema: 1.for each provider edge ( x a, y ) or ( y, x a ), involving an attribute of x and an entity y, external to x, introduce the respective provider edge between x and y (unless it already exists, of course); 2.remove the provider edges ( x a, y ) and ( y, x a ) of the previous step; 3.remove the nodes of the attributes of x and the respective part-of edges.

DaWaK'05, Copenhagen, August Zooming in and out GsGs GaGa

DaWaK'05, Copenhagen, August Zooming in and out GsGs GaGa MeasureDefinitionGaGs SizeSize(G)725 LengthMax provider path32 Complexity 0.5*ext. edges + int. edges 836 Cohesion F_IN+F_OUT F (IN+OUT) 10.75

DaWaK'05, Copenhagen, August Outline Background & Motivation Database updates and composite transformations Zooming in and out the Architecture Graph Conclusions and Future Work

DaWaK'05, Copenhagen, August Summary We have incorporated update semantics in our graph- based modeling of ETL activities. We also cover complex transformations like negation, aggregation and self-joins. We have introduced a systematic way of transforming the Architecture Graph to allow zooming in and out at multiple levels of abstraction Long version and ER’05: internal semantics for activities and quality measures for the design of ETL activities

DaWaK'05, Copenhagen, August This work is part of the Arktos II project Future work includes research in What-if analysis of ETL scenarios Measures for the quality of the design of ETL scenarios On-going/Future Work Arktos II

DaWaK'05, Copenhagen, August Thank you! Arktos II

DaWaK'05, Copenhagen, August Backup Slides

DaWaK'05, Copenhagen, August Vision – big picture The major research goal is to be able to have a metadata repository that incorporates metadata on the static part of an information system, i.e., tables, constraints, query forms, etc on the dynamic part, i.e., data-centric software modules We invest on a graph-based modeling approach, based on the flexibility of graphs as modeling tools

DaWaK'05, Copenhagen, August Preliminaries Data types Constants Attributes RecordSets Function types Functions R $2€ 1 PKEY my$2€ Integer

DaWaK'05, Copenhagen, August Relationships Instance-Of Relationships Part-Of Relationships Regulator Relationships Provider Relationships Derived Provider Relationships

DaWaK'05, Copenhagen, August Activities Name Input Schemata Output Schema Rejections Schema Parameter List Output/Rejection Operational Semantics Output Activity Parameters Input 1 Input 2 Rejected Rows

DaWaK'05, Copenhagen, August Importance Metrics Dependency: the in-degree of the node with respect to the provider edges; Responsibility: the out-degree of the node with respect to the provider edges; Degree: dependency + responsibility Local vs. Transitive

DaWaK'05, Copenhagen, August Functions Functions are treated as any other predicate in LDL, with the following special characteristics: The function involves a list of parameters, the last of which is the return value of the function. All function parameters referenced in the body of the rule either as homonyms with attributes, of other predicates or through equalities with such attributes, are linked through equality regulator relationships with these attributes. The return value is possibly connected to the output through a provider relationship (or with some other predicate of the body, through a regulator relationship).

DaWaK'05, Copenhagen, August Aliases & Negation Alias relationships. An alias relationship is introduced whenever the same predicate appears in the same rule (e.g., in the case of a self-join). All the nodes representing these occurrences of the same predicate are connected through alias relationships to denote their semantic interrelationship. Note that due to the fact that intra- activity programs do not directly interact with external recordsets or activities, this practically involves the rare case of internal intermediate rules Negation. When a predicates appears negated in a rule body, then the respective part-of edge between the rule and the literal’s node is tagged with ‘⌐’. Note that negated predicates can appear only in the rule body.

DaWaK'05, Copenhagen, August Activity semantics in LDL R06: a_in1(pkey,suppkey,date,qty,cost)<- ps(pkey,suppkey,date,qty,cost). R07: a_in2(pkey,source,skey)<- lookUp(l_pkey,source,l_skey), pkey=l_pkey, skey=l_skey,source=1. R08: a_out(pkey,suppkey,date,qty,cost,skey)<- a_in1(pkey,date,qty,cost), a_in2(pkey,source,l_skey). R09: D2E.a_in(skey,suppkey,date,qty,cost)<- sk.a_out(pkey,suppkey,date,qty,cost,skey)

DaWaK'05, Copenhagen, August Activities

DaWaK'05, Copenhagen, August Zooming in and out