Data integration and transformation 3. Data Exchange Paolo Atzeni Dipartimento di Informatica e Automazione Università Roma Tre 28/10/2009.

Slides:



Advertisements
Similar presentations
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 5 More SQL: Complex Queries, Triggers, Views, and Schema Modification.
Advertisements

Chapter 6 The Relational Algebra
Data integration and transformation Paolo Atzeni Dipartimento di Informatica e Automazione Università Roma Tre 29/09/2010.
Information capacity in schema and data translation Paolo Atzeni Based on work done with L. Bellomarini, P. Bernstein, F. Bugiotti, P. Cappellari, G. Gianforme,
IS698: Database Management Min Song IS NJIT. The Relational Data Model.
Relational Database Design UNIT II 1. 2 Advantages of Using Database Systems Centralized control of a firm’s data Redundancy can be reduced (avoid keeping.
Muse: A System for Understanding and Designing Mappings Bogdan Alexe Laura Chiticariu Renée J. Miller Daniel Pepper Wang-Chiew Tan UC Santa Cruz U. of.
Relational Database. Relational database: a set of relations Relation: made up of 2 parts: − Schema : specifies the name of relations, plus name and type.
1June 7, 2004Ontologies for interoperability1 Ontology-based data integration Maurizio Lenzerini Dipartimento di Informatica e Sistemistica “A. Ruberti”
The Relational Model. Introduction Introduced by Ted Codd at IBM Research in 1970 The relational model represents data in the form of table. Main concept.
Relational Algebra Ch. 7.4 – 7.6 John Ortiz. Lecture 4Relational Algebra2 Relational Query Languages  Query languages: allow manipulation and retrieval.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Relational Algebra Chapter 4, Part A Modified by Donghui Zhang.
A S EMANTIC A PPROACH TO D ISCOVERING S CHEMA M APPING Yuan An, Alex Borgida, Renee J. Miller, and John Mylopoulos Presented by: Kristine Monteith.
Interactive Generation of Integrated Schemas Laura Chiticariu et al. Presented by: Meher Talat Shaikh.
From Relational Algebra to the Structured Query Language Rose-Hulman Institute of Technology Curt Clifton.
Database Systems Chapter 6 ITM Relational Algebra The basic set of operations for the relational model is the relational algebra. –enable the specification.
Schema Mapping as Query Discovery Renee J. Miller Laura M. Haas Mauricio A. Hernandez Presented by: Helen Chen.
Database Systems More SQL Database Design -- More SQL1.
Lecture 2 The Relational Model. Objectives Terminology of relational model. How tables are used to represent data. Connection between mathematical relations.
1 The Relational Data Model, Relational Constraints, and The Relational Algebra.
Content Resource- Elamsari and Navathe, Fundamentals of Database Management systems.
Relational Database Concepts. Let’s start with a simple example of a database application Assume that you want to keep track of your clients’ names, addresses,
Review: Application of Database Systems
Data integration and transformation 3. Data Exchange Paolo Atzeni Dipartimento di Informatica e Automazione Università Roma Tre 28/10-4/11/2009.
Database Management COP4540, SCS, FIU Relational Model Chapter 7.
SQL Databases are a Moving Target Juan F. Sequeda – Syed Hamid Tirmizi –
DatabaseIM ISU1 Fundamentals of Database Systems Chapter 5 The Relational Data Model.
The Relational Data Model and Relational Database Constraints
DATA-DRIVEN UNDERSTANDING AND REFINEMENT OF SCHEMA MAPPINGS Data Integration and Service Computing ITCS 6010.
Chapter 2 Adapted from Silberschatz, et al. CHECK SLIDE 16.
Ihr Logo Fundamentals of Database Systems Fourth Edition El Masri & Navathe Chapter 5 The Relational Data Model and Relational Database Constraints.
Chapter 6 The Relational Data Model and the Relational Algebra.
JOI/1 Data Manipulation - Joins Objectives –To learn how to join several tables together to produce output Contents –Extending a Select to retrieve data.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Relational Algebra.
Slide Chapter 5 The Relational Data Model and Relational Database Constraints.
Data Exchange with Data-Metadata Translations MAD Algorithm Paolo Papotti Mauricio A. Mauricio A. Hernández Wang-ChiewTan.
1 Relational Algebra and Calculas Chapter 4, Part A.
1.1 CAS CS 460/660 Introduction to Database Systems Relational Algebra.
Database System Concepts, 6 th Ed. ©Silberschatz, Korth and Sudarshan See for conditions on re-usewww.db-book.com Chapter 2: Intro to Relational.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Database Management Systems Chapter 4 Relational Algebra.
Database Systems (Atzeni, Ceri, Paraboschi, Torlone) Chapter 3 : Relational algebra and calculus McGraw-Hill and Atzeni, Ceri, Paraboschi, Torlone 1999.
Chapter 2 Introduction to Relational Model. Example of a Relation attributes (or columns) tuples (or rows) Introduction to Relational Model 2.
Chapter 2: Intro to Relational Model. 2.2 Example of a Relation attributes (or columns) tuples (or rows)
Relational Algebra p BIT DBMS II.
1 Theory, Practice & Methodology of Relational Database Design and Programming Copyright © Ellis Cohen Relational State Assertions These slides.
1 Theory, Practice & Methodology of Relational Database Design and Programming Copyright © Ellis Cohen Inner Joins These slides are licensed.
Database System Concepts, 6 th Ed. ©Silberschatz, Korth and Sudarshan See for conditions on re-usewww.db-book.com Chapter 2: Intro to Relational.
Constraints and Views Chap. 3-5 continued (7 th ed. 5-7)
1 The Relational Data Model David J. Stucki. Relational Model Concepts 2 Fundamental concept: the relation  The Relational Model represents an entire.
Relational Algebra COMP3211 Advanced Databases Nicholas Gibbins
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 3 The Relational Data Model and Relational Database Constraints تنبيه.
Relational Algebra Database Management Systems, 3rd ed., Ramakrishnan and Gehrke, Chapter 4.
More SQL: Complex Queries,
Introduction to Relational Model
Chapter 2: Intro to Relational Model
Chapter (9) ER and EER-to-Relational Mapping, and other Relational Languages Objectives How a relational database schema can be created from a conceptual.
Chapter 3: Intro to Relational Model
Chapter 2: Intro to Relational Model
Chapter 2: Intro to Relational Model
Chapter 2: Intro to Relational Model
Chapter (9) ER and EER-to-Relational Mapping, and other Relational Languages Objectives How a relational database schema can be created from a conceptual.
Chapter 2: Intro to Relational Model
Nested Mappings: Schema Mapping Reloaded
Nested Mappings: Schema Mapping Reloaded
More SQL: Complex Queries, Triggers, Views, and Schema Modification
Clio: Schema Mapping and Data Exchange
Chapter 2: Intro to Relational Model
Chapter 2: Intro to Relational Model
Chapter (7) ER-to-Relational Mapping, and other Relational Languages
Presentation transcript:

Data integration and transformation 3. Data Exchange Paolo Atzeni Dipartimento di Informatica e Automazione Università Roma Tre 28/10/2009

References Ronald Fagin, Laura M. Haas, Mauricio Hernandez, Renee J. Miller, Lucian Popa, and Yannis Velegrakis "Clio: Schema Mapping Creation and Data Exchange" A.T. Borgida et al. (Eds.): Mylopoulos Festschrift, LNCS 5600, Springer-Verlag Berlin Heidelberg, 2009, pp. 198–236. and other papers cited in it P. AtzeniITD /10/20092

P. AtzeniITD /10/20093 Data exchange Given a source and a target schema, find a transformation from the former to the latter

P. AtzeniITD /10/20094 Data exchange, a typical approach (the Clio project) Schema Match Mapping generation Query generation Target schema Source schema

Simple example Dept(Id,DeptName)Emp(Code,EmpName,Dept) Employee(Id,Name,DeptId) (with FK from DeptId to Dept.Id) Assume we know that Employee.Id corresponds to Code Name corresponds to EmpName DeptName corresponds to Dept We would like to obtain a query that populates Emp SELECT Id as Code, Name AS EmpName, DeptName AS Dept FROM Employee JOIN Dept ON DeptId = Dept.Id P. AtzeniITD /10/20095

Better visualization Employee Id Name DeptId Dept Id DeptName Emp Code EmpName Dept P. AtzeniITD /10/20096 We want to obtain SELECT Id as Code, Name AS EmpName, DeptName AS Dept FROM Employee JOIN Dept ON DeptId = Dept.Id and not SELECT Id as Code, Name AS EmpName, NULL AS Dept FROM Employee UNION SELECT NULL as Code, NULL AS EmpName, DeptName AS Dept FROM Dept nor SELECT Id as Code, NULL AS EmpName, NULL AS Dept FROM Employee UNION …

The main issue How do we discover we should use a join and not one or two unions? Attributes that appear together in a relation –Id,Name in the source and Code,EmpName in the target The foreign key P. AtzeniITD /10/20097

P. AtzeniITD /10/20098 Data exchange, another example PayRate ( Rank HrRate ) Professor ( Id Name Sal ) Student ( Name GPA Yr ) WorksOn ( Name Proj Hrs ProjRank ) Personnel ( Id Name Sal Addr ) Address ( Id Addr ) Foreign keys –between the two Id –between ProjRank and Rank –between the two Name

P. AtzeniITD /10/20099 Data exchange, example PayRate ( Rank HrRate ) Professor ( Id Name Sal ) Student ( Name GPA Yr ) WorksOn ( Name Proj Hrs ProjRank ) Personnel ( Id Name Sal Addr ) Address ( Id Addr ) Assume we are given correspondences, which involve functions: –Usually identity –PayRate(HrRate)*WorksOn(Hrs) → Personnel(Sal)

P. AtzeniITD /10/ Data exchange, example PayRate ( Rank HrRate ) Professor ( Id Name Sal ) Student ( Name GPA Yr ) WorksOn ( Name Proj Hrs ProjRank ) Personnel ( Id Name Sal Addr ) Address ( Id Addr ) How do we combine HrRate and Hrs? –Via a join suggested by foreign keys Foreign key between ProjRank and ProjRank suggests a join Foreign keys over Name and between Yr and Rank suggest another

Heuristic We have many correspondences Group correspondences in such a way that each set contains at most one correspondence for each attribute in the target We are interested in sets where the source attribute are either in the same relations or in relations whose join is meaningful P. AtzeniITD /10/200911

Professor ( Id Name Sal ) PayRate ( Rank HrRate ) Student ( Name GPA Yr ) WorksOn ( Name Proj Hrs ProjRank ) Personnel ( Id Name Sal Addr ) Address ( Id Addr ) P. AtzeniITD /10/ Partition the correspondences … and for each partition the joins are meaningful

P. AtzeniITD /10/ The process, example SELECT P.Id, P.Name, P.Sal, A.Addr FROM Professor P, Address A WHERE A.Id = P.Id UNION ALL SELECT NULL AS Id, S.Name, p.HrRate * W.Hrs, NULL AS Addr FROM PayRate P, Student S, WorksOn W WHERE W.Name = S.Name AND S.Yr = P.Rank Professor ( Id Name Sal ) PayRate ( Rank HrRate ) Student ( Name GPA Yr ) WorksOn ( Name Proj Hrs ProjRank ) Personnel ( Id Name Sal Addr ) Address ( Id Addr )

More complex example (with nesting) Companies Name Address Year Grants Gid Recipient Amount Supervisor Manager Contacts Cid Phone Organizations Code Year Fundings FId FinId Finances FinId Budget Phone P. AtzeniITD /10/ f1 f2 f3 f4 Nested relation Organizations FundingsCode HAL Year 301 FinIdFId SM PH

Correspondences (given by a "schema matcher") Companies Name Address Year Grants Gid Recipient Amount Supervisor Manager Contacts Cid Phone Organizations Code Year Fundings FId FinId Finances FinId Budget Phone P. AtzeniITD /10/ v1 v2 v3 v4 f1 f2 f3 f4

Let us formalize correspondences P. AtzeniITD /10/ v1 v2 v3 v4 Companies Name Address Year Grants Gid Recipient Amount Supervisor Manager Contacts Cid Phone f1 f2 f3 Organizations Code Year Fundings FId FinId Finances FinId Budget Phone f4  n,d,y Companies(n,d,y) →  y',F Organizations(n,y',F)) v1 v2  g,r,a,s,m Grants(g,r,a,s,m) →  c,y,F,f Organiz…(c,y,F)), F(g,f) v4  c, e, p Contacts(c,e,p) →  f,b Finances(f,b,p) v3  g, r, a, s, m Grants(g,r,a,s,m) →  f,p Finances(f,a,p)

Correspondences alone are not enough P. AtzeniITD /10/ v1 v2 v3 v4 Companies Name Address Year Grants GId Recipient Amount Supervisor Manager Contacts Cid Phone f1 f2 f3 Organizations Code Year Fundings FId FinId Finances FinId Budget Phone f4  n,d,y Companies(n,d,y) →  y',F Organizations(n,y',F)) v1v3  g, r, a, s, m Grants(g,r,a,s,m) →  f,p Finances(f,a,p) v2  g,r,a,s,m Grants(g,r,a,s,m) →  c,y,F,f Organiz…(c,y,F)), F(g,f) v4  c, e, p Contacts(c,e,p) →  f,b Finances(f,b,p) Companies NameAddressYearHALNY1920SMSeattle1984PHSF1957 Grants GIdRec.tAmt 301HAL30 302HAL40 303PH30 Organizations FundingsCode HAL Year FinIdFId SM PH

More complex mappings are needed, representing associations P. AtzeniITD /10/ v1 v2 v3 v4 Companies Name Address Year Grants GId Recipient Amount Supervisor Manager Contacts Cid Phone f1 f2 f3 Organizations Code Year Fundings FId FinId Finances FinId Budget Phone f4  n,d,y,g,a,s,m Companies(n,d,y), Grants(g,n,a,s,m) →  y',F,f Organizations(n,y',F)), F(g,f) v3  g, r, a, s, m Grants(g,r,a,s,m) →  f,p Finances(f,a,p) v4  c, e, p Contacts(c,e,p) →  f,b Finances(f,b,p) Companies NameAddressYearHALNY1920SMSeattle1984PHSF1957 Grants GIdRec.tAmt 301HAL30 302HAL40 303PH30 Organizations FundingsCode HAL Year 301 FinIdFId SM PH Note: The "association" between companies and grants in the source is suggested by f1 (a foreign key)

Yet more complex P. AtzeniITD /10/ v1 v2 v3 v4 Companies Name Address Year Grants Gid Recipient Amount Supervisor Manager Contacts Cid Phone f1 f2 f3 Organizations Code Year Fundings FId FinId Finances FinId Budget Phone f4  n,d,y,g,a,s,m Companies(n,d,y), Grants(g,n,a,s,m) →  y',F,f, p Organizations(n,y',F), F(g,f), Finances(f,a,p) Notes: Three tuples are generated for each pair of related companies and grants The mapping specifies that there exist an f, appearing in two places, without saying which its value should be

A final issue P. AtzeniITD /10/ v1 v2 v3 v4 Companies Name Address Year Grants Gid Recipient Amount Supervisor Manager Contacts Cid Phone f1 f2 f3 Organizations Code Year Fundings FId FinId Finances FinId Budget Phone f4 How do we obtain the phone to be put in finances? Is it the supervisor's one or the manager's? FKs suggest either (or even both) Human intervention is needed to choose

Various solutions in nested cases with possibily undesirable features P. AtzeniITD /10/ Companies NameAddressYearHALNY1920SMSeattle1984PHSF1957 Grants GIdRec.tAmt 301HAL30 302HAL40 303PH30 Organizations FundingsCode HAL Year 301 FinIdFId k1 SM PH 303 k1 302 k1 Finances FinIdBudgetphone k130 k140 k130

A better solution P. AtzeniITD /10/ Companies NameAddressYearHALNY1920SMSeattle1984PHSF1957 Grants GIdRec.tAmt 301HAL30 302HAL40 303PH30 Organizations FundingsCode HAL Year 301 FinIdFId k1 SM PH 303 k3 302 k2 Finances FinIdBudgetphone k130 k240 k330

A more verbose notation for mappings P. AtzeniITD /10/ v1 v2 v3 v4 Companies Name Address Year Grants Gid Recipient Amount Supervisor Manager Contacts Cid Phone f1 f2 f3 Organizations Code Year Fundings FId FinId Finances FinId Budget Phone f4  n,d,y,g,a,s,m Companies(n,d,y), Grants(g,n,a,s,m) →  y',F,f, p Organizations(n,y',F)), F(g,f), Finances(f,a,p) foreach c in companies, g in grants where c.name=g.recipient exists o in organizations, f in o.fundings, i in finances where f.finId = i.finId with o.code = c.name andf.fId = g.gId and i.budget = g.amount query on the source query on the target correspondences

The mapping as a source-to-target constraint P. AtzeniITD /10/ v1 v2 v3 v4 Companies Name Address Year Grants Gid Recipient Amount Supervisor Manager Contacts Cid Phone f1 f2 f3 Organizations Code Year Fundings FId FinId Finances FinId Budget Phone f4 foreach c in companies, g in grants where c.name=g.recipient exists o in organizations, f in o.fundings, i in finances where f.finId = i.finId with o.code = c.name andf.fId = g.gId and i.budget = g.amount Q S  Q T "the result of Q T (over the target, projected as in the with-clause) must contain the result of Q S (over the source, projected as in the with- clause)" QSQS QTQT

Syntax and restrictions foreach x 1 in g 1,..., x n in g n where B 1 exists y 1 in g' 1,..., y m in g' m where B 2 with e 1 = e' 1 and... and e k = e' k foreach c in companies, g in grants where c.name=g.recipient exists o in organizations, f in o.fundings, i in finances where f.finId = i.finId with o.code = c.name andf.fId = g.gId and i.budget = g.amount P. AtzeniITD /10/ x i in g i (generator) x i variable g i set (either the root or a set nested within it) B 1 conjunction of equalities over the x i variables y i in g' i B 2 similar e 1 = e' 1 … equalities between a source expression and a target expression Restrictions: See paper, page 210, lines 5+: "The mapping is well formed …"

Schema constraints Referential integrity is essential in this approach as the basis for the discovery of "associations" Given the nested model, they need a rather complex definition So, two steps –Paths (primary paths and relative paths) –Nested referential integrity (NRI) constraints P. AtzeniITD /10/200926

Primary paths Primary path (given a schema root R, that is a first level element in the schema): –x 1 in g 1, x 2 in g 2, …, x n in g n where g 1 is an expression on R (just R?), g i (for i ≥ 2) g 1 is an expression on x i-1 Examples –c in companies –o in organizations –o in organizations, f in o.fundings P. AtzeniITD /10/200927

Relative paths Primary path (given a schema root R, that is a first level element in the schema): –x 1 in g 1, x 2 in g 2, …, x n in g n where g 1 is an expression on R (just R?), g i (for i ≥ 2) g 1 is an expression on x i-1 Relative path with respect to a variable x –x 1 in g 1, x 2 in g 2, …, x n in g n where g 1 is an expression on x (just x?), g i (for i ≥ 2) g 1 is an expression on x i-1 Example –f in o.fundings P. AtzeniITD /10/200928

Nested referential integrity (NRI) constraints foreach P 1 exists P 2 where B –P 1 is a primary path –P 2 is either a primary path or a relative path with respect to a variable in P 1 –B is a conjunction of equalities between an expression on a variable of P 1 and an expression on a variable of P 2 Example foreach o in organizations, f in o.fundings exists i in finances where f.finId = i.finId P. AtzeniITD /10/ Organizations Code Year Fundings FId FinId Finances FinId Budget Phone f4