Download presentation
Presentation is loading. Please wait.
Published byRoderick Reed Modified over 9 years ago
1
Data integration and transformation 3. Data Exchange Paolo Atzeni Dipartimento di Informatica e Automazione Università Roma Tre 28/10/2009
2
References Ronald Fagin, Laura M. Haas, Mauricio Hernandez, Renee J. Miller, Lucian Popa, and Yannis Velegrakis "Clio: Schema Mapping Creation and Data Exchange" A.T. Borgida et al. (Eds.): Mylopoulos Festschrift, LNCS 5600, Springer-Verlag Berlin Heidelberg, 2009, pp. 198–236. and other papers cited in it P. AtzeniITD - 3 - 28/10/20092
3
P. AtzeniITD - 3 - 28/10/20093 Data exchange Given a source and a target schema, find a transformation from the former to the latter
4
P. AtzeniITD - 3 - 28/10/20094 Data exchange, a typical approach (the Clio project) Schema Match Mapping generation Query generation Target schema Source schema
5
Simple example Dept(Id,DeptName)Emp(Code,EmpName,Dept) Employee(Id,Name,DeptId) (with FK from DeptId to Dept.Id) Assume we know that Employee.Id corresponds to Code Name corresponds to EmpName DeptName corresponds to Dept We would like to obtain a query that populates Emp SELECT Id as Code, Name AS EmpName, DeptName AS Dept FROM Employee JOIN Dept ON DeptId = Dept.Id P. AtzeniITD - 3 - 28/10/20095
6
Better visualization Employee Id Name DeptId Dept Id DeptName Emp Code EmpName Dept P. AtzeniITD - 3 - 28/10/20096 We want to obtain SELECT Id as Code, Name AS EmpName, DeptName AS Dept FROM Employee JOIN Dept ON DeptId = Dept.Id and not SELECT Id as Code, Name AS EmpName, NULL AS Dept FROM Employee UNION SELECT NULL as Code, NULL AS EmpName, DeptName AS Dept FROM Dept nor SELECT Id as Code, NULL AS EmpName, NULL AS Dept FROM Employee UNION …
7
The main issue How do we discover we should use a join and not one or two unions? Attributes that appear together in a relation –Id,Name in the source and Code,EmpName in the target The foreign key P. AtzeniITD - 3 - 28/10/20097
8
P. AtzeniITD - 3 - 28/10/20098 Data exchange, another example PayRate ( Rank HrRate ) Professor ( Id Name Sal ) Student ( Name GPA Yr ) WorksOn ( Name Proj Hrs ProjRank ) Personnel ( Id Name Sal Addr ) Address ( Id Addr ) Foreign keys –between the two Id –between ProjRank and Rank –between the two Name
9
P. AtzeniITD - 3 - 28/10/20099 Data exchange, example PayRate ( Rank HrRate ) Professor ( Id Name Sal ) Student ( Name GPA Yr ) WorksOn ( Name Proj Hrs ProjRank ) Personnel ( Id Name Sal Addr ) Address ( Id Addr ) Assume we are given correspondences, which involve functions: –Usually identity –PayRate(HrRate)*WorksOn(Hrs) → Personnel(Sal)
10
P. AtzeniITD - 3 - 28/10/200910 Data exchange, example PayRate ( Rank HrRate ) Professor ( Id Name Sal ) Student ( Name GPA Yr ) WorksOn ( Name Proj Hrs ProjRank ) Personnel ( Id Name Sal Addr ) Address ( Id Addr ) How do we combine HrRate and Hrs? –Via a join suggested by foreign keys Foreign key between ProjRank and ProjRank suggests a join Foreign keys over Name and between Yr and Rank suggest another
11
Heuristic We have many correspondences Group correspondences in such a way that each set contains at most one correspondence for each attribute in the target We are interested in sets where the source attribute are either in the same relations or in relations whose join is meaningful P. AtzeniITD - 3 - 28/10/200911
12
Professor ( Id Name Sal ) PayRate ( Rank HrRate ) Student ( Name GPA Yr ) WorksOn ( Name Proj Hrs ProjRank ) Personnel ( Id Name Sal Addr ) Address ( Id Addr ) P. AtzeniITD - 3 - 28/10/200912 Partition the correspondences … and for each partition the joins are meaningful
13
P. AtzeniITD - 3 - 28/10/200913 The process, example SELECT P.Id, P.Name, P.Sal, A.Addr FROM Professor P, Address A WHERE A.Id = P.Id UNION ALL SELECT NULL AS Id, S.Name, p.HrRate * W.Hrs, NULL AS Addr FROM PayRate P, Student S, WorksOn W WHERE W.Name = S.Name AND S.Yr = P.Rank Professor ( Id Name Sal ) PayRate ( Rank HrRate ) Student ( Name GPA Yr ) WorksOn ( Name Proj Hrs ProjRank ) Personnel ( Id Name Sal Addr ) Address ( Id Addr )
14
More complex example (with nesting) Companies Name Address Year Grants Gid Recipient Amount Supervisor Manager Contacts Cid Email Phone Organizations Code Year Fundings FId FinId Finances FinId Budget Phone P. AtzeniITD - 3 - 28/10/200914 f1 f2 f3 f4 Nested relation Organizations FundingsCode HAL Year 301 FinIdFId SM PH 303 302
15
Correspondences (given by a "schema matcher") Companies Name Address Year Grants Gid Recipient Amount Supervisor Manager Contacts Cid Email Phone Organizations Code Year Fundings FId FinId Finances FinId Budget Phone P. AtzeniITD - 3 - 28/10/200915 v1 v2 v3 v4 f1 f2 f3 f4
16
Let us formalize correspondences P. AtzeniITD - 3 - 28/10/200916 v1 v2 v3 v4 Companies Name Address Year Grants Gid Recipient Amount Supervisor Manager Contacts Cid Email Phone f1 f2 f3 Organizations Code Year Fundings FId FinId Finances FinId Budget Phone f4 n,d,y Companies(n,d,y) → y',F Organizations(n,y',F)) v1 v2 g,r,a,s,m Grants(g,r,a,s,m) → c,y,F,f Organiz…(c,y,F)), F(g,f) v4 c, e, p Contacts(c,e,p) → f,b Finances(f,b,p) v3 g, r, a, s, m Grants(g,r,a,s,m) → f,p Finances(f,a,p)
17
Correspondences alone are not enough P. AtzeniITD - 3 - 28/10/200917 v1 v2 v3 v4 Companies Name Address Year Grants GId Recipient Amount Supervisor Manager Contacts Cid Email Phone f1 f2 f3 Organizations Code Year Fundings FId FinId Finances FinId Budget Phone f4 n,d,y Companies(n,d,y) → y',F Organizations(n,y',F)) v1v3 g, r, a, s, m Grants(g,r,a,s,m) → f,p Finances(f,a,p) v2 g,r,a,s,m Grants(g,r,a,s,m) → c,y,F,f Organiz…(c,y,F)), F(g,f) v4 c, e, p Contacts(c,e,p) → f,b Finances(f,b,p) Companies NameAddressYearHALNY1920SMSeattle1984PHSF1957 Grants GIdRec.tAmt 301HAL30 302HAL40 303PH30 Organizations FundingsCode HAL Year FinIdFId SM PH 301 302
18
More complex mappings are needed, representing associations P. AtzeniITD - 3 - 28/10/200918 v1 v2 v3 v4 Companies Name Address Year Grants GId Recipient Amount Supervisor Manager Contacts Cid Email Phone f1 f2 f3 Organizations Code Year Fundings FId FinId Finances FinId Budget Phone f4 n,d,y,g,a,s,m Companies(n,d,y), Grants(g,n,a,s,m) → y',F,f Organizations(n,y',F)), F(g,f) v3 g, r, a, s, m Grants(g,r,a,s,m) → f,p Finances(f,a,p) v4 c, e, p Contacts(c,e,p) → f,b Finances(f,b,p) Companies NameAddressYearHALNY1920SMSeattle1984PHSF1957 Grants GIdRec.tAmt 301HAL30 302HAL40 303PH30 Organizations FundingsCode HAL Year 301 FinIdFId SM PH 303 302 Note: The "association" between companies and grants in the source is suggested by f1 (a foreign key)
19
Yet more complex P. AtzeniITD - 3 - 28/10/200919 v1 v2 v3 v4 Companies Name Address Year Grants Gid Recipient Amount Supervisor Manager Contacts Cid Email Phone f1 f2 f3 Organizations Code Year Fundings FId FinId Finances FinId Budget Phone f4 n,d,y,g,a,s,m Companies(n,d,y), Grants(g,n,a,s,m) → y',F,f, p Organizations(n,y',F), F(g,f), Finances(f,a,p) Notes: Three tuples are generated for each pair of related companies and grants The mapping specifies that there exist an f, appearing in two places, without saying which its value should be
20
A final issue P. AtzeniITD - 3 - 28/10/200920 v1 v2 v3 v4 Companies Name Address Year Grants Gid Recipient Amount Supervisor Manager Contacts Cid Email Phone f1 f2 f3 Organizations Code Year Fundings FId FinId Finances FinId Budget Phone f4 How do we obtain the phone to be put in finances? Is it the supervisor's one or the manager's? FKs suggest either (or even both) Human intervention is needed to choose
21
Various solutions in nested cases with possibily undesirable features P. AtzeniITD - 3 - 28/10/200921 Companies NameAddressYearHALNY1920SMSeattle1984PHSF1957 Grants GIdRec.tAmt 301HAL30 302HAL40 303PH30 Organizations FundingsCode HAL Year 301 FinIdFId k1 SM PH 303 k1 302 k1 Finances FinIdBudgetphone k130 k140 k130
22
A better solution P. AtzeniITD - 3 - 28/10/200922 Companies NameAddressYearHALNY1920SMSeattle1984PHSF1957 Grants GIdRec.tAmt 301HAL30 302HAL40 303PH30 Organizations FundingsCode HAL Year 301 FinIdFId k1 SM PH 303 k3 302 k2 Finances FinIdBudgetphone k130 k240 k330
23
A more verbose notation for mappings P. AtzeniITD - 3 - 28/10/200923 v1 v2 v3 v4 Companies Name Address Year Grants Gid Recipient Amount Supervisor Manager Contacts Cid Email Phone f1 f2 f3 Organizations Code Year Fundings FId FinId Finances FinId Budget Phone f4 n,d,y,g,a,s,m Companies(n,d,y), Grants(g,n,a,s,m) → y',F,f, p Organizations(n,y',F)), F(g,f), Finances(f,a,p) foreach c in companies, g in grants where c.name=g.recipient exists o in organizations, f in o.fundings, i in finances where f.finId = i.finId with o.code = c.name andf.fId = g.gId and i.budget = g.amount query on the source query on the target correspondences
24
The mapping as a source-to-target constraint P. AtzeniITD - 3 - 28/10/200924 v1 v2 v3 v4 Companies Name Address Year Grants Gid Recipient Amount Supervisor Manager Contacts Cid Email Phone f1 f2 f3 Organizations Code Year Fundings FId FinId Finances FinId Budget Phone f4 foreach c in companies, g in grants where c.name=g.recipient exists o in organizations, f in o.fundings, i in finances where f.finId = i.finId with o.code = c.name andf.fId = g.gId and i.budget = g.amount Q S Q T "the result of Q T (over the target, projected as in the with-clause) must contain the result of Q S (over the source, projected as in the with- clause)" QSQS QTQT
25
Syntax and restrictions foreach x 1 in g 1,..., x n in g n where B 1 exists y 1 in g' 1,..., y m in g' m where B 2 with e 1 = e' 1 and... and e k = e' k foreach c in companies, g in grants where c.name=g.recipient exists o in organizations, f in o.fundings, i in finances where f.finId = i.finId with o.code = c.name andf.fId = g.gId and i.budget = g.amount P. AtzeniITD - 3 - 28/10/200925 x i in g i (generator) x i variable g i set (either the root or a set nested within it) B 1 conjunction of equalities over the x i variables y i in g' i B 2 similar e 1 = e' 1 … equalities between a source expression and a target expression Restrictions: See paper, page 210, lines 5+: "The mapping is well formed …"
26
Schema constraints Referential integrity is essential in this approach as the basis for the discovery of "associations" Given the nested model, they need a rather complex definition So, two steps –Paths (primary paths and relative paths) –Nested referential integrity (NRI) constraints P. AtzeniITD - 3 - 28/10/200926
27
Primary paths Primary path (given a schema root R, that is a first level element in the schema): –x 1 in g 1, x 2 in g 2, …, x n in g n where g 1 is an expression on R (just R?), g i (for i ≥ 2) g 1 is an expression on x i-1 Examples –c in companies –o in organizations –o in organizations, f in o.fundings P. AtzeniITD - 3 - 28/10/200927
28
Relative paths Primary path (given a schema root R, that is a first level element in the schema): –x 1 in g 1, x 2 in g 2, …, x n in g n where g 1 is an expression on R (just R?), g i (for i ≥ 2) g 1 is an expression on x i-1 Relative path with respect to a variable x –x 1 in g 1, x 2 in g 2, …, x n in g n where g 1 is an expression on x (just x?), g i (for i ≥ 2) g 1 is an expression on x i-1 Example –f in o.fundings P. AtzeniITD - 3 - 28/10/200928
29
Nested referential integrity (NRI) constraints foreach P 1 exists P 2 where B –P 1 is a primary path –P 2 is either a primary path or a relative path with respect to a variable in P 1 –B is a conjunction of equalities between an expression on a variable of P 1 and an expression on a variable of P 2 Example foreach o in organizations, f in o.fundings exists i in finances where f.finId = i.finId P. AtzeniITD - 3 - 28/10/200929 Organizations Code Year Fundings FId FinId Finances FinId Budget Phone f4
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.