Presentation is loading. Please wait.

Presentation is loading. Please wait.

Foundations and Applications of Schema Mappings Phokion G. Kolaitis University of California Santa Cruz & IBM Almaden Research Center.

Similar presentations


Presentation on theme: "Foundations and Applications of Schema Mappings Phokion G. Kolaitis University of California Santa Cruz & IBM Almaden Research Center."— Presentation transcript:

1 Foundations and Applications of Schema Mappings Phokion G. Kolaitis University of California Santa Cruz & IBM Almaden Research Center

2 2 Logic and Databases Extensive interaction between logic and databases during the past 40 years. Logic provides both a unifying framework and a set of tools for formalizing and studying data management tasks. The interaction between logic and databases is a prime example of  Logic in Computer Science but also  Logic from Computer Science

3 3 The Relational Data Model E.F. Codd, 1969-1971 Relational Database: Collection (R 1, …, R m ) of finite relations Relational database » Finite structure A = (A, R 1, …, R m ) Relational Query Languages:  Relational Algebra: operations , , £, [, n  Relational Calculus: (safe) first-order logic SQL: The standard commercial database query language based on relational algebra and relational calculus. E.F. Codd

4 4 Logic and Databases The interaction between logic and databases started with the introduction of the relational data model by E.F. Codd in 1969. It continues today across a wide spectrum of topics in database management. This talk is about the role of logic in data interoperability.

5 5 The Data Interoperability Challenge Data may reside  at several different sites  in several different formats (relational, XML, …). Applications need to access and process all these data. Growing market of enterprise data interoperability tools:  $1.44B in 2007; 17% annual rate of growth  15 major vendors in Gartner’s Magic Quadrant Report (source: Gartner, Inc., September 2008)

6 6 Gartner’s Magic Quadrant Report on Data Interoperability Products Microsoft Oracle Informatica IBM (Cognos) SAP – Business Objects Sybase Syncort ETI Pitney Boss Software Open Text SAS Pervasive Software iWay Software Sun Microsystems Tibco Software Challengers Leaders Niche Players Visionaries Ability to execute Completeness of vision

7 7 Theoretical Aspects of Data Interoperability The research community has studied two different, but closely related, facets of data interoperability: Data Integration (aka Data Federation)  Formalized and studied for the past 10-15 years Data Exchange (aka Data Translation)  Formalized and studied for the past 6-7 years “Data exchange is the oldest database problem” Phil Bernstein - 2003

8 8 Data Integration Query heterogeneous data in different sources via a virtual global schema I1I1 Global Schema I2I2 I3I3 Sources query S 1 S2S2 S 3 T Q

9 9 Data Exchange Transform data structured under a source schema into data structured under a different target schema. S T Σ I J Source Schema Target Schema

10 10 Schema Mappings Schema mappings: High-level, declarative assertions that specify the relationship between two database schemas. Schema mappings constitute the essential building blocks in formalizing and studying data interoperability tasks, including data integration and data exchange. Schema mappings make it possible to separate the design of the relationship between schemas from its implementation.  Are easier to generate and manage (semi)-automatically;  Can be compiled into SQL/XSLT scripts automatically.

11 11 Outline Schema Mappings and Schema-Mapping Specification Languages. Schema Mappings and Data Exchange  Algorithmic problems in data exchange.  Solutions and universal solutions. Composing Schema Mappings.

12 12 Acknowledgments Much of the work presented has been carried out in collaboration with  Ron Fagin, IBM Almaden  Renee J. Miller, U. of Toronto  Lucian Popa, IBM Almaden  Wang-Chiew Tan, UC Santa Cruz. Papers in ICDT 2003, PODS 2003-2009, TCS, ACM TODS. The work has been motivated from the Clio Project at IBM Almaden aiming to develop a working system for schema mapping generation and data exchange.

13 13 Schema Mappings Source S Target T Schema Mapping M = (S, T, Σ)  Source schema S, Target schema T  High-level, declarative assertions Σ that specify the relationship between S and T. Question: What is a “good” schema-mapping specification language? Σ

14 14 Schema-Mapping Specification Languages Obvious Idea: Use a logic-based language to specify schema mappings. In particular, use first-order logic. Warning: Unrestricted use of first-order logic as a schema-mapping specification language gives rise to undecidability of basic algorithmic problems about schema mappings.

15 15 Schema-Mapping Specification Languages Let us consider some simple tasks that every schema-mapping specification language should support:  Copy (Nicknaming): Copy each source table to a target table and rename it.  Projection: Form a target table by projecting on one or more columns of a source table.  Column Augmentation: Form a target table by adding one or more columns to a source table.  Decomposition: Decompose a source table into two or more target tables.  Join: Form a target table by joining two or more source tables.  Combinations of the above (e.g., “join + column augmentation + …”)

16 16 Schema-Mapping Specification Languages  Copy (Nicknaming): 8 x 1, …,x n (P(x 1,…,x n ) ! R(x 1,…,x n ))  Projection: 8 x,y,z(P(x,y,z) ! R(x,y))  Column Augmentation: 8 x,y (P(x,y) ! 9 z R(x,y,z))  Decomposition: 8 x,y,z (P(x,y,z) ! R(x,y) Æ T(y,z))  Join: 8 x,y,z(E(x,z) Æ F(z,y) ! R(x,y,z))  Combinations of the above (e.g., “join + column augmentation + …”) 8 x,y,z(E(x,z) Æ F(z,y) ! 9 w (R(x,y) Æ T(x,y,z,w)))

17 17 Schema-Mapping Specification Languages Question: What do all these tasks (copy, projection, column augmentation, decomposition, join) have in common? Answer:  They can be specified using tuple-generating dependencies (tgds).  In fact, they can be specified using a special class of tuple-generating dependencies known as source-to-target tuple generating dependencies (s-t tgds).

18 18 Database Integrity Constraints Dependency Theory: extensive study of integrity constraints in relational databases in the 1970s and 1980s (Codd, Fagin, Beeri, Vardi …) Tuple-generating dependencies (tgds) emerged as an important class of constraints with a balance between high expressive power and good algorithmic properties. Tgds are expressions of the form 8 x (  (x) ! 9 y à (x, y )), where  (x), à (x, y ) are conjunctions of atomic formulas. Special Cases: Inclusion Dependencies Multivalued Dependencies

19 19 Schema-Mapping Specification Language The relationship between source and target is given by source-to-target tuple generating dependencies (s-t tgds) 8 x (  (x)   y  (x, y)), where   (x) is a conjunction of atoms over the source;   (x, y) is a conjunction of atoms over the target.  s-t tgds assert that: some conjunctive query over the source is contained in some other conjunctive query over the target. Example: (dropping the universal quantifiers in the front) (Student (s)  Enrolls(s,c))   t  g (Teaches(t,c)  Grade(s,c,g))

20 20 Schema-Mapping Specification Language Fact: s-t tgds generalize the main specifications used in data integration:  They generalize LAV (local-as-view) specifications: P(x)   y  (x, y), where P is a source schema.  E(x,y) ! 9 z (H(x,z) Æ H(z,y)) (LAV constraint) Note: Copy, projection, and decomposition are LAV s-t tgds.  They generalize GAV (global-as-view) specifications:  (x)  R(x), where R is a target relation (they are equivalent to full tgds:  (x)   (x), where  (x) and  (x) are conjunctions of atoms).  E(x,y) Æ E(y,z) ! F(x,z) (GAV (full) constraint) Note: Copy, projection, and join are GAV s-t tgds.

21 21 Target Dependencies In addition to source-to-target dependencies, we also consider target dependencies:  Target Tgds :  T (x)   y  T (x, y) Dpt (e,d) ! 9 p Proj(e,p) (a target inclusion dependency constraint)  Target Equality Generating Dependencies (egds):  T (x)  (x 1 =x 2 ) Dpt (e, d 1 )  Dpt (e, d 2 )  (d 1 = d 2 ) (a target key constraint)

22 22 Schema Mappings & Data Exchange Source S Target T Data Exchange via the schema mapping M = (S, T, Σ) Given a source instance I, construct a target instance J, so that (I, J) satisfy the specifications Σ of M. Such a J is called a solution for I.  Difficulty:  Usually, there are multiple solutions  Which one is the “best” to materialize? I J Σ

23 23 Over/Underspecification in Data Exchange Fact: A given source instance may have no solutions (overspecification) Fact: A given source instance may have multiple solutions (underspecification) Example: Source relation E(A,B), target relation H(A,B) Σ: E(x,y)   z (H(x,z)  H(z,y)) Source instance I = {E(a,b)} Solutions: Infinitely many solutions exist  J 1 = {H(a,b), H(b,b)} constants:  J 2 = {H(a,a), H(a,b)} a, b, …  J 3 = {H(a,X), H(X,b)} variables (labelled nulls):  J 4 = {H(a,X), H(X,b), H(a,Y), H(Y,b)} X, Y, …  J 5 = {H(a,X), H(X,b), H(Y,Y)}

24 24 Data Exchange & Universal solutions Fagin, K …, Miller, Popa: Identified and studied the concept of a universal solution for schema mappings specified by s-t tgds and “well behaved” target tgds & egds.  A universal solutions is a most general solution.  A universal solution “represents” the entire space of solutions.  A “canonical” universal solution can be generated efficiently using the chase procedure.  Universal solutions can be used to efficiently compute the certain answers of queries posed over the target schema.

25 25 Universal Solutions in Data Exchange Definition (FKMP): A solution is universal if it has homomorphisms to all other solutions (thus, it is a “most general” solution).  Constants: entries in source instances  Variables (labeled nulls): other entries in target instances  Homomorphism h: J 1 → J 2 between target instances: h(c) = c, for constant c If P(a 1,…,a m ) is in J 1,, then P(h(a 1 ),…,h(a m )) is in J 2. Claim: Universal solutions are the preferred solutions in data exchange.

26 26 Universal Solutions in Data Exchange Schema SSchema T I J Σ J1J1 J2J2 J3J3 Universal Solution Solutions h1h1 h2h2 h3h3 Homomorphisms

27 27 Example - continued Source relation S(A,B), target relation T(A,B) Σ : E(x,y)   z (H(x,z)  H(z,y)) Source instance I = {H(a,b)} Solutions: Infinitely many solutions exist  J 1 = {H(a,b), H(b,b)} is not universal  J 2 = {H(a,a), H(a,b)} is not universal  J 3 = {H(a,X), H(X,b)} is universal  J 4 = {H(a,X), H(X,b), H(a,Y), H(Y,b)} is universal  J 5 = {H(a,X), H(X,b), H(Y,Y)} is not universal

28 28 Structural Properties of Universal Solutions Universal solutions are akin to:  most general unifiers in logic programming;  initial models. Uniqueness up to homomorphic equivalence: If J and J’ are universal for I, then they are homomorphically equivalent. Representation of the entire space of solutions: Assume that J is universal for I, and J’ is universal for I’. Then the following are equivalent: 1.I and I’ have the same space of solutions. 2.J and J’ are homomorphically equivalent.

29 29 Algorithmic Problems in Data Exchange Question: What can we say about the complexity of  The existence-of-solutions problem Sol(M) and  The data exchange problem (construct a universal solution) for a fixed schema mapping M = (S, T,  st,  t ) specified by s-t tgds and target tgds and egds? Answer: Depending on the target constraints in  t :  Sol(M) is trivial (solutions always exist) / Universal solutions can be constructed in PTIME (in fact, in LOGSPACE). …  Sol(M) can be in PTIME (in fact, it can be PTIME-complete) / Universal solutions can be constructed in PTIME (if solutions exist) …  Sol(M) can be undecidable / Universal solutions may not exist (even if solutions exist)

30 30 Undecidability in Data Exchange Theorem (K …, Panttaja, Tan): There is a schema mapping M= (S, T,  * st,  * t ) such that:   * st consists of a single s-t tgd;   * t consists of one target egd and two target tgds.  The existence-of-solutions problem Sol(M) is undecidable. Hint of Proof: Reduction from the Embedding Problem for Finite Semigroups Given a finite partial semigroup, can it be embedded to a finite semigroup? (undecidability implied by results of Evans and Gurevich).

31 31 The Embedding Problem & Data Exchange Reducing the Embedding Problem for Semigroups to Sol(M)   st : R(x,y,z) ! R’(x,y,z)   t : R’ is a partial function: R’(x,y,z) Æ R’(x,y,w) ! z = w R’ is associative R’(x,y,u) Æ R’(y,z,v) Æ R’(u,z,w) ! R’(x,u,w) R’ is a total function R’(x,y,z) Æ R’(x’,y’,z’) ! 9 w 1 … 9 w 9 (R’(x,x’,w 1 ) Æ R’(x,y’,w 2 ) Æ R’(x,z’,w 3 ) R’(y,x’,w 4 ) Æ R’(y,y’,w 5 ) Æ R’(x,z’,w 6 ) R’(z,x’,w 7 ) Æ R’(z,y’,w 8 ) Æ R’(z,z’,w 9 ))

32 32 Tractability in Data Exchange Question: Are there broad structural conditions on the target constraints that guarantee tractability? (that is,  The existence of solutions problem is in PTIME and  A universal solution can be constructed in PTIME, if a solution exists.)

33 33 Algorithmic Properties of Universal Solutions Theorem (FKMP): Schema mapping M= (S, T,  st,  t ) such that:   st is a set of source-to-target tgds;   t is the union of a weakly acyclic set of target tgds with a set of target egds. Then: Universal solutions exist if and only if solutions exist. Sol(M) is in PTIME. A canonical universal solution (if a solution exists) can be produced in PTIME using the chase procedure.

34 34 Chase Procedure for Tgds and Egds Given a source instance I, 1. Use the naïve chase to chase I with  st and obtain a target instance J*. 2. Chase J * with the target tgds and the target egds in  t to obtain a target instance J as follows: 2.1. For target tgds introduce new facts in J as dictated by the RHS of the s-t tgd and introduce new values (variables) in J each time existential quantifiers need witnesses. 2.2. For target egds  (x) ! x 1 = x 2 2.2.1. If a variable is equated to a constant, replace the variable by that constant; 2.2.2. If one variable is equated to another variable, replace one variable by the other variable. 2.2.3 If one constant is equated to a different constant, stop and report “failure”.

35 35 Weakly Acyclic Sets of Tgds: Definition Position graph of a set  of tgds:  Nodes: R.A, with R relation symbol, A attribute of R  Edges: for every  (x)   y  (x, y) in  for every x in x occurring in , for every occurrence of x in  in R.A: For every occurrence of x in  in S.B, add an edge R.A S.B In addition, for every existentially quantified y that occurs in  in T.C, add a special edge R.A T.C   is weakly acyclic if the position graph has no cycle containing a special edge. A tgd  is weakly acyclic if so is the singleton set {  }.

36 36 Weakly Acyclic Sets of Tgds: Examples Example 1: { D(e,m) ! M(m), M(m) ! 9 e D(e,m) } is weakly acyclic, but cyclic. D.1 M.1 D.2 Example 2: { E(x,y) ! 9 z E(y,z) } is not weakly acyclic. E.1 E.2

37 37 Complexity of Data Exchange M = (S, T, Σ st, Σ t ) Σ st a set of s-t tgds Existence-of- Solutions Problem Existence-of- Universal Solutions Problem Computing a Universal Solution Σ t = ; No target constraints Trivial PTIME § t : Weakly acyclic set of target tgds + egds PTIME It can be PTIME- complete PTIME Univ. solutions exist if and only if solutions exist PTIME § t : target tgds + egds Undecidable, in general No algorithm exists, in general

38 38 From Theory to Practice Clio Project at IBM Almaden managed by Howard Ho.  Semi-automatic schema-mapping generation tool;  Data exchange system based on schema mappings. Universal solutions used as the semantics of data exchange. Universal solutions are generated via SQL queries extended with Skolem functions (implementation of chase procedure), provided there are no target constraints. Clio technology is now part of IBM Rational® Data Architect.

39 39 Supports nested structures  Nested Relational Model  Nested Constraints Automatic & semi- automatic discovery of attribute correspondence. Interactive derivation of schema mappings. Performs data exchange Some Features of Clio

40 40 Source Schema S “ conforms to” data Data exchange process (or SQL/XQuery/XSLT ) “ conforms to” Schema Mappings in Clio Mapping Generation Schema Mapping Target Schema T

41 41 Managing Schema Mappings Schema mappings can be quite complex. Methods and tools are needed to automate or semi-automate schema-mapping management. Metadata Management Framework – Bernstein 2003 based on generic schema-mapping operators:  Match operator  Merge operator  …  Composition operator  Inverse operator

42 42 Composing Schema Mappings Given M 12 = (S 1, S 2, Σ 12 ) and M 23 = (S 2, S 3, Σ 23 ), derive a schema mapping M 13 = (S 1, S 3, Σ 13 ) that is “equivalent” to the sequential application of M 12 and M 23. M 13 is a composition of M 12 and M 23 M 13 = M 12 ± M 23 Schema S 1 Schema S 2 Schema S 3 M 12 M 23 M 13

43 43 Composing Schema Mappings Given  12 = (S 1, S 2,  12 ) and  23 = (S 2, S 3,  23 ), derive a schema mapping  13 = (S 1, S 3,  13 ) that is “equivalent” to the sequence  12 and  23. Schema S 1 Schema S 2 Schema S 3  12  23  13 What does it mean for  13 to be “equivalent” to the composition of  12 and  23 ?

44 44 Earlier Work Metadata Model Management (Bernstein in CIDR 2003)  Composition is one of the fundamental operators  However, no precise semantics is given Composing Mappings among Data Sources (Madhavan & Halevy in VLDB 2003)  First to propose a semantics for composition  However, their definition is in terms of maintaining the same certain answers relative to a class of queries.  Their notion of composition depends on the class of queries; it may not be unique up to logical equivalence.

45 45 Semantics of Composition Every schema mapping M = (S, T,  ) defines a binary relationship Inst(M) between instances: Inst(M) = { (I,J) | (I,J)   }. In fact, from a semantic point of view, a schema mapping M can be identified with the set Inst(M). Definition: (FKPT) A schema mapping M 13 is a composition of M 12 and M 23 if Inst(M 13 ) = Inst(M 12 )  Inst(M 23 ), that is, (I 1,I 3 )   13 if and only if there exists I 2 such that (I 1,I 2 )   12 and (I 2,I 3 )   23.

46 46 The Composition of Schema Mappings Fact: If both  = (S 1, S 3,  ) and  ’ = (S 1, S 3,  ’) are compositions of  12 and  23, then  are  ’ are logically equivalent. For this reason:  We say that  (or  ’) is the composition of  12 and  23.  We write  12   23 to denote it

47 47 Issues in Composition of Schema Mappings The semantics of composition was the first main issue. The second main issue is the language of the composition.  Is the language of s-t tgds closed under composition? If  12 and  23 are specified by finite sets of s-t tgds, is  12   23 also specified by a finite set of s-t tgds?  If not, what is the “right” language for composing schema mappings?

48 48 Inexpressibility of Composition Theorem:  The language of s-t tgds is not closed under composition.  In fact, there are schema mappings  12 and  23 specified by s-t tgds such that their composition  12 ±  23 is not expressible in least fixed-point logic LFP; hence, it is expressible neither in first-order logic nor in Datalog.

49 49 Lower Bounds for Composition  12 :  x  y (E(x,y)   u  v (C(x,u)  C(y,v)))  x  y (E(x,y)  F(x,y))  23 :  x  y  u  v (C(x,u)  C(y,v)  F(x,y)  D(u,v)) Given graph G=(V, E):  Let I 1 = E  Let I 3 = { (r,g), (g,r), (b,r), (r,b), (g,b), (b,g) } Fact: G is 3-colorable iff  Inst(  12 )  Inst(  23 ) Theorem (Dawar – 1998): 3-Colorability is not expressible in LFP

50 50 Complexity of Composition Definition: The model checking problem for a schema mapping M = (S, T,  ) asks: given a source instanced I and a target instance J, does ²  ? Fact: If a schema mapping M is specified by s-t tgds, then the model checking problem for M is in LOGSPACE. Fact: There are schema mappings  12 and  23 specified by s-t tgds such that the model checking problem for their composition  12 ±  23 is NP-complete.

51 51 Employee Example  12 :  Emp(e)   m Rep(e,m)  23 :  Rep(e,m)  Mgr(e,m)  Rep(e,e)  SelfMgr(e) Theorem: This composition is not definable by any set (finite or infinite) of s-t tgds. Fact: This composition is definable in a well-behaved fragment of second-order logic, called SO tgds, that extends s-t tgds with Skolem functions. Emp e Rep e m Mgr e m SelfMgr e

52 52 Employee Example - revisited  12 :   e ( Emp(e)   m Rep(e,m) )  23 :   e  m( Rep(e,m)  Mgr(e,m) )   e ( Rep(e,e)  SelfMgr(e) ) Fact: The composition is definable by the SO-tgd  13 :   f (  e( Emp(e)  Mgr(e,f(e) )   e( Emp(e)  (e=f(e))  SelfMgr(e) ) )

53 53 Second-Order Tgds Definition: Let S be a source schema and T a target schema. A second-order tuple-generating dependency (SO tgd) is a formula of the form:  f 1 …  f m ( (  x 1 (  1   1 ))  …  (  x n (  n   n )) ), where  Each f i is a function symbol.  Each  i is a conjunction of atoms from S and equalities of terms.  Each  i is a conjunction of atoms from T. Example:  f (  e( Emp(e)  Mgr(e,f(e) )   e( Emp(e)  (e=f(e))  SelfMgr(e) ) )

54 54 Composing SO-Tgds and Data Exchange Theorem (FKPT):  The composition of two SO-tgds is definable by a SO-tgd.  There is an algorithm for composing SO-tgds.  The chase procedure can be extended to SO-tgds; it produces universal solutions in polynomial time.  Every SO tgd is the composition of finitely many finite sets of s-t tgds. Hence, SO tgds are the “right” language for the composition of s-t tgds

55 55 When is the composition FO-definable? Fact:  It is an undecidable problem to tell whether the composition of two schema mappings specified by s-t tgds is first-order definable.  However, there are certain sufficient conditions that guarantee that the composition is first-order definable.  If  12 is specified by GAV (full) s-t tgds and  23 is specified by s-t tgds, then their composition is definable by s-t tgds.  Arocena, Fuxman, Miller: If both  12 and  23 are specified by LAV s-t tgds with distinct variables, then their composition is specified by LAV s-t tgds.

56 56 Synopsis of Schema Mapping Composition s-t tgds are not closed under composition. SO-tgds form a well-behaved fragment of second-order logic.  SO-tgds are closed under composition; they are the “right” language for composing s-t tgds.  SO-tgds are “chasable”: Polynomial-time data exchange with universal solutions. SO-tgds and the composition algorithm have been incorporated in Clio’s Mapping Specification Language (MSL).

57 57 Related Work and Open Problems Related Work: Composing richer schema mappings Nash, Bernstein, Melnik – 2007  Composing Schema Mappings in Open & Closed Worlds Libkin and Sirangelo – 2008  XML Schema Mappings Amano, Libkin, Murlak – 2009 Open Problems:  Composition of schema mappings specified by s-t tgds and target constraints (target tgds and target egds).  Composition of schema mappings specified by richer source-to- target dependencies.

58 58 Inverting Schema Mapping Given M 12, derive M 21 that “undoes” M 12 What is the “right” semantics of the inverse operator?  Several different approaches to inverting schema mappings have been proposed and investigated: “inverse”, “quasi-inverse”, “maximum recovery”, … Schema S 1 Schema S 2 M 12 M 21

59 59 Some Directions of Research  Inverting schema mappings requires further study.  Study of the information loss of a schema mapping.  Detailed study of other schema mapping operators (Diff, Merge, …) remains to be carried out.  Schema-mapping optimization.  Applications of schema-mapping operators to:  Study of schema evolution;  Modeling and analysis of ETL via schema mappings.

60 60 Data Interoperability: The Elephant and the Six Blind Men Data interoperability remains a major challenge: “Information integration is a beast.” (L. Haas – 2007) Schema mappings specified by tgds offer a formalism that covers only some aspects of data interoperability. However, theory and practice can inform each other.


Download ppt "Foundations and Applications of Schema Mappings Phokion G. Kolaitis University of California Santa Cruz & IBM Almaden Research Center."

Similar presentations


Ads by Google