Data Exchange: Semantics and Query Answering

Data Exchange: Semantics and Query Answering
Ronald Fagin IBM Almaden Research Center Phokion G. Kolaitis -- UC Santa Cruz Renee J. Miller University of Toronto Lucian Popa IBM Almaden Research Center In this talk, I will describe the problem of data exchange and address foundational and algorithmic issues related to the semantics of data exchange and to the query answering problem in the context of data exchange. This is joint work with Ron, Phokion Kolaitis from UC Santa Cruz, and Renee Miller from U of Toronto. The first half of the talk will formally address the precise semantics of data exchange. However, before going into that, I will first explain, informally, what we mean by data exchange, and where does this problem appear. IBM Almaden - November 12, 2002 (To appear in ICDT 2003) 2/23/2019

Motivation and Overview
Data exchange problem: How to restructure data from a source schema to a target schema, according to a given specification Main motivation for this work: Understanding of fundamental issues that lie underneath data exchange systems such as EXPRESS and Clio Main challenge: Inherent under-specification: Specification (as we shall see) must be simple and intuitive, but There are many ways in which the restructuring can be performed ! Question: did we make the right choice in the design of Clio ? Our approach (only relational case, so far): Define and study universal solutions Show this is the “best” way of performing data exchange Study computational aspects Study what happens after data exchange: query answering Data exchange is the problem of restructuring and translating data from a source schema into an instance of a target schema, according to a given specification. This is an old but recurrent problem. It appears in many tasks that require data to be transferred between independent applications that do not necessarily agree on the same data format (or schema). An early “data exchange” system was EXPRESS built more than 25 years ago and whose main goal was conversion of hierarchical data to relational data. A more recent example of a data exchange system, that is built at Almaden, is Clio, whose main task is mapping between XML and/or relational schemas. The main motivation for this work: understanding of theoretical issues lying underneath such systems. Subsequent query answering is directly related to the way we perform data exchange: depending on which solution we materialize we may get different answers to a target query. 2/23/2019

The Data Exchange Problem
Source schema S Target schema T t st I J Assume a data exchange setting: source schema S, target schema T with a set t dependencies (see next) set st source-to-target dependencies (see next) The data exchange problem is the following: Input: source instance I Output: target instance J such that: <I, J>  st and J  t (call such J a solution for I ) Let’s start with the formal definition of what we mean by the data exchange problem. 2/23/2019

Source-to-target Dependencies
For most practical purposes, st contains: source-to-target tuple-generating dependencies (tgds) : S(x)  y T(x, y) e.g. DeptEmp(did, mgr_name, eid)  M. Dept (did, M, mgr_name)  Emp (eid, did) (Move data from source table DeptEmp into two target tables, Dept and Emp. The existential variable M is an “unspecified” manager id) Dept did mgr_id mgr_name DeptEmp did mgr_name eid Emp eid did 2/23/2019

Target Dependencies The second, equally important, part of the specification, are the target dependencies t: tgds : T(x)  y T(x, y) e.g. Dept (did, mgr_id, mgr_name)  D. Emp (mgr_id, D) (A foreign key constraint in the target) equality generating dependencies (egds): T(x)  (x1=x2) Emp (e, d1)  Emp (e, d2)  (d1 = d2) (A target key constraint) Dept did mgr_id mgr_name Emp eid did 2/23/2019

Questions (To be Answered Next)
When more than one solution exists, how do we choose a “best” one ? How do we compute a “best” solution ? Is there always a solution ? Is there always a “best” solution ? How does query answering on the chosen solution behave ? 2/23/2019

Universal Solutions = “Best” Solutions
2/23/2019

Existence of Multiple Solutions
source target X0 , Y0 , Z0 … represent “unknown” values (or “nulls”) a0 b0 c0 J1 T P <a0 b’0 c’0> <a’’0 b0 c’’0> <a’’’0 b’’’0 c0> A B C P(a,b,c)  YZ. T(a,Y,Z) Q(a,b,c)  XU. T(X,b,U) R(a,b,c)  VW. T(V,W,c) h1 = {Y0 -> b0, Z0 -> c0, … } h2 a0 Y0 Z0 X0 b0 U0 V0 W0 c0 J T T Q A B C A B C a0 b0 Z1 V1 W1 c0 T J2 R A B C There may be many solutions for the target instance (J, J1, J2, etc.) However, J seems to be more general: there exist homomorphisms h1: J  J1 and h2: J  J2 (see definition next) but none from J1 or J2 to J intuitively, J1 and J2 have extra information 2/23/2019

Homomorphisms As we have seen, the values of a target instance can be either: constants (i.e. values coming from the source instance), or nulls (unknown values) Definition. Assume J1 and J2 are such target instances. A homomorphism h: J1 -> J2 is a mapping from values of J1 to values of J2 such that: h(c) = c, for constants c (nulls of J1 can be mapped to any values of J2) for every tuple <a1, …, an> in relation T of instance J1: < h(a1), … h(an) > must be a tuple in relation T of instance J2 Example: a0 b0 c0 J1 T a0 Y0 Z0 X0 b0 U0 V0 W0 c0 J h1 = {Y0 -> b0, Z0 -> c0, … } 2/23/2019

Universal Solution Definition. Assume a data exchange setting (S, T, st, t). Given source instance I, a universal solution for I is a target instance J such that: (1) J is a solution for I (2) for every solution J’ for I, there exists homomorphism h: J  J’ a0 Y0 Z0 X0 b0 U0 V0 W0 c0 J T For the previous example, J is a universal solution. J1 and J2 are not. Among all solutions, universal solutions are special: They contain no more and no less than the amount of information given by the specification 2/23/2019

We adopt the universal solution as the notion of “best” solution.
Fact: Uniqueness up to homomorphic equivalence: If J1 and J2 are universal for I then there are homomorphisms between J1 and J2 in both directions Representation of the space of solutions: Sol(I1) = Sol(I2) iff J1 and J2 are homomorphically equivalent We adopt the universal solution as the notion of “best” solution. Later we will see another justification for universal solutions in terms of query answering. 2/23/2019

When do universal solutions exist ?
How do we compute a universal solution ? 2/23/2019

Computing Universal Solutions
2/23/2019

Chase We canonically generate a universal solution by using the chase:
Given source instance I, start with an empty target instance J Generate tuples in J by applying the dependencies in st and t. Example: J < CS M0 Mary > < E003 CS > Added in a first chase step (M0 is a null) Dept did mgr_id mgr_name CS Mary E003 I DeptEmp did mgr_name eid Emp eid did We next observe that there is a canonical way of computing universal solutions based on a classical relational procedure, called the chase. Here is how the chase works (in the context of data exchange)… < M D0 > Added in a second chase step DeptEmp(did, mgr_name, eid) -> M. Dept (did, M, mgr_name)  Emp (eid, did) st : Dept (did, mgr_id, mgr_name) -> D. Emp (mgr_id, D) t : 2/23/2019

This process is repeatedly applied:
for all the source tuples and for the generated tuples, as long as there are dependencies that are not yet satisfied The chase may be infinite (cyclic t ) … … or it may fail (e.g. target key constraints that are not possible to satisfy for the given source data) (details in the paper) However, if the chase successfully terminates, the resulting target instance is a solution. 2/23/2019

Canonical Generation of Universal Solutions
Theorem. Assume a data exchange setting (S, T, st, t). Given source instance I: If the chase is finite and successful then its result is a universal solution. If the chase fails then there is no solution. Thus, the chase is a procedure for computing universal solutions, provided that: Solutions exist, and The chase is finite We call universal solutions computed by the chase canonical universal solutions When can we guarantee that the chase is finite ? 2/23/2019

Weakly Acyclic Sets of Dependencies
Some cyclic sets of dependencies may cause infinite chases In such case no universal solution may exist, and the semantics of the data exchange is undefined Still there are cyclic sets of dependencies that behave well and are quite useful Weakly acyclic sets of dependencies (defined in the paper): Cover many practical cases of target constraints Allow for restricted cyclicity The chase is guaranteed to be finite 2/23/2019

Polynomial-Time Chase
Theorem. Let  be a weakly acyclic set of dependencies. For every instance K, the chase of K with  can be computed in polynomial time. Corollary. Assume a data exchange setting (S, T, st, t) such that t is a weakly acyclic set of dependencies. For every source instance I, the existence of a solution can be checked in polynomial time For every source instance I, if a solution exists then a universal solution can be produced in polynomial time. 2/23/2019

Next: what happens after data exchange ?
In particular, how is subsequent query answering affected by our choice of a solution (universal solution) ? 2/23/2019

Query Answering 2/23/2019

Query Evaluation on a Solution
Source schema S Target schema T t st I J Assume a fixed data exchange setting with a source instance I. Suppose that a System 1 chooses a solution J for data exchange. A query q can now be asked against the target. The evaluation of q, in System 1, is q(J). However, a System 2 materializing a different solution J’ may give a different evaluation q(J’). Different choices of J (for the same I) imply possibly different query evaluations. Is there a notion of the “right” set of answers to q with respect to I ? 2/23/2019

Certain Answers We will use a notion that has been around in the context of data integration and incomplete databases, where queries are asked against a set of possible databases. Definition. Given I and q, a tuple t is a certain answer if: t  q(J), for every solution J Notation: certain(q, I) = the set of all certain answers Thus, t is certain if it is in the answer of q on every solution. The certain answers provide well-defined semantics to query answering because they are independent of the choice of a solution. 2/23/2019

Can we compute the certain answers based just on our chosen (universal) solution ?
2/23/2019

Positive Queries Proposition. Assume a data exchange setting (S, T, st, t) and a source instance I. Let q be a positive query. If J is a universal solution, then certain(q, I) = q(J) . Let J be a solution such that for every positive query q we have that certain(q, I) = q(J) . Then J is a universal solution. Note: In the above: Positive query means union of SPJ queries q(J) means evaluate q on J and then throw away tuples that contain nulls) Thus, the certain answers of positive queries can be computed by evaluating them on any universal solution. Moreover, this property characterizes universal solutions. 2/23/2019

Conjunctive Queries with Inequalities
The situation changes when negation is involved (even in the very simple form of ). Example: A B R S <a0, b0> <a1, a0> T d1: R(a,b)  X. T(a,X) d2: S(a,b)  Z. T(Z,b) d1 d2 q(u, v) :- xz. T(u, x) T(z, v)  xz a0 X0 Z0 a0 J (universal) a0 a0 J2 (not universal) It can be verified that: <a0,a0>  q(J), but <a0,a0>  q(J2) (thus, not a certain answer). Hence certain(q, I)  q(J) The universal solution gives extra answers 2/23/2019

There are examples for which such query q* exists.
For conjunctive queries with inequalities, we have seen that simple query evaluation on a universal solution is not enough for computing the certain answers. Question: Can we find a different SQL query q* such that when evaluated on a universal solution gives the set of certain answers of q ? There are examples for which such query q* exists. However, we show next that the answer is “no”, in general. 2/23/2019

Complexity: Two or More Inequalities
Theorem. Computing the certain answers of unions of conjunctive queries with at most two inequalities per term is coNP-hard, even in a restricted data exchange setting (LAV). [AD98] proved a similar result for the case of conjunctive queries with six or more inequalities. The coNP-hardness implies: the certain answers cannot be computed by evaluating the query q (or any other SQL query q*) on a polynomial-time generated universal solution (unless P = NP). 2/23/2019

Complexity: One Inequality
Theorem. Assume a data exchange setting (S, T, st, t) such that t is a weakly acyclic set of dependencies. Let q be a union of conjunctive queries with at most one inequality per term. Let I be a source instance and let J be an arbitrary universal solution for I. Then there exists a polynomial-time algorithm with input J that computes certain(q, I). Thus, computing the certain answers for such queries is a tractable problem. Moreover, this computation can take place on any universal solution. The universal solution has all the information needed to compute the certain answers. We show next that the problem of computing the certain answers, even for this tractable case, cannot be solved by means of SQL query evaluation. 2/23/2019

First-Order Inexpressibility
Theorem. There exists a data exchange setting and a boolean conjunctive query q with one inequality, for which there is no first-order query q* over the canonical universal solution such that certain(q, I) = q*(J) . This is a strong inexpressibility result that shows that in data exchange we cannot use the notion of certain answers for answering queries with inequalities. (In practice, instead of going for certain answers, we should just use query evaluation on the universal solution) The proof uses an original combination of finite model theory techniques and the chase. 2/23/2019

Conclusions Universal solutions are a good candidate for using in data exchange Clio produces such universal solution (in the relational case) All universal solutions are equally good for answering positive queries. Simple query evaluation has the same semantics as that of the certain answers. For queries with inequalities, different universal solutions may give different query evaluations which may yet be different from the certain answers. There is no hope to find the certain answers by means of SQL query evaluation on a universal solution 2/23/2019

Future Work Among all universal solutions, is there a universal solution that approximates, in a best way, the certain answers ? If yes, can this be computed efficiently ? Extension to semantics and query answering for data exchange in the nested (XML) case. 2/23/2019

The End 2/23/2019

Relationship to Clio Source-to-target tgds are the same formalism that Clio uses internally (for the relational case): st is generated by Clio in the semantic translation phase [VLDB02] from correspondences. User input Then Clio generates, based on st , a set of queries in the data translation phase [VLDB02] These queries compute a solution Is Clio’s solution a good one ? (Since other solutions are also possible) Here we try to understand, formally, the concept of “good” solutions t is more general than the target constraints that Clio can currently handle. 2/23/2019

Data Exchange: Semantics and Query Answering

Similar presentations

Presentation on theme: "Data Exchange: Semantics and Query Answering"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Data Exchange: Semantics and Query Answering

Similar presentations

Presentation on theme: "Data Exchange: Semantics and Query Answering"— Presentation transcript:

Similar presentations

About project

Feedback