Managing Inconsistent Data in Data Integration and Data Exchange

Managing Inconsistent Data in Data Integration and Data Exchange
Renée J. Miller University of Toronto Periklis Andritsos, Ariel Fuxman Tasos Kementsietsidis, Yannis Velegrakis

Outline Schema Mapping – reconciling differences in schemas
Clio: Creating mappings (VLDB00,VLDB02) Using semantics of schemas and data ToMAS: Managing schema mappings (VLDB03) Evolving schemas and semantics Using Mappings Data Exchange (ICDT03) Querying Inconsistent Data (IJCAI/IIWeb03) Data Mapping – reconciling differences in data Hyperion: managing data mappings (SIGMOD03) Using networks of P2P data mappings VLDB00 – introduced schema mapping problem – relational schemas w/ source FK only– target with NO constraints VLDB02 – xml solution – considered constraints in target and boarder class of NESTED constraints Data translation – novelty is in creating a nested target (previous work created flat target, one table at a time) - for nested target can’t do one table at a time Understanding mapping – data viewer – won’t have time to discuss (but there are back-up slides on this at end of talk if needed) 11/16/2018 R.J. Miller - U. Toronto

Mapping Independent Data Sources
Source Schema S’’ Q Source Schema S’ data Source Schema S Mapping Target Schema T data “conforms to” data “conforms to” data Compilation – use semantics embedded in schema/data to compile mapping into A correct query Issues – semantics may be incomplete/incorrect – leads to incomplete/incorrect query - interactively let user correct/augment query Emphasize that we are working with data conforming to a schema – not documents or completely unstructured data! In this talk, focus on the “data exchange” problem of materializing a target (which can then be used for answering queries directly). Note that our solutions can also be used in “query unfolding” when the target remains virtual and queries on target are “unfolded” into queries on source!!! Data Integration – answer target queries using data from source(s) Target data is virtual 11/16/2018 R.J. Miller - U. Toronto

Mapping Independent Data Sources
Q Source Schema S Mapping Target Schema T “conforms to” data “conforms to” Compilation – use semantics embedded in schema/data to compile mapping into A correct query Issues – semantics may be incomplete/incorrect – leads to incomplete/incorrect query - interactively let user correct/augment query Emphasize that we are working with data conforming to a schema – not documents or completely unstructured data! In this talk, focus on the “data exchange” problem of materializing a target (which can then be used for answering queries directly). Note that our solutions can also be used in “query unfolding” when the target remains virtual and queries on target are “unfolded” into queries on source!!! data Data Exchange – answer target queries answered locally Target data is materialized 11/16/2018 R.J. Miller - U. Toronto

Overview Goal: interoperability between independent data sources
Creating Mappings Managing Mappings – as sources change Using Mappings – to query and exchange data Even when data is dirty or inconsistent Challenges Schemas can be arbitrarily different Still, data must not lose its meaning Use semantics embedded in schemas & data Facilitate specification of any additional semantics Performed manually: complex user queries, programs, etc. Hard to debug; understand; verify correctness I will break down our results into two parts: schema mapping – reconciling schematic differences data translation – reconciling differences in data content 11/16/2018 R.J. Miller - U. Toronto

Schema Mapping Q “conforms to” “conforms to” data
Wants data from S Understands T May not understand S XML Schema DTD Relational Q Source schema S Mapping Target schema T “conforms to” “conforms to” Compilation – use semantics embedded in schema/data to compile mapping into A correct query Issues – semantics may be incomplete/incorrect – leads to incomplete/incorrect query - interactively let user correct/augment query Emphasize that we are working with data conforming to a schema – not documents or completely unstructured data! In this talk, focus on the “data exchange” problem of materializing a target (which can then be used for answering queries directly). Note that our solutions can also be used in “query unfolding” when the target remains virtual and queries on target are “unfolded” into queries on source!!! data Automate (to the extent possible) the creation of mappings Mappings used for (virtual) data integration or (materialized) data exchange 11/16/2018 R.J. Miller - U. Toronto

Illustration: Mapping Creation
Support Nested Structures Element correspondences Human friendly Automatic discovery Preserve data meaning Discover data associations Use constraints & schema Create New Target Values Since grants are logically associated with companies in source, we preserve this association in mapping data to target Since gid and amount are logically associated in source (in same relation) we seek to find a way of preserving that association in target if there is an association declared in schema (FK on aid) we use that otherwise, we could ask user to supply such an association in target Basic idea: find all logical associations in source and try to preserve it in target Produce Correct Grouping 11/16/2018 R.J. Miller - U. Toronto

Creating Correspondences
Graphical User Interface DBA interactively specifies Automatic Discovery Attribute (Element) Classifier Extensible to Other Schema Matchers VLDB J. 01 Survey Correspondence based on syntactic information Within schema or data Since grants are logically associated with companies in source, we preserve this association in mapping data to target Since gid and amount are logically associated in source (in same relation) we seek to find a way of preserving that association in target if there is an association declared in schema (FK on aid) we use that otherwise, we could ask user to supply such an association in target Basic idea: find all logical associations in source and try to preserve it in target 11/16/2018 R.J. Miller - U. Toronto

Interpreting Correspondences
statDB: Set of Rcd cityStat: Rcd orgs: Set of Rcd org: Rcd cid name fundings: Set of Rcd funding: Rcd gid proj aid financials: Set of Rcd financial: Rcd date amount city expenseDB: Rcd companies: Set of Rcd company: Rcd cid name city grants: Set of Rcd grant: Rcd gid amount project What semantics do we associate to an arrow? Simple inclusion dependencies For gid, this inclusion is right, but not sufficient since it loses association between grants and companies Good enough for one arrow ! cid expenseDB.companies  cid statDB.cityStat.orgs Still works for these two arrows! cid,name expenseDB.companies  cid,name statDB.cityStat.orgs How about now ? gid expenseDB.grants  gid statDB.cityStat.orgs.fundings 11/16/2018 R.J. Miller - U. Toronto

Associations btw Elements
statDB: Set of Rcd cityStat: Rcd orgs: Set of Rcd org: Rcd cid name fundings: Set of Rcd funding: Rcd gid proj aid financials: Set of Rcd financial: Rcd date amount city expenseDB: Rcd companies: Set of Rcd company: Rcd cid name city grants: Set of Rcd grant: Rcd gid amount project Abusing notation here by writing the xml unnest as a join Associations make use of semantics embedded in nested structure and in constraints We must recognize that grants are associated to companies Association (in the source): grants ⋈ companies Association (in the target): statDB ⋈ orgs ⋈ fundings ⋈ financials 11/16/2018 R.J. Miller - U. Toronto

Schema Mapping Enumerate ALL logical associations consistent with schema semantics Constraints Nesting (schema structure) Data Interpret correspondences (arrows) over pair source & target association Value-added is that clio does this automatically – user does not have to find all the associations (which might require complex logical inference) 11/16/2018 R.J. Miller - U. Toronto

Mappings as Views Views: st have a special form: I J
q Source schema S (Local) Target schema T (Global) t st virtual ! I J Views: st have a special form: GAV: Qs(S)  Ti where Ti is a relation in T, Qs is a query on S Company(C,N,Ct),Grant(C,G,A,P)  Projects(P,Ct) Plain old view: create view projects (p, ct) as (select p,ct from …) LAV: Si  Qt(T) where Si is a relation in S , Qt is a query on T Company(C,N,City)  Org(C,N) City(C,Ct) Back-up slide – I only cover this if there are questions on the relationship to answering queries using views 11/16/2018 R.J. Miller - U. Toronto

Virtual or Materialized
Clio Mappings q Source schema S (Local) Target schema T (Global) t st Virtual or Materialized I J Clio Schema Mapping: Qs(S)  Qt(T) and constraints on S and T (s , t ) More general than views Generality often required when S, T are fixed No design control Back-up slide – I only cover this if there are questions on the relationship to answering queries using views 11/16/2018 R.J. Miller - U. Toronto

Using Mappings and Views
q Source schema S Target schema T t st virtual ! I J Data Integration The target is not materialized; it is just a querying interface Queries are posed on the target schema; data is in the source. Problem: how to answer the query in the “best” possible way AKA: Answering queries using views GAV/LAV (mostly) assumes conjunctive queries (mostly) assumes no target constraints – target is a view Uses relational (not nested relational) model Back-up slide – I only cover this if there are questions on the relationship to answering queries using views 11/16/2018 R.J. Miller - U. Toronto

Using Mappings Data Exchange
q Source schema S (Local) Target schema T (Global) t st Materialize! I J Data Exchange The target is materialized Queries are posed on the target schema; answered using target data Problem: what is “best” instance to exchange Mapping: Qs(S)  Qt(T) and constraints on S and T (s , t ) Given instance of S there may be many instances of T Which is best instance to exchange? Grant(C,G,A,P,S)  Funding(C,G,Aid),Financials(Aid,D,A) Back-up slide – I only cover this if there are questions on the relationship to answering queries using views 11/16/2018 R.J. Miller - U. Toronto

Semantics of Query Answering
Answering queries using views Query is answered using source data Answer is set of tuples in query result on ALL possible target instances: certain answers Data Exchange Query is answered using ONE materialized target Can single target give same information as source(s)? Is query result the same in both settings? 11/16/2018 R.J. Miller - U. Toronto

Mappings at Data Level Financial Target Database Employee Salary
Human Resources Employee Salary Employee Salary Mapping Financial(e,s)  Global(e,s) HumanRes(e,s)  Global(e,s) Employee Salary 11/16/2018 R.J. Miller - U. Toronto

Data Inconsistency Financial Target Database Employee Salary
John 1000 Employee Salary John 1000 2000 Mary 3000 Employee Salary Human Resources Employee Salary Employee Salary John 2000 Mary 3000 Mapping Financial(e,s)  Global(e,s) HumanRes(e,s)  Global(e,s) Employee Salary 11/16/2018 R.J. Miller - U. Toronto

Reconciling Inconsistencies (I)
1 – Delete all tuples for John Financial Target Database Employee Salary John 1000 Employee Salary Mary 3000 Employee Salary Human Resources Employee Salary Employee Salary John 2000 Mary 3000 Mapping Financial(e,s)  Global(e,s) HumanRes(e,s)  Global(e,s) Employee Salary 11/16/2018 R.J. Miller - U. Toronto

Reconciling Inconsistencies (II)
2 – Delete the salaries of John Financial Target Database Employee Salary John 1000 Employee Salary John null Mary 3000 Employee Salary Human Resources Employee Salary Employee Salary John 2000 Mary 3000 Mapping Financial(e,s)  Global(e,s) HumanRes(e,s)  Global(e,s) Employee Salary 11/16/2018 R.J. Miller - U. Toronto

Reconciling Inconsistencies (III)
3 – Delete only one tuple for John Financial Target Database Employee Salary John 1000 Employee Salary John 1000 Mary 3000 Employee Salary Human Resources Employee Salary Employee Salary John 2000 Mary 3000 Mapping Financial(e,s)  Global(e,s) HumanRes(e,s)  Global(e,s) Employee Salary 11/16/2018 R.J. Miller - U. Toronto

Repairing an integrated database
Employee Salary John 1000 Mary 3000 An integrated inconsistent database Employee Salary John 1000 2000 Mary 3000 Employee Salary 11/16/2018 R.J. Miller - U. Toronto

Repairing an integrated database
Employee Salary John 1000 Mary 3000 An integrated inconsistent database Employee Salary John 1000 2000 Mary 3000 Repair 2 Employee Salary John 2000 Mary 3000 11/16/2018 R.J. Miller - U. Toronto

Consistent Query Answers
Repair 1 Intuition: Input: query Q Get a query result Q( ) for each repair . A tuple is in the consistent answer if it appears in all query results. Employee Salary John 1000 Mary 3000 Repair 2 Employee Salary John 2000 Mary 3000 11/16/2018 R.J. Miller - U. Toronto

Repair 1 Employee Salary John 1000 Mary 3000 Q(e,s)=Target(e,s) “Get all employees and their salaries” Repair 2 Employee Salary John 2000 Mary 3000 11/16/2018 R.J. Miller - U. Toronto

Repair 1 Employee Salary John 1000 Mary 3000 Q(e,s)=Target(e,s) “Get all employees and their salaries” Repair 2 ConsistentS(Q,I)={(Mary,3000)} Employee Salary John 2000 Mary 3000 11/16/2018 R.J. Miller - U. Toronto

Repair 1 Employee Salary John 1000 Mary 3000 Q(e)=  s: Global(e,s) “Get all employees” Repair 2 Employee Salary John 2000 Mary 3000 11/16/2018 R.J. Miller - U. Toronto

Result 1 Employee John Mary Q(e)=  s: Global(e,s) “Get all employees” Result 2 ConsistentS(Q,I)={(John),(Mary)} Employee John Mary 11/16/2018 R.J. Miller - U. Toronto

Repair 1 Employee Salary John 1000 Mary 3000 Q=  e Target(e,2000) “Is there an employee who earns $2000?” Repair 2 Employee Salary John 2000 Mary 3000 11/16/2018 R.J. Miller - U. Toronto

Repair 1 Employee Salary John 1000 Mary 3000 FALSE Q= e Target(e,2000) “Is there an employee who earns $2000?” Repair 2 ConsistentS(Q,I)=FALSE Employee Salary John 2000 Mary 3000 TRUE 11/16/2018 R.J. Miller - U. Toronto

Our work (IJCAI/IIWeb03)
Problem: Retrieving consistent answers is co-NP complete in general (i.e., we need to explore an exponential number of repairs) [Chomicki and Marcinkowski 2002, Cali et al. 2003] 11/16/2018 R.J. Miller - U. Toronto

Our work Problem: Retrieving consistent answers is co-NP complete in general (i.e., we need to explore an exponential number of repairs) [Chomicki and Marcinkowski 2002, Cali et al. 2003] Goal: Find a class of tractable queries (i.e., the consistent answers can be retrieved in polynomial time without explicitly building all repairs). 11/16/2018 R.J. Miller - U. Toronto

Example: A tractable query
Are there two employees with the same salary? Inconsistent instance Employee Salary John 1000 2000 Mary Anna 3000 Graph of the inconsistent instance John 1000 Mary 2000 Anna 3000 Employee Salary 11/16/2018 R.J. Miller - U. Toronto

Example: A tractable query
John 1000 Employee Salary John 1000 Mary 2000 Anna 3000 Mary 2000 Anna 3000 11/16/2018 R.J. Miller - U. Toronto

Inexpressibility result
Query rewriting Input: query Q Output: query Q’ s.t. Q’(I)=consistentS(Q,I) for every I. Appealing approach tractable reuses existing DBMSs BUT: so far known to be applicable only to a restricted classes of queries ([ABC, PODS 1999]) 11/16/2018 R.J. Miller - U. Toronto

Can we use query rewriting? 11/16/2018 R.J. Miller - U. Toronto

Can we use query rewriting? NO 11/16/2018 R.J. Miller - U. Toronto

Practical Considerations (I)
Conflicts are usually confined to a small portion of the database Robert 4000 Fred 5000 Paul 6000 7000 Peter 1000 John 2000 Mary Anna 3000 11/16/2018 R.J. Miller - U. Toronto

Practical Considerations (I)
Conflicts are usually confined to a small portion of the database 1000 John 2000 Mary Anna 3000 11/16/2018 R.J. Miller - U. Toronto

Practical Considerations (II)
Reasonable assumption in integration and exchange: constant number of conflicts per key. Financial Employee Salary John 1000 Target Database Employee Salary John 1000 2000 Mary 3000 Employee ! Salary Human Resources Employee Salary John 2000 Mary 3000 Employee ! Salary 11/16/2018 R.J. Miller - U. Toronto

Bibliography J. Chomicki and J. Marcinkowski. On the Computational Complexity of Consistent Query Answers. coRR cs.DB/ , 2002. M. Arenas, L. Bertossi, and J. Chomicki. Consistent Query Answers in Inconsistent Databases, Proc. ACM PODS, 1999. Andrea Calì, Domenico Lembo, Riccardo Rosati. On the decidability and complexity of query answering over inconsistent and incomplete databases, Proc. ACM PODS, 2003. 11/16/2018 R.J. Miller - U. Toronto

What if sources unwilling to share schemas?
Data Mapping (SIGMOD03) What if sources unwilling to share schemas? Common in more autonomous P2P settings How can such sources share data? Shared schema mappings not appropriate Need to manage and share Data mappings Hyperion – P2P data sharing 11/16/2018 R.J. Miller - U. Toronto

P2P File-Sharing Systems
Currently, P2P querying relies on the use of value searches. e.g., retrieve songs for music band “New Order” However, P2P query mechanisms do not capture the intricacies of values, i.e., that values are often associated to each other. e.g. the value “New Order” is an alias for the value “Joy Division” We propose the use of mapping tables to record such associations e.g. a mapping table that records artist aliases We start by noting that in current file-sharing systems querying relies on the use of value searches. As an example, queries are of the form “retrieve all songs for Artist = “New Order” “. In spite of this dependency on values, these query mechanisms do not have any level of sophistication to manage the values or their associations. To see why this is important consider probably the most common type of value associations which is aliases. For example, we know that the band “New Order” was initially called “Joy Division”. If the system knew of this particular value association, then while retrieving “New Order” songs, it would also retrieve the songs for “Joy Division” . In this work we propose the use of mapping tables to record this, and other types, of associations. A mapping table is like a relational table, in this case it has two column one containing the old artist name and the other the new one. <Begin Transition>Mapping tables can be used to express more complex associations as we show in our next example <End Transition> old-name new-name Prince The Artist Puff Daddy P. Diddy Joy Division New Order 11/16/2018 R.J. Miller - U. Toronto

A P2P Genome Database System
Peers store information about genes, proteins, etc. SwissProt(pid, name) Gene (gid, name) “alias” pid name 101 Neurofibromin 102 p75 ICD 103 Neuromedin 104 Sialidase 1 105 G9 Sialidase gid name 001 NF1 002 NID 003 NGFR 004 NEU1 gid pid 001 101 003 102 004 104 105 Characteristics of mapping tables: The recorded associations can be 1:1, 1:n or m:n They are, in general, non-binary They associate values within or across domains The example we present is from biological databases. We chose this example since we found a number of databases are logically organized as a P2P system. The peers here store information about genes, proteins, diseases and the like.. We show here two such sources, one with genes and one with proteins. The mapping table, in the middle, associates values, in this case identifiers, from the two sources. What is recorded in each association in the table is the fact that a particular gene produces the indicated protein. A few things to note about mapping tables. First, the association recorded in the table can be 1:1, 1:n or m:n. In our example, gene 004 is associated to two proteins which means that it produces either one. Second, that in general mapping tables are non-binary. For example, it might be the case that the identifier for genes is comprised by 2 attributes. Finally, as we saw through our examples mapping tables can associates values within a domain-that was our example with alias- or across domain-which is our example here with genes and proteins. 11/16/2018 R.J. Miller - U. Toronto

Contributions State of the art: Our contributions:
Mapping tables represent expert knowledge. Currently, they are created manually by domain specialists. Our contributions: We automate the creation and maintenance of these tables. More specifically: We investigate alternative semantics for mapping tables. We motivate why reasoning capabilities are needed to manage them. We propose efficient algorithms for both finding inconsistencies in mapping tables and inferring new mapping tables Mapping tables represent expert knowledge and in the domains that they used, for example in biological databases, they are created manually by domain specialists. In this work our objective is to automate the creation and maintenance of these tables. To this end, we investigate alternative mapping table semantics, we motivate why reasoning capabilities are needed to manage the tables and we propose algorithms for both finding inconsistencies between tables and for inferring new ones. 11/16/2018 R.J. Miller - U. Toronto

Conclusions Managing Data Inconsistency Tolerate inconsistency
Identify inconsistency at query time Recognizes that cleaning not always possible or desirable Reconciling inconsistency Data mappings record reconciliation Manage use and combination of data mappings 11/16/2018 R.J. Miller - U. Toronto

Managing Inconsistent Data in Data Integration and Data Exchange

Similar presentations

Presentation on theme: "Managing Inconsistent Data in Data Integration and Data Exchange"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Managing Inconsistent Data in Data Integration and Data Exchange

Similar presentations

Presentation on theme: "Managing Inconsistent Data in Data Integration and Data Exchange"— Presentation transcript:

Similar presentations

About project

Feedback