Data Integration Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems October 9, 2008.

Slides:



Advertisements
Similar presentations
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 5 More SQL: Complex Queries, Triggers, Views, and Schema Modification.
Advertisements

CHAPTER 3: DESCRIBING DATA SOURCES
CPSC 504: Data Management Discussion on Chandra&Merlin 1977 Laks V.S. Lakshmanan Dept. of CS UBC.
Relational Algebra Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY courtesy of Joe Hellerstein for some slides.
1 Conjunctions of Queries. 2 Conjunctive Queries A conjunctive query is a single Datalog rule with only non-negated atoms in the body. (Note: No negated.
D ATABASE S YSTEMS I R ELATIONAL A LGEBRA. 22 R ELATIONAL Q UERY L ANGUAGES Query languages (QL): Allow manipulation and retrieval of data from a database.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Relational Algebra Chapter 4, Part A Modified by Donghui Zhang.
INFS614, Fall 08 1 Relational Algebra Lecture 4. INFS614, Fall 08 2 Relational Query Languages v Query languages: Allow manipulation and retrieval of.
1 Relational Algebra & Calculus. 2 Relational Query Languages  Query languages: Allow manipulation and retrieval of data from a database.  Relational.
Efficient Query Evaluation on Probabilistic Databases
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 5 More SQL: Complex Queries, Triggers, Views, and Schema Modification.
By relieving the brain of all unnecessary work, a good notation sets it free to concentrate on more advanced problems, and, in effect, increases the mental.
CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 52 Database Systems I Relational Algebra.
By relieving the brain of all unnecessary work, a good notation sets it free to concentrate on more advanced problems, and, in effect, increases the mental.
Constraint Logic Programming Ryan Kinworthy. Overview Introduction Logic Programming LP as a constraint programming language Constraint Logic Programming.
By relieving the brain of all unnecessary work, a good notation sets it free to concentrate on more advanced problems, and, in effect, increases the mental.
FALL 2004CENG 351 File Structures and Data Managemnet1 Relational Algebra.
By relieving the brain of all unnecessary work, a good notation sets it free to concentrate on more advanced problems, and, in effect, increases the mental.
Data Integration Techniques Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems October 30, 2003 Some slide content may.
1 Relational Algebra. 2 Relational Query Languages Query languages: Allow manipulation and retrieval of data from a database. Relational model supports.
CMSC424: Database Design Instructor: Amol Deshpande
Local-as-View Mediators Priya Gangaraju(Class Id:203)
Local-as-View Data Integration Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems February 21, 2005.
Data Integration Methods Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems February 16, 2004.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Relational Algebra Chapter 4, Part A.
2005lav-iii1 The Infomaster system & the inverse rules algorithm  The InfoMaster system  The inverse rules algorithm  A side trip – equivalence & containment.
By relieving the brain of all unnecessary work, a good notation sets it free to concentrate on more advanced problems, and, in effect, increases the mental.
1 Relational Algebra and Calculus Yanlei Diao UMass Amherst Feb 1, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
Rutgers University Relational Algebra 198:541 Rutgers University.
CIS607, Fall 2005 Semantic Information Integration Article Name: Clio Grows Up: From Research Prototype to Industrial Tool Name: DH(Dong Hwi) kwak Date:
Relational Algebra Chapter 4 - part I. 2 Relational Query Languages  Query languages: Allow manipulation and retrieval of data from a database.  Relational.
MiniCon Reformulation & Adaptive Re-Optimization Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems February 23, 2005.
CSCD343- Introduction to databases- A. Vaisman1 Relational Algebra.
Relational Algebra, R. Ramakrishnan and J. Gehrke (with additions by Ch. Eick) 1 Relational Algebra.
Presenter: Dongning Luo Sept. 29 th 2008 This presentation based on The following paper: Alon Halevy, “Answering queries using views: A Survey”, VLDB J.
1 Relational Algebra and Calculus Chapter 4. 2 Relational Query Languages  Query languages: Allow manipulation and retrieval of data from a database.
CSE314 Database Systems More SQL: Complex Queries, Triggers, Views, and Schema Modification Doç. Dr. Mehmet Göktürk src: Elmasri & Navanthe 6E Pearson.
Lecture 05 Structured Query Language. 2 Father of Relational Model Edgar F. Codd ( ) PhD from U. of Michigan, Ann Arbor Received Turing Award.
1 Relational Algebra. 2 Relational Query Languages v Query languages: Allow manipulation and retrieval of data from a database. v Relational model supports.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Relational Algebra.
Relational Algebra  Souhad M. Daraghma. Relational Query Languages Query languages: Allow manipulation and retrieval of data from a database. Relational.
1 Relational Algebra & Calculus Chapter 4, Part A (Relational Algebra)
1 Relational Algebra and Calculas Chapter 4, Part A.
1.1 CAS CS 460/660 Introduction to Database Systems Relational Algebra.
Database Management Systems 1 Raghu Ramakrishnan Relational Algebra Chpt 4 Xin Zhang.
Relational Algebra.
ICS 321 Fall 2011 The Relational Model of Data (i) Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa 8/29/20111Lipyeow.
1 Relational Algebra Chapter 4, Sections 4.1 – 4.2.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Database Management Systems Chapter 4 Relational Algebra.
Database Management Systems 1 Raghu Ramakrishnan Relational Algebra Chpt 4 Xin Zhang.
CSCD34-Data Management Systems - A. Vaisman1 Relational Algebra.
Ch. 13 Ch. 131 jcmt CSE 3302 Programming Languages CSE3302 Programming Languages (notes?) Dr. Carter Tiernan.
Database Management Systems, R. Ramakrishnan1 Relational Algebra Module 3, Lecture 1.
CMPT 258 Database Systems Relational Algebra (Chapter 4)
Relational Algebra p BIT DBMS II.
Mariposa and Data Integration I Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems October 6, 2008.
Local-as-View Data Integration Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems October 16, 2008.
1 CS122A: Introduction to Data Management Lecture #7 Relational Algebra I Instructor: Chen Li.
1 The Relational Data Model David J. Stucki. Relational Model Concepts 2 Fundamental concept: the relation  The Relational Model represents an entire.
More SQL: Complex Queries, Triggers, Views, and Schema Modification
Relational Algebra Chapter 4 1.
Relational Algebra Chapter 4, Part A
Relational Algebra 461 The slides for this text are organized into chapters. This lecture covers relational algebra, from Chapter 4. The relational calculus.
Relational Algebra Chapter 4 1.
Relational Algebra Chapter 4, Sections 4.1 – 4.2
Local-as-View Mediators
This Lecture Substitution model
Relational Algebra & Calculus
Presentation transcript:

Data Integration Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems October 9, 2008

2 Administrivia Next week: Tuesday – Database & Information Retrieval Day Levine 101

Schema Mappings  We generally use queries as the basis of mappings  Goals:  Compose a query with a set of mappings  Intersect the constraints in the query and mappings – only returning data matching the constraints  (Possibly) compose chains of mappings  (Possibly) invert mappings  The basic formalism: mappings are conjunctive queries  Q(X) :- R1(X 1 ), R 2 (X 2 ), …, c 1 (X 1 )  And both queries and the overall set of mappings are unions of conjunctive queries  Why: tractability! 3

4 The Job of Mappings Between different data sources:  May have different numbers of tables – different decompositions  Attributes may be broken down differently (“rating” vs. “EbertThumb” and “RoeperThumb”)  Metadata in one relation may be data in another  Values may not exactly correspond (“shows” vs. “movies”)  It may be unclear whether a value is the same (“COPPOLA” vs. “Francis Ford Coppola”)  May have different, but synonymous terms (ImdbID “123456”  SSN “ ”)  Might have sub/superclass relationships

5 General Techniques  Value-value correspondences accomplished using concordance tables  Join through a table mapping values to values  Imdb_Actor(ID, SAG_actor_name)  Table-multitable correspondences accomplished using joins (in one direction), projections (in other direction)  Key question: what happens if a needed attribute is missing? (e.g., DecentMovie has no genre)  Super/subclass relationships generally must be captured using selection (in one direction), union (in other direction)  … And sometimes we just can’t specify the correspondence!

6 Some Examples of Mappings  Show( ID, Title, Year, Lang, Genre )  Movie( ID, Title, Year, Genre, Director, Star1, Star2 )  EnglishMovie( Title, Year, Genre, Rating )  Docu( ID, Title, Year ) Participant( ID, Name, Role ) ImdbIDCastOf 1234Catwoman NameCastOf Berry, H.Monster’s Ball PieceOfArt(I, T, Y, “English”, “G”) :- EnglishMovie(T, Y, G, _), MovieIDFor(I, T, Y) Movie(I, T, Y, “doc”, D, S1, S2) :- Docu(I, T, Y), Participant(I, D, “Dir”), Participant(I, S1, “Cast1”), Participant(I, S2, “Cast2”) T1 T2 Need a concordance table from ImdbIDs to actress names

Query Answering with Mappings: Reformulation  Inputs: a query Q, a set of mappings M, and a set of sources S  M1(X) :- R1(X 1 Y 1 ), R 2 (X 2 Y 2 ), …, c 1 (X 1 Y 1 ),…   X M2(X)   Y 1 Y 2 R1(X 1 Y 1 )  R 2 (X 2 Y 2 )  …  c 1 (X 1 Y 1 )  …  Goal: a set of rewritings Q’, expressed as a union of conjunctive queries over S  which typically returns the set of all certain answers – those answers implied by the base data and the constraints expressed in the mappings 7

Kinds of Schema Mappings  Global As View (GAV):   X M(X)   Y 1 Y 2 R1(X 1 Y 1 )  R 2 (X 2 Y 2 )  …  c 1 (X 1 Y 1 )  …  Q(X) :- M R (X 1,Y 1 ), M S (X 2,Y 2 ), …  Local As View (LAV):   X M R (X)   Y 1 Y 2 R1(X 1 Y 1 )  R 2 (X 2 Y 2 )  …  c 1 (X 1 Y 1 )  …  Q(X) :- M R (X 1,Y 1 ), M S (X 2,Y 2 ), …  Global-Local As View (GLAV), aka Tuple-Generating Dependencies (TGDs):   X M R (X 1 Z 1 ),M S (X 2 Z 1 )   Y 1 Y 2 R1(W 1 Y 1 )  R 2 (W 2 Y 2 )  …  c 1 (X 1 Y 1 )  …where X1 ⋃ X2 = W1 ⋃ W2 ⋃ … 8

Query Reformulation in Global-As-View  The most traditional scheme, implemented in most commercial systems  Mediated schema is a view over source data  Example real-world systems: IBM DB2 / WebSphere Information Integrator; Oracle Fusion  Reuses query unfolding capabilities from a DBMS:  Query over a View over Base data  Query over Base data 9

Query Unfolding: Basic Procedure  V1(x,y,z) :- R1(x,y,w), R2(w,u), R3(u,z)  V2(x,y) :- R1(x,u), R2(u,y), R3(y,w), R4(w,z)  Q(u) :- V1(u,v,w), V2(x,y)  Substitute the body of V1 into Q, renaming appropriately; repeat for V2 10

Challenges  If there are multiple rules for a view, unfolding may generate an exponential number of queries  Each query might be non-minimal  Leads to reasoning about query containment and equivalence  If containment holds both ways between Q1, Q2 then they are equivalent  We’ll see a containment check later… 11

Global-As-View: Summary  Very easy to implement – doesn’t require any new logic on the part of a regular DBMS engine  For instance, Starburst QGM rewrites would work  But some drawbacks – primarily that:  We don’t have a mechanism to describe when a source contains only a subset of the data in the mediated schema  e.g., “All books from this source are of type textbook”  The mediated schema often needs to change as we add sources – it is somewhat “brittle” because it’s defined in terms of sources 12

13 An Alternate Approach: Local-As-View When you integrate something, you have some conceptual model of the integrated domain  Define that as a basic frame of reference, everything else as a view over it  “Local as View” using mappings that are conjunctive queries May have overlapping/incomplete sources  Define each source as the subset of a query over the mediated schema – the “open world assumption”  We can use selection or join predicates to specify that a source contains a range of values: ComputerBooks(…)  Books(Title, …, Subj), Subj = “Computers”

14 The Local-as-View Model The basic model is the following:  “Local” sources are views over the mediated schema  Sources have the data – mediated schema is virtual  Sources may not have all the data from the domain – “open-world assumption” The system must use the sources (views) to answer queries over the mediated schema

15 Answering Queries Using Views Assumption: conjunctive queries, set semantics  Suppose we have a mediated schema: show(ID, title, year, genre), rating(ID, stars, source)  A conjunctive query might be: q(t) :- show(i, t, y, g), rating(i, 5, s) Recall intuitions about this class of queries:  Adding a conjunct to a query (e.g., t = 1997) removes answers from the result but never adds any  Any conjunctive query with at least the same constraints & conjuncts will give valid answers

16 Why This Class of Mappings & Queries?  Abiteboul & Duschka showed the data complexity of answering queries using views with OWA: viewsqueries CQCQ != PQdatalogFO CQPTIMEco-NPPTIME undec CQ != PTIMEco-NPPTIME undec PQco-NP undec datalogco-NPundecco-NPundec FOundec  Note that the common “inflationary semantics” version of Datalog must terminate in PTIME, even with recursion

17 Query Answering Suppose we have the query: q(t) :- show(i, t, y, g), rating(i, 5, s) and sources: 5star(i)  show(i, t, y, g), rating(i, 5, s) TVguide(t,y,g,r)  show(i, t, y, g), rating(i, r, “TVGuide”) movieInfo(i,t,y,g)  show(i, t, y, g) critics(i,r, s)  rating(i, r, s) goodMovies(t,y)  show(i, t, y, “drama”), rating(i, 5, s), y = 1997 We want to compose the query with the source mappings – but they’re in the wrong direction!

18 Inverse Rules We can take every mapping and “invert” it, though sometimes we may have insufficient information: If 5star(i)  show(i, t, y, g), rating(i, 5, s) then we can also infer that: show(i,???,???,???,???)  5star(i) But how to handle the absence of the missing attributes?  We know that there must be AT LEAST one instance of ??? for each attribute for each show ID  So we might simply insert a NULL and define that NULL means “unknown” (as opposed to “missing”)…

19 But NULLs Lose Information Suppose we take these rules and ask for: q(t) :- show(i, t, y, g), rating(i, 5, s) If we look at the rule: goodMovies(t,y)  show(i, t, y, “drama”), rating(i, 5, s), y = 1997 “By inspection,” q(t)  goodMovies(t,y) But if apply our inversion procedure, we get: show(i, t, y, g)  goodMovies(t,y), i = NULL, g = “drama”, y = 1997 rating(i, r, s)  goodMovies(t,y), i = NULL, r = 5, s = NULL We need “a special NULL” so we can figure out which IDs and ratings match up

20 The Solution: “Skolem Functions” Skolem functions:  Conceptual “perfect” hash functions  Each function returns a unique, deterministic value for each combination of input values  Every function returns a non-overlapping set of values (Skolem function F will never return a value that matches any of Skolem function G’s values) Skolem functions won’t ever be part of the answer set or the computation – it doesn’t produce real values  They’re just a way of logically generating “special NULLs”

21 Query Answering Using Inverse Rules Invert all rules using the procedures described Take the query and the possible rule expansions and execute them in a Datalog interpreter  In the previous query, we expand with all combinations of expansions of show and of rating – every possible way of combining and cross-correlating info from different sources  Then discard unsatisfiable rewritings via unification, i.e., substituting in constants from the query for variables in the view  Finally, execute the union of all satisfiable rewritings

22 Pros & Cons of Inverse Rules  Works even with recursive queries, binding patterns, FDs on schemas  Generally, they take view definitions, split them, and re-join them to produce answers  Not very efficient  No treatment of predicates  Can we do better?

23 The Bucket Algorithm  Given a query Q with relations and predicates  Create a bucket for each subgoal in Q  Iterate over each view (source mapping)  If source includes bucket’s subgoal:  Create mapping between q’s vars and the view’s var at the same position  If satisfiable with substitutions, add to bucket  Do cross-product of buckets, see if result is contained (exptime, but queries are probably relatively small)  For each result, do a containment check to make sure the rewriting is contained within the query

24 Let’s Try a Bucket Example Query q(t) :- show(i, t, y, g), rating(i, 5, s) Sources 5star(i)  show(i, t, y, g), rating(i, 5, s) TVguide(t,y,g,r)  show(i, t, y, g), rating(i, r, “TVGuide”) movieInfo(i,t,y,g)  show(i, t, y, g) critics(i,r,s)  rating(i, r, s) goodMovies(t,y)  show(i, t, y, “drama”), rating(i, 5, s), y = 1997 good98(t,y)  show(i, t, y, “drama”), rating(i, 5, s), y = 1998

25 Populating the Buckets show(i,t,y,g)rating(i,5,s) 5star(i) TVguide(t,y,g,r) movieInfo(i,t,y,g) critics(i,r,s) goodMovies(t,y) good98(t,y)

26 Evaluation  On the board…

27 Example of Containment Testing Suppose we have two queries: q1(t) :- show(i, t, y, g), rating(i, 5, s), y = 1997 q2(t,y) :- show(i, t, y, “drama”), rating(i, 5, s) Intuitively, q1 must contain the same or fewer answers vs. q2:  It has all of the same conditions, except one extra conjunction (i.e., it’s more restricted)  There’s no union or any other way it can add more data We can say that q2 contains q1 because this holds for any instance of our mediated schema

28 Checking Containment via Canonical Databases  To test for q1 µ q2:  Create a “canonical DB” that contains a tuple for each subgoal in q1  Execute q2 over it  If q2 returns a tuple that matches the head of q1, then q1 µ q2 (This is an NP-complete algorithm in the size of the query. Testing for full first-order logic queries is undecidable!!!)  Let’s see this for our example…

29 Example Canonical DB q1(t) :- show(i, t, 1997, g), rating(i, 5, s) q2(t,y) :- show(i, t, y, “drama”), rating(i, 5, s) show rating it1997g i5s Need to get tuple in executing q2 over this database What if q2 didn’t ask for g = drama?

30 Buckets, Rev. 2: The MiniCon Algorithm  A “much smarter” bucket algorithm:  In many cases, we don’t need to perform the cross- product of all items in all buckets  Eliminates the need for the containment check  This – and the Chase & Backchase strategy of Tannen et al – are the two methods most used in virtual data integration today

31 Minicon Descriptions (MCDs)  Basically, a modification to the bucket approach  “head homomorphism” – defines what variables must be equated  Variable-substituted version of the subgoals  Mapping of variable names  Info about what’s covered  Property 1:  If a variable occurs in the head of a query, then there must be a corresponding variable in the head of the MCD view  If a variable participates in a join predicate in the query, then it must be in the head of the view

32 MCD Construction For each subgoal of the query For each subgoal of each view Choose the least restrictive head homomorphism to match the subgoal of the query If we can find a way of mapping the variables, then add MCD for each possible “maximal” extension of the mapping that satisfies Property 1

33 MCDs for Our Example 5star(i)  show(i, t, y, g), rating(i, 5, s) TVguide(t,y,g,r)  show(i, t, y, g), rating(i, r, “TVGuide”) movieInfo(i,t,y,g)  show(i, t, y, g) critics(i,r,s)  rating(i, r, s) goodMovies(t,y)  show(i, t, 1997, “drama”), rating(i, 5, s) good98(t,y)  show(i, t, 1998, “drama”), rating(i, 5, s) viewh.h.mappinggoals sat. 5star(i)iiiiiiii2 TVguide(t,y,g,r)t  t, y  y, g  gt  t, y  y, g  g, r  r1,2 movieInfo(i,t,y,g)i  i, t  t, y  y, g  g 1 critics(i,r,s)i  i, r  r, s  s 2 goodMovies(t,y)t  t,y  y 1,2 good98(t,y)t  t,y  y 1,2 q(t) :- show(i, t, y, g), rating(i, r, s), r = 5

34 Combining MCDs  Now look for ways of combining pairwise disjoint subsets of the goals  Greatly reduces the number of candidates!  Also proven to be correct without the use of a containment check  Variations need to be made for:  Constants in general (I sneaked those in)  “Semi-interval” predicates (x <= c)  Note that full-blown inequality predicates are co-NP-hard in the size of the data, so they don’t work

35 MiniCon and LAV Summary  The state-of-the-art for AQUV in the relational world of data integration  It’s been extended to support “conjunctive XQuery” as well  Scales to large numbers of views, which we need in LAV data integration  Chase & Backchase by Tannen et al.  A procedure that has very close connections to inverse rules  Slightly more general in some ways – but:  Produces equivalent rewritings, not maximally contained ones  Not always polynomial in the size of the data

36 Recall Next reading assignment:  DeWitt and Kabra  Avnur and Hellerstein  Compare the different approaches Start thinking about what you’d like to do for a project  One-page proposal of your project scope, goals, and means of assessing success/failure due next Monday, Feb. 28 th  By now you should have a good idea of what most of the ideas in the handout involve