Data Integration Techniques Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems October 30, 2003 Some slide content may.

Slides:



Advertisements
Similar presentations
Manipulation of Query Expressions. Outline Query unfolding Query containment and equivalence Answering queries using views.
Advertisements

CSE 636 Data Integration Data Integration Approaches.
CHAPTER 3: DESCRIBING DATA SOURCES
CPSC 504: Data Management Discussion on Chandra&Merlin 1977 Laks V.S. Lakshmanan Dept. of CS UBC.
Relational Algebra Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY courtesy of Joe Hellerstein for some slides.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 The Relational Model Chapter 3.
1 Global-as-View and Local-as-View for Information Integration CS652 Spring 2004 Presenter: Yihong Ding.
Relational Databases for Querying XML Documents: Limitations & Opportunities VLDB`99 Shanmugasundaram, J., Tufte, K., He, G., Zhang, C., DeWitt, D., Naughton,
Datalog and Data Integration Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems November 12, 2007 LSD Slides courtesy.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Relational Algebra Chapter 4, Part A Modified by Donghui Zhang.
INFS614, Fall 08 1 Relational Algebra Lecture 4. INFS614, Fall 08 2 Relational Query Languages v Query languages: Allow manipulation and retrieval of.
CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 28 Database Systems I The Relational Data Model.
CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 52 Database Systems I Relational Algebra.
SECTIONS 21.4 – 21.5 Sanuja Dabade & Eilbroun Benjamin CS 257 – Dr. TY Lin INFORMATION INTEGRATION.
Under the Covers: Tuning and Physical Storage Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems November 3, 2003 Some.
By relieving the brain of all unnecessary work, a good notation sets it free to concentrate on more advanced problems, and, in effect, increases the mental.
Local-as-View Mediators Priya Gangaraju(Class Id:203)
1 CIS607, Fall 2005 Semantic Information Integration Presentation by Paea LePendu Week 8 (Nov. 16)
Local-as-View Data Integration Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems February 21, 2005.
1 Describing and Utilizing Constraints to Answer Queries in Data-Integration Systems Chen Li Information and Computer Science University of California,
Data Integration Methods Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems February 16, 2004.
1 Relational Model. 2 Relational Database: Definitions  Relational database: a set of relations  Relation: made up of 2 parts: – Instance : a table,
Bridging Different Data Representations Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems November 11, 2004.
Describing data sources. Outline Overview Schema mapping languages.
Recursive Views and Global Views Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems November 9, 2004 Some slide content.
Data Integration, Concluded and Physical Storage Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems November 16, 2004.
1 Relational Algebra and Calculus Yanlei Diao UMass Amherst Feb 1, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
Rutgers University Relational Algebra 198:541 Rutgers University.
CIS607, Fall 2005 Semantic Information Integration Article Name: Clio Grows Up: From Research Prototype to Industrial Tool Name: DH(Dong Hwi) kwak Date:
Data Integration and Physical Storage Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems November 15, 2005.
Lecture 2 The Relational Model. Objectives Terminology of relational model. How tables are used to represent data. Connection between mathematical relations.
The Relational Model These slides are based on the slides of your text book.
Presenter: Dongning Luo Sept. 29 th 2008 This presentation based on The following paper: Alon Halevy, “Answering queries using views: A Survey”, VLDB J.
1 Relational Algebra and Calculus Chapter 4. 2 Relational Query Languages  Query languages: Allow manipulation and retrieval of data from a database.
The Relational Model. Review Why use a DBMS? OS provides RAM and disk.
Navigational Plans For Data Integration Marc Friedman Alon Levy Todd Millistein Presented By Avinash Ponnala Avinash Ponnala.
Chapter 2 Adapted from Silberschatz, et al. CHECK SLIDE 16.
Data Integration Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems October 16, 2015 LSD Slides courtesy AnHai Doan.
1 Lessons from the TSIMMIS Project Yannis Papakonstantinou Department of Computer Science & Engineering University of California, San Diego.
Mediators, Wrappers, etc. Based on TSIMMIS project at Stanford. Concepts used in several other related projects. Goal: integrate info. in heterogeneous.
9/7/2012ISC329 Isabelle Bichindaritz1 The Relational Database Model.
FALL 2004CENG 351 File Structures and Data Management1 Relational Model Chapter 3.
Data Integration, Concluded Physical Data Storage Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems October 25, 2015.
1 Relational Algebra and Calculas Chapter 4, Part A.
ICS 321 Fall 2011 The Relational Model of Data (i) Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa 8/29/20111Lipyeow.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Database Management Systems Chapter 4 Relational Algebra.
Database Management Systems 1 Raghu Ramakrishnan Relational Algebra Chpt 4 Xin Zhang.
Information Integration By Neel Bavishi. Mediator Introduction A mediator supports a virtual view or collection of views that integrates several sources.
Data Integration Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems October 9, 2008.
Data Integration Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems November 14, 2007.
Bridging Different Data Representations Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems October 28, 2003 Some slide.
Data Integration Hanna Zhong Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009.
CMPT 258 Database Systems Relational Algebra (Chapter 4)
Local-as-View Data Integration Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems October 16, 2008.
Data Integration Approaches
1 Integration of data sources Patrick Lambrix Department of Computer and Information Science Linköpings universitet.
Chapter 8 Relational Database Design. 2 Relational Database Design: Goals n Reduce data redundancy (undesirable replication of data values) n Minimize.
Of 24 lecture 11: ontology – mediation, merging & aligning.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 The Relational Model Chapter 3.
1 CS122A: Introduction to Data Management Lecture #4 (E-R  Relational Translation) Instructor: Chen Li.
CENG 351 File Structures and Data Management1 Relational Model Chapter 3.
Module 2: Intro to Relational Model
Relational Model By Dr.S.Sridhar, Ph.D.(JNUD), RACI(Paris, NICE), RMR(USA), RZFM(Germany)
Relational Algebra 461 The slides for this text are organized into chapters. This lecture covers relational algebra, from Chapter 4. The relational calculus.
Chapter 2: Intro to Relational Model
Local-as-View Mediators
Example of a Relation attributes (or columns) tuples (or rows)
Chapter 2: Intro to Relational Model
Chen Li Information and Computer Science
Presentation transcript:

Data Integration Techniques Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems October 30, 2003 Some slide content may be courtesy of Susan Davidson, Dan Suciu, & Raghu Ramakrishnan

2 We Left Off with TSIMMIS  “The Stanford-IBM Manager of Multiple Information Sources” … or, a Yiddish stew  An instance of a “global-as-view” mediation system  One of the first systems to support semi-structured data, which predated XML by several years  This system, like the Information Manifold, focused on querying web sources  Real-world integration companies (IBM, BEA, Actuate, …) are focusing on the enterprise – more $$$!

3 Queries in TSIMMIS  Specified in OQL-style language called Lorel  OQL was an object-oriented query language  Lorel is a predecessor to XQuery; OEM is a predecessor to XML  Based on path expressions over OEM structures: select book where book.author = “DB2 UDB” and book.title = “Chamberlin”  This is basically like XQuery, which we’ll use in place of Lorel and the MSL template language. Restating the query above: for $b in document(“mediated-schema”)/book where $b/title/text = “DB2 UDB” and $b/author/text() = “Chamberlin” return $b

4 Query Answering in TSIMMIS  Basically, it’s view unfolding, i.e., composing a query with a view  The query is the one being asked  The views are the MSL templates for the wrappers  Some of the views may actually require parameters, e.g., an author name, before they’ll return answers  These are called input bindings  Common for web forms (see Amazon, Google, …)  XQuery functions (XQuery’s version of views) support parameters as well, so we’ll use these to illustrate

5 A Wrapper Definition in MSL, Translated to XQuery  Wrappers have templates and binding patterns ($X) in MSL: B :- B: }> // $$ = “select * from book where author=“ $X //  This reformats a SQL query over Book(author, year, title)  In XQuery, this might look like: define function GetBook($X AS xsd:string) as book* { for $x in sql(“select * from book where author=‘” + $x +”’”) return $x $x }

6 How to Answer the Query Given our query: for $b in document(“mediated-schema”)/book where $b/title/text() = “DB2 UDB” and $b/author/text() = “Chamberlin” return $b We want to find all wrapper definitions that:  Either output enough information that we can evaluate all of our conditions over the output  They return a book’s title, and author so we can test against these  Or have already “enforced” the conditions for us!  They already do a selection on author=“Chamberlin,” etc.

7 Query Composition with Views  We find all views that define book with author and title, and we compose the query with each of these  In our example, we find one wrapper definition that matches: define function GetBook($x AS xsd:string) as book* { for $b in sql(“select * from book where author=‘” + $x +”’”) return $b $x } for $b in document(“mediated-schema”)/book where $b/title/text() = “DB2 UDB” and $b/author/text() = “Chamberlin” return $b

8 Matching View Output to Our Query’s Conditions  Determine that the query tests for $x=“Chamberlin” by matching the query’s XPath, $b/author/text(), on the function’s output: define function GetBook($x AS xsd:string) as book { for $b in sql(“select * from book where author=‘” + $x +”’”) return $b $x } let $x := “Chamberlin” for $b in GetBook($x)/book where $b/title/text() = “DB2 UDB” return $b

9 The Final Step: Unfolding The expression: let $x := “Chamberlin” for $b in { for $b in sql(“select * from book where author=‘” + $x +”’”) return $b $x }/book where $b/title/text() = “DB2 UDB” return $b Can be unnested (“unfolded”) and simplified to: for $b in sql(“select * from book where author=‘Chamberlin’”) where $b/title/text() = “DB2 UDB” return $b

10 What Is the Answer? Given schema book(author, year, title) and Datalog rules defining an instance: book(“Chamberlin”, “1992”, “DB2 UDB”) book(“Chamberlin”, “1995”, “DB2/CS”) book(“Bernstein”, “1997”, “Transaction Processing”)  TSIMMIS is an instance of a global-as-view mediator with a semistructured data model  Can also have GAV mediators using Datalog or SQL, which work on similar principles  Queries and mappings are unfolded (macro-expanded + simplified)

11 Limitations of Global-As-View  Some data sources may contain data that falls within certain ranges or has certain known properties  “Books by Aho”, “Students at UPenn”, …  How do we express these? (Important so we reduce the number of sources we query!)  Mediated schema is basically the union of the various MSL templates – as they change, so may the mediated schema!  Not good for scalability or flexibility

12 Observations of Levy et al. in Information Manifold Paper  When you integrate something, you have a conceptual model of the integrated domain  Define that as a basic frame of reference – not the data that’s in the sources  May have overlapping/incomplete sources  Define each source as the subset of a query over the mediated schema  We can use selection or join predicates to specify that a source contains a range of values: ComputerBooks(…)  Books(Title, …, Subj), Subj = “Computers”

13 The Information Manifold  Defines the mediated schema independently of the sources!  “Local-as-view” instead of “global-as-view”  Assumes that we can only see a small subset of all the possible facts – “open-world assumption”  Allows us to specify information about data sources  Focuses on relations (with OO extensions), Datalog  Guarantees soundness of answers, completeness of “certain answers” – those tuples that must exist  Maximal set of tuples in query answer that are logically implied by data at the sources, plus all mappings’ constraints

14 The Local-as-View Model  Properties:  “Local” sources are views over the mediated schema  Sources have the data – mediated schema is virtual  Sources may not have all the data from the domain – “open-world assumption”  The system must use the sources (views) to answer queries over the mediated schema  “Answering queries using views” …

15 Answering Queries Using Views  Our assumption for today: conjunctive queries, set semantics  Suppose we have a mediated schema: author(aID, isbn, year), book(isbn, title, publisher)  A conjunctive query might be: q(a, t, p) :- author(a, i, _), book(i, t, p), t = “DB2 UDB”  Recall intuitions about this class of queries:  Adding a conjunct to a query removes answers from the result but never adds any  Any conjunctive query with at least the same constraints & conjuncts will give valid answers

16 Query Answering  Suppose we have the same query: q(a, t, p) :- author(a, i, _), book(i, t, p), t = “DB2 UDB”  and sources: s1(a,t)  author(a, i, _), book(i, t, p), t = “123” s2(a,t)  author(a, i, _), book(i, t, p), t = “DB2 UDB” s3(a,t,p)  author(a, i, _), book(i, t, p), t = “123” s4(a,i)  author(a, i, _), a = “Smith” s5(a,i)  author(a, i, _) s6(i,p)  book(i, t, p)  We want to compose the query with the source mappings – but they’re in the wrong direction!

17 Inverse Rules  We can take every mapping and “invert” it, though sometimes we may have insufficient information:  If s5(a,i)  author(a, i, _)  then we can also infer that: author(a, i, ???)  s5(a,i) But how to handle the absence of the 3 rd attribute?  We know that there must be AT LEAST one instance of ??? in author for each (a,i) pair  So we might simply insert a NULL and define that NULL means “unknown” (as opposed to “missing”)…

18 But NULLs Lose Information  Suppose we take these rules and ask for: q(a,t) :- author(a, i, _), book(i, t, p)  If we look at the rule: s1(a,t)  author(a, i, _), book(i, t, p), t = “123”  Clearly q(a,t) :- s1(a,t)  But if apply our inversion procedure, we get: author(a, NULL, NULL)  s1(a,t) book(NULL, t, p)  s1(a,t), t = “123”  and there’s no way to figure out how to join author and book on NULL!  We need “a special NULL for each a-t combo” so we can figure out which a’s and t’s go together

19 The Solution: “Skolem Functions”  Skolem functions:  “Perfect” hash functions  Each function returns a unique, deterministic value for each combination of input values  Every function returns a non-overlapping set of values (Skolem function F will never return a value that matches any of Skolem function G’s values)  Skolem functions won’t ever be part of the answer set or the computation  They’re just a way of logically generating “special NULLs”

20 Revisiting Our Example  Query: q(a,t) :- author(a, i, _), book(i, t, p)  Mapping rule: s1(a,t)  author(a, i, _), book(i, t, p), t = “123”  Inverse rules: author(a, f(a,t), NULL)  s1(a,t) book(f(a,t), t, p)  s1(a,t), t = “123”  We can now expand the query:  q(a,t) :- author(a, i, NULL), book(i, t, p), i = f(a,t)  q(a,t) :- s1(a,t), s1(a,t), t = “123”, i = f(a,t)

21 Query Answering Using Inverse Rules  Invert all rules using the procedures described  Take the query and the possible rule expansions and execute them in a Datalog interpreter  In the previous query, we expand with all combinations of expansions of book and of author – every possible way of combining and cross-correlating info from different sources  Then we throw away all unsatisfiable rewritings (some expansions will be logically inconsistent)

22 Levy et al. Alternative Approach: The Bucket Algorithm  Given a query Q with relations and predicates  Create a bucket for each subgoal in Q  Iterate over each view (source mapping)  If source includes bucket’s subgoal:  Create mapping between q’s vars and the view’s var at the same position  If satisfiable with substitutions, add to bucket  Do cross-product of buckets, see if result is contained in the query (recall we saw an algorithm to do that)

23 Source Capabilities  The simplest form is to annotate the attributes of a relation:  Book bff (auth,title,pub)  But many data integration efforts had more sophisticated models  Can a data source support joins between its relations?  Can a data source be sent a relation that it should join with?  In the end, we need to perform parts of the query in the mediator, and other parts at the sources

24 Contributions of the Info Manifold  More robust way of defining mediated schemas and sources  Mediated schema is clearly defined, less likely to change  Sources can be more accurately described  Relatively efficient algorithms for query reformulation, creating executable plans  Still requires standardization on a single schema  Can be hard to get consensus  Some other aspects were captured in related papers  Overlap between sources; coverage of data at sources  Semi-automated creation of mappings  Semi-automated construction of wrappers

25 Later Integration Systems Focused on Better Performance Tukwila/Piazza [Ives+99,Halevy+02] – Washington  Descendants of the Information Manifold  Similar capabilities, but with adaptive processing of XML as it is read across streams Niagara [DeWitt+99] – Wisconsin  XML querying of web sources  Giving answers a screenful at a time TelegraphCQ [Chandrasekaran+03] – Berkeley  Adaptive, select-project-join queries over infinite streams