Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Integration Techniques Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems October 30, 2003 Some slide content may.

Similar presentations


Presentation on theme: "Data Integration Techniques Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems October 30, 2003 Some slide content may."— Presentation transcript:

1 Data Integration Techniques Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems October 30, 2003 Some slide content may be courtesy of Susan Davidson, Dan Suciu, & Raghu Ramakrishnan

2 2 We Left Off with TSIMMIS  “The Stanford-IBM Manager of Multiple Information Sources” … or, a Yiddish stew  An instance of a “global-as-view” mediation system  One of the first systems to support semi-structured data, which predated XML by several years  This system, like the Information Manifold, focused on querying web sources  Real-world integration companies (IBM, BEA, Actuate, …) are focusing on the enterprise – more $$$!

3 3 Queries in TSIMMIS  Specified in OQL-style language called Lorel  OQL was an object-oriented query language  Lorel is a predecessor to XQuery; OEM is a predecessor to XML  Based on path expressions over OEM structures: select book where book.author = “DB2 UDB” and book.title = “Chamberlin”  This is basically like XQuery, which we’ll use in place of Lorel and the MSL template language. Restating the query above: for $b in document(“mediated-schema”)/book where $b/title/text = “DB2 UDB” and $b/author/text() = “Chamberlin” return $b

4 4 Query Answering in TSIMMIS  Basically, it’s view unfolding, i.e., composing a query with a view  The query is the one being asked  The views are the MSL templates for the wrappers  Some of the views may actually require parameters, e.g., an author name, before they’ll return answers  These are called input bindings  Common for web forms (see Amazon, Google, …)  XQuery functions (XQuery’s version of views) support parameters as well, so we’ll use these to illustrate

5 5 A Wrapper Definition in MSL, Translated to XQuery  Wrappers have templates and binding patterns ($X) in MSL: B :- B: }> // $$ = “select * from book where author=“ $X //  This reformats a SQL query over Book(author, year, title)  In XQuery, this might look like: define function GetBook($X AS xsd:string) as book* { for $x in sql(“select * from book where author=‘” + $x +”’”) return $x $x }

6 6 How to Answer the Query Given our query: for $b in document(“mediated-schema”)/book where $b/title/text() = “DB2 UDB” and $b/author/text() = “Chamberlin” return $b We want to find all wrapper definitions that:  Either output enough information that we can evaluate all of our conditions over the output  They return a book’s title, and author so we can test against these  Or have already “enforced” the conditions for us!  They already do a selection on author=“Chamberlin,” etc.

7 7 Query Composition with Views  We find all views that define book with author and title, and we compose the query with each of these  In our example, we find one wrapper definition that matches: define function GetBook($x AS xsd:string) as book* { for $b in sql(“select * from book where author=‘” + $x +”’”) return $b $x } for $b in document(“mediated-schema”)/book where $b/title/text() = “DB2 UDB” and $b/author/text() = “Chamberlin” return $b

8 8 Matching View Output to Our Query’s Conditions  Determine that the query tests for $x=“Chamberlin” by matching the query’s XPath, $b/author/text(), on the function’s output: define function GetBook($x AS xsd:string) as book { for $b in sql(“select * from book where author=‘” + $x +”’”) return $b $x } let $x := “Chamberlin” for $b in GetBook($x)/book where $b/title/text() = “DB2 UDB” return $b

9 9 The Final Step: Unfolding The expression: let $x := “Chamberlin” for $b in { for $b in sql(“select * from book where author=‘” + $x +”’”) return $b $x }/book where $b/title/text() = “DB2 UDB” return $b Can be unnested (“unfolded”) and simplified to: for $b in sql(“select * from book where author=‘Chamberlin’”) where $b/title/text() = “DB2 UDB” return $b

10 10 What Is the Answer? Given schema book(author, year, title) and Datalog rules defining an instance: book(“Chamberlin”, “1992”, “DB2 UDB”) book(“Chamberlin”, “1995”, “DB2/CS”) book(“Bernstein”, “1997”, “Transaction Processing”)  TSIMMIS is an instance of a global-as-view mediator with a semistructured data model  Can also have GAV mediators using Datalog or SQL, which work on similar principles  Queries and mappings are unfolded (macro-expanded + simplified)

11 11 Limitations of Global-As-View  Some data sources may contain data that falls within certain ranges or has certain known properties  “Books by Aho”, “Students at UPenn”, …  How do we express these? (Important so we reduce the number of sources we query!)  Mediated schema is basically the union of the various MSL templates – as they change, so may the mediated schema!  Not good for scalability or flexibility

12 12 Observations of Levy et al. in Information Manifold Paper  When you integrate something, you have a conceptual model of the integrated domain  Define that as a basic frame of reference – not the data that’s in the sources  May have overlapping/incomplete sources  Define each source as the subset of a query over the mediated schema  We can use selection or join predicates to specify that a source contains a range of values: ComputerBooks(…)  Books(Title, …, Subj), Subj = “Computers”

13 13 The Information Manifold  Defines the mediated schema independently of the sources!  “Local-as-view” instead of “global-as-view”  Assumes that we can only see a small subset of all the possible facts – “open-world assumption”  Allows us to specify information about data sources  Focuses on relations (with OO extensions), Datalog  Guarantees soundness of answers, completeness of “certain answers” – those tuples that must exist  Maximal set of tuples in query answer that are logically implied by data at the sources, plus all mappings’ constraints

14 14 The Local-as-View Model  Properties:  “Local” sources are views over the mediated schema  Sources have the data – mediated schema is virtual  Sources may not have all the data from the domain – “open-world assumption”  The system must use the sources (views) to answer queries over the mediated schema  “Answering queries using views” …

15 15 Answering Queries Using Views  Our assumption for today: conjunctive queries, set semantics  Suppose we have a mediated schema: author(aID, isbn, year), book(isbn, title, publisher)  A conjunctive query might be: q(a, t, p) :- author(a, i, _), book(i, t, p), t = “DB2 UDB”  Recall intuitions about this class of queries:  Adding a conjunct to a query removes answers from the result but never adds any  Any conjunctive query with at least the same constraints & conjuncts will give valid answers

16 16 Query Answering  Suppose we have the same query: q(a, t, p) :- author(a, i, _), book(i, t, p), t = “DB2 UDB”  and sources: s1(a,t)  author(a, i, _), book(i, t, p), t = “123” s2(a,t)  author(a, i, _), book(i, t, p), t = “DB2 UDB” s3(a,t,p)  author(a, i, _), book(i, t, p), t = “123” s4(a,i)  author(a, i, _), a = “Smith” s5(a,i)  author(a, i, _) s6(i,p)  book(i, t, p)  We want to compose the query with the source mappings – but they’re in the wrong direction!

17 17 Inverse Rules  We can take every mapping and “invert” it, though sometimes we may have insufficient information:  If s5(a,i)  author(a, i, _)  then we can also infer that: author(a, i, ???)  s5(a,i) But how to handle the absence of the 3 rd attribute?  We know that there must be AT LEAST one instance of ??? in author for each (a,i) pair  So we might simply insert a NULL and define that NULL means “unknown” (as opposed to “missing”)…

18 18 But NULLs Lose Information  Suppose we take these rules and ask for: q(a,t) :- author(a, i, _), book(i, t, p)  If we look at the rule: s1(a,t)  author(a, i, _), book(i, t, p), t = “123”  Clearly q(a,t) :- s1(a,t)  But if apply our inversion procedure, we get: author(a, NULL, NULL)  s1(a,t) book(NULL, t, p)  s1(a,t), t = “123”  and there’s no way to figure out how to join author and book on NULL!  We need “a special NULL for each a-t combo” so we can figure out which a’s and t’s go together

19 19 The Solution: “Skolem Functions”  Skolem functions:  “Perfect” hash functions  Each function returns a unique, deterministic value for each combination of input values  Every function returns a non-overlapping set of values (Skolem function F will never return a value that matches any of Skolem function G’s values)  Skolem functions won’t ever be part of the answer set or the computation  They’re just a way of logically generating “special NULLs”

20 20 Revisiting Our Example  Query: q(a,t) :- author(a, i, _), book(i, t, p)  Mapping rule: s1(a,t)  author(a, i, _), book(i, t, p), t = “123”  Inverse rules: author(a, f(a,t), NULL)  s1(a,t) book(f(a,t), t, p)  s1(a,t), t = “123”  We can now expand the query:  q(a,t) :- author(a, i, NULL), book(i, t, p), i = f(a,t)  q(a,t) :- s1(a,t), s1(a,t), t = “123”, i = f(a,t)

21 21 Query Answering Using Inverse Rules  Invert all rules using the procedures described  Take the query and the possible rule expansions and execute them in a Datalog interpreter  In the previous query, we expand with all combinations of expansions of book and of author – every possible way of combining and cross-correlating info from different sources  Then we throw away all unsatisfiable rewritings (some expansions will be logically inconsistent)

22 22 Levy et al. Alternative Approach: The Bucket Algorithm  Given a query Q with relations and predicates  Create a bucket for each subgoal in Q  Iterate over each view (source mapping)  If source includes bucket’s subgoal:  Create mapping between q’s vars and the view’s var at the same position  If satisfiable with substitutions, add to bucket  Do cross-product of buckets, see if result is contained in the query (recall we saw an algorithm to do that)

23 23 Source Capabilities  The simplest form is to annotate the attributes of a relation:  Book bff (auth,title,pub)  But many data integration efforts had more sophisticated models  Can a data source support joins between its relations?  Can a data source be sent a relation that it should join with?  In the end, we need to perform parts of the query in the mediator, and other parts at the sources

24 24 Contributions of the Info Manifold  More robust way of defining mediated schemas and sources  Mediated schema is clearly defined, less likely to change  Sources can be more accurately described  Relatively efficient algorithms for query reformulation, creating executable plans  Still requires standardization on a single schema  Can be hard to get consensus  Some other aspects were captured in related papers  Overlap between sources; coverage of data at sources  Semi-automated creation of mappings  Semi-automated construction of wrappers

25 25 Later Integration Systems Focused on Better Performance Tukwila/Piazza [Ives+99,Halevy+02] – Washington  Descendants of the Information Manifold  Similar capabilities, but with adaptive processing of XML as it is read across streams Niagara [DeWitt+99] – Wisconsin  XML querying of web sources  Giving answers a screenful at a time TelegraphCQ [Chandrasekaran+03] – Berkeley  Adaptive, select-project-join queries over infinite streams


Download ppt "Data Integration Techniques Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems October 30, 2003 Some slide content may."

Similar presentations


Ads by Google