Information Integration Modified from Alon Halevy’s lecture notes.

Information Integration Modified from Alon Halevy’s lecture notes

What is Data Integration Providing –Uniform (same query interface to all sources) –Access to (queries; eventually updates too) –Multiple (we want many, but 2 is hard too) –Autonomous (DBA doesn’t report to you) –Heterogeneous (data models are different) –Structured (or at least semi-structured) –Data Sources (not only databases).

Meta-search engines --Text-based --No access to hidden web

Netbot Junglee Comparison shoppers --Call every source --Collate results DealPilot.Com

CORA allows more sophisticated queries All papers that cite Rao, but are not written by rao Neither CORA nor DBLP are complete CORA tends to be more complete for online papers and AI papers DBLP sticks to “published papers”, and is more complete in DB coverage Both sources provide Englishified-BIBTEX citations

The Problem: Data Integration Uniform query capability across autonomous, heterogeneous data sources on LAN, WAN, or Internet

Motivation(s) Enterprise data integration; web-site construction. WWW: –Comparison shopping –Portals integrating data from multiple sources –B2B, electronic marketplaces Science and culture: –Medical genetics: integrating genomic data –Astrophysics: monitoring events in the sky. –Environment: Puget Sound Regional Synthesis Model –Culture: uniform access to all cultural databases produced by countries in Europe.

And if that wasn’t enough… Explosion of intranet and extranet information 80% of corporate information is unmanaged By 2004 30X more enterprise data than 1999 The average company: –maintains 49 distinct enterprise applications –spends 35% of total IT budget on integration- related efforts Source: Gartner, 1999

Current Solutions Mostly ad-hoc programming: create a special solution for every case; pay consultants a lot of money. Data warehousing: load all the data periodically into a warehouse. –6-18 months lead time –Separates operational DBMS from decision support DBMS. (not only a solution to data integration). –Performance is good; data may not be fresh. –Need to clean, scrub you data.

Data Warehouse Architecture Data source Data source Data source Relational database (warehouse) User queries Data extraction programs Data cleaning/ scrubbing OLAP / Decision support/ Data cubes/ data mining Junglee did this, essentially

The Virtual Integration Architecture Leave the data in the sources. When a query comes in: –Determine the relevant sources to the query –Break down the query into sub-queries for the sources. –Get the answers from the sources, and combine them appropriately. Data is fresh. Challenge: performance (and theory…)

Class of 16 th Feb

Virtual Integration Architecture Data source wrapper Data source wrapper Data source wrapper Sources can be: relational, hierarchical (IMS), structure files, web sites. Mediator: User queries Mediated schema Data source catalog Reformulation engine optimizer Execution engine Which data model?

Research Projects Garlic (IBM), Information Manifold (AT&T) Tsimmis, InfoMaster (Stanford) The Internet Softbot/Razor/Tukwila (UW) Hermes (Maryland) DISCO (INRIA, France) SIMS/Ariadne (USC/ISI) Emerac (ASU)

Dimensions to Consider How many sources are we accessing? How autonomous are they? Meta-data about sources? Is the data structured? Queries or also updates? Requirements: accuracy, completeness, performance, handling inconsistencies. Closed world assumption vs. open world?

Outline Wrappers Semantic integration and source descriptions: –Modeling source completeness –Modeling source capabilities Query optimization Query execution

Wrapper Programs Task: to communicate with the data sources and do format translations. They are built w.r.t. a specific source. They can sit either at the source or at the mediator. Often hard to build (very little science). Can be “intelligent”: perform source- specific optimizations.

Example Introduction to DB Phil Bernstein Eric Newcomer Addison Wesley, 1999 Introduction to DB Phil Bernstein Eric Newcomer Addison Wesley 1999 Transform: into: XML standard may make it easy to do

Data Source Catalog Contains all meta-information about the sources: –Logical source contents (books, new cars). –Source capabilities (can answer SQL queries) –Source completeness (has all books). –Physical properties of source and network. –Statistics about the data (like in an RDBMS) –Source reliability –Mirror sources –Update frequency.

RDF Metadata standard birds, butterflies, snakes John Smith

More RDF Examples

Content Descriptions User queries refer to the mediated schema. Data is stored in the sources in a local schema. Content descriptions provide the semantic mappings between the different schemas. Data integration system uses the descriptions to translate user queries into queries on the sources.

Desiderata from Source Descriptions Expressive power: distinguish between sources with closely related data. Hence, be able to prune access to irrelevant sources. Easy addition: make it easy to add new data sources. Reformulation: be able to reformulate a user query into a query on the sources efficiently and effectively.

Reformulation Problem Given: –A query Q posed over the mediated schema –Descriptions of the data sources Find: –A query Q’ over the data source relations, such that: Q’ provides only correct answers to Q, and Q’ provides all possible answers to Q given the sources.

Approaches to Specifying Source Descriptions Global-as-view (GAV): express the mediated schema relations as a set of views over the data source relations Local-as-view (LAV): express the source relations as views over the mediated schema. Can be combined…?

Example Scenario A mediator for movie databases –Want to provide the information about movies and where they are playing

Global-as-View

Mediated schema: Movie(title, dir, year, genre), Schedule(cinema, title, time). Create View Movie AS select * from S1 [S1(title,dir,year,genre)] union select * from S2 [S2(title, dir,year,genre)] union [S3(title,dir), S4(title,year,genre)] select S3.title, S3.dir, S4.year, S4.genre from S3, S4 where S3.title=S4.title

Global-as-View: Example 2 How do we get the missing values?

Global-as-View: Example 2 Mediated schema: Movie(title, dir, year, genre), Schedule(cinema, title, time). Create View Movie AS [S1(title,dir,year)] select title, dir, year, NULL from S1 union [S2(title, dir,genre)] select title, dir, NULL, genre from S2 Null values

Global-as-View: Example 3 Mediated schema: Movie(title, dir, year, genre), Schedule(cinema, title, time). Source S4: S4(cinema, genre) Create View Movie AS select NULL, NULL, NULL, genre from S4 Create View Schedule AS select cinema, NULL, NULL from S4. But what if we want to find which cinemas are playing comedies?

Global-as-View Summary Query reformulation boils down to view unfolding. Very easy conceptually. Can build hierarchies of mediated schemas. You sometimes loose information. Not always natural. Adding sources is hard. Need to consider all other sources that are available.

Start of 18/4

Local-as-View: example 1 Mediated schema: Movie(title, dir, year, genre), Schedule(cinema, title, time).

Local-as-View: example 1 Mediated schema: Movie(title, dir, year, genre), Schedule(cinema, title, time). S1(title,dir,year,genre) S3(title,dir) S5(title,dir,year), year >1960 Create Source S1 AS select * from Movie Create Source S3 AS select title, dir from Movie Create Source S5 AS select title, dir, year from Movie where year > 1960 AND genre=“Comedy” Sources are “materialized views” of Virtual schema

Mediated schema: Movie(title, dir, year, genre), Schedule(cinema, title, time). S1(title,dir,year,genre) S3(title,dir) S5(title,dir,year), year >1960 Create Source S1 AS select * from Movie Create Source S3 AS select title, dir from Movie Create Source S5 AS select title, dir, year from Movie where year > 1960 AND genre=“Comedy” Creat Source S4 AS select cinema, genre from Movie m, Schedule s where m.title=s.title S4(Cinema,Genre)

Local-as-View: Example 2 Mediated schema: Movie(title, dir, year, genre), Schedule(cinema, title, time). Source S4: S4(cinema, genre) Create Source S4 select cinema, genre from Movie m, Schedule s where m.title=s.title. Now if we want to find which cinemas are playing comedies, there is hope!

Local-as-View Summary Very flexible. You have the power of the entire query language to define the contents of the source. Hence, can easily distinguish between contents of closely related sources. Adding sources is easy: they’re independent of each other. Query reformulation: answering queries using views!

Reformulation in LAV Given a set of views V1,…,Vn, and a query Q, can we answer Q using only the answers to V1,…,Vn? –Notice the use of materialized (pre-computed) views ONLY materialized views should be used… Many, many papers on this problem. –Bucket algorithm [Levy; 96] –Inverse rules algorithm [Duschka, 99] –Hybrid versions… (e.g. MiniCon) –Survey on the topic: (Halevy, 2000).

Reformulation: The issues Query: Find all the years in which Zhang Yimou released movies. Select year from movie M where M.dir=yimou Q(y) :- movie(T,D,Y,G),D=yimou Not executable Q(y) :- S1(T,D,Y,G), D=yimou (1) Q(y) :- S1(T,D,Y,G), D=yimou Q(y) :- S5(T,D,Y), D=yimou (2) Which is the better plan? What are we looking for? --equivalence? --containment? --Maximal Containment --Smallest plan?

Maximal Containment Query plan should be sound and complete –Sound implies that Query should be contained in the Plan (I.e., tuples satisfying the query are subset of the tuples satisfying the plan –Completeness? –Traditional DBs aim for equivalence Query contains the plan; Plan contains the query –Impossible We want all query tuples that can be extracted given the sources we have –Maximal containment (no other query plan, which “contains” this plan is available)

Maximally contained Rewritings Q’ is a maximally contained rewriting of a query Q using the views V = V 1, …, V n if ÊFor any database D, and extensions v 1, …, v n of the views such that v i  V i (D), 1  i  n, then Q’(v 1, …, v 2 )  Q(D) for all i ·There is no other query Q1 such that ¬Q’(v 1, …, v n )  Q 1 (v 1, …, v n ) (2) Q 1 (v 1, …, v n )  Q(D), and there exists at least one database for which  is a strict subset

The world of Containment Consider Q 1 (.) :- B 1 (.) Q 2 (.) :- B 2 (.) Q 1  Q 2 (“contained in”) if the answer to Q 1 is a subset of Q 2 –Basically holds if B 1 (x) |= B 2 (x) Given a query Q, and a query plan Q 1, –Q 1 is a sound query plan if Q 1 is contained in Q –Q 1 is a complete query plan if Q is contained in Q 1 –Q 1 is a maximally contained query plan if there is no Q 2 which is a sound query plan for Q 1, such that Q 1 is contained in Q 2

Computing Containment Checks Consider Q 1 (.) :- B 1 (.) Q 2 (.) :- B 2 (.) Q 1  Q 2 (“contained in”) if the answer to Q 1 is a subset of Q 2 –Basically holds if B 1 (x) |= B 2 (x) –(but entailment is undecidable in general… boo hoo) Containment can be checked through containment mappings, if the queries are “conjunctive” (select/project/join queries, without constraints) [ONLY EXPONENTIAL!!!--aren’t you relieved?] m is a containment mapping from Vars(Q 2 ) to Vars(Q 1 ) if ¶m maps every subgoal in the body of Q 2 to a subgoal in the body of Q 1 ·m maps the head of Q 2 to the head of Q 1 Eg: Q1(x,y) :- R(x), S(y), T(x,y) Q2(u,v) :- R(u), S(v) Containment mapping: [u/x ; v/y]

Reforumulation Algorithms Bucket algorithm –cartesian product of buckets followed by “containment” check –Doesn’t quite work for recursive plans... Inverse Rules –plan fragments for mediator relations V1V2 Q(.) :- V1() & V2() S11() :- V1() S12 :- V1() S21() :- V2() S22 :- V2() S00() :- V1(), V2() S11 S12 S00 S21 S22 S00 V1() :- S11() V1() :- S12() V1() :- S00() V2() :- S21() V2() :- S22() V2() :- S00() Q(.) :- V1() & V2()

Movie(T,D,Y,G) :- S1(T,D,Y,G) Movie(T,D,?y,?g ) :- S3(T,D) Movie(T,D,Y,?g) :- S5(T,D,Y), y>1960 Movie(?t,?d,?y,G) :- S4(C,G) Schedule(C,?t,?ti) :- S4(C,G) Movie(T,D,Y,G) S1(T,D,Y,G) S3(T,D) S5(T,D,Y) S4(C,G) Skolemize Sf s5-1 (T,D,Y) Sched(C,T,Ti) Q(C,G) :- Movie(T,D,Y,G), Schedule(C,T,Ti) Skolemize Sf s4-1 (G,C)

Start of 4/23

Bucket Algorithm: Populating buckets ¶For each subgoal in the query, place relevant views in the subgoal’s bucket Inputs: Q(x):- r 1 (x,y) & r 2 (y,x) V 1 (a):-r 1 (a,b) V 2 (d):-r 2 (c,d) V 3 (f):- r 1 (f,g) & r 2 (g,f) r 1 (x,y) V 1 (x),V 3 (x) r 2 (y,x) V 2 (x), V 3 (x) Buckets:

Combining Buckets ËFor every combination in the Cartesian products from the buckets, check containment in the query Candidate rewritings: Q’ 1 (x) :- V 1 (x) & V 2 (x)  Q’ 2 (x) :- V 1 (x) & V 3 (x)  Q’ 3 (x) :- V 3 (x) & V 2 (x)  Q’ 4 (x) :- V 3 (x) & V 3 (x)  r 1 (x,y) V 1 (x),V 3 (x) r 2 (y,x) V 2 (x), V 3 (x) Bucket Algorithm will check all possible combinations r 1 (x,y)r 2 (y,x) Buckets: Q(x):- r 1 (x,y) & r 2 (y,x) R 1 (x,b) & R 2 (c,x)

Creating Inverse Rules Inputs: V 1 (a):-r 1 (a,b) V 2 (d):-r 2 (c,d) V 3 (f):- r 1 (f,g) & r 2 (g,f) Inverse Rules: IR1 r 1 (a, sf V1 (a)) :-V 1 (a) IR2 r 2 (sf V2 (d),d) :-V 2 (d) IR3 r 1 (f,sf V3 (f)) :-V 3 (f) IR4 r 2 (sf V3 (f),f) :-V 3 (f) Same Skolem Function For each V(X):-r 1 (X 1 ) & … & r n (X n ) for each j = 1, …, n form an inverse rule: r j (X j ):-V(X)

Combining Inverse Rules Inverse Rules + IR1 r 1 (a, sf V1 (a)) :-V 1 (a) IR2 r 2 (sf V2 (d),d) :-V 2 (d) IR3 r 1 (f,sf V3 (f)) :-V 3 (f) IR4 r 2 (sf V3 (f),f) :-V 3 (f) Tuples V 1 (g) V 2 (h) V 3 (j) V 3 (m) = Expansion: r 1 (g,sf V1 (g)), r 2 (sf V2 (h),h), r 1 (j,sf V3 (j)), r 2 (sf V3 (j),j) r1(m,sf V3 (m)), r2(sf V3 (m),m) At query time, query over rules Q(x):-r 1 (x,y)& r 2 (y,x) Query +

Complexity of finding maximally- contained plans Complexity doesn’t depend on the query –Can handle recursive queries just as easily Complexity does change if the sources are not “conjunctive queries” –Sources as unions of conjunctive queries (NP-hard) Disjunctive descriptions –Sources as recursive queries (Undecidable) Comparison predicates Complexity also changes based on Open vs. Closed world assumption True source contents Advertised description

Local Completeness Information If sources are incomplete, we need to look at each one of them. Often, sources are locally complete. Movie(title, director, year) complete for years after 1960, or for American directors. Question: given a set of local completeness statements, is a query Q’ a complete answer to Q? True source contents Advertised description Guarantees

Example Q(t,d) :- M(T,D,Y) & Sh(T,Th,C,H) & C= “Seattle” ==Query Q’(t,d) :- S 1 (T,D,Y) & S 2 ( T,Th,C,H) & C=“Seattle” ==Plan Q’(t,d) :- M(T,D,Y) & Sh(T,Th,C,H) & C=“seattle” & Y>1960 ==The part of the plan that is complete S1S1 S2S2 Contains? S 1 (T,D,Y) :- M(T,D,Y) LCW: S 1 (T,D,Y) :- M(T,D,Y), Y> 1960 S 2 (T,Th,C,H) :- Sh(T,Th,C,H) True source contents Advertised description Guarantees

Example #2 S 1 (T,D,Y) :- M(T,D,Y) LCW: S 1 (T,D,Y) :- M(T,D,Y), Y> 1960 S 3 (T,Y) :- Os(T,Y) Q(D) :- M(T,D,Y) & Os(T,Y) & Y>1965 ==Query Q’(D) :- S 1 (T,D,Y) & S 3 ( T,Y) & Y > 1965 ==Plan Q’(D) :- M(T,D,Y) &Os(T,Y) & Y>1965 & Y> 1960 ==The part of the plan that is complete Contains?

Other ways of modeling inter- source relations? Inter-source comparison predicates –S 1 has more of the movies after 1960 than S 2 Overlap statistics –S 1 has 80% of the movies made after 1960; while S 2 has 60% of the movies –S 1 has 98% of the movies stored in S 2 Computing cardinalities of unions given intersections Who gives these statistics? -Third party -Probing

Start of 4/25

Query Optimization Very related to query reformulation! Goal of the optimizer: find a physical plan with minimal cost. Key components in optimization: –Search space of plans Have to deal with binding restrictions; bushy plans –Search strategy Need anytime strategies and strategies that trade coverage for cost –Cost model Acquiring statistics

Optimization challenges Each relation is exported in to- to by a single database All sources are assumed to be fully relational Completeness is expected; optimize cost Cost dominated by transfer time Deterministic execution; availability of statistics Multiple sources export partial and overlapping portions of a relation –Need to minimize plans to remove redundancy Sources may allow only limited types of queries Need to model query capabilities Optimize cost/coverage together –First few tuples? Access costs need to be considered Unreliable sources; networks –Adaptive execution –Non-blocking plans Symmetric hash joins etc TraditionalInformation Integration

Traditional Query Optimization Process (simplified a bit) Parse the SQL query into a logical tree: –identify distinct blocks (corresponding to nested subqueries or views). Query rewrite phase: –apply algebraic transformations to yield a cheaper plan. –Merge blocks and move predicates between blocks. Optimize each block: join ordering. Complete the optimization: select scheduling (pipelining strategy).

Optimization in Distributed DBMS A distributed database (2-minute tutorial): –Data is distributed over multiple nodes, but is uniform. –Query execution can be distributed to sites. –Communication costs are significant. Consequences for optimization: –Optimizer needs to decide locality –Need to exploit independent parallelism. –Need operators that reduce communication costs (semi-joins).

DDBMS vs. Data Integration In a DDBMS, data is distributed over a set of uniform sites with precise rules. In a data integration context: –Data sources may provide only limited access patterns to the data. –Data sources may have additional query capabilities. –Cost of answering queries at sources unknown. –Statistics about data unknown. –Transfer rates unpredictable.

Modeling Source Capabilities Negative capabilities: –A web site may require certain inputs (in an HTML form). –Need to consider only valid query execution plans. Positive capabilities: –A source may be an ODBC compliant system. –Need to decide placement of operations according to capabilities. Problem: how to describe and exploit source capabilities.

Example #1: Continued Create Source S1 as select * from Cites given paper1 Create Source S2 as select paper1 from Cites Select p1 From S1, S2 Where S2.paper1=S1.paper1 AND S1.paper2=“Hal00”

Example #1: Access Patterns Create Source S1 as select * from Cites given paper1 Create Source S2 as select paper from ASU-Papers Create Source S3 as select paper from AwardPapers given paper Query: select * from AwardPapers Recursive plan!! S1 bf (p1,p2) :- cites(p1,p2) S2(p) :- Asp(p) S3 b (p) :- Awp(p) Q(p) :- Awp(p) Awp(p) :- S3 b (p) Asp(p) :- S2(p) Cites(p1,p2) :- S1 bf (p) Dom(p) :- S2(p) Dom(p) :- Dom(p1), S1(p1,p) Dom(p),

Binding restrictions and Search Space In principle, we need to consider all possible join orderings: Binding pattern restrictions necessitate consideration of bushy tree plans Many more bushy tree plans than there are left-linear plans –But, binding pattern restrictions also reduce the number of viable plans B A C D B A C D C D B A

Handling Positive Capabilities Characterizing positive capabilities: –Schema independent (e.g., can always perform joins, selections). –Schema dependent: can join R and S, but not T. –Given a query, tells you whether it can be handled. Key issue: how do you search for plans? Garlic approach (IBM): Given a query, STAR rules determine which subqueries are executable by the sources. Then proceed bottom-up as in System-R.

Coverage vs. Cost tradeoffs You can get maximal coverage by calling all the potentially relevant sources –But the cost will be prohibitively high. Utility: U(p) = F(cost,coverage) –W*log(coverage(p)) - (1-w)*cost(p)

S11 S12 S13 S14 S21 S22 S23 S24 S31 S32 S33 S34 S11-S21-S31 …. S21-S22-S32.. S14-S21-S31

Cost and Coverage Comparison W*log(coverage(p)) - (1-w)*cost(p)

W*log(coverage(p)) - (1-w)*cost(p)

Optimization and Execution Problem: –Few and unreliable statistics about the data. –Unexpected (possibly bursty) network transfer rates. –Generally, unpredictable environment. General solution: (research area) –Adaptive query processing. –Interleave optimization and execution. As you get to know more about your data, you can improve your plan.

A Scenario Want 1998 Red BMW No accidents 20% < avg. model price (Queryable) Data Source (Queryable) XML Sources The Internet Internet Query Engine XML Query Language XML Query Engine (e.g., Niagara) High-level Query Through GUI www.google+.com

Inside the Internet Query Engine (carId, model, price, otherinfo) Red Used BMW Cars (carId, model, price, otherinfo) Not Exists (carId, model, price, otherinfo) (model, avgprice) Group By Join model = model price <= 0.8 * avgprice Union (carId, model, price, otherinfo) Accident Reports (carId) Union (carId)

Robust plans to deal with execution uncertainities Need non-blocking plans; which give out results as soon as they are available Plan operators should have the properties –Flexible input –Maximal output –Non-monotonic processing Deal with both additions and updates

Matching Object Across Sources How do I know that A. Halevy in source 1 is the same as Alon Halevy in source 2? If there are uniform keys across sources, no problem. If not: –Domain specific solutions (e.g., maybe look at the address, ssn). –Use Information retrieval techniques (Cohen, 98). Judge similarity as you would between documents. –Use concordance tables. These are time-consuming to build, but you can then sell them for lots of money.

Will XML make integration a piece of cake?

Information Integration Modified from Alon Halevy’s lecture notes.

Similar presentations

Presentation on theme: "Information Integration Modified from Alon Halevy’s lecture notes."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Information Integration Modified from Alon Halevy’s lecture notes.

Similar presentations

Presentation on theme: "Information Integration Modified from Alon Halevy’s lecture notes."— Presentation transcript:

Similar presentations

About project

Feedback