Kambhampati & KnoblockInformation Integration on the Web (MA-1)1 Information & Data Integration Combining information from multiple autonomous sources.

Kambhampati & KnoblockInformation Integration on the Web (MA-1)1 Information & Data Integration Combining information from multiple autonomous sources

Kambhampati & KnoblockInformation Integration on the Web (MA-1)2 The end-game: 3 options Have an in-class final exam –5/8 2:30pm is the designated time Have a take-home exam Make the final home-work into a take home –..and have a mandatory discussion class on 5/8 2:30pm Also, note the change in demo schedule

Kambhampati & KnoblockInformation Integration on the Web (MA-1)3 Today’s Agenda Discuss Semtag/Seeker Lecture start on Information Integration

Kambhampati & KnoblockInformation Integration on the Web (MA-1)4 Information Integration Combining information from multiple autonomous information sources –And answering queries using the combined information Many variations depending on –The type of information sources (text? Data? Combination?) Data vs. Information integration Horizontal vs. Vertical integration –The level of eagerness of the integration Ad hoc vs. Pre-mediated integration –Pre-mediation itself can be Warehouse vs online approaches –Generality of the solution Mashup vs. Mediated

Information Integration as making the database repository of the web.. Discovering information sources (e.g. deep web modeling, schema learning, …) Gathering data (e.g., wrapper learning & information extraction, federated search, …) Queries Querying integrated information sources (e.g. queries to views, execution of web-based queries, …) Data mining & analyzing integrated information (e.g., collaborative filtering/classification learning using extracted data, …) Linkage Cleaning data (e.g., de-duping and linking records) to form a single [virtual] database The “IR” View

Kambhampati & KnoblockInformation Integration on the Web (MA-1)6 Services Webpages Structured data Sensors (streaming Data) The “DB” View Integration as Mediation over Autonomous databases Query Executor Source Fusion/ Query Planning Source Trust Answers Monitor Mediator

Kambhampati & KnoblockInformation Integration on the Web (MA-1)7 Who is dying to have it? (Applications) WWW: –Comparison shopping –Portals integrating data from multiple sources –B2B, electronic marketplaces Science and culture: –Medical genetics: integrating genomic data –Astrophysics: monitoring events in the sky. –Culture: uniform access to all cultural databases produced by countries in Europe provinces in Canada Enterprise data integration –An average company has 49 different databases and spends 35% of its IT dollars on integration efforts Skeptic’s corner

Kambhampati & KnoblockInformation Integration on the Web (MA-1)8 Isn’t web mostly text? –The invisible web is mostly structured… Most web servers have back end database servers They dynamically convert (wrap) the structured data into readable english – => The capital of India is New Delhi. –So, if we can “unwrap” the text, we have structured data! »(un)wrappers, learning wrappers etc… –Note also that such dynamic pages cannot be crawled... –The (coming) Semi-structured web Most pages are at least “semi”-structured XML standard is expected to ease the presenatation/on-the-wire transfer of such pages. (BUT…..) –The Services Travel services, mapping services –The Sensors Stock quotes, current temperatures, ticket prices… Skeptic’s corner

Kambhampati & KnoblockInformation Integration on the Web (MA-1)9 Is it like Expedia/Travelocity/Orbitz… Surpringly, NO! The online travel sites don’t quite need to do data integration; they just use SABRE –SABRE was started in the 60’s as a joint project between American Airlines and IBM –It is the de facto database for most of the airline industry (who voluntarily enter their data into it) There are very few airlines that buck the SABRE trend—SouthWest airlines is one (which is why many online sites don’t bother with South West) So, online travel sites really are talking to a single database (not multiple data sources)… –To be sure, online travel sources do have to solve a hard problem. Finding an optimal fare (even at a given time) is basically computationally intractable (not to mention the issues brought in by fluctuating fares). So don’t be so hard on yourself Check out http://www.maa.org/devlin/devlin_09_02.html Skeptic’s corner

Kambhampati & KnoblockInformation Integration on the Web (MA-1)10 Why isn’t this just Search engines do text-based retrieval of URLS –Works reasonably well for single document texts, or for finding sites based on single document text Cannot integrate information from multiple documents Cannot do effective “query relaxation” or generalization Cannot link documents and databases The aim of Information integration is to support query processing over structured and semi- structured sources as well as services. Skeptic’s corner

Kambhampati & KnoblockInformation Integration on the Web (MA-1)11 Are we talking “comparison shopping” agents? Certainly closer to the aims of these But: Wider focus Consider larger range of databases Consider services Implies more challenges “warehousing” may not work Manual source characterization/ integration won’t scale-up

Kambhampati & KnoblockInformation Integration on the Web (MA-1)12

Kambhampati & KnoblockInformation Integration on the Web (MA-1)13 4/26 Information Integration –2 Focus on Data Integration

Kambhampati & KnoblockInformation Integration on the Web (MA-1)14 Information Integration Data Integration Text Integration Data aggregation (vertical integration) Data Linking (horizontal integration) Collection SelectionSoft-Joins

Kambhampati & KnoblockInformation Integration on the Web (MA-1)15 Different “Integration” scenarios “Data Aggregation” (Vertical) –All sources export (parts of a) single relation No need for joins etc Could be warehouse or virtual –E.g. BibFinder, Junglee, Employeds etc –Challenges: Schema mapping; data overlap “Data Linking” (Horizontal) –Joins over multiple relations stored in multiple DB E.g. Softjoins in WHIRL Ted Kennedy episode –Challenges: record linkage over text fields (object mapping); query reformulation “Collection Selection” –All sources export text documents –E.g. meta-crawler etc. Challenges: Similarity definition; relevance handling All together (vertical & horizontal) –Many interesting research issues –..but few actual fielded systems

Kambhampati & KnoblockInformation Integration on the Web (MA-1)16 Collection Selection

Kambhampati & KnoblockInformation Integration on the Web (MA-1)17 Collection Selection/Meta Search Introduction Metasearch Engine A system that provides unified access to multiple existing search engines. Metasearch Engine Components –Database Selector Identifying potentially useful databases for each user query –Document Selector Identifying potentially useful document returned from selected databases –Query Dispatcher and Result Merger Ranking the selected documents

Kambhampati & KnoblockInformation Integration on the Web (MA-1)18 Collection Selection NYT FT Collection Selection Query Execution Results Merging CNNWPWSJ

Kambhampati & KnoblockInformation Integration on the Web (MA-1)19 Evaluating collection selection Let c1..cj be the collections that are chosen to be accessed for the query Q. Let d1…dk be the top documents returned from these collections. We compare these results to the results that would have been returned from a central union database –Ground Truth: The ranking of documents that the retrieval technique (say vector space or jaccard similarity) would have retrieved from a central union database that is the union of all the collections Compare precision of the documents returned by accessing

Kambhampati & KnoblockInformation Integration on the Web (MA-1)20 Existing Approaches in Database Selection Rough Representative Approaches –Using a few keywords or sentences to describe the contents of the database –Often manually generated Statistical Representative Approaches –Contains statistical information for each term in databases (such as the document frequency and average weight of the term). –Scalability issues due to large amount of statistics for each db. Learning-based approaches –Using past retrieval experience to determine the usefulness for future queries –Training queries / real user queries –The obtained experience against a database will be saved as the representative of the database

Kambhampati & KnoblockInformation Integration on the Web (MA-1)21 General Scheme & Challenges Get a representative of each of the database –Representative is a sample of files from the database Challenge: get an unbiased sample when you can only access the database through queries. Compare the query to the representatives to judge the relevance of a database –Coarse approach: Convert the representative files into a single file (super-document). Take the (vector) similarity between the query and the super document of a database to judge that database’s relevance –Finer approach: Keep the representative as a mini-database. Union the mini-databases to get a central mini-database. Apply the query to the central mini-database and find the top k answers from it. Decide the relevance of each database based on which of the answers came from which database’s representative You can use an estimate of the size of the database too –What about overlap between collections? (See ROSCO paper)

Kambhampati & KnoblockInformation Integration on the Web (MA-1)22 Uniform Probing for Content Summary Construction Automatic extraction of document frequency statistics from uncooperative databases –[Callan and Connell TOIS 2001],[Callan et al. SIGMOD 1999] Main Ideas –Pick a word and send it as a query to database D RandomSampling-OtherResource(RS-Ord): from a dictionary RandomSampling-LearnedResource(RS-Lrd): from retrieved documents –Retrieval the top-K documents returned –If the number of retrieved documents exceeds a threshod T, stop, otherwise retart at the beginning –k=4, T=300 –Compute the sample document frequency for each word that appeared in a retrieved document.

Kambhampati & KnoblockInformation Integration on the Web (MA-1)23 CORI Net Approach (Representative as a super document) Representative Statistics –The document frequency for each term and each database –The database frequency for each term Main Ideas –Visualize the representative of a database as a super document, and the set of all representative as a database of super documents –Document frequency becomes term frequency in the super document, and database frequency becomes document frequency in the super database –Ranking scores can be computed using a similarity function such as the Cosine function

Kambhampati & KnoblockInformation Integration on the Web (MA-1)24 ReDDE Approach (Representative as a mini-collection) Use the representatives as mini collections Construct a union-representative that is the union of the mini-collections (such that each document keeps information on which collection it was sampled from) Send the query first to union-collection, get the top-k ranked results –See which of the results in the top-k came from which mini- collection. The collections are ranked in terms of how much their mini-collections contributed to the top-k answers of the query. –Scale the number of returned results by the expected size of the actual collection

Kambhampati & KnoblockInformation Integration on the Web (MA-1)25 Data Integration

Kambhampati & KnoblockInformation Integration on the Web (MA-1)26 Models for Integration Modified from Alon Halevy’s slides

Kambhampati & KnoblockInformation Integration on the Web (MA-1)27 Solutions for small-scale integration Mostly ad-hoc programming: create a special solution for every case; pay consultants a lot of money. Data warehousing: load all the data periodically into a warehouse. –6-18 months lead time –Separates operational DBMS from decision support DBMS. (not only a solution to data integration). –Performance is good; data may not be fresh. –Need to clean, scrub you data. Junglee did this, for employment classifieds

Kambhampati & KnoblockInformation Integration on the Web (MA-1)28 Data Warehouse Architecture Data source Data source Data source Relational database (warehouse) User queries Data extraction programs Data cleaning/ scrubbing OLAP / Decision support/ Data cubes/ data mining Junglee did this, for employment classifieds

Kambhampati & KnoblockInformation Integration on the Web (MA-1)29 The Virtual Integration Architecture Leave the data in the sources. When a query comes in: –Determine the relevant sources to the query –Break down the query into sub-queries for the sources. –Get the answers from the sources, and combine them appropriately. Data is fresh. Approach scalable Issues: –Relating Sources & Mediator –Reformulating the query –Efficient planning & execution Garlic [IBM], Hermes[UMD];Tsimmis, InfoMaster[Stanford]; DISCO[INRIA]; Information Manifold [AT&T]; SIMS/Ariadne[USC];Emerac/Havasu[ASU]

Kambhampati & KnoblockInformation Integration on the Web (MA-1)30 Virtual Integration Architecture Sources can be: relational, hierarchical (IMS), structure files, web sites.

Kambhampati & KnoblockInformation Integration on the Web (MA-1)31 Research Projects Garlic (IBM), Information Manifold (AT&T) Tsimmis, InfoMaster (Stanford) The Internet Softbot/Razor/Tukwila (UW) Hermes (Maryland) DISCO (INRIA, France) SIMS/Ariadne (USC/ISI) Emerac/Havasu (ASU) BibFinder (ASU)

Kambhampati & KnoblockInformation Integration on the Web (MA-1)32 Desiderata for Relating Source-Mediator Schemas Expressive power: distinguish between sources with closely related data. Hence, be able to prune access to irrelevant sources. Easy addition: make it easy to add new data sources. Reformulation: be able to reformulate a user query into a query on the sources efficiently and effectively. Nonlossy: be able to handle all queries that can be answered by directly accessing the sources Reformulation

Kambhampati & KnoblockInformation Integration on the Web (MA-1)33 11/9 --Google Mashups—what type of integration are they? --Mturk.com  collaborative computing the capitalist way

Kambhampati & KnoblockInformation Integration on the Web (MA-1)34 Why isn’t this just No common schema –Sources with heterogeneous schemas (and ontologies) –Semi-structured sources Legacy Sources –Not relational-complete –Variety of access/process limitations Autonomous sources –No central administration –Uncontrolled source content overlap Unpredictable run-time behavior –Makes query execution hard Predominantly “Read-only” –Could be a blessing—less worry about transaction management – (although the push now is to also support transactions on web) Databases Distributed Databases Skeptic’s corner

Kambhampati & KnoblockInformation Integration on the Web (MA-1)35 Reformulation Problem Given: –A query Q posed over the mediated schema –Descriptions of the data sources Find: –A query Q’ over the data source relations, such that: Q’ provides only correct answers to Q, and Q’ provides all possible answers to Q given the sources.

Kambhampati & KnoblockInformation Integration on the Web (MA-1)36 Approaches for relating source & Mediator Schemas Global-as-view (GAV): express the mediated schema relations as a set of views over the data source relations Local-as-view (LAV): express the source relations as views over the mediated schema. Can be combined…? Virtual vs Materialized “View” Refresher Let’s compare them in a movie Database integration scenario.. Differences minor for data aggregation…

Kambhampati & KnoblockInformation Integration on the Web (MA-1)37 Example Scenario A mediator for movie databases –Want to provide the information about movies and where they are playing

Kambhampati & KnoblockInformation Integration on the Web (MA-1)38 Global-as-View Mediated schema: Movie(title, dir, year, genre), Schedule(cinema, title, time). Create View Movie AS select * from S1 [S1(title,dir,year,genre)] union select * from S2 [S2(title, dir,year,genre)] union [S3(title,dir), S4(title,year,genre)] select S3.title, S3.dir, S4.year, S4.genre from S3, S4 where S3.title=S4.title Express mediator schema relations as views over source relations

Kambhampati & KnoblockInformation Integration on the Web (MA-1)39 Global-as-View Mediated schema: Movie(title, dir, year, genre), Schedule(cinema, title, time). Create View Movie AS select * from S1 [S1(title,dir,year,genre)] union select * from S2 [S2(title, dir,year,genre)] union [S3(title,dir), S4(title,year,genre)] select S3.title, S3.dir, S4.year, S4.genre from S3, S4 where S3.title=S4.title Express mediator schema relations as views over source relations Mediator schema relations are Virtual views on source relations

Kambhampati & KnoblockInformation Integration on the Web (MA-1)40 Global-as-View: Example 2 Mediated schema: Movie(title, dir, year, genre), Schedule(cinema, title, time). Create View Movie AS [S1(title,dir,year)] select title, dir, year, NULL from S1 union [S2(title, dir,genre)] select title, dir, NULL, genre from S2 Null values Express mediator schema relations as views over source relations

Kambhampati & KnoblockInformation Integration on the Web (MA-1)41 Global-as-View: Example 2 Mediated schema: Movie(title, dir, year, genre), Schedule(cinema, title, time). Source S4: S4(cinema, genre) Create View Movie AS select NULL, NULL, NULL, genre from S4 Create View Schedule AS select cinema, NULL, NULL from S4. But what if we want to find which cinemas are playing comedies? “Lossy Mediation” Express mediator schema relations as views over source relations

Kambhampati & KnoblockInformation Integration on the Web (MA-1)42 Local-as-View: example 1 Mediated schema: Movie(title, dir, year, genre), Schedule(cinema, title, time). S1(title,dir,year,genre) S3(title,dir) S5(title,dir,year), year >1960 Create Source S1 AS select * from Movie Create Source S3 AS select title, dir from Movie Create Source S5 AS select title, dir, year from Movie where year > 1960 AND genre=“Comedy” Sources are “materialized views” of mediator schema Express source schema relations as views over mediator relations

Kambhampati & KnoblockInformation Integration on the Web (MA-1)43 Mediated schema: Movie(title, dir, year, genre), Schedule(cinema, title, time). S1(title,dir,year,genre) S3(title,dir) S5(title,dir,year), year >1960 Create Source S1 AS select * from Movie Create Source S3 AS select title, dir from Movie Create Source S5 AS select title, dir, year from Movie where year > 1960 AND genre=“Comedy” Creat Source S4 AS select cinema, genre from Movie m, Schedule s where m.title=s.title S4(Cinema,Genre) Express source schema relations as views over mediator relations

Kambhampati & KnoblockInformation Integration on the Web (MA-1)44 Local-as-View: Example 2 Mediated schema: Movie(title, dir, year, genre), Schedule(cinema, title, time). Source S4: S4(cinema, genre) Create Source S4 select cinema, genre from Movie m, Schedule s where m.title=s.title Now if we want to find which cinemas are playing comedies, there is hope! Express source schema relations as views over mediator relations

Kambhampati & KnoblockInformation Integration on the Web (MA-1)45 GAV vs. LAV Mediated schema: Movie(title, dir, year, genre), Schedule(cinema, title, time). Source S4: S4(cinema, genre) Lossy mediation

Kambhampati & KnoblockInformation Integration on the Web (MA-1)46 GAV vs. LAV Not modular –Addition of new sources changes the mediated schema Can be awkward to write mediated schema without loss of information Query reformulation easy –reduces to view unfolding (polynomial) –Can build hierarchies of mediated schemas Best when –Few, stable, data sources –well-known to the mediator (e.g. corporate integration) Garlic, TSIMMIS, HERMES Modular--adding new sources is easy Very flexible--power of the entire query language available to describe sources Reformulation is hard –Involves answering queries only using views (can be intractable—see below) Best when –Many, relatively unknown data sources –possibility of addition/deletion of sources Information Manifold, InfoMaster, Emerac, Havasu

Kambhampati & KnoblockInformation Integration on the Web (MA-1)47 Reformulation in LAV Given a set of views V1,…,Vn, and a query Q, can we answer Q using only the answers to V1,…,Vn? –Notice the use of materialized (pre-computed) views ONLY materialized views should be used… Approaches –Bucket algorithm [Levy; 96] –Inverse rules algorithm [Duschka, 99] –Hybrid versions SV-Bucket [2001], MiniCon [2001]

Kambhampati & KnoblockInformation Integration on the Web (MA-1)48 Reformulation in LAV: The issues Query: Find all the years in which Zhang Yimou released movies. Select year from movie M where M.dir=yimou Q(y) :- movie(T,D,Y,G),D=yimou Not executable Q(y) :- S1(T,D,Y,G), D=yimou (1) Q(y) :- S1(T,D,Y,G), D=yimou Q(y) :- S5(T,D,Y), D=yimou (2) Which is the better plan? What are we looking for? --equivalence? --containment? --Maximal Containment --Smallest plan?

Kambhampati & KnoblockInformation Integration on the Web (MA-1)49 Maximal Containment Query plan should be sound and complete –Sound implies that Query should be contained in the Plan (I.e., tuples satisfying the query are subset of the tuples satisfying the plan –Completeness? –Traditional DBs aim for equivalence Query contains the plan; Plan contains the query –Impossible We want all query tuples that can be extracted given the sources we have –Maximal containment (no other query plan, which “contains” this plan is available) P contains Q if P |= Q (exponential even for “conjunctive” (SPJ) query plans)

Kambhampati & KnoblockInformation Integration on the Web (MA-1)50 The world of Containment Consider Q 1 (.) :- B 1 (.) Q 2 (.) :- B 2 (.) Q 1  Q 2 (“contained in”) if the answer to Q 1 is a subset of Q 2 –Basically holds if B 1 (x) |= B 2 (x) Given a query Q, and a query plan Q 1, –Q 1 is a sound query plan if Q 1 is contained in Q –Q 1 is a complete query plan if Q is contained in Q 1 –Q 1 is a maximally contained query plan if there is no Q 2 which is a sound query plan for Q 1, such that Q 1 is contained in Q 2

Kambhampati & KnoblockInformation Integration on the Web (MA-1)51 Computing Containment Checks Consider Q 1 (.) :- B 1 (.) Q 2 (.) :- B 2 (.) Q 1  Q 2 (“contained in”) if the answer to Q 1 is a subset of Q 2 –Basically holds if B 1 (x) |= B 2 (x) –(but entailment is undecidable in general… boo hoo) Containment can be checked through containment mappings, if the queries are “conjunctive” (select/project/join queries, without constraints) [ONLY EXPONENTIAL!!!--aren’t you relieved?] m is a containment mapping from Vars(Q 2 ) to Vars(Q 1 ) if ¶m maps every subgoal in the body of Q 2 to a subgoal in the body of Q 1 ·m maps the head of Q 2 to the head of Q 1 Eg: Q1(x,y) :- R(x), S(y), T(x,y) Q2(u,v) :- R(u), S(v) Containment mapping: [u/x ; v/y]

Kambhampati & KnoblockInformation Integration on the Web (MA-1)52 Reformulation Algorithms Bucket algorithm –Cartesian product of buckets –Followed by “containment” check Inverse Rules –plan fragments for mediator relations V1V2 Q(.) :- V1() & V2() S11() :- V1() S12 :- V1() S21() :- V2() S22 :- V2() S00() :- V1(), V2() S11 S12 S00 S21 S22 S00 V1() :- S11() V1() :- S12() V1() :- S00() V2() :- S21() V2() :- S22() V2() :- S00() Q(.) :- V1() & V2() Bucket Algorithm Inverse Rules [Levy] [Duschka] P 1 contains P 2 if P 2 |= P 1

Kambhampati & KnoblockInformation Integration on the Web (MA-1)53 Movie(T,D,Y,G) :- S1(T,D,Y,G) Movie(T,D,?y,?g ) :- S3(T,D) Movie(T,D,Y,?g) :- S5(T,D,Y), y>1960 Movie(?t,?d,?y,G) :- S4(C,G) Schedule(C,?t,?ti) :- S4(C,G) Movie(T,D,Y,G) S1(T,D,Y,G) S3(T,D) S5(T,D,Y) S4(C,G) Skolemize Sf s5-1 (T,D,Y) Sched(C,T,Ti) Q(C,G) :- Movie(T,D,Y,G), Schedule(C,T,Ti) Skolemize Sf s4-1 (G,C)

Kambhampati & KnoblockInformation Integration on the Web (MA-1)54 Take-home vs. In-class (The “Tastes Great/Less Filling” debate of CSE 494) More time consuming –To set, to take and to grade Caveat: people tend to over-estimate the time taken to do the take-home since the don’t factor out the “preparation” time that is interleaved with the exam time..but may be more realistic Probably will be given on Th 6 th and will be due by 11 th evening. (+/-) Less time-consuming –To take (sort of like removing a bandage..) –and definitely to grade… Scheduled to be on Tuesday, May 11 th 2:40— 4:30

Kambhampati & KnoblockInformation Integration on the Web (MA-1)55 Interactive Review on 5/4 A large (probably most?) of the class on 5/4 will be devoted to interactive semester review Everyone should come prepared with at least 3 topics/ideas that they got out of the class –You will be called in random order and you should have enough things to not repeat what others before you said. What you say will then be immortalized on the class homepage –See the *old* acquired wisdom page on the class page.

Kambhampati & KnoblockInformation Integration on the Web (MA-1)56 Bucket Algorithm: Populating buckets ¶For each subgoal in the query, place relevant views in the subgoal’s bucket Inputs: Q(x):- r 1 (x,y) & r 2 (y,x) V 1 (a):-r 1 (a,b) V 2 (d):-r 2 (c,d) V 3 (f):- r 1 (f,g) & r 2 (g,f) r 1 (x,y) V 1 (x),V 3 (x) r 2 (y,x) V 2 (x), V 3 (x) Buckets:

Kambhampati & KnoblockInformation Integration on the Web (MA-1)57 Combining Buckets ËFor every combination in the Cartesian products from the buckets, check containment in the query Candidate rewritings: Q’ 1 (x) :- V 1 (x) & V 2 (x) X Q’ 2 (x) :- V 1 (x) & V 3 (x) X Q’ 3 (x) :- V 3 (x) & V 2 (x) X Q’ 4 (x) :- V 3 (x) & V 3 (x) * r 1 (x,y) V 1 (x),V 3 (x) r 2 (y,x) V 2 (x), V 3 (x) Bucket Algorithm will check all possible combinations r 1 (x,y)r 2 (y,x) Buckets: Q(x):- r 1 (x,y) & r 2 (y,x) R 1 (x,b) & R 2 (c,x)

Kambhampati & KnoblockInformation Integration on the Web (MA-1)58 Complexity of finding maximally contained plans in LAV Complexity does change if the sources are not “conjunctive queries” –Sources as unions of conjunctive queries (NP-hard) Disjunctive descriptions –Sources as recursive queries (Undecidable) Comparison predicates Complexity is less dependent on the query –Recursion okay; but inequality constraints lead to NP-hardness Complexity also changes based on Open vs. Closed world assumption True source contents (of Big Two) Advertised description “All cars” [Abiteboul & Duschka, 98] You can “reduce” the complexity by taking a conjunctive query that is an upperbound.  This just pushes the complexity to minimization phase Advertised description “Toyota” U “Olds” Big Two

Kambhampati & KnoblockInformation Integration on the Web (MA-1)59 Is XML standardization a magical solution for Integration? If all WEB sources standardize into XML format –Source access (wrapper generation issues) become easier to manage –BUT all other problems remain Still need to relate source (XML)schemas to mediator (XML)schema Still need to reason about source overlap, source access limitations etc. Still need to manage execution in the presence of source/network uncertainities

Kambhampati & KnoblockInformation Integration on the Web (MA-1)60 Query Optimization Challenges -- Deciding what to optimize --Getting the statistics on sources --Doing the optimization We will first talk about reformulation level challenges

Kambhampati & KnoblockInformation Integration on the Web (MA-1)61 Practical issues complicating Reformulation Sources may have access limitations –Access restrictions can lead to recursive rewritings even when the queries are non- recursive! Sources may have overlap –Non-minimal rewritings may result when overlap information is ignored

Kambhampati & KnoblockInformation Integration on the Web (MA-1)62 Source Limitations Sources are not really fully-relational databases –Legacy systems –Limited access patters (Can’s ask a white-pages source for the list of all numbers) –Limited local processing power Typically only selections (on certain attributes) are supported Access limitations modeled in terms of allowed (“feasible”) binding patterns with which the source can be accessed –E.g. S(X,Y,Z) with feasible patterns f,f,b or b,b,f

Kambhampati & KnoblockInformation Integration on the Web (MA-1)63 Access Restrictions & Recursive Reformulations Create Source S1 as select * from Cites given paper1 Create Source S2 as select paper from ASU-Papers Create Source S3 as select paper from AwardPapers given paper Query: select * from AwardPapers Recursive plan!! S1 bf (p1,p2) :- cites(p1,p2) S2(p) :- Asp(p) S3 b (p) :- Awp(p) Q(p) :- Awp(p) Awp(p) :- S3 b (p) Asp(p) :- S2(p) Cites(p1,p2) :- S1 bf (p) Dom(p) :- S2(p) Dom(p) :- Dom(p1), S1(p1,p) Dom(p), [Kwok&Weld, 96; Duschka &Levy, 97]

Kambhampati & KnoblockInformation Integration on the Web (MA-1)64 Managing Source Overlap Often, sources on the Internet have overlapping contents –The overlap is not centrally managed (unlike DDBMS—data replication etc.) Reasoning about overlap is important for plan optimality –We cannot possibly call all potentially relevant sources! Qns: How do we characterize, get and exploit source overlap? –Qualitative approaches (LCW statements) –Quantitative approaches (Coverage/Overlap statistics)

Kambhampati & KnoblockInformation Integration on the Web (MA-1)65 Local Completeness Information If sources are incomplete, we need to look at each one of them. Often, sources are locally complete. Movie(title, director, year) complete for years after 1960, or for American directors. Question: given a set of local completeness statements, is a query Q’ a complete answer to Q? True source contents Advertised description Guarantees (LCW; Inter-source comparisons) Problems: 1. Sources may not be interested in giving these!  Need to learn  hard to learn! 2. Even if sources are willing to give, there may not be any “big enough” LCWs Saying “I definitely have the car with vehicle ID XXX is useless

Kambhampati & KnoblockInformation Integration on the Web (MA-1)66 Quantitative ways of modeling inter-source overlap Coverage & Overlap statistics [Koller et. al., 97] –S 1 has 80% of the movies made after 1960; while S 2 has 60% of the movies –S 1 has 98% of the movies stored in S 2 Computing cardinalities of unions given intersections Who gives these statistics? -Third party -Probing

Kambhampati & KnoblockInformation Integration on the Web (MA-1)67 Extremes of Laziness in Data Integration Fully Query-time II (blue sky for now) –Get a query from the user on the mediator schema –Go “discover” relevant data sources –Figure out their “schemas” –Map the schemas on to the mediator schema –Reformulate the user query into data source queries –Optimize and execute the queries –Return the answers Fully pre-fixed II –Decide on the only query you want to support –Write a (java)script that supports the query by accessing specific (pre- determined) sources, piping results (through known APIs) to specific other sources Examples include Google Map Mashups E.g. We may start with known sources and their known schemas, do hand-mapping and support automated reformulation and optimization (most interesting action is “in between”)

Kambhampati & KnoblockInformation Integration on the Web (MA-1)68

Kambhampati & KnoblockInformation Integration on the Web (MA-1)69 Query Services Webpages Structured data Sensors (streaming Data) Executor Needs to handle Source/network Interruptions, Runtime uncertainity, replanning Source Fusion/ Query Planning Needs to handle: Multiple objectives, Service composition, Source quality & overlap Source Trust Ontologies; Source/Service Descriptions Replanning Requests Preference/Utility Model Answers Probing Queries Source Calls Monitor Updating Statistics User queries refer to the mediated schema. Data is stored in the sources in a local schema. Content descriptions provide the semantic mappings between the different schemas. Mediator uses the descriptions to translate user queries into queries on the sources. DWIM

Kambhampati & KnoblockInformation Integration on the Web (MA-1)70 Dimensions to Consider How many sources are we accessing? How autonomous are they? Can we get meta-data about sources? Is the data structured? –Discussion about soft-joins. See slide next Supporting just queries or also updates? Requirements: accuracy, completeness, performance, handling inconsistencies. Closed world assumption vs. open world? –See slide next

71 Soft Joins..WHIRL [Cohen] We can extend the notion of Joins to “Similarity Joins” where similarity is measured in terms of vector similarity over the text attributes. So, the join tuples are output n a ranked form—with the rank proportional to the similarity Neat idea… but does have some implementation difficulties Most tuples in the cross-product will have non-zero similarities. So, need query processing that will somehow just produce highly ranked tuples Also other similarity/distance metrics may be used E.g. Edit distance

Kambhampati & KnoblockInformation Integration on the Web (MA-1)73 Local Completeness Information If sources are incomplete, we need to look at each one of them. Often, sources are locally complete. Movie(title, director, year) complete for years after 1960, or for American directors. Question: given a set of local completeness statements, is a query Q’ a complete answer to Q? True source contents Advertised description Guarantees (LCW; Inter-source comparisons) Problems: 1. Sources may not be interested in giving these!  Need to learn  hard to learn! 2. Even if sources are willing to give, there may not be any “big enough” LCWs Saying “I definitely have the car with vehicle ID XXX is useless

Kambhampati & KnoblockInformation Integration on the Web (MA-1)74 Overview User queries refer to the mediated schema. Data is stored in the sources in a local schema. Content descriptions provide the semantic mappings between the different schemas. Mediator uses the descriptions to translate user queries into queries on the sources. Schema: Template for the stored data

Kambhampati & KnoblockInformation Integration on the Web (MA-1)75 Source Descriptions Contains all meta-information about the sources: –Logical source contents (books, new cars). –Source capabilities (can answer SQL queries) –Source completeness (has all books). –Physical properties of source and network. –Statistics about the data (like in an RDBMS) –Source reliability –Mirror sources –Update frequency.

Kambhampati & KnoblockInformation Integration on the Web (MA-1)76 Source Access How do we get the “tuples”? –Many sources give “unstructured” output Some inherently unstructured; while others “englishify” their database-style output –Need to (un)Wrap the output from the sources to get tuples –“Wrapper building”/Information Extraction –Can be done manually/semi-manually

Kambhampati & KnoblockInformation Integration on the Web (MA-1)77 Source Fusion/Query Planning Accepts user query and generates a plan for accessing sources to answer the query –Needs to handle tradeoffs between cost and coverage –Needs to handle source access limitations –Needs to reason about the source quality/reputation

Kambhampati & KnoblockInformation Integration on the Web (MA-1)78 Monitoring/Execution Takes the query plan and executes it on the sources –Needs to handle source latency –Needs to handle transient/short-term network outages –Needs to handle source access limitations –May need to re-schedule or re-plan

Kambhampati & KnoblockInformation Integration on the Web (MA-1)79 Integrator vs. DBMS No common schema –Sources with heterogeneous schemas –Semi-structured sources Legacy Sources –Not relational-complete –Variety of access/process limitations Autonomous sources –No central administration –Uncontrolled source content overlap –Lack of source statistics Tradeoffs between query plan cost, coverage, quality etc. –Multi-objective cost models Unpredictable run-time behavior –Makes query execution hard Presence of “services” –Need to “compose” services vs. Reprise

Kambhampati & KnoblockInformation Integration on the Web (MA-1)80 Deductive Databases Relations viewed as predicates. Interrelations between relations expressed as “datalog” rules –(Horn clauses, without function symbols) –Queries correspond to datalog programs “Conjunctive queries” are datalog programs with a single non- recursive rule [Correspond to SPJ queries in SQL] Emprelated(Name,Dname) :- Empdep(Name,Dname) Emprelated(Name,Dname) :- Empdep(Name,D1), Emprelated(D1,Dname) EDB predicate IDB predicate

Kambhampati & KnoblockInformation Integration on the Web (MA-1)81 BibFinder Case Study See the bibfinder slides

Kambhampati & KnoblockInformation Integration on the Web (MA-1)82 What to Optimize Traditional DB optimizers compare candidate plans purely in terms of the time they take to produce all answers to a query. In Integration scenarios, the optimization is “multi-objective” –Total time of execution –Cost to first few tuples Often, the users are happier with plans that give first tuples faster –Coverage of the plan Full coverage is no longer an iron-clad requirement –Too many relevant sources, Uncontrolled overlap between the sources Can’t call them all! –(Robustness, –Access premiums…)

Kambhampati & KnoblockInformation Integration on the Web (MA-1)83 Source Statistics Needed The size of the source relation and attributes; –The length and cardinality of the attributes; –the cardinality of the source relation; The feasible access patterns for the source; The network bandwidth and latency between the source and the integration system Coverage of the source S for a relation R denoted by P(S|R) –Overlap between sources P(S 1..S k | R) Extension of R

Kambhampati & KnoblockInformation Integration on the Web (MA-1)84 Getting the Statistics Since the sources are autonomous, the mediator needs to actively gather the relevant statistics –Learning bandwidth and latency statistics [Gruser et. al. 2000] use neural networks to learn the response time patterns of web sources –Can learn the variation of response times across the days of the week and across the hours of the day. –Learning coverages and overlaps [Nie et. al. 2002] use itemset mining techniques to learn compact statistics about the spread of the mediator schema relations across the accessible sources –Can trade quality of the statistics for reduced space consumption

Kambhampati & KnoblockInformation Integration on the Web (MA-1)85 Using LCW rules to minimize plans Basic Idea: --If reformulation of Q leads to a union of conjunctive plans P 1 UP 2 U…P k --then, if P 1 is “complete” for Q (under the given LCW information), then we can minimize the reformulation by pruning P 2 …P k -- [P 1 →LCW] contains P 1 UP 2 U…P k For Recursive Plans (obtained when the sources have access restrictions) --We are allowed to remove a rule r from a plan P, if the “complete” version of r is already contained in P-r Emerac [Lambrecht & Kambhampati, 99] [Duschka, AAAI-97]

Kambhampati & KnoblockInformation Integration on the Web (MA-1)86 Example Q(t,d) :- M(T,D,Y) & Sh(T,Th,C,H) & C= “Seattle” ==Query Q’(t,d) :- S 1 (T,D,Y) & S 2 ( T,Th,C,H) & C=“Seattle” ==Plan1 Q’’(t,d) :- M(T,D,Y) & S2( T,Th,C,H) & C=“Seattle” ==Plan2 S1S1 S 2/3 S 1 (T,D,Y) :- M(T,D,Y) S 2 (T,Th,C,H) :- Sh(T,Th,C,H) LCW: S2(T,Th,C,H) :- Sh(T,Th,C,H) & C=Seattle S2(T,Th,C,H) :- Sh(T,Th,C,H) True source contents Advertised description Guarantees [Levy, 96; Duschka, 97; Lambrecht & Kambhampati, 99] Minimization harder in the presence of source limitations

Kambhampati & KnoblockInformation Integration on the Web (MA-1)87 Example Q(t,d) :- M(T,D,Y) & Sh(T,Th,C,H) & C= “Seattle” ==Query Q’(t,d) :- S 1 (T,D,Y) & S 2 ( T,Th,C,H) & C=“Seattle” ==Plan Q’(t,d) :- M(T,D,Y) & Sh(T,Th,C,H) & C=“seattle” & Y>1960 ==The part of the plan that is complete S1S1 S2S2 Contains? S 1 (T,D,Y) :- M(T,D,Y) LCW: S 1 (T,D,Y) :- M(T,D,Y), Y> 1960 S 2 (T,Th,C,H) :- Sh(T,Th,C,H) True source contents Advertised description Guarantees [Levy, 96; Duschka, 97; Lambrecht & Kambhampati, 99] Minimization harder in the presence of source limitations

Kambhampati & KnoblockInformation Integration on the Web (MA-1)88 Approaches for handling multiple objectives –Do staged optimization [Information Manifold] Optimize for coverage, and then for cost –Do joint optimization Generate all the non-dominated solutions (Pareto-Set) Combine the objectives into a single metric –e.g. [Havasu/Multi-R] »Cost increases additively »Coverage decreases multiplicatively utility(p) = w*log(coverage(p)) - (1-w)*cost(p) »The logarithm ensures coverage additive[Candan 01]

Kambhampati & KnoblockInformation Integration on the Web (MA-1)89 Learning Coverage/Overlap Statistics StatMiner: A threshold based hierarchical association rule mining approach – Learns statistics with respect to “ query classes ” rather than specific queries Defines query classes in terms of attribute-value hierarchies Discovers frequent query classes and limits statistics to them – Maps a user ’ s query into it ’ s closest ancestor class, and uses the statistics of the mapped class to estimate the statistics of the query. – Handles the efficiency and accuracy tradeoffs by adjusting the thresholds. Challenge: Impractical to learn and store all the statistics for every query. Havasu [Nie et. al. 2002]

Kambhampati & KnoblockInformation Integration on the Web (MA-1)90 Alternatives to statistics-based approaches Avoid cost-based optimization and depend on “qualitative” easier to get statistics –Instead of coverage/overlap statistics use: LCW characterizations mirror source characterizations –Eg: RAZOR, INFORMATION MANIFOLD, EMERAC –Instead of tuple transfer statistics use bound is easier assumption high-traffic binding pattern information –Eg. INFORMATION MANIFOLD, EMERAC, [Yerneni et. al.]

Kambhampati & KnoblockInformation Integration on the Web (MA-1)91 Staged Optimization of Cost & Coverage Information Manifold([IM]) [Levy et. al; VLDB96] Cartesian product of buckets followed by “containment” check There can be m n distinct plans R1R2 Q(.) :- R1(), R2() S11 S12 S13 S21 S22 S23 S11&S21 S11&S22 S11&S23 S12&S21 S12&S22 S12&S23 S13&S21 S13&S22 S13&S23 Ranking and choosing top N plans[Doan et. al; ICDE02] Finding a physical plan for for each selected plan Problem: Cost and Coverage are interrelated..

Kambhampati & KnoblockInformation Integration on the Web (MA-1)92 Joint Optimization of Cost & Coverage Havasu/Multi-R [Nie et. al. 2001] –Search in the space of “parallel” plans Each subplan in the parallel plan contains (a subset of) sources relevant for a subgoal –Dynamic programming is used to search among the subgoal orders –Greedy approach is used to create a subplan for a particular subgoal Keep adding sources until the utility (defined in terms of cost and coverage) starts to worsen –Capable of generating plans for a variety of cost/coverage tradeoffs Increasing relative weight of coverage

Kambhampati & KnoblockInformation Integration on the Web (MA-1)93 Techniques for optimizing response time for first tuples Staged approach: Generate plans based on other objectives and post- process them to improve their response time for first-k tuples –Typical idea is to replace asymmetric operators with symmetric ones e.g. replace nested-loop join with symmetric hash join –e.g. Telegraph, Tukwila, Niagara –Problem: Access limitations between sources may disallow symmetric operations –Solution: Use joint optimization approach (e.g. Havasu) and consider the cost of first tuples as a component of plan utility [Viglas & Naughton, 2002] describe approaches for characterizing the “rate” of answer delivery offered by various query plans.

Kambhampati & KnoblockInformation Integration on the Web (MA-1)94 Integrating Services Source can be “services” rather than “data repositories” –Eg. Amazon as a composite service for book buying –Separating line is somewhat thin Handling services –Description (API;I/O spec) WSDL –Composition Planning in general –Execution Data-flow architectures –See next part

Kambhampati & KnoblockInformation Integration on the Web (MA-1)95 about XML/Xquery/RDF Impact of “X”-standards on Integration

Kambhampati & KnoblockInformation Integration on the Web (MA-1)96 Impact of XML on Integration If and when all sources accept Xqueries and exchange data in XML format, then –Mediator can accept user queries in Xquery –Access sources using Xquery –Get data back in XML format –Merge results and send to user in XML format How about now? –Sources can use XML adapters (middle-ware)

Kambhampati & KnoblockInformation Integration on the Web (MA-1)97 Services Source can be “services” rather than “data repositories” –Separating line is thin… Handling services –Description (API;I/O spec) WSDL –Composition Planning in general –Execution Data-flow architectures [SWORD, WWW-2002]

Kambhampati & KnoblockInformation Integration on the Web (MA-1)1 Information & Data Integration Combining information from multiple autonomous sources.

Similar presentations

Presentation on theme: "Kambhampati & KnoblockInformation Integration on the Web (MA-1)1 Information & Data Integration Combining information from multiple autonomous sources."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Kambhampati & KnoblockInformation Integration on the Web (MA-1)1 Information & Data Integration Combining information from multiple autonomous sources.

Similar presentations

Presentation on theme: "Kambhampati & KnoblockInformation Integration on the Web (MA-1)1 Information & Data Integration Combining information from multiple autonomous sources."— Presentation transcript:

Similar presentations

About project

Feedback