CSE 636 Data Integration Data Integration Approaches
2 Virtual Integration Architecture Leave the data in the sources When a query comes in: –Determine the relevant sources to the query –Break down the query into sub-queries for the sources –Get the answers from the sources, filter them if needed and combine them appropriately Data is fresh Otherwise known as On Demand Integration
3 Mediator Virtual Integration Architecture Data Source Data Source Global Schema Local Schema Local Schema QueryResult Wrapper End User Design-Time Mediation Language Mapping Tool Run-Time Query Reformulation Optimization & Execution XML Web Services 1
4 Design-Time Mediator Virtual Integration Architecture Data Source Data Source Global Schema Local Schema Local Schema QueryResult Wrapper End User Mediation Language Mapping Tool Run-Time Query Reformulation Optimization & Execution XML Web Services 1 2
5 Mediator Virtual Integration Architecture Data Source Data Source Global Schema Local Schema Local Schema QueryResult Wrapper End User Design-Time Mediation Language Mapping Tool Run-Time Query Reformulation Optimization & Execution XML Web Services 1 2 3
6 Mediator Virtual Integration Architecture Data Source Data Source Global Schema Local Schema Local Schema QueryResult Wrapper End User Design-Time Mediation Language Mapping Tool Run-Time Query Reformulation Optimization & Execution XML Web Services
7 Mediator Virtual Integration Architecture Data Source Data Source Global Schema Local Schema Local Schema QueryResult Wrapper End User Design-Time Mediation Language Mapping Tool Run-Time Query Reformulation Optimization & Execution XML Web Services
8 Mediator Virtual Integration Architecture Data Source Data Source Global Schema Local Schema Local Schema QueryResult End User Wrapper Design-Time Mediation Language Mapping Tool Run-Time Query Reformulation Optimization & Execution XML Web Services
9 Dimensions to Consider: How many sources are we accessing? How autonomous are they? Meta-data about sources? Is the data structured? Queries or also updates? Requirements: accuracy, completeness, performance, handling inconsistencies. Closed world assumption vs. open world? Virtual Integration Approaches
10 Logic Mediation Languages Authors ISBN FirstName LastName Books Title ISBN Price DiscountPrice Edition BookCategories ISBN Category CDCategories ASIN Category Artists ASIN ArtistName GroupName CDs Album ASIN Price DiscountPrice Studio Global Schema CD ASIN Title Genre … Artist ASIN Name …
11 Expressive power: distinguish between sources with closely related data. Hence, be able to prune access to irrelevant sources. Easy addition: make it easy to add new data sources. Reformulation: be able to reformulate a user query into a query on the sources efficiently and effectively. Desiderata from Source Descriptions
12 Given: A query Q posed over the global schema Descriptions of the data sources Find: A query Q’ over the data source relations, such that: –Q’ provides only correct answers to Q, and –Q’ provides all possible answers from to Q given the sources. Reformulation Problem
13 Languages for Schema Mapping Mediated Schema Q Q’ GAV LAV GLAV Source Local Schema Local Schema Local Schema Local Schema Local Schema Mediator Global Schema
14 Global-as-View (GAV) Global Schema: Movie(title, dir, year, genre) Schedule(cinema, title, time) Integrating View: Create View Movie AS SELECT * FROM S1 [S1(title,dir,year,genre)] union SELECT * FROM S2 [S2(title,dir,year,genre)] union SELECT S3.title, S3.dir, S4.year, S4.genre FROM S3, S4 [S3(title,dir), WHERE S3.title = S4.title S4(title,year,genre)]
15 Global-as-View: Example 2 Global Schema: Movie(title, dir, year, genre) Schedule(cinema, title, time) Integrating View: Create View Movie AS SELECT title, dir, year, NULL FROM S1 [S1(title,dir,year)] union SELECT title, dir, NULL, genre FROM S2 [S2(title,dir,genre)]
16 Global-as-View: Example 3 Global Schema: Movie(title, dir, year, genre) Schedule(cinema, title, time) Integrating Views: Create View Movie AS SELECT NULL, NULL, NULL, genre FROM S4 [S4(cinema, genre)] Create View Schedule AS SELECT cinema, NULL, NULL FROM S4 [S4(cinema, genre)] But what if we want to find which cinemas are playing comedies?
17 Global-as-View Summary Query reformulation boils down to view unfolding. Very easy conceptually. Can build hierarchies of global schemas. You sometimes loose information. Not always natural. Adding sources is hard. Need to consider all other sources that are available.
18 Local-as-View (LAV) Mediated Schema Source 1 Source 2 Source 3 Source 4 Source 5 Local Schema Local Schema Local Schema Local Schema Mediator Global Schema Book ISBN Title Genre Year Author ISBN Name R1 ISBN Title Name Local Schema R5 ISBN Title Books before 1970Humor Books Create View R1 AS SELECT B.ISBN, B.Title, A.Name FROM Book B, Author A WHERE A.ISBN = B.ISBN AND B.Year < 1970 Create View R5 AS SELECT B.ISBN, B.Title FROM Book B WHERE B.Genre = ‘Humor’
19 Query Reformulation Mediated Schema Source 1 Source 2 Source 3 Source 4 Source 5 Local Schema Local Schema Local Schema Local Schema Mediator Global Schema Book ISBN Title Genre Year Author ISBN Name R1 ISBN Title Name Local Schema R5 ISBN Title Books before 1970Humor Books Query: Find authors of humor books Plan: R1 Join R5
20 Query Reformulation Mediated Schema Source 1 Source 2 Source 3 Source 4 Source 5 Local Schema Local Schema Local Schema Local Schema Mediator Global Schema Book ISBN Title Genre Year Author ISBN Name R1 ISBN Title Name Local Schema R5 ISBN Title Books before 1970Humor Books Query: Find authors of humor books before 1960 Plan: Can’t do it!
21 Local-as-View: Example 1 Global Schema: Movie(title, dir, year, genre) Schedule(cinema, title, time) Source Views: Create Source S1 AS [S1(title, dir, year, genre)] SELECT * FROM Movie Create Source S3 AS [S3(title, dir)] SELECT title, dir FROM Movie Create Source S5 AS [S5(title, dir, year)] SELECT title, dir, year FROM Movie WHERE year > 1960 AND genre=‘Comedy’
22 Local-as-View: Example 2 Global Schema: Movie(title, dir, year, genre) Schedule(cinema, title, time) Source Views: Create Source S4 [S4(cinema, genre)] SELECT cinema, genre FROM Movie M, Schedule S WHERE M.title=S.title Now if we want to find which cinemas are playing comedies, there is hope!
23 Very flexible. You have the power of the entire query language to define the contents of the source. Hence, can easily distinguish between contents of closely related sources. Adding sources is easy: they’re independent of each other. Query reformulation: answering queries using views! Local-as-View Summary
24 The General Problem Given a set of views V1,…,Vn, and a query Q, can we answer Q using only the answers to V1,…,Vn? Many, many papers on this problem The best performing algorithm: The MiniCon Algorithm (Pottinger & Halevy, VLDB 2000)
25 Local Completeness Information If sources are incomplete, we need to look at each one of them. Often, sources are locally complete. Movie(title, director, year) complete for years after 1960, or for American directors. Question: given a set of local completeness statements, is a query Q’ a complete answer to Q?
26 Movie(title, director, year) –complete after 1960 Show(title, theater, city, hour) Query: find movies (and directors) playing in Seattle: SELECT M.title, M.director FROM Movie M, Show S WHERE M.title=S.title AND city=‘Seattle’ Complete or not? Example
27 Movie(title, director, year), Oscar(title, year) Query: find directors whose movies won Oscars after 1965: SELECT M.director FROM Movie M, Oscar O WHERE M.title=O.title AND M.year=O.year AND O.year > 1965 Complete or not? Example #2
28 References Information integration –Maurizio Lenzerini –Eighteenth International Joint Conference on Artificial Intelligence, IJCAI 2003 –Invited Tutorial Data Integration: a Status Report –Alon Halevy –German Database Conference (BTW), 2003 –Invited Talk