Presentation is loading. Please wait.

Presentation is loading. Please wait.

CSE 636 Data Integration Overview Fall 2006. 2 What is Data Integration? The problem of providing uniform (sources transparent to user) access to (query,

Similar presentations


Presentation on theme: "CSE 636 Data Integration Overview Fall 2006. 2 What is Data Integration? The problem of providing uniform (sources transparent to user) access to (query,"— Presentation transcript:

1 CSE 636 Data Integration Overview Fall 2006

2 2 What is Data Integration? The problem of providing uniform (sources transparent to user) access to (query, and eventually updates too) multiple (even 2 is a problem!) autonomous (not affect the behavior of sources) heterogeneous (different data models, schemas) structured (at least semistructured) data sources (not only databases)

3 3 Motivation Enterprise data integration; web-site construction. World-wide web: –comparison shopping (Netbot, Junglee) –portals integrating data from multiple sources –XML integration Science & culture –Medical genetics: integrating genomic data –Astrophysics: monitoring events in the sky –Environment: Puget Sound Regional Synthesis Model –Culture: uniform access to all the cultural databases produced by different countries.

4 4 Principle Dimensions of Data Integration Virtual vs. materialized architecture Access: query only or query&update? –problem similar to updating through views –need distributed transactional services. Mediated schema: yes or no? –Mediated schema requires schema integration and then query reformulation. –Without mediated schema, we lose some of the advantages of data integration.

5 5 Data Warehouse Architecture Data Source Data Source Relational Database (Warehouse) Data Source Users   Applications OLAP / Decision Support Data Cubes / Data Mining ETL Tools (Extract-Transform-Load) Data Cleaning

6 6

7 7 Virtual Integration Architecture Leave the data in the sources When a query comes in: –Determine the relevant sources to the query –Break down the query into sub-queries for the sources –Get the answers from the sources, filter them if needed and combine them appropriately Data is fresh Otherwise known as On Demand Integration

8 8 Virtual Integration Architecture End Users   Applications Data Source Data Source Global Schema Local Schema Local Schema Data Source Local Schema Design-Time Schema Mappings Schema Mappings Schema Mappings Sources can be: Relational DBs Excel Files Web Sites Web Services

9 9 Differences in: –Names in schema –Attribute grouping –Coverage of databases –Granularity and format of attributes Inventory Database B Authors ISBN FirstName LastName Books Title ISBN Price DiscountPrice Edition Inventory Database A BooksAndMusic Title Author Publisher ItemID ItemType SuggestedPrice Categories Keywords Schema Mappings BookCategories ISBN Category CDCategories ASIN Category Artists ASIN ArtistName GroupName CDs Album ASIN Price DiscountPrice Studio

10 10 Issues for Schema Mappings Design-Time What formalisms to express them? How to create them? Can we discover them somehow? How do we use them? End Users   Applications Data Source Data Source Global Schema Local Schema Local Schema Data Source Local Schema Mappings Schema Mappings Schema Mappings

11 11 Mediator Virtual Integration Architecture Data Source Data Source Global Schema Local Schema Local Schema Data Source Local Schema Run-Time Reformulation Optimization Execution QueryResult Wrapper

12 12 Mediator Issues for Query Processing Data Source Data Source Global Schema Local Schema Local Schema Data Source Local Schema Reformulation Query User queries refer to the global schema Data is stored in the sources in a local schema Rewriting algorithms

13 13 Issues for Query Processing Reformulation Global Schema Books Title ISBN Price DiscountPrice Edition Local Schema A BooksAndMusic Title Author Publisher ItemID ItemType SuggestedPrice Categories Keywords SELECT ISBN, Price FROM Books WHERE Title = ‘on the road’ SELECT ItemID, SuggestedPrice FROM BooksAndMusic WHERE Title = ‘on the road’ AND ItemType = ‘Books’

14 14 Mediator Issues for Query Processing Data Source Data Source Global Schema Local Schema Local Schema Data Source Local Schema Query Translation Reformulation Optimization Execution Query Wrapper Different query languages

15 15 Local Source A Issues for Query Processing Query Translation Global Schema Books Title ISBN Price DiscountPrice Edition SELECT ISBN, Price FROM Books WHERE Title = ‘on the road’ http://www.amazon.com/homepage.html?ItemType=Books&Title=on+the+road

16 16 Mediator Issues for Query Processing Data Source Data Source Global Schema Local Schema Local Schema Data Source Local Schema Data Translation Reformulation Optimization Execution Query Wrapper Different data models

17 17 Issues for Query Processing Data Translation On the Road -- by Jack Kerouac; Paperback Buy new : $10.86 Local Result A Global Schema Books Title ISBN Price DiscountPrice Edition TitleISBNPrice…… On the Road12310.86……

18 18 Mediator Issues for Query Processing Data Source Data Source Global Schema Local Schema Local Schema Data Source Local Schema Query Execution Reformulation Optimization Execution Query Wrapper Access as many data sources as needed Duplicate/redundant and irrelevant data Limited query capabilities

19 19 Issues for Query Processing Limited Query Capabilities Global Schema Books Title ISBN Price DiscountPrice Edition Local Schema A BooksAndMusic Title Author ItemID ItemType SuggestedPrice SELECT ISBN, Price, DiscountPrice FROM Books WHERE Title = ‘on the road’ SELECT GreatPrice FROM DiscountBooks WHERE ISBN = ? Local Schema B DiscountBooks Title Edition ISBN GreatPrice SELECT ItemID, SuggestedPrice FROM BooksAndMusic WHERE Title = ? SELECT ItemID, SuggestedPrice FROM BooksAndMusic WHERE Title = ‘on the road’ A B SELECT GreatPrice FROM DiscountBooks WHERE ISBN = 123 C ItemIDSuggestedPrice 12310.86 ItemIDSuggestedPrice 12310.86 D E GreatPrice 8.86 ISBNPriceDiscountPrice 12310.868.86

20 20 Mediator Issues for Query Processing Data Source Data Source Global Schema Local Schema Local Schema Data Source Local Schema Query Answering Reformulation Optimization Execution QueryResult Wrapper Combine the results and further process them if needed Mainly union and merge Inconsistencies

21 21 Issues for Query Processing Query Answering (Union) ItemIDSuggestedPrice 12310.86 ISBNGreatPrice 4568.86 ISBNPrice 12310.86 4568.86

22 22 Issues for Query Processing Query Answering (Merge) ItemIDTitle 123On the Road ISBNEditionPrice 1232nd8.86 ISBNTitleEditionPrice 123On the Road2nd8.86 Primary Key ISBNTitleEditionPrice 123On the Road2nd8.86 Primary Key Primary Key

23 23 Issues for Query Processing Query Answering (Inconsistencies) ItemIDTitleEdition 123On the Road1st ISBNEditionPrice 1232nd8.86 ISBNTitleEditionPrice 123On the Road8.86 Primary Key ISBNTitleEditionPrice 123On the Road???8.86 Primary Key Primary Key

24 24 Peer-Based Integration Peer 2 Peer 1 Peer 5 Peer 3 Peer 4 Query

25 25 Peer-Based Integration No need for a central mediated schema Peers serve as mediators for other peers A peer can be both a server and a client Semantic relationships are specified locally (between small sets of peers) Queries are posed using the peer’s schema Answers come from anywhere in the system This is not P2P file sharing. –Data has rich semantics

26 26 References Information integration –Maurizio Lenzerini –Eighteenth International Joint Conference on Artificial Intelligence, IJCAI 2003 –Invited Tutorial Data Integration: a Status Report –Alon Halevy –German Database Conference (BTW), 2003 –Invited Talk


Download ppt "CSE 636 Data Integration Overview Fall 2006. 2 What is Data Integration? The problem of providing uniform (sources transparent to user) access to (query,"

Similar presentations


Ads by Google