Data Integration Rachel Pottinger and Liang Sun CSE 590ES January 24, 2000.

Data Integration Rachel Pottinger and Liang Sun CSE 590ES January 24, 2000

What is Data Integration? Providing uniform (sources transparent to user) access to (query, and eventually updates to) multiple autonomous (can’t affect behavior of sources) heterogeneous (different models and schemas) data sources Sounds like the devices in Portolano!

Outline Architecture of data integration system Source description & query reformulation Query optimization

Motivation Enterprise data integration; web-site construction. World-wide web: –comparison shopping (Netbot, Junglee) –portals integrating data from multiple sources –XML integration Science & culture –Medical genetics: integrating genomic data –Astrophysics: monitoring events in the sky –Environment: Puget Sound Regional Synthesis Model –Culture: uniform access to all the cultural databases produced by countries in Europe.

(some) Research Prototypes DISCO (INRIA) Garlic (IBM) HERMES (U. of Maryland) InfoMaster (Stanford) Information Manifold (AT&T) IRO-DB (Versailles) SIMS, ARIADNE (USC/ISI) The Internet Softbot/Occam/Razor/Tukwila (us) TSIMMIS (Stanford) XMAS (UCSD) WHIRL (AT&T)

Principle Dimensions of Data Integration Virtual vs. materialized architecture Mediated schema? Pros: –Can ask questions over different schemas Cons: –Requires query reformulation

Mediated schema example Real database schemas: imdb(title, actor, director, genre,country) seattlesidewalk(title, theatre, time, price) showtimes(city, title, theatre, time) mrshowbiz(title, year, review) siskelebert(title, review) Mediated schemas: movieInfo(ID, title, genre, country, year, director) movieShowtime(ID, city, theatre, time) movieActor(ID, actor) movieReview(ID, review) Query: query(M,theatre, time):- movieActor(M, “tom hanks”), movieShowtime(M, “seattle”, theatre,time).

Materialization Architecture Data Source Data Source Data Source Wrapper Data Extraction Data Warehouse Application

Tukwila Architecture Data Source Wrapper Query Execution Engine Query Optimization Query Reformulation Global Data Model Data Source Local Data Model catalog

Translating between data models Where is the wrapper? How intelligent is the wrapper? Exported schema Query in exported schema Data in global data model Native schemaQuery in native schema Data in local data model Global DM Local DM

Describing Information Sources User queries refer to the mediated schema Sources store data in the local schemas Content descriptions provide the mappings between the mediated and local schemas Content Descriptions Mediated Schema Relations Information Source Relations

Data Source Catalogs Catalogs contain descriptions of: Logical source contents Source capabilities Source completeness Mirror sources Physical properties of the source and network Source reliability

Desiderata from source descriptions Distinguish between sources with closely related data: so we can prune access to irrelevant sources Enable easy addition of new information sources: because sources are dynamically being added and removed Be able to find sources relevant to a query: reformulate queries such that we obtain guarantees on which sources we access

Query Reformulation Problem Problem: reformulate user query referring to mediated schema onto local schemas Given a query Q in terms of the mediated-schema descriptions of the data sources Find a query Q’ that uses only the data source relation such that Q’  Q (i.e., answers are correct) and Q’ provides all possible answers to Q using the sources

Approaches to Specification of Source Descriptions Mediated schema relations defined as views over the source relations Source relations defined as views over mediated-schema relations Sources described as concepts in a description logic

The Global As View Approach Mediated-schema relations described in terms of source relations Movies and their years can be obtained from either DB 1 or DB 2 : MovieYear(title,year):-DB 1 (title,director,year) MovieYear(title,year):-DB 2 (title, director, year) Movie reviews can be obtained by joining DB 1 and DB 3 MovieRev(title,director,review):- DB 1 (title, director, year) & DB 3 (title,review)

Query Reformulation in GAV Query reformulation is done by rule unfolding Query: find reviews for 1997 movies: q(title,review):- MovieYear(title,1997)& MovieRev(title,director,review) Reformulated query on the sources: q(title, review):- DB 1 (title,director,year) & DB 3 (title,review) q(title,review):-DB 1 (title,director,year) & DB 2 (title,director,year)&DB 3 (title,review) Containment check shows second rule is redundant

The Local As View Approach Every data source is described as a query expression (view!) over mediated-schema relations S 1 : V 1 (title,year,director)  year  1960 & genre = ‘Comedy’ & Movie(title,year,director,genre) S 2 : V 2 (title,review)  Review(title,review)

Query Reformulation Find reviews for comedies produced after 1950: q(title,review):-Movie(title,year,director,‘Comedy’) & year  1950 & Review(title,review) V 1 (title,year,director)  year  1960 & genre = ‘Comedy’ & Movie(title,year,director,genre) V 2 (title,review)  Review(title,review) The reformulated query on the sources: q’(title,review):-V 1 (title,year,director) & V 2 (title,review)

Comparison of the approaches Local as view approach: Easier to add sources: specify the query expression Easier to specify constraints on source contents Global as view: Query reformulation is straightforward

The Query Optimization Problem (currently divorced from reformulation problem) The goal of a query optimizer: Translate a declarative query into an equivalent imperative program of minimal cost The imperative program is a query execution plan: an operator tree in some algebra Basic notions in optimization: search space, search strategy, cost model

Similarities of Data Integration with Optimization in DDBMS A distributed database: Query execution distributed over multiple sites Communication costs significant Consequences for Query Optimization Optimizer needs to decide operation locality Plans should exploit independent parallelism Plans should reduce communication overhead Caching can become a significant factor

Differences from DDBMS Capabilities of data sources: May provide only limited access patterns to data May have additional query processing capabilities Information about sources and network are missing: cost of answering queries unknown statistics harder to estimate transfer rates unpredictable In DDBMS data is distributed by precise rules

Modeling Source Capabilities Negative capabilities: A web-site may require certain inputs Need to consider only valid query execution plans Positive capabilities: A source may be an ODBC compliant database Need to decide the placement of operations according to capabilities Problem: how to describe and use source capabilities

Negative Capabilities We model access limitations by binding patterns: Sources: CitationDB bf (X,Y)  Cites(X,Y) CitingPapers f (X)  Cites(X,Y) Query: Q(X):- Cites(X,a) Need to consider only valid plans: q(X) :-CitingPapers(Y) &CitationDB(Y,a) Requires recursive rewritings to find all solutions

Optimization with positive capabilities Schema dependent vs. schema independent: Source able to perform joins, selections, or specifically R S Describing and using positive capabilities: Positive-capabilities testing module (PCTM): is a plan valid? Level of specification: declarative query vs. logical query execution plan. Interaction between optimizer and PCTM

Dealing with unexpected data transfer delays Problem: even the best plan can be bad with data transfer delays Query scrambling [Urhan et al, SIGMOD-98]: a set of runtime techniques to adapt to initial delays by: Rescheduling the query execution plan: leave plan unchanged, but evaluate different operators Operator synthesis: modify tree by removing or rearranging operators

Adaptive Query Processing (teaser for next week) Tukwila: a more general framework for query processing in data integration Due to lack of stats and network delays, interleaves query optimization and execution: Execute plan fragments; re-optimize Can decide to re-optimize even if next fragment is planned Adaptive operators for data integration Rule based mechanism for coordinating execution and optimization

Conclusions Data integration handles many problems needed for embedded systems applications Many data sources Easy addition and deletion of sources Different source capabilities Dealing with network delays Easy for user

Data Integration Rachel Pottinger and Liang Sun CSE 590ES January 24, 2000.

Similar presentations

Presentation on theme: "Data Integration Rachel Pottinger and Liang Sun CSE 590ES January 24, 2000."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Data Integration Rachel Pottinger and Liang Sun CSE 590ES January 24, 2000.

Similar presentations

Presentation on theme: "Data Integration Rachel Pottinger and Liang Sun CSE 590ES January 24, 2000."— Presentation transcript:

Similar presentations

About project

Feedback