Data Integration Rachel Pottinger and Liang Sun CSE 590ES January 24, 2000
What is Data Integration? Providing uniform (sources transparent to user) access to (query, and eventually updates to) multiple autonomous (can’t affect behavior of sources) heterogeneous (different models and schemas) data sources Sounds like the devices in Portolano!
Outline Architecture of data integration system Source description & query reformulation Query optimization
Motivation Enterprise data integration; web-site construction. World-wide web: –comparison shopping (Netbot, Junglee) –portals integrating data from multiple sources –XML integration Science & culture –Medical genetics: integrating genomic data –Astrophysics: monitoring events in the sky –Environment: Puget Sound Regional Synthesis Model –Culture: uniform access to all the cultural databases produced by countries in Europe.
(some) Research Prototypes DISCO (INRIA) Garlic (IBM) HERMES (U. of Maryland) InfoMaster (Stanford) Information Manifold (AT&T) IRO-DB (Versailles) SIMS, ARIADNE (USC/ISI) The Internet Softbot/Occam/Razor/Tukwila (us) TSIMMIS (Stanford) XMAS (UCSD) WHIRL (AT&T)
Principle Dimensions of Data Integration Virtual vs. materialized architecture Mediated schema? Pros: –Can ask questions over different schemas Cons: –Requires query reformulation
Mediated schema example Real database schemas: imdb(title, actor, director, genre,country) seattlesidewalk(title, theatre, time, price) showtimes(city, title, theatre, time) mrshowbiz(title, year, review) siskelebert(title, review) Mediated schemas: movieInfo(ID, title, genre, country, year, director) movieShowtime(ID, city, theatre, time) movieActor(ID, actor) movieReview(ID, review) Query: query(M,theatre, time):- movieActor(M, “tom hanks”), movieShowtime(M, “seattle”, theatre,time).
Materialization Architecture Data Source Data Source Data Source Wrapper Data Extraction Data Warehouse Application
Tukwila Architecture Data Source Wrapper Query Execution Engine Query Optimization Query Reformulation Global Data Model Data Source Local Data Model catalog
Translating between data models Where is the wrapper? How intelligent is the wrapper? Exported schema Query in exported schema Data in global data model Native schemaQuery in native schema Data in local data model Global DM Local DM
Describing Information Sources User queries refer to the mediated schema Sources store data in the local schemas Content descriptions provide the mappings between the mediated and local schemas Content Descriptions Mediated Schema Relations Information Source Relations
Data Source Catalogs Catalogs contain descriptions of: Logical source contents Source capabilities Source completeness Mirror sources Physical properties of the source and network Source reliability
Desiderata from source descriptions Distinguish between sources with closely related data: so we can prune access to irrelevant sources Enable easy addition of new information sources: because sources are dynamically being added and removed Be able to find sources relevant to a query: reformulate queries such that we obtain guarantees on which sources we access
Query Reformulation Problem Problem: reformulate user query referring to mediated schema onto local schemas Given a query Q in terms of the mediated-schema descriptions of the data sources Find a query Q’ that uses only the data source relation such that Q’ Q (i.e., answers are correct) and Q’ provides all possible answers to Q using the sources
Approaches to Specification of Source Descriptions Mediated schema relations defined as views over the source relations Source relations defined as views over mediated-schema relations Sources described as concepts in a description logic
The Global As View Approach Mediated-schema relations described in terms of source relations Movies and their years can be obtained from either DB 1 or DB 2 : MovieYear(title,year):-DB 1 (title,director,year) MovieYear(title,year):-DB 2 (title, director, year) Movie reviews can be obtained by joining DB 1 and DB 3 MovieRev(title,director,review):- DB 1 (title, director, year) & DB 3 (title,review)
Query Reformulation in GAV Query reformulation is done by rule unfolding Query: find reviews for 1997 movies: q(title,review):- MovieYear(title,1997)& MovieRev(title,director,review) Reformulated query on the sources: q(title, review):- DB 1 (title,director,year) & DB 3 (title,review) q(title,review):-DB 1 (title,director,year) & DB 2 (title,director,year)&DB 3 (title,review) Containment check shows second rule is redundant
The Local As View Approach Every data source is described as a query expression (view!) over mediated-schema relations S 1 : V 1 (title,year,director) year 1960 & genre = ‘Comedy’ & Movie(title,year,director,genre) S 2 : V 2 (title,review) Review(title,review)
Query Reformulation Find reviews for comedies produced after 1950: q(title,review):-Movie(title,year,director,‘Comedy’) & year 1950 & Review(title,review) V 1 (title,year,director) year 1960 & genre = ‘Comedy’ & Movie(title,year,director,genre) V 2 (title,review) Review(title,review) The reformulated query on the sources: q’(title,review):-V 1 (title,year,director) & V 2 (title,review)
Comparison of the approaches Local as view approach: Easier to add sources: specify the query expression Easier to specify constraints on source contents Global as view: Query reformulation is straightforward
The Query Optimization Problem (currently divorced from reformulation problem) The goal of a query optimizer: Translate a declarative query into an equivalent imperative program of minimal cost The imperative program is a query execution plan: an operator tree in some algebra Basic notions in optimization: search space, search strategy, cost model
Similarities of Data Integration with Optimization in DDBMS A distributed database: Query execution distributed over multiple sites Communication costs significant Consequences for Query Optimization Optimizer needs to decide operation locality Plans should exploit independent parallelism Plans should reduce communication overhead Caching can become a significant factor
Differences from DDBMS Capabilities of data sources: May provide only limited access patterns to data May have additional query processing capabilities Information about sources and network are missing: cost of answering queries unknown statistics harder to estimate transfer rates unpredictable In DDBMS data is distributed by precise rules
Modeling Source Capabilities Negative capabilities: A web-site may require certain inputs Need to consider only valid query execution plans Positive capabilities: A source may be an ODBC compliant database Need to decide the placement of operations according to capabilities Problem: how to describe and use source capabilities
Negative Capabilities We model access limitations by binding patterns: Sources: CitationDB bf (X,Y) Cites(X,Y) CitingPapers f (X) Cites(X,Y) Query: Q(X):- Cites(X,a) Need to consider only valid plans: q(X) :-CitingPapers(Y) &CitationDB(Y,a) Requires recursive rewritings to find all solutions
Optimization with positive capabilities Schema dependent vs. schema independent: Source able to perform joins, selections, or specifically R S Describing and using positive capabilities: Positive-capabilities testing module (PCTM): is a plan valid? Level of specification: declarative query vs. logical query execution plan. Interaction between optimizer and PCTM
Dealing with unexpected data transfer delays Problem: even the best plan can be bad with data transfer delays Query scrambling [Urhan et al, SIGMOD-98]: a set of runtime techniques to adapt to initial delays by: Rescheduling the query execution plan: leave plan unchanged, but evaluate different operators Operator synthesis: modify tree by removing or rearranging operators
Adaptive Query Processing (teaser for next week) Tukwila: a more general framework for query processing in data integration Due to lack of stats and network delays, interleaves query optimization and execution: Execute plan fragments; re-optimize Can decide to re-optimize even if next fragment is planned Adaptive operators for data integration Rule based mechanism for coordinating execution and optimization
Conclusions Data integration handles many problems needed for embedded systems applications Many data sources Easy addition and deletion of sources Different source capabilities Dealing with network delays Easy for user