2005Integration-intro1 Data Integration Systems overview The architecture of a data integration system:  Components and their interaction  Tasks  Concepts.

Slides:



Advertisements
Similar presentations
Chapter 10: Designing Databases
Advertisements

Schema Matching and Query Rewriting in Ontology-based Data Integration Zdeňka Linková ICS AS CR Advisor: Július Štuller.
CSE 636 Data Integration Data Integration Approaches.
Information Integration Using Logical Views Jeffrey D. Ullman.
1 Global-as-View and Local-as-View for Information Integration CS652 Spring 2004 Presenter: Yihong Ding.
Introduction to Databases
Distributed databases
Search Engines and Information Retrieval
Distributed Databases Logical next step in geographically dispersed organisations goal is to provide location transparency starting point = a set of decentralised.
Session – 6 DISTRIBUTED DATABASE ARCHITECTURE Matakuliah: M0184 / Pengolahan Data Distribusi Tahun: 2005 Versi:
NaLIX: A Generic Natural Language Search Environment for XML Data Presented by: Erik Mathisen 02/12/2008.
Distributed Systems Architectures
SECTIONS 21.4 – 21.5 Sanuja Dabade & Eilbroun Benjamin CS 257 – Dr. TY Lin INFORMATION INTEGRATION.
Overview Distributed vs. decentralized Why distributed databases
Ch1: File Systems and Databases Hachim Haddouti
1 Lecture 13: Database Heterogeneity Debriefing Project Phase 2.
©Silberschatz, Korth and Sudarshan1.1Database System Concepts Chapter 1: Introduction Purpose of Database Systems View of Data Data Models Data Definition.
1 Lecture 13: Database Heterogeneity. 2 Outline Database Integration Wrappers Mediators Integration Conflicts.
Chapter 4 Relational Databases Copyright © 2012 Pearson Education, Inc. publishing as Prentice Hall 4-1.
Definition of terms Definition of terms Explain business conditions driving distributed databases Explain business conditions driving distributed databases.
Distributed Systems: Client/Server Computing
Chapter 4 Relational Databases Copyright © 2012 Pearson Education 4-1.
Quete: Ontology-Based Query System for Distributed Sources Haridimos Kondylakis, Anastasia Analyti, Dimitris Plexousakis Kondylak, analyti,
LECTURE 2 DATABASE SYSTEM CONCEPTS AND ARCHITECTURE.
Advanced Database CS-426 Week 2 – Logic Query Languages, Object Model.
Integration of Biological Sources: Current Systems and Challenges Ahead ( Sigmod Record, Vol. 33. No. 3, September 2004 ) Thomas Hernandez & Sybbarao Kambhampati.
1 Overview of Database Federation and IBM Garlic Project Presented by Xiaofen He.
PHASE 3: SYSTEMS DESIGN Chapter 7 Data Design.
1 Distributed and Parallel Databases. 2 Distributed Databases Distributed Systems goal: –to offer local DB autonomy at geographically distributed locations.
CS462: Introduction to Database Systems. ©Silberschatz, Korth and Sudarshan1.2Database System Concepts Course Information Instructor  Kyoung-Don (KD)
Database System Concepts, 5th Ed. ©Silberschatz, Korth and Sudarshan See for conditions on re-usewww.db-book.com Chapter 1: Introduction.
Systems Analysis – Analyzing Requirements.  Analyzing requirement stage identifies user information needs and new systems requirements  IS dev team.
1 Introduction to databases concepts CCIS – IS department Level 4.
Introduction to Databases
Optimizing Queries and Diverse Data Sources Laura M. Hass Donald Kossman Edward L. Wimmers Jun Yang Presented By Siddhartha Dasari.
Search Engines and Information Retrieval Chapter 1.
Web-Enabled Decision Support Systems
CST203-2 Database Management Systems Lecture 2. One Tier Architecture Eg: In this scenario, a workgroup database is stored in a shared location on a single.
File Processing - Database Overview MVNC1 DATABASE SYSTEMS Overview.
CODD’s 12 RULES OF RELATIONAL DATABASE
CSE 636 Data Integration Limited Source Capabilities Slides by Hector Garcia-Molina Fall 2006.
XML & Mediators Thitima Sirikangwalkul Wai Sum Mong April 10, 2003.
1 Lessons from the TSIMMIS Project Yannis Papakonstantinou Department of Computer Science & Engineering University of California, San Diego.
Ocean Observatories Initiative Data Management (DM) Subsystem Overview Michael Meisinger September 29, 2009.
5 - 1 Copyright © 2006, The McGraw-Hill Companies, Inc. All rights reserved.
Chapter 1 Introduction to Databases. 1-2 Chapter Outline   Common uses of database systems   Meaning of basic terms   Database Applications  
1.1 CAS CS 460/660 Introduction to Database Systems Relational Algebra.
Data Access and Security in Multiple Heterogeneous Databases Afroz Deepti.
Kjell Orsborn UU - DIS - UDBL DATABASE SYSTEMS - 10p Course No. 2AD235 Spring 2002 A second course on development of database systems Kjell.
Database Environment Chapter 2. Data Independence Sometimes the way data are physically organized depends on the requirements of the application. Result:
Scaling Heterogeneous Databases and Design of DISCO Anthony Tomasic Louiqa Raschid Patrick Valduriez Presented by: Nazia Khatir Texas A&M University.
DATABASE MANAGEMENT SYSTEM ARCHITECTURE
Information Integration BIRN supports integration across complex data sources – Can process wide variety of structured & semi-structured sources (DBMS,
Data Integration Hanna Zhong Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009.
Issues in Ontology-based Information integration By Zhan Cui, Dean Jones and Paul O’Brien.
Object storage and object interoperability
Introduction to Distributed Databases Yiwei Wu. Introduction A distributed database is a database in which portions of the database are stored on multiple.
1 Chapter 2 Database Environment Pearson Education © 2009.
1 Integration of data sources Patrick Lambrix Department of Computer and Information Science Linköpings universitet.
Advanced Databases COMP3017 Dr Nicholas Gibbins
Data Models. 2 The Importance of Data Models Data models –Relatively simple representations, usually graphical, of complex real-world data structures.
Introduction: Databases and Database Systems Lecture # 1 June 19,2012 National University of Computer and Emerging Sciences.
1 Chapter 22 Distributed DBMSs - Concepts and Design Simplified Transparencies © Pearson Education Limited 1995, 2005.
Data Mining and Data Warehousing: Concepts and Techniques What is a Data Warehouse? Data Warehouse vs. other systems, OLTP vs. OLAP Conceptual Modeling.
Information Retrieval in Practice
Definition CASE tools are software systems that are intended to provide automated support for routine activities in the software process such as editing.
Chapter 2 Database Environment Pearson Education © 2009.
Chapter 2 Database Environment.
Information Integration
Chapter 2 Database Environment Pearson Education © 2009.
Presentation transcript:

2005Integration-intro1 Data Integration Systems overview The architecture of a data integration system:  Components and their interaction  Tasks  Concepts

2005Integration-intro2 Main components of a DI system (I) Mediator מתווך Supports in its user interface : The global data model The integrated / global / mediated schema / world view A query language Manages the interaction with sources Posing queries Receiving answers, transforming and showing them Is responsible for query execution strategies planning carrying out

2005Integration-intro3 (II) Wrapper עוטף Serves as the interface to a source Receive queries from a mediator Plan and execute how to retrieve the data from its source Transform data to global data model Send to mediator For an SQL source, these are rather easy For a restricted capability source, may require A series of queries on the source, or A program to be executed (on a non-db source) Filtering results obtained from the source

2005Integration-intro4 A simple architecture: Arrows represent query and data flow source wrapper mediator

2005Integration-intro5 A more complex architecture: Mediators can serves as wrapped sources for other mediators source wrapper mediator source wrapper

2005Integration-intro6 Important: The global database is virtual – contain no data The data reside in the sources The users pose queries as if the data resides in the global db Users may/may not be aware that the data actually comes from the sources

2005Integration-intro7 Main tasks & activities: At mediator: Query reformulation & decomposition –express queries in terms of the sources’ schemas –decompose into queries on sources Planning query execution, including optimization –a declarative query may be executed in various ways (even in a single centralized db) –different sources may provide same data at different costs (money, communication time, response time, delays, …) –If data is associated with user priorities, we may want to retrieve some answers before others When answers arrive – fuse them – a full answer is not a simple union of partial answers; data on an entity must be combined (fused) into a single record

2005Integration-intro8 Requirements: (from mediator, wrapper, system) Ability to handle Incomplete information (data may be missing from available sources) Heterogeneity – in data model, schema, contents Both data and meta-data ability to describe sources: Capabilities –what queries can a source answer –What mechanisms does it offer for data retrieval Coverage – allows to know –Where can data relevant to a query be found –Is there overlap between sources?

2005Integration-intro9  The relationship between source and global data The global data is virtual  the mediated schema describes data that resides in the sources is described by source schemas The relationship between the mediated and source data determines how queries are answered Two main approaches (a combination of the two – later)

2005Integration-intro10 Global as View – GAV The global db is defined as a view on the sources In relational model: Each global relation defined as a view, by a query on sources Obvious advantage: simplicity of query answering Given Q on global relations, expand it: replace each atom R(x) by an expression on sources, using the definition of R Then send appropriate sub-queries to sources

2005Integration-intro11 Simple example: a university database Source A: Dept(D, C) – departments and their courses Teaches(C,T) – teachers of courses Source B: Enroll(S, C) – student enrollment to courses Integrated schema & its definition: Stud(S, D, T) :- Dept(D, C), Teach(C, T), Enroll(S, C) Query Q: Stud(S, ‘CS’, ‘Beeri’) Expand body to Dept(‘CS’, C), Teach(C, ‘Beeri’), Enroll(S, C) Then use one of (at least) two execution strategies on sources A,B

2005Integration-intro12 Local as View – LAV The global database is viewed as the “real world” Each source is defined as a view on it Example (revisited): Global schema: Univ(D, C, T, S) Source A: Dept(D, C) :- Univ(D, C, T, S) Teaches(C, T) :- Univ(D, C, T, S) Source B: Enroll(S, C) :- Univ(D, C, T, S)

2005Integration-intro13 Possible assumptions on sources: A source contains all data in its defining view A source contains some of the data in its view, usually not all 2 nd is more realistic Example: Global database describes cars for sale A source may contain : only some of the attributes of cars present in the global schema (e.g., it may not contain history, or owner-contact) Only some of the cars for sale full view / contained view Obviously, the more sources we have, the more cars

2005Integration-intro14 Query answering in LAV: Expansion is not possible An approach: answering queries using views Practically: rewriting queries using views (differences explored later) Only the views have data  rewrite query to an expression over the views expression must be (explained in more detail later) Full views: equivalent to query Contained views: contained in query  Solution may/may not exist (contrast to expansion)  Finding it is more difficult This problem was explored in many contexts, e.g.: Query optimization using views/previous answers

2005Integration-intro15 Why prefer LAV to GAV? Ease of expanding a system: –In GAV, adding a source may require re-definition of global schema – makes it difficult to add sources –In LAV, just define the new source as a view given an algorithm for using views to answer queries, it automatically uses the new source As for expanding queries vs. using views: Even in GAV, when sources have restricted capabilities, query answering requires using views

2005Integration-intro16 Typically, a global schema reflects a real ‘world’, as we know it; each source materializes only a fragment –Horizontal – not all entity types or attributes are present –Vertical – not all entities of a type are present Thus, it is natural to define the sources as (contained) views Examples: Cars for sale: –global db reflects our understanding and requirements –A source provides only some info, only on the cars it has Looking for personal information using UNIX facilities –we know about: name, office, phone, , … –Each facility may offer only some of the above

2005Integration-intro17 LAV is a natural approach in the presence of www and its diversity & dynamicity of source Legacy systems Most research efforts & systems are LAV

2005Integration-intro18 On rewriting queries using views: It is not clear (now) how to obtain a rewriting, given Q But, given v1(..), v2(..), …, vn(..) as a candidate, we may expand each vi using its definition in terms of the global schema Check whether the resulting expression is equivalent to or contained in Q (both Q and the expansion are in terms of global schema relations)  Equivalence and containment of queries are fundamental problems for data integration

2005Integration-intro19 Example (our LAV example): Q: ans(S,’CS’,’Beeri’) :- Univ(’CS, C, ’Beeri’,S) Guess an answer in terms of views: ans`(S, ’CS’, ‘Beeri’ ) :- Dept(‘CS’,C), Teach(C, ‘Beeri’), Enroll(S,C) (Note: must use distinct variables in different expansions for all non-join variables) Is the query equivalent to this expansion? Is the expansion contained in the query? Univ(‘CS’,C, T1, S1)Univ(D2,C, ‘Beeri’, S2)Univ(D3,C, T3, S)