TU/e eindhoven university of technology / faculty of mathematics and informatics Technologie van Informatiesystemen TIS college 3
TU/e eindhoven university of technology / faculty of mathematics and informatics Inhoud Inleiding, 30/11 Web engineering & Web information systems, 7/12 Data transformatie & Data integratie, 14/12 ERP, Smulders (Deloitte), 21/ /1 Flower, Berens (Pallas Athena), 25/1 + 1/2 Biztalk, van den Boom (Microsoft), 15+22/2
TU/e eindhoven university of technology / faculty of mathematics and informatics Inhoud Inleiding, 30/11 Web engineering & Web information systems, 7/12 Data transformatie & Data integratie, 14/12 ERP, Smulders (Deloitte), 21/ /1 Flower, Berens (Pallas Athena), 25/1 + 1/2 Biztalk, van den Boom (Microsoft), 15+22/2 Philippe Thiran
TU/e eindhoven university of technology / faculty of mathematics and informatics Data Transformation Data Integration Philippe Thiran Computer Science Department Technische Universiteit Eindhoven The Netherlands
TU/e eindhoven university of technology / faculty of mathematics and informatics Data Transformation & Integration Agenda – Problem Statement Existing database systems Heterogeneity, distribution, autonomy – Data Transformation Schema conversion Query conversion: Wrapper – Data Integration Schema integration Query processing: Multidatabase and Federation
TU/e eindhoven university of technology / faculty of mathematics and informatics Problem Statement Existing database systems Heterogeneity, distribution, autonomy
TU/e eindhoven university of technology / faculty of mathematics and informatics Problem Statement Existing Database Systems Existing Database Systems – Data are recorded in existing database systems – Existing database systems are: Mission critical (essential to the organization business) To be operational at all times Inflexible – Typically, existing database systems are: Very large (millions of lines of code) Old (often more than 10 years old) Written in old programming language like COBOL, PL/1, SQL! Built around an old DBMS
TU/e eindhoven university of technology / faculty of mathematics and informatics Problem Statement Existing Database Systems Existing Database Systems – Data are recorded in existing database systems – Answer of old requirements New functions and services New user requirements New technology (Web) Communication among them?
TU/e eindhoven university of technology / faculty of mathematics and informatics Problem Statement Existing Database Systems Existing Systems: New Services – How to deal with existing database systems ? Abandon the existing systems: migration to a new system Keep and modify the existing systems Keep the existing systems and wrap them: autonomy Existing Systems: Communication – How to integrate existing database systems?
TU/e eindhoven university of technology / faculty of mathematics and informatics Data Integration Problems – Integrating database systems is very hard and costly – Three main dimension of the problem: Distribution Autonomy Heterogeneity Distribution Autonomy Heterogeneity Centralized DBMS Distributed databases Problem Statement Data Integration
TU/e eindhoven university of technology / faculty of mathematics and informatics Autonomy – Autonomy refers to the distribution of control – Four dimensions of autonomy: Design: own data models and own transaction management technique Communication: nor knowledge of the existence of other system nor how to communicate with them Execution: independently of the other systems Association: each system decides how much of its data and processing capabilities it will share with the other system Data Integration Problem Statement Distribution Autonomy Heterogeneity
TU/e eindhoven university of technology / faculty of mathematics and informatics Heterogeneity – Heterogeneity may exist at three basic levels: DBMS level. Data is managed by a variety of DBMS based on different data models and data languages – Data models : relational model, hierarchical model and file model – Data languages : SQL, DL/1, COBOL programs Platform level. Different hardwares, different network protocols Semantic level. Different designer viewpoints in modelling the same objects of the application domain. Incompatible design specifications which lead to different naming, types or integrity constraints Data Integration Problem Statement Distribution Autonomy Heterogeneity
TU/e eindhoven university of technology / faculty of mathematics and informatics Data Integration Generic Integration Architecture Schema Hierarchy Database Schema 1 DB1 Export Schema 1 Database Schema 2 DB2 Export Schema 2 Data Schema 3 Export Schema 3 Relational DBMS OO DBMS File System Import Schema 1 Integrated Schema Import Schema 2 Import Schema 3 Local Models Common Model Unifies data models View on export schema available for non-local access Homogenizes and unions import schemas
TU/e eindhoven university of technology / faculty of mathematics and informatics Data Integration Generic Integration Architecture Schema Hierarchy Database Schema 1 DB1 Export Schema 1 Database Schema 2 DB2 Export Schema 2 Data Schema 3 Export Schema 3 Relational DBMS OO DBMS File System Import Schema 1 Integrated Schema Import Schema 2 Import Schema 3 Local Models Common Model Data and Schema Transformation Data and Schema Integration
TU/e eindhoven university of technology / faculty of mathematics and informatics Data Transformation Schema Conversion Query Conversion: Wrapper
TU/e eindhoven university of technology / faculty of mathematics and informatics Data Transformation Schema Conversion Introduction – Schema conversion – Query/Data conversion Data Source 1 Local Data Models Common Data Model Query1’ Database Schema 1 Data Source 2 Database Schema 2 Export Schema 1 Export Schema 2 Query1 Query2’ Query2 Data1’ Data1 Data2’ Data2
TU/e eindhoven university of technology / faculty of mathematics and informatics Data Transformation Schema Conversion Schema Conversion – Schema transformation Transformation of a schema expressed in a data model (Ms) into an equivalent schema expressed in another data model (Mt) Examples – ER model Relational model (lecture ISO) – Relational model XML Schema (see later) Schema transformation operators Schema conversion consists in applying the relevant transformations on the relevant constructs of the schema expressed in Ms in such a way that the final result complies with Mt
TU/e eindhoven university of technology / faculty of mathematics and informatics Data Transformation Schema Conversion Schema Conversion –Schema transformation A (schema) transformation basically is an operator by which a source data structure C is replaced with a target structure C'. Example of a semantics-preserving transformation: transforming a relationship type into an attribute B B1 B2 id:B1 A A1 B1 ref:B N R B B1 B2 id:B1 A A1 RT-FK: Transforming a binary relationship type into a foreign key.
TU/e eindhoven university of technology / faculty of mathematics and informatics Data Transformation Schema Conversion Schema Conversion –2 main schema transformations for ER model Relational model RT-ET: Transforming a relationship type into an entity type. Inverse: ET-RT RT-FK: Transforming a binary relationship type into a foreign key. Inverse: FK-RT B B1 B2 id:B1 A A1 B1 ref:B N R B B1 B2 id:B1 A A1
TU/e eindhoven university of technology / faculty of mathematics and informatics Data Transformation Schema Conversion Schema Conversion –Exercice: From ER model Relational model
TU/e eindhoven university of technology / faculty of mathematics and informatics Data Transformation Schema Conversion Schema Conversion –Exercice: From ER model Relational model
TU/e eindhoven university of technology / faculty of mathematics and informatics Data Transformation Schema Conversion Schema Conversion –Exercice: From ER model Relational model
TU/e eindhoven university of technology / faculty of mathematics and informatics Data Transformation Wrappers Definition – A wrapper controls a (legacy) data source – Basically a wrapper is a software component that offers an homogeneous query interface based on a common data model (XML for the Web) – It converts data and queries from the common data model to a local data model It offers an adequate way for solving the DBMS heterogeneity that appears when one wants to integrate existing and heterogeneous data systems Database Schema Export Schema Data Source Wrapper Local Data Models Common Data Model Common Data Model Common Query Language
TU/e eindhoven university of technology / faculty of mathematics and informatics Data Transformation Wrappers Definition (ctd) – A data wrapper is basically defined as a converter of data and queries – That is, a wrapper: Offers an export schema in the common data model Accepts queries against the export schema Translates them into queries understandable by the data system Transforms the results of the local queries into a format understood by the application Database Schema Export Schema Data Source Wrapper Local Data Models Common Data Model Common Data Model Common Query Language QueryData Local Data Model Local Query Language
TU/e eindhoven university of technology / faculty of mathematics and informatics Data Transformation Wrappers Categories of Wrappers – There exists no standard approach to build wrappers – Functionality One-way: only transformation of data (e.g., for data warehouses) Two-way: transformation of requests and data – Development Hard-wired wrappers, for specific data sources Semi-automated generation: wrapper development tools Automatically generated wrappers – Availability Standalone programs (data conversion, data migration) Components of a federation (see later) Database interface for foreign data
TU/e eindhoven university of technology / faculty of mathematics and informatics Data Transformation Wrappers Wrappers and the Web – Wrapper interface Data format: XML Common data model: XML DTD and Schema Common query language: XPath, XQuery, none – Wrapper mapping Generally between relational data and XML Two translation types – Automated – Defined by the user XML- or SQL-oriented query language
TU/e eindhoven university of technology / faculty of mathematics and informatics Data Transformation Wrappers XML Views of Relational Databases – Automated translation Oi d DescCost 10Ship Generator8000 IdCustnameCustnum 10Philips7734 9Unilever7725 OidDueAmt 101/10/ /10/ Order Item Payement 10 Philips Unilever Ship Generator 8000 similar to and
TU/e eindhoven university of technology / faculty of mathematics and informatics Data Transformation Wrappers XML Views of Relational Databases – User-defined Translation Oi d DescCost 10Ship Generator8000 IdCustnameCustnum 10Philips7734 9Unilever7725 OidDueAmt 101/10/ /10/ Order Item Payement Philips …
TU/e eindhoven university of technology / faculty of mathematics and informatics Data Transformation Wrappers XML Views of Relational Databases – Exercises What is the XML Document of this relational database?
TU/e eindhoven university of technology / faculty of mathematics and informatics Data Transformation Wrappers XML Views of Relational Databases – Exercises What is the XML Document of this relational database? <!ATTLIST Order OrderID ID #REQUIRED> <!ATTLIST Detail Product IDREF #REQUIRED> <!ATTLIST Product Reference ID #REQUIRED Label CDATA #IMPLIED UnitPrice CDATA #REQUIRED>
TU/e eindhoven university of technology / faculty of mathematics and informatics Data Transformation Wrappers XML Views of Existing Relational Databases – Mapping definition SQL-oriented query language For $b in SQL(select * from Order where Custname=“’ +$x + ‘””) return {$b/Id} {$x} IdCustnameCustnum 10Philips7734 9Unilever7725 Order IdCustname
TU/e eindhoven university of technology / faculty of mathematics and informatics Data Transformation Wrappers XML Views of Existing Relational Databases – XML View definition Bottom-up (from the relational schema) Top-Down (from a given XML schema) – Mappings between XML views and relational schemas Automated (algorithm) Manual (defined by the user)
TU/e eindhoven university of technology / faculty of mathematics and informatics Data Transformation Wrappers XML Views of Existing Relational Databases – Examples Product NameSQL-written Mapping XML-written Mapping XML SchemaQuery over views Xperanto (IBM) noyes (XQuery) XML Schemayes (XQuery) update Microsoft’s SQL Server yes (FOR XML clause) noXDR Schemayes (XPath) DB2 (IBM)noyes (subset of XQuery) yes (XQuery)no Oracle9iyesno SilkRoute (AT&T) noyes (XQuery) XML Schemayes (XQuery) update
TU/e eindhoven university of technology / faculty of mathematics and informatics Data Integration Generic Integration Architecture Schema Integration Query Processing: multidatabase and federation
TU/e eindhoven university of technology / faculty of mathematics and informatics Data Integration Generic Integration Architecture Schema Integration
TU/e eindhoven university of technology / faculty of mathematics and informatics Data Integration Generic Integration Architecture Schema Hierarchy Database Schema 1 DB1 Export Schema 1 Database Schema 2 DB2 Export Schema 2 Data Schema 3 Export Schema 3 Relational DBMS OO DBMS File System Import Schema 1 Integrated Schema Import Schema 2 Import Schema 3 Local Models Common Model Unifies data models View on export schema available for non-local access Homogenizes and unions import schemas
TU/e eindhoven university of technology / faculty of mathematics and informatics Data Integration Generic Integration Architecture Component Architecture Application 1 DB1 Application 2Application 3 DBMS 1 DB2 DBMS 2 DB3 DBMS 3 Wrapper Meditor Common DDL/DML Integrated Schema Export Schema 1 Local DDL/DML Database Schema 1 Import Schema 1 Controls a local data source Offers an homogeneous query interface based on a common data model Offers an abstract integrated view of sources Reconciles independent data structures to yield a unique, coherent, view of the data
TU/e eindhoven university of technology / faculty of mathematics and informatics Data Integration Generic Integration Architecture Aspects to Consider for Integration – General Issues Bottom-up vs. top-down engineering – From existing schema to integrated or vice-versa – Schema integration vs. schema matching Virtual vs. materialized integration Read-only vs. read-write access Transparency – Language, schema, location – Data Model related issues Types of sources – Structured, semi-structured, unstructured Common data model of integrated system Tight vs. loose integration – Use of a global schema Query model
TU/e eindhoven university of technology / faculty of mathematics and informatics Methodology – Bottom-up process – Four main steps Preparing the local schemas Detecting what is common between the components of local schemas – Correspondence (what is common) Solving the conflicts – Conflict (what is incompatible) Integrating the different schemas according to the correspondences and conflicts detected in the previous steps Data Integration Schema Integration
TU/e eindhoven university of technology / faculty of mathematics and informatics Concept of Correspondence – Two complementary views of correspondence: Structural correspondence (schema level: concepts) Instance correspondence (instance level: data) – Structural correspondence Five types of structural correspondence: – Identity – Independence – Complementarity – Subtyping – Common supertype Data Integration Schema Integration
TU/e eindhoven university of technology / faculty of mathematics and informatics Concept of Correspondence – Instance correspondence Four types of instance correspondence: – Disjointed: the instances classes are disjointed – Inclusion: the set of one class is included to another class – Equivalence: the classes contain the same instances – Overlapping: the classes share some instances but not all Data Integration Schema Integration
TU/e eindhoven university of technology / faculty of mathematics and informatics Concept of Conflict – Conflicts occur in three possible ways : syntactic (naming conflicts), structural, semantic or instance – Syntactic conflicts (resolution: use of an ontology) Synonyms. Two identical objects (entities, attributes, relationships) that have different names are synonyms Homonyms. Two different objects that have identical names are homonyms – Structural conflicts (resolution: mapping function or transformation) Domain. Two identical objects have different domains (Differences in dimension, units and scales) Structure. The same concept is presented by different data structures (e.g., different attributes) Data Integration Schema Integration
TU/e eindhoven university of technology / faculty of mathematics and informatics Concept of Conflict – Structural conflict In the left-hand schema, Address is an compound attribute, whereas in the right-hand one, Address is represented by an entity type Resolution: transformation Data Integration Schema Integration Site 1 Site 2
TU/e eindhoven university of technology / faculty of mathematics and informatics Concept of Conflict – Semantic conflicts A semantic conflict appears when a contradiction appears between two representations A and B of the same application domain concept or between two integrity constraints (resolution?) Example – In the left-hand schema, Customer is identified by CustId, whereas in the right-hand one, it is identified by Name Data Integration Schema Integration Site 1Site 2
TU/e eindhoven university of technology / faculty of mathematics and informatics Data Integration Schema Integration Concept of Conflict – Instance conflicts Instance conflicts are specific to existing data Modelling constructs A and B that are recognized as corresponding can cover sets with different scopes Examples – ZIP codes of addresses can be written like “NL-5600 MB” or “56oo MB” or “5600” – Different ZIP codes can be recorded for the same address (encoding errors) – Resolution: Data transforming… cleaning?
TU/e eindhoven university of technology / faculty of mathematics and informatics Data Integration Query Processing: multidatabase and federation
TU/e eindhoven university of technology / faculty of mathematics and informatics Data Integration Integration Architecture Three Classical Architectures – Multidatabases No integrated schema Integrated access to different relational DBMS – Federated Databases Integrated schema Integrated access to different DBMS Integrated access to different data sources (on the Web) – Data Warehouses Materialized integrated data sources Not here
TU/e eindhoven university of technology / faculty of mathematics and informatics Data Integration Query Processing Classical Architecture: Multidatabase – Enable transparent access to multiple (relational) databases Hides distribution, different SQL variants Processes queries and updates against multiple databases (2- phase commit) Does not provide any type of global schema (does not hide the different database schemas) Example: IBM DataJoiner DataJoiner Sybase Open Client Oracle SQL*Net TCP/IP Network Sybase Server Oracle Server
TU/e eindhoven university of technology / faculty of mathematics and informatics Data Integration Query Processing Classical Architecture: Multidatabase – Multidatabase schema Source 1 Source 2 SybaseOracle Multidatabase Schema
TU/e eindhoven university of technology / faculty of mathematics and informatics Data Integration Query Processing Classical Architecture: Multidatabase – Query processing Multidatabase Schema SELECT title FROM PUBLICATIONS SELECT title FROM PAPERS Source 1 Sybase Source 2 Oracle Sybase Data Oracle Data SELECT p2.title FROM Sybase.PUBLICATIONS p1, Oracle.PAPERS p2 WHERE p1.title = p2.title
TU/e eindhoven university of technology / faculty of mathematics and informatics Data Integration Query Processing Classical Architecture: Multidatabase Main properties Transparency – Low level of transparency provided to the user (The user is responsible for finding the relevant information, understanding each database schema, detecting and resolving the semantic conflicts, and finally, building the required view of the data in the sources) Autonomy – Not intrusive against the autonomy of the data sources – Suitable when component systems are strongly autonomous Methodology – Simplicity since there is no schema integration Maintenance and evolution – No integrated schema maintenance
TU/e eindhoven university of technology / faculty of mathematics and informatics Data Integration Query Processing Classical Architecture: Federation – Integrated schema(s) and unique interface Hides the semantic and location heterogeneity Wrapper/Mediator hierarchy – Wrapper » Controls a local data source » Offers an homogeneous query interface based on a common data model – Mediator » Offers an abstract integrated view of several sources » Reconciles independent data structures to yield a unique, coherent, view of the data – Research projects Tsimmis (Stanford) Garlic (IBM) Oasis (Dublin University)
TU/e eindhoven university of technology / faculty of mathematics and informatics Data Integration Query Processing Classical Architecture: Federation – Typical example Views Integrated schema Import schemas Oracle SQL DBMS XML DBMS Wrapper (provides export schema) Meditor Authors ANR Title FirstName Surname Affiliation id:ANR Publication PNR Title Authors Journal Pages id:PNR
TU/e eindhoven university of technology / faculty of mathematics and informatics Data Integration Query Processing Classical Architecture: Federation – Typical example Views Import schema DB1 Import schema DB2 Integrated schema
TU/e eindhoven university of technology / faculty of mathematics and informatics Q2 Q2’ Q1’ Data Integration Query Processing: Federation Submit query Q Q = FOR $b IN //Book RETURN $b/author Q1 = FOR $b IN //Book RETURN $b/authors Q2 = FOR $b IN //book RETURN $b/author Q1’ = SELECT a.name FROM AUTHORS A Q2’ = //book/author ORACLE SQL DBMS XML DBMS Q1 A1= { … } A1A2 A2= { … } A2 Return result A A1’={ … } A = A1’ A2
TU/e eindhoven university of technology / faculty of mathematics and informatics Data Integration Query Processing Classical Architecture: Federation Main properties Transparency – High level of transparency provided to the user. The user is not aware of the distribution and the heterogeneity of the integrated data sources Autonomy – Each local data source have control over its sharable information Methodology – Problems of defining an integrated schema – Web as Loosely Coupled Federation Many different, widely distributed information systems Heterogeneity – Structural homogeneous: XML – Semantically heterogeneous: no explicit schemas (ontology?) Autonomy – Runtime autonomy: pages change on average every 4 weeks, dangling links Distribution – Replication (proxies) and caching frequently used