1 Corso di Architetture della Info A.A Carlo Batini I sistemi di Data Integration elementi architetturali
2 Data Integration (or mediator) systems
3 Data Integration definition Data integration is a major research and business area that has the main purpose of allowing a user to provide uniform access to multiple, autonomous, heterogeneous data sources through the presentation of a unified view of these data. Finding this agreement is complex because one has to find differences and similarities in each schema to be able to conform.
The plus of data integration architectures wrt federated architectures Manages –schema level heterogeneities more complex than in federated databases –(to some extent..) instance level heterogeneities due to quality errors (accuracy, currency, incompleteness, inconsistencies, etc.) in data
5 Data integration – several approaches Data integration stands for several approaches for combining data from different data sources [Hull, 1997]: Integrated read-only views: Mediation. To support an integrated, read-only, view of data that resides in multiple databases (the majority of academic and commercial systems) Integrated read-write views: Mediation with update. An extension of the mediation architecture to support updates against an integrated view Initially, we will deal only with the first issue
Schema level heterogeneities
NB heterogeneity and conflic are synonym in the following Are of two types Name heterogeneities Type heterogeneities
Name heterogeneities Sinonyms – Different names for the same concepts –employee, clerk –exam, course –code, num Homonyms – Same name for different concepts - Employee as employee in one schema, as vendor in another schema
Name conflicts – HOMONYMS – SYNONIMS Examples of name heterogeneities price (production price) Product price (sale price) Product Department Division
Type conflicts The same concepts is represented with different conceptual structures in two schemas Different definition domains for the same attribute in two schemas Attribute in one schema and derived value in another schema Attribute in one schema and entity in another schema Attribute in one schema and generalization hierarchy in another schema Entity in one schema and relationship in another schema Different abstraction levels for the same concept in two schemas: e.g. two entities with homonym names related by an IS-A hierarchy in two schemas Different granularities in the definition domains Different cardinalities in the same relationships Key conflicts See next pages for examples -
Examples of type conflicts - 1 TYPE CONFLICTS in a single attribute (e.g. NUMERIC, ALPHANUMERIC,...) e.g. the attribute “gender”: –Male/Female –M/F –0/1 –In Italy, it is implicit in the “codice fiscale” (SSN) Year has a four digit domain in one schema and two digit domain in another schema
different currencies (euros, US dollars, etc.) different measure systems (kilos vs pounds, centigrades vs. Farhenheit.) different granularities (grams, kilos, etc.) Examples of type conflicts - 2
Examples of type conflicts - 3 Person WOMAN MAN GENDER Person PUBLISHER BOOK PUBLISHER EMPLOYEE DEPARTMENT PROJECT EMPLOYEE PROJECT Structure conflicts
DEPENDENCY (OR CARDINALITY) CONFLICTS Examples of type conflicts - 4 EMPLOYEE DEPARTMENT PROJECT EMPLOYEE PROJECT 1:11:n 1:1 1:n
KEY CONFLICTS Examples of type conflicts - 5 CODE PRODUCT LINE CODE PRODUCT DESCRIPTION
16 Data integration The research community has been investigating data integration for about 20 years: different research communities (database, artificial intelligence, semantic web) have been developing and addressing issues related to data integration: –Definitions, architectures, classification of the problems to be addressed –Data Integration problems have been analyzed in different perspectives and different approaches have been proposed –Developed benchmarks allow the evaluation and the comparison of the approaches (THALIA benchmark) –Several commercial software suites have been released and are on testing in real environments
17 Integration of Heterogeneous & Distributed Data Sources “Data integration is the problem of combining data residing at different sources, and providing the user with a unified view of these data” (Global Virtual Schema (GS)) [Lenzerini, 2002] Query Global Schema (GS) Mapping Local Schema DB File XML
18 Main elements of DI architecture Three main elements of the architecture of a schema integration system can be distinguished. These elements are: a global schema one or more source/local schemas mappings between the global and the source/local schemas
19 Typical architecture of a data integration system Global schema Mapping User query Source 1Source 2 Source n Local schema 1Local schema 2Local schema n Wrapper Mediator Wrapper
20 Definitions of global schema and mappings The global schema describes the structure of the schema representing the whole universe of discourse. The mappings, or connections, describe how each element in the local schemas relates to the global schema (REMARK mappings can be expressed in the two directions…)
21 Typical architecture of a data integration system Global schema Mapping User query Source 1Source 2 Source n Local schema 1Local schema 2Local schema n Wrapper Mediator Wrapper Global schema Mapping User query Source 1Source 2 Source n Local schema 1Local schema 2Local schema n Wrapper Mediator Wrapper From local schemas to the global schema From the global schema to local schemas
22 Definitions of global schema and mappings The global schema describes the structure of the schema representing the whole universe of discourse. The mappings, or connections, describe how each element in the local schemas relates to the global schema Mappings can be expressed in the two directions Summarized, the essence of integration is to combine information in a logical way so information can be queried as one through a common interface. The schema for each information source needs to be connected through a mapping with the global schema of the common interface to enable querying.
23 Wise 2009 – Poznan (PL)Università di Modena e Reggio Emilia & Milano Bicocca 23 Mediators (1) Query Interface Local Sources Global Schema View Mapping Local Schemata SOURCE 1 Professor (first_name, last_name, , area) SOURCE 2 Faculty_member(name, mail, research_topic) GLOBAL SCHEMA Full_professor (name, mail, area) Search mail of professors whose research activities are in the “Database area” Select From Professor Where area = “Database” Select mail From Faculty_member Where research_topic = “Database” Resultset
24 Wise 2009 – Poznan (PL)Università di Modena e Reggio Emilia & Milano Bicocca 24 Mediators (2) The mediator builds a unified schema of several (heterogeneous) information sources and allows a user to formulate a query on it The user query is transformed in a set of sub-queries, one for each data source involved in the query The results are collected by the Mediator, merged and shown to the user
25 Architettura funzionale di un Data Integration system Wrapper Mediatore Wrapper DBMS BD MultiDBMS client Mediatore - Fornisce agli utenti una rappresentazione virtuale unica delle fonti, data dallo schema globale - Traduce le queries in termini di frammenti, inviate ai wrapper -Ricompone i risultati restituiti dai wrapper - Effettua le azioni di data fusion e di risoluzione delle eterogeneita’ sui valori
Instance level heterogeneities
Mediators object fusion and reconciliation A mediator’s main functionality is object fusion: group together information about the same real world entity remove redundancy among the various data sources resolve inconsistencies among the various data sources achieve accuracy, completeness, currency (and other DQ dimensions…) among data from different data sources
28 Architettura funzionale di un Data Integration system Wrapper Mediator Wrapper DBMS BD DI System client Wrapper -Traduce la richiesta che proviene dal mediatore in termini della rappresentazione logico fisica dello schema locale sottostante
29 Wise 2009 – Poznan (PL)Università di Modena e Reggio Emilia & Milano Bicocca 29 Mediators (3) We may divide the interactions with a mediator in two phases: 1.The creation of the unified representation (Publishing phase at design time) 2.The formulation and the execution of a query in the unified representation (Querying phase)
30 Architettura funzionale di un MDBS nel nostro esempio Wrapper Mediatore Wrapper DBMS BD MultiDBMS client StudenteCorsoProfessore Global schema
31 Architettura funzionale di un mediator system - esempio Wrapper Mediatore Wrapper DBMS BD MultiDBMS client Studente Corso Professore Modulo Local schema
32 Virtual Integration Architecture including optimization functionality Data source wrapper Data source wrapper Data source wrapper Sources can be: relational, hierarchical (IMS), structured files, web sites. Mediator: User queries Mediated schema Data source catalog Reformulator Optimizer Execution engine
33 DI Systems and design time vs run time issues Publishing phase (or Design time) –[The global schema and the mappings] must be defined from source schemas Run time –Queries are executed and –Global schema, local schemas and the mappings are maintained
34 Wise 2009 – Poznan (PL)Università di Modena e Reggio Emilia & Milano Bicocca 34 Mediators – relevant challenges Mediator User Interface Data Sources Publishing Phase Visualizing the unified schema Model and language for representing the unified schema Matching and Mapping the unified schema and the local sources Building the unified schema Managing updates Schema extraction Querying Phase Model and Language for formulating queries Model and language for querying the schema Query unfolding / rewriting Data fusion and cleaning Query transformation and execution
35 Wise 2009 – Poznan (PL)Università di Modena e Reggio Emilia & Milano Bicocca 35 Mediators – relevant challenges Mediator User Interface Data Sources Publishing Phase Visualizing the unified schema Model and language for representing the unified schema Matching and Mapping the unified schema and the local sources Building the unified schema Managing updates Schema extraction Querying Phase Model and Language for formulating queries Model and language for querying the schema Query unfolding / rewriting Data fusion and cleaning Query transformation and execution
36 wrapper Mediated Schema Semantic mappings optimization & execution query reformulation Design timeRun time
37 Basic properties of a DI System A System Providing: –Uniform (same query interface to all sources) –Access to (queries; eventually updates too) –Multiple (we want many, but 2 is hard too) –Autonomous (DBA doesn’t report to you) –Heterogeneous (data models are different) –Structured (and at least semi-structured) –Data Sources (not only databases).