Automatic Integration of Relational Database Systems Ramon Lawrence University of Manitoba Ramon Lawrence University of Manitoba.

Slides:



Advertisements
Similar presentations
XML: Extensible Markup Language
Advertisements

The database approach to data management provides significant advantages over the traditional file-based approach Define general data management concepts.
Introduction to Databases
Prentice Hall, Database Systems Week 1 Introduction By Zekrullah Popal.
Page 1 Integrating Multiple Data Sources using a Standardized XML Dictionary Ramon Lawrence Integrating Multiple Data Sources using a Standardized XML.
Unity Demonstration Dr. Ramon Lawrence University of Iowa Dr. Ramon Lawrence University of Iowa
Page 1 Querying Relational Databases without Explicit Joins Ramon Lawrence, Ken Barker Querying Relational Databases without Explicit Joins.
Xyleme A Dynamic Warehouse for XML Data of the Web.
Integrating Relational Database Schemas using a Standardized Dictionary.
Visual Web Information Extraction With Lixto Robert Baumgartner Sergio Flesca Georg Gottlob.
File Systems and Databases
Organizing Data & Information
Integrating data sources on the World-Wide Web Ramon Lawrence and Ken Barker U. of Manitoba, U. of Calgary
1 Lecture 13: Database Heterogeneity Debriefing Project Phase 2.
Page 1 MDBS Schema Integration: The Relational Integration Model Ramon Lawrence MDBS Schema Integration: The Relational Integration Model Candidacy Exam.
Chapter 4: Database Management. Databases Before the Use of Computers Data kept in books, ledgers, card files, folders, and file cabinets Long response.
Page 1 Multidatabase Querying by Context Ramon Lawrence, Ken Barker Multidatabase Querying by Context.
XML –Query Languages, Extracting from Relational Databases ADVANCED DATABASES Khawaja Mohiuddin Assistant Professor Department of Computer Sciences Bahria.
Automatic Data Ramon Lawrence University of Manitoba
INTEGRATION INTEGRATION Ramon Lawrence University of Iowa
Mgt 20600: IT Management & Applications Databases Tuesday April 4, 2006.
Chapter 4 Relational Databases Copyright © 2012 Pearson Education 4-1.
Quete: Ontology-Based Query System for Distributed Sources Haridimos Kondylakis, Anastasia Analyti, Dimitris Plexousakis Kondylak, analyti,
4/20/2017.
MDC Open Information Model West Virginia University CS486 Presentation Feb 18, 2000 Lijian Liu (OIM:
Chapter 1 Database Systems. Good decisions require good information derived from raw facts Data is managed most efficiently when stored in a database.
Database System Concepts and Architecture Lecture # 3 22 June 2012 National University of Computer and Emerging Sciences.
Information storage: Introduction of database 10/7/2004 Xiangming Mu.
Week 1 Lecture MSCD 600 Database Architecture Samuel ConnSamuel Conn, Asst. Professor Suggestions for using the Lecture Slides.
Chapter 5 Lecture 2. Principles of Information Systems2 Objectives Understand Data definition language (DDL) and data dictionary Learn about popular DBMSs.
Copyright © 2003 by Prentice Hall Computers: Tools for an Information Age Chapter 13 Database Management Systems: Getting Data Together.
Introduction to MDA (Model Driven Architecture) CYT.
2005 SPRING CSMUIntroduction to Information Management1 Organizing Data John Sum Institute of Technology Management National Chung Hsing University.
RELATIONAL FAULT TOLERANT INTERFACE TO HETEROGENEOUS DISTRIBUTED DATABASES Prof. Osama Abulnaja Afraa Khalifah
11 CORE Architecture Mauro Bruno, Monica Scannapieco, Carlo Vaccari, Giulia Vaste Antonino Virgillito, Diego Zardetto (Istat)
Lecture2: Database Environment Prepared by L. Nouf Almujally & Aisha AlArfaj 1 Ref. Chapter2 College of Computer and Information Sciences - Information.
Dimitrios Skoutas Alkis Simitsis
Lecture2: Database Environment Prepared by L. Nouf Almujally 1 Ref. Chapter2 Lecture2.
5 - 1 Copyright © 2006, The McGraw-Hill Companies, Inc. All rights reserved.
Chapter 4c, Database H Definition H Structure H Parts H Types.
Lesson Overview 3.1 Components of the DBMS 3.1 Components of the DBMS 3.2 Components of The Database Application 3.2 Components of The Database Application.
Declaratively Producing Data Mash-ups Sudarshan Murthy 1, David Maier 2 1 Applied Research, Wipro Technologies 2 Department of Computer Science, Portland.
3-Tier Client/Server Internet Example. TIER 1 - User interface and navigation Labeled Tier 1 in the following graphic, this layer comprises the entire.
6.1 © 2010 by Prentice Hall 6 Chapter Foundations of Business Intelligence: Databases and Information Management.
MANAGING DATA RESOURCES ~ pertemuan 7 ~ Oleh: Ir. Abdul Hayat, MTI.
Chapter 9 Database Systems Introduction to CS 1 st Semester, 2014 Sanghyun Park.
11 CORE Architecture Mauro Bruno, Monica Scannapieco, Carlo Vaccari, Giulia Vaste Antonino Virgillito, Diego Zardetto (Istat)
Management Information Systems, 4 th Edition 1 Chapter 8 Data and Knowledge Management.
Issues in Ontology-based Information integration By Zhan Cui, Dean Jones and Paul O’Brien.
Lection №4 Development of the Relational Databases.
Session 1 Module 1: Introduction to Data Integrity
Object storage and object interoperability
Introduction to Active Directory
Copyright 2007, Information Builders. Slide 1 iWay Web Services and WebFOCUS Consumption Michael Florkowski Information Builders.
1 Management Information Systems M Agung Ali Fikri, SE. MM.
Data Resource Management Data Concepts Database Management Types of Databases Chapter 5 McGraw-Hill/Irwin Copyright © 2007 by The McGraw-Hill Companies,
Data Integrity & Indexes / Session 1/ 1 of 37 Session 1 Module 1: Introduction to Data Integrity Module 2: Introduction to Indexes.
Chapter 2 Database System Concepts and Architecture
Information Systems Today: Managing in the Digital World
Chapter 9 Database Systems
Fundamentals & Ethics of Information Systems IS 201
Chapter 2 Database Environment Pearson Education © 2009.
Chapter 2 Database Environment.
MANAGING DATA RESOURCES
File Systems and Databases
Data Model.
Introduction of Week 11 Return assignment 9-1 Collect assignment 10-1
Chapter 2 Database Environment Pearson Education © 2009.
Chapter 2 Database Environment Pearson Education © 2009.
Presentation transcript:

Automatic Integration of Relational Database Systems Ramon Lawrence University of Manitoba Ramon Lawrence University of Manitoba

Page 2 Outline è Introduction, Motivation, and Background è Our integration approach è The integration architecture u Standard dictionary, X-Specs, query processor è Example integration u Northwind, Southstorm databases è Querying the integrated databases u Generating SQL queries from semantic queries è Unity implementation è Contributions, Conclusions, and Future Work

Page 3 Database Terminology è Database system - is a database and a system to manage the data. è Schema - is a description of the data organization and format in a database. è Schema integration - is the process of combining local schemas into a global, integrated view by resolving conflicts present between the schemas. è Data integration - is the process of combining data at the entity-level. It requires resolving representational conflicts and determining equivalent keys. è Multidatabase system (MDBS) - is a collection of autonomous, local databases participating in a global database system to share data.

Page 4 What is Integration? è Two levels of integration: u Schema integration - the description of the data u Data integration - the individual data instances è Integration problems include: u Different data models and conflicts within a model u Incompatible concept representations u Different user or view perspectives u Naming conflicts (homonym, synonym) è Integration handles the different mechanisms for storing data (structural conflicts), for referencing data (naming conflicts), and for attributing meaning to the data (semantic conflicts).

Page 5 Why is Integration Required? è There are many integration environments: u Operational systems within an organization u System integration during company merger u Data warehouses, Intranets, and the WWW è Users require information from many data sources which often do not work together. è Companies require a global view of their entire operations which may be present in numerous operational databases for different departments and distributed geographically. è E-commerce demands integration of web databases with production systems.

Page 6 What is the Current Solution? è Manual Integration Algorithms: u Allow designer to detect and resolve conflicts u Manipulate information using semantic models è Knowledge bases/Artificial Intelligence: u Cyc knowledge base and Carnot project è Global Dictionaries and Lexical Semantics: u Wordnet, Clio, Summary schemas model u Concept hierarchies (Castano)

Page 7 What is the Current Solution? (2) è SQL and multidatabase query languages: u SQL, MSQL, IDL, DIRECT, SchemaSQL u Requires user to understand DB structure & semantics è Wrapper and mediator systems: u Information Manifold, TSIMMIS, Infomaster u Use query languages or description logics u Focus on query rewriting and reformulation è Industrial standards: u XML, BizTalk, E-commerce portals u Apply to limited domains/industries u Require standard structures and database changes

Page 8 Previous Work Summary è Current techniques for database integration have some of these problems: u Require integrator to understand all databases u Integration process is manual u Do not hide system complexity from the user u Force changes on the existing database systems u Construct global view manually u Suffer from query imprecision (query containment)

Page 9 Our Approach è Our approach combines standardization and query mapping algorithms. è The major idea is that schema conflicts can be resolved if we: u Eliminate all naming conflicts u Define a language capable of determining schema equivalence and performing transformations è Naming conflicts are eliminated by accepting a standard term dictionary. u Not a knowledge base or set of mediated views u Leverages semantic information in English words

Integration Architecture Architecture Components: 1) Integrated Context View user’s view of integration 2) X-Spec Editor stores schema & metadata uses XML 3) Standard Dictionary terms to express semantics 4) Integration Algorithm combines X-Specs into integrated context view 5) Query Processor accepts query on view determines data source mappings and joins executes queries and formats results Local Transactions X-Spec X-Spec Editor Standard Dictionary Integration Algorithm Integrated Context View Query Processor and ODBC Manager Database Client Subtransactions Client Multidatabase Layer Database X-Spec

Page 11 Architecture Components è The architecture consists of four components: u A standard dictionary (SD) to capture data semantics ïSD terms are used to build semantic names describing semantics of schema elements. u X-Specs for storing data semantics ïDatabase metadata and semantic names stored using XML u Integration Algorithm ïMatches concepts in different databases by semantic names. ïProduces an integrated view of all database concepts. u Query Processor ïAllows the user to formulate queries on the view. ïTranslates from semantic names in integrated view to SQL queries and integrates and formats results. s Involves determining correct field and table mappings and discovery of join conditions and join paths

Page 12 è The integration architecture consists of three separate processes: u Capture process: independently extracts database schema information and metadata into a XML document called a X-Spec. u Integration process: combines X-Specs into a structurally-neutral hierarchy of database concepts called an integrated context view. u Query process: allows the user to formulate queries on the integrated view that are mapped by the query processor to structural queries (SQL) and the results are integrated and formatted. Integration Processes

Page 13 Integration Architecture: The Capture Process Relational Schema Standard Dictionary X-Spec Specification Editor Automatic Extraction DBA Lookup of terms è Capture process involves: u Automatically extracting the schema information and metadata using a specification editor u Assigning semantic names to each schema element (tables and fields) to capture their semantics

Page 14 Architecture Components: The Standard Dictionary è A standard dictionary (SD) provides standardized terms to capture data semantics. u Hierarchy of terms related by IS-A or Has-A links u Contains base set of common database concepts, but new concepts can be added è A SD term is a single, unambiguous semantic definition. u Several SD entries for a single English word are required if the word has multiple definitions. è The top-level dictionary terms are those proposed by Sowa.

Page 15 Architecture Components: Dictionary vs. Knowledge Base è The standard dictionary differs from a knowledge base such as Cyc because: u Not intended to be a general English dictionary or contain knowledge facts about the world ïDictionary is evolved as new terms are required ïNot all English words are used u Dictionary provides the systems with no “knowledge” ïSince no facts are stored, system cannot deduce new facts ïDictionary terms are just semantic place holders, integrators determine the semantics of the database not the system u Simplified organization ïDictionary is organized as a tree for efficiency and simplicity in determining related concepts u Re-use of terms ïTerms are re-used in semantic names

Page 16 Architecture Components: Using the Standard Dictionary è SD terms are used to build semantic names describing semantics of schema elements. è Semantic names have the form: u semantic name := [CT_Type] | [CT_Type] CN u CT_Type := CT | CT {; CT} | CT {,CT} u CT := context term, CN := concept name u each CT and CN is a single term from the SD è Semantic names are included in specifications describing a database.

Page 17 Northwind & Southstorm Integration Example

Page 18 Northwind & Southstorm Integration Example (2)

Integration Example (3) Page 19

Page 20 Northwind & Southstorm Integration Example (4)

Page 21 What is a semantic name? è A semantic name is a universal, semantic identifier in a domain. u Similar to a field name in the Universal Relation. u Semantics are guaranteed unique by construction. u System has mechanism for comparing semantics across domains even though it does not understand them. (Exploiting semantics in English words.) è Important definitions: u context - a semantic name is a context if it maps to a table u concept - a semantic name is a concept if it maps to a field u context closure - of semantic name S i denoted S i * is the set of semantic names produced by taking ordered subsets of the terms of S i = {T 1, T 2, … T N } starting with T 1.

Page 22 Architecture Components: X-Specs è Database metadata and semantic names are combined into specifications called X-Specs: u Stored and transmitted using XML u Contains information on a relational schema u Organized into database, table, and field levels u Stores semantic names to describe and integrate schema elements

Southstorm X-Spec <Schema name = "Southstorm_xspec.xml” xmlns="urn:schemas-microsoft-com:xml-data" xmlns:dt="urn:schemas-microsoft-com:datatypes"> <element type = "[Order;Customer;Address] Address Line 1" sys_name="Cust_address" sys_type="Field"/> Page 23

Page 24 Architecture Components: Integrating X-Specs è Each database to be integrated is described using a X-Spec. è Identical concepts in different databases are identified by similar semantic names. è Concepts with identical (or hierarchially related) semantic names are combined regardless of their physical representation in the individual databases.

Page 25 Integration Architecture: The Integration Process è Integration process involves: u Automatically identifying identical concepts by matching semantic names u Constructing a global view of database concepts consisting of a hierarchy of concept terms u Resolving structural differences during query generation and submission (e.g. a concept may be represented as a table in one database and a field (attribute) in another)

Page 26 Integration Product: The Integrated Context View è The product of the integration is a structurally- neutral hierarchy of concepts called an integrated context view. è Define a context view (CV) as follows: u If a semantic name S i is in CV, then for any S j in S i *, S j is also in CV. u For each semantic name S i in CV, there exists a set of zero or more mappings M i that associate a schema element E j with S i. u A semantic name S i can only occur once in the CV. è A context view (CV) is a valid Universal Relation. u Each field is assigned a semantic name which uniquely identifies its semantic connotation.

Page 27 Northwind & Southstorm Integration Example

Page 28 Architecture Components: The Query Processor è The query processor: u Allows the user to formulate queries on the view. u Translates from semantic names in the context view to structural queries (SQL) on databases. ïInvolves determining correct field and table mappings and discovery of join conditions and join paths u Retrieves query results and formats them for display to the user. è Client-side query processing: u Perform joins between databases using common keys. u Data value formatting and transformation

Page 29 The Query Processor: Determining field/table mappings è For each database (D) in the context view u For each semantic name (S) in query ïIf S has only one semantic name mapping in D Then s Add field mapping to query and its parent table ïElse If S has multiple mappings but all in one table Then s Add each field mapping to query and the parent table ïElse S has multiple mappings in more than one table Then s If any field mapping has a table already in query take that one s Else take field mapping with best semantic name match s Else take first mapping found ïEnd If u Next è Next

Page 30 The Query Processor: Constructing Join Graphs è Given a set of fields (F) and tables (T) to access, joins are applied to connect the tables. è A join graph is an undirected graph where: u Each node N i is a table in the database. u There is a link from node N i to node N j if there is a join between the two tables. è A join path is a sequence of joins connecting two nodes in the graph. è A join tree is a set of joins connecting two or more nodes. è A join matrix M stores the shortest join paths between any two nodes (tables).

Page 31 The Query Processor: Join Graph for Northwind

Page 32 The Query Processor: Join Discovery Results è Join discovery in a database with a connected, acyclic join graph and a join matrix M: u There exists only one join tree for any set of tables. u The joins required to connect a table set T is found by taking any T i of T and unioning the join paths in M[N i,N 1 ], M[N i,N 2 ],... M[N i,N n ] where N 1,N 2,..N n are the nodes corresponding to the set of tables T. è For a cyclic join graph: u There may exist more than one join tree for a set of tables and each tree may have different semantics. u Can allow the user to uniquely determine join tree by graphically displaying join conditions to the user as they browse the context view.

Page 33 Advanced Query Processing è Advanced query processor features include: u global keys and joins - a mechanism for specifying when a field stores a global key such as a social security number. u result normalization - a procedure for normalizing query results returned from each individual database. (e.g. Southstorm) u data integration - transforming data representational conflicts at the global level. ïFor example, “M” and “F” may represent “Male” and “Female” in one database, and another may represent these concepts using “0” and “1”.

Page 34 Northwind & Southstorm Query Examples è Example 1: Retrieve all order ids ([Order] Id) and customers ([Customer] Name): u SS: SELECT Order_num, Cust_name FROM Orders_tb u NW: SELECT OrderID, CompanyName FROM Orders, Customers WHERE Orders.CustomerID = Customers.CustomerID è Example 2: Retrieve all ordered products ([Order;Product] Id) and their order ids. u SS: SELECT Order_num, Item1_id, Item2_id FROM Orders_tb u NW: SELECT OrderID, ProductID FROM OrderDetails u Note: In NW, selects from two different order id mappings. In SS, result normalization is required.

Page 35 Integration Example: Discussion è Important points: u System table and field names are not presented to the user who queries based on semantic names. u Database structure is not shown to the user. u Field and table mappings are automatically determined based on X-Spec information. u Join conditions are inserted as needed when available to join tables. u Different physical representations for the same concept are combined. u Hierarchically related concepts are combined based on their IS-A relationship in the standard dictionary.

Page 36 Unity Overview è Unity is a software package that implements the integration architecture with a GUI. è Developed using Microsoft Visual C++ 6 and Microsoft Foundation Classes (MFC). è Unity allows the user to: u Construct and modify standard dictionaries u Build X-Specs to describe data sources u Integrate X-Specs into an integrated view u Transparently query integrated systems using ODBC and automatically generate SQL transactions è Unity is available for demonstration and distribution.

Page 41 Architecture Discussion è The architecture automatically integrates relational schemas into a multidatabase. è Desirable properties: u Individual mappings - information sources integrated one-at-a-time and independently u Integrated view constructed for query transparency - user queries system by semantics instead of structure u Handles schema conflicts - including semantic, structural, and naming conflicts u Automated integration - integrated view constructed efficiently and automatically u No wrapper or mediator software is required u Transparent querying - users issue semantic queries which are translated to SQL by the query processor

Page 42 Contributions è Architecture contributions: u Has an unique application of a standard dictionary which is not a knowledge base u Separates the capture and integration processes u Allows transparent querying without structure u Provides algorithms for dynamically extracting database data (creating relevant views) u Algorithms for mediation of global level conflicts (global keys, normalization, etc.) u Arguably simpler method for capturing data semantics than using description logic u An implementation, Unity, which demonstrates the practical benefits of the architecture

Page 43 Conclusions è Automatic database integration is possible by using a standard term dictionary and defining semantic names for schema elements. è Integration of data sources has applications to the WWW and construction of data warehouses. è Users are able to transparently query integrated systems by concept instead of structure.

Page 44 Future Work è The integration architecture is evolving with standards on XML and captures metadata information in XML documents. è We are constantly refining Unity. u Develop an integration component for a web browser è The query processor is being extended to resolve more complex queries and conflicts. è Test the system in large industrial projects. è Allow distributed updates and global updates on all databases.

Page 45 References è Publications: u Unity - A Database Integration Tool, R. Lawrence and K. Barker, TRLabs Emerging Technology Bulletin, January u Multidatabase Querying by Context, R. Lawrence and K. Barker, DataSem2000, pages , Oct u Integrating Relational Database Schemas using a Standardized Dictionary, To appear in SAC’ ACM Symposium on Applied Computing, March, è Sponsors: u NSERC, TRLabs è Further Information: u