CIS607, Fall 2005 Semantic Information Integration Article Name: Clio Grows Up: From Research Prototype to Industrial Tool Name: DH(Dong Hwi) kwak Date:

Slides:



Advertisements
Similar presentations
Relational Database and Data Modeling
Advertisements

CHAPTER OBJECTIVE: NORMALIZATION THE SNOWFLAKE SCHEMA.
Chapter 10: Designing Databases
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 5 More SQL: Complex Queries, Triggers, Views, and Schema Modification.
What is a Database By: Cristian Dubon.
1 CHAPTER 4 RELATIONAL ALGEBRA AND CALCULUS. 2 Introduction - We discuss here two mathematical formalisms which can be used as the basis for stating and.
Describing Process Specifications and Structured Decisions Systems Analysis and Design, 7e Kendall & Kendall 9 © 2008 Pearson Prentice Hall.
CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 52 Database Systems I Relational Algebra.
Manish Bhide, Manoj K Agarwal IBM India Research Lab India {abmanish, Amir Bar-Or, Sriram Padmanabhan IBM Software Group, USA
Interactive Generation of Integrated Schemas Laura Chiticariu et al. Presented by: Meher Talat Shaikh.
DAVID M. KROENKE’S DATABASE PROCESSING, 10th Edition © 2006 Pearson Prentice Hall 13-1 COS 346 Day 25.
NaLIX: A Generic Natural Language Search Environment for XML Data Presented by: Erik Mathisen 02/12/2008.
1 CIS607, Fall 2005 Semantic Information Integration Presentation by Dong Hwi Kwak Week 5 (Oct. 26)
Oct 31, 2000Database Management -- Fall R. Larson Database Management: Introduction to Terms and Concepts University of California, Berkeley School.
Visual Web Information Extraction With Lixto Robert Baumgartner Sergio Flesca Georg Gottlob.
Introduction to Structured Query Language (SQL)
Object Relational Mapping. What does ORM do? Maps Object Model to Relational Model. Resolve impedance mismatch. Resolve mapping of scalar and non-scalar.
Physical Database Monitoring and Tuning the Operational System.
1 Indexing and Querying XML Data for Regular Path Expressions A Paper by Quanzhong Li and Bongki Moon Presented by Amnon Shochot.
1 Lecture 13: Database Heterogeneity Debriefing Project Phase 2.
Database Systems and XML David Wu CS 632 April 23, 2001.
Page 1 Multidatabase Querying by Context Ramon Lawrence, Ken Barker Multidatabase Querying by Context.
The information integration wizard (Iwiz) project Report on work in progress Joachim Hammer Presented by Muhammed Al-Muhammed.
CSC 8310 Programming Languages Meeting 2 September 2/3, 2014.
Query Processing Presented by Aung S. Win.
Page 1 ISMT E-120 Desktop Applications for Managers Introduction to Microsoft Access.
Database System Concepts and Architecture Lecture # 3 22 June 2012 National University of Computer and Emerging Sciences.
ASP.NET Programming with C# and SQL Server First Edition
Selective and Authentic Third-Party distribution of XML Documents - Yashaswini Harsha Kumar - Netaji Mandava (Oct 16 th 2006)
Ontology Alignment/Matching Prafulla Palwe. Agenda ► Introduction  Being serious about the semantic web  Living with heterogeneity  Heterogeneity problem.
Chapter 7 Structuring System Process Requirements
4-1 Chapter 4 - The Instruction Set Architecture Computer Architecture and Organization by M. Murdocca and V. Heuring © 2007 M. Murdocca and V. Heuring.
MIS 385/MBA 664 Systems Implementation with DBMS/ Database Management Dave Salisbury ( )
Ohio State University Department of Computer Science and Engineering Automatic Data Virtualization - Supporting XML based abstractions on HDF5 Datasets.
1 CIS336 Website design, implementation and management (also Semester 2 of CIS219, CIS221 and IT226) Lecture 6 XSLT (Based on Møller and Schwartzbach,
Interoperability in Information Schemas Ruben Mendes Orientador: Prof. José Borbinha MEIC-Tagus Instituto Superior Técnico.
OnLine Analytical Processing (OLAP)
DATA-DRIVEN UNDERSTANDING AND REFINEMENT OF SCHEMA MAPPINGS Data Integration and Service Computing ITCS 6010.
Chapter 2 Adapted from Silberschatz, et al. CHECK SLIDE 16.
Querying Structured Text in an XML Database By Xuemei Luo.
Lecture2: Database Environment Prepared by L. Nouf Almujally & Aisha AlArfaj 1 Ref. Chapter2 College of Computer and Information Sciences - Information.
Dimitrios Skoutas Alkis Simitsis
Module 3: Creating Maps. Overview Lesson 1: Creating a BizTalk Map Lesson 2: Configuring Basic Functoids Lesson 3: Configuring Advanced Functoids.
1 Relational Databases and SQL. Learning Objectives Understand techniques to model complex accounting phenomena in an E-R diagram Develop E-R diagrams.
RRXS Redundancy reducing XML storage in relations O. MERT ERKUŞ A. ONUR DOĞUÇ
Unit-1 Introduction Prepared by: Prof. Harish I Rathod
1.  10% Assignments/ class participation  10% Pop Quizzes  05% Attendance  25% Mid Term  50% Final Term 2.
Data Exchange with Data-Metadata Translations MAD Algorithm Paolo Papotti Mauricio A. Mauricio A. Hernández Wang-ChiewTan.
CSC 240 (Blum)1 Introduction to Data Entry, Queries and Reports.
Database Systems Design, Implementation, and Management Coronel | Morris 11e ©2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or.
Sofia, Bulgaria | 9-10 October The Query Governor Richard Campbell Stephen Forte Richard Campbell Stephen Forte.
1. 2 Preface In the time since the 1986 edition of this book, the world of compiler design has changed significantly 3.
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
Page 1© Crown copyright 2004 FLUME Marco Christoforou, Rupert Ford, Steve Mullerworth, Graham Riley, Allyn Treshansky, et. al. 19 October 2007.
Computing & Information Sciences Kansas State University Friday, 20 Oct 2006CIS 560: Database System Concepts Lecture 24 of 42 Friday, 20 October 2006.
Dec. 13, 2002 WISE2002 Processing XML View Queries Including User-defined Foreign Functions on Relational Databases Yoshiharu Ishikawa Jun Kawada Hiroyuki.
Object storage and object interoperability
 Distributed Database Concepts  Parallel Vs Distributed Technology  Advantages  Additional Functions  Distribution Database Design  Data Fragmentation.
Lecture 15: Query Optimization. Very Big Picture Usually, there are many possible query execution plans. The optimizer is trying to chose a good one.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
1 Storing and Maintaining Semistructured Data Efficiently in an Object- Relational Database Mo Yuanying and Ling Tok Wang.
7 1 Database Systems: Design, Implementation, & Management, 7 th Edition, Rob & Coronel 7.6 Advanced Select Queries SQL provides useful functions that.
Of 24 lecture 11: ontology – mediation, merging & aligning.
IT 5433 LM3 Relational Data Model. Learning Objectives: List the 5 properties of relations List the properties of a candidate key, primary key and foreign.
More SQL: Complex Queries, Triggers, Views, and Schema Modification
Chapter 6 - Database Implementation and Use
Analysis models and design models
Grid Based Data Integration with Automatic Wrapper Generation
Query Optimization.
Presentation transcript:

CIS607, Fall 2005 Semantic Information Integration Article Name: Clio Grows Up: From Research Prototype to Industrial Tool Name: DH(Dong Hwi) kwak Date: 10/26/05 4:00 pm

Introduction Mapping between different representations of data are important. Mapping between different representations of data are important. Because it is necessary to integrate and exchange of data residing at a multiple sites, in different formats (or schemas), and even under different data models (such as relational or XML) Because it is necessary to integrate and exchange of data residing at a multiple sites, in different formats (or schemas), and even under different data models (such as relational or XML)

Data Exchange vs. Data Integration Data Exchange (or data translation): The task of restructuring data from a source format (or schema) into a target format(or schema). Data Exchange (or data translation): The task of restructuring data from a source format (or schema) into a target format(or schema). Data Integration (or federation) : The ability to query a set of heterogeneous data sources via a virtual unified target schema. Data Integration (or federation) : The ability to query a set of heterogeneous data sources via a virtual unified target schema.

Relationship or Mapping In both cases (data exchange, data integration), relationship or mapping must first be established between the source schema and the target schema. In both cases (data exchange, data integration), relationship or mapping must first be established between the source schema and the target schema. Two complementary level Two complementary level Syntactic one: schema matching Syntactic one: schema matching Operational one: instances over the source schema with instances over target schema Operational one: instances over the source schema with instances over target schema Move actual data from source to a target Move actual data from source to a target Answer queries Answer queries

Clio System Two important component Two important component Mapping Generation Component Mapping Generation Component Takes as input correspondences between the source and target schemas Takes as input correspondences between the source and target schemas Generates a schema mapping consisting of logical mapping that provide an interpretation of the given correspondences Generates a schema mapping consisting of logical mapping that provide an interpretation of the given correspondences Query Generation Component Query Generation Component To convert a set of logical mappings into an executable transformation script To convert a set of logical mappings into an executable transformation script SQL, SQL/XML, XQuery and XSLT SQL, SQL/XML, XQuery and XSLT The user can interact with the system through GUI during a design The user can interact with the system through GUI during a design The user can view, add, and remove correspondences The user can view, add, and remove correspondences

Clio System Two main component Two main component Mapping Generation Mapping Generation Query Generation Query Generation

Mapping & Query Generation Figure 2 illustrates an actual Clio screenshot showing portions of two gene expression schemas and correspondences Figure 2 illustrates an actual Clio screenshot showing portions of two gene expression schemas and correspondences Left: Source schema Left: Source schema Relational schema Relational schema Right: Target schema Right: Target schema XML schema XML schema

Mapping & Query Generation Mapping Generation Mapping Generation For any two related elements in the source schema, for which there exist correspondences into two related elements in the target schema. For any two related elements in the source schema, for which there exist correspondences into two related elements in the target schema. FACTOR_NAME and BIOLOGY_DESC : There is a foreign key that links the EXPERIMENTFACOR table to the EXPERIMENTSET table FACTOR_NAME and BIOLOGY_DESC : There is a foreign key that links the EXPERIMENTFACOR table to the EXPERIMENTSET table biology_desc and factor_name : the latter is a child of the former in XML schema. biology_desc and factor_name : the latter is a child of the former in XML schema. Therefore there will be a mapping that maps related instances of FACTOR_NAME and BIOLOGY_DESC into related instances biology_desc and factor_name Therefore there will be a mapping that maps related instances of FACTOR_NAME and BIOLOGY_DESC into related instances biology_desc and factor_name

Mapping & Query Generation Generation of tableaux Generation of tableaux The first step of the algorithm is to generate all the basic ways in which elements relate to each other within one schema The first step of the algorithm is to generate all the basic ways in which elements relate to each other within one schema In figure 3 we show several of the source and target tableaux that are generated for our example In figure 3 we show several of the source and target tableaux that are generated for our example

Mapping & Query Generation Generation of logical mapping Generation of logical mapping The second step of the algorithm is the generation of logical mapping The second step of the algorithm is the generation of logical mapping The basic algorithm pairs all the existing tableaux in the source with the existing tableaux in the target, and find the correspondences that are covered by each pair The basic algorithm pairs all the existing tableaux in the source with the existing tableaux in the target, and find the correspondences that are covered by each pair M 1 is obtained from (S 2, T 2 ) M 1 is obtained from (S 2, T 2 ) M 3 is obtained from the pair(S 4, T 4 ) M 3 is obtained from the pair(S 4, T 4 )

Mapping & Query Generation Mapping language Mapping language Skolem functions: This functions can explicitly represent target elements for which no source value is given Skolem functions: This functions can explicitly represent target elements for which no source value is given For example, the mapping m1 will not specify a value for attribute under exp_factor For example, the mapping m1 will not specify a value for attribute under exp_factor A Skolem function creates a unique value for this attributes A Skolem function creates a unique value for this attributes

Mapping & Query Generation Query Generation Query Generation Each logical mapping is compiled into a query graph Each logical mapping is compiled into a query graph For each logical mapping, query generators walk the relevant part of the target schema and create the necessary join and grouping condition For each logical mapping, query generators walk the relevant part of the target schema and create the necessary join and grouping condition In Figure 5, the XQuery fragment produces the target In Figure 5, the XQuery fragment produces the target

Mapping & Query Generation Lines 1-4 in Figure 5 implements join of two tables EXPERIMENTFACOR and EXPERIMENTSET Lines 1-4 in Figure 5 implements join of two tables EXPERIMENTFACOR and EXPERIMENTSET Lines 7-10 output the attributes within exp_set Lines 7-10 output the attributes within exp_set Lines produce an actual exp_factor element Lines produce an actual exp_factor element Line 30 crates a value for the id element Line 30 crates a value for the id element

Practical Challenges (Mapping Generation) Redundancy Check Redundancy Check If there a is one-to-one mapping of the variables of T 1 into the T 2, T 1 is a sub-tableau of T 2 If there a is one-to-one mapping of the variables of T 1 into the T 2, T 1 is a sub-tableau of T 2 Can significantly reduce the amount of irrelevant mappings Can significantly reduce the amount of irrelevant mappings

Practical Challenges (Mapping Generation) Hybrid Algorithm Hybrid Algorithm There are two phases in mapping generation (precomputation, insertion of correspondences) There are two phases in mapping generation (precomputation, insertion of correspondences) The separation enables users to speed up the addition of correspondences in the GUI The separation enables users to speed up the addition of correspondences in the GUI The disadvantage of this separation might be occurred when the schemas are large The disadvantage of this separation might be occurred when the schemas are large A large amount of memory may be needed to hold all the data structure A large amount of memory may be needed to hold all the data structure

Practical Challenges (Mapping Generation) Hybrid Algorithm Hybrid Algorithm The main idea behind this algorithm is to precompute only a bounded number of source tableaux and target tableaux The main idea behind this algorithm is to precompute only a bounded number of source tableaux and target tableaux Sometimes correspondences between elements may fail because of the deeper schema trees Sometimes correspondences between elements may fail because of the deeper schema trees Generate a source tableau that includes all the set- type elements -> The tableaux are closed under the chase. -> Thus including all the other schema elements associated via foreign key Generate a source tableau that includes all the set- type elements -> The tableaux are closed under the chase. -> Thus including all the other schema elements associated via foreign key The data structures holding the sub-tableaux and the sub-skeleton relationship are updated. The data structures holding the sub-tableaux and the sub-skeleton relationship are updated. The algorithm may lose its completeness (small price) The algorithm may lose its completeness (small price)

Practical Challenges (Mapping Generation) Performance evaluation: mapping MAGE-ML Performance evaluation: mapping MAGE-ML MAGE-ML is a complex XML schema MAGE-ML is a complex XML schema Two experiments are performed Two experiments are performed Control the nesting level of the precomputed tableaux (maximum 6 nested level of sets), no limits on the total number of precomputed tableaux (Lower bound) Control the nesting level of the precomputed tableaux (maximum 6 nested level of sets), no limits on the total number of precomputed tableaux (Lower bound) Control the nesting level of the precomputed tableaux (maximum 6 nested level of sets), the total number of precomputed tableaux(maxium 110 per schema (actual improvement) Control the nesting level of the precomputed tableaux (maximum 6 nested level of sets), the total number of precomputed tableaux(maxium 110 per schema (actual improvement)

Practical Challenges (Mapping Generation) First experiment First experiment Load the MAGE-ML (source): less than 1 sec Load the MAGE-ML (source): less than 1 sec Precompute all(1030) tableaux: 2.6 sec Precompute all(1030) tableaux: 2.6 sec Computing subtableaux relationship: 74 sec Computing subtableaux relationship: 74 sec Memory to hold the data structure: 335 MB Memory to hold the data structure: 335 MB Loading the MAGE-ML schema: run out of memory Loading the MAGE-ML schema: run out of memory Second experiment Second experiment Precomputation of the tableaux(116 now): 0.5 sec Precomputation of the tableaux(116 now): 0.5 sec Computing subtableaux relationship: 0.7 sec Computing subtableaux relationship: 0.7 sec Memory to hold the data structure: 163 MB Memory to hold the data structure: 163 MB The amount of memory needed to hold everything: 251 MB The amount of memory needed to hold everything: 251 MB Overall, the performance of hybrid algorithm is quite acceptable. Overall, the performance of hybrid algorithm is quite acceptable.

Practical Challenges (Query Generation: Deep Union) There are two drawbacks There are two drawbacks There is no duplicate removal within and among query fragments There is no duplicate removal within and among query fragments There is no grouping of data among query fragments There is no grouping of data among query fragments (OrderID, ItemID) (OrderID, ItemID) Input data :{(o 1,i 1 ),(o 1,i 2 )} Input data :{(o 1,i 1 ),(o 1,i 2 )} Ouput data: {(o 1,(i 1,i 2 )),(o 1,(i 1,i 2 ))} Ouput data: {(o 1,(i 1,i 2 )),(o 1,(i 1,i 2 ))} Second query: {(o 1,(i 3 ))} Second query: {(o 1,(i 3 ))} We would expect this second tuple to be merged with previous result and produce only one tuple for o 1 with {(o 1,(i 1,i 2, i 3 )} We call this special union operation is deep union We would expect this second tuple to be merged with previous result and produce only one tuple for o 1 with {(o 1,(i 1,i 2, i 3 )} We call this special union operation is deep union

Remaining Challenges Complex mappings are need a more expressive correspondence selection mechanism than that supported by Clio Complex mappings are need a more expressive correspondence selection mechanism than that supported by Clio Exploring the need for logical mapping that nest other logical mapping inside Exploring the need for logical mapping that nest other logical mapping inside Mapping adaptation issues when source and target schemas change Mapping adaptation issues when source and target schemas change