A Survey of Approaches to Automatic Schema Matching (VLDB Journal, 2001) November 7, 2008 IDB SNU Presented by Kangpyo Lee.

Slides:



Advertisements
Similar presentations
Three-Step Database Design
Advertisements

Chapter 10: Designing Databases
Entity-Relationship (ER) Modeling
Lukas Blunschi Claudio Jossen Donald Kossmann Magdalini Mori Kurt Stockinger.
1 A Survey of Approaches to Automatic Schema Matching Name: Samer Samarah Number: This.
Query Languages. Information Retrieval Concerned with the: Representation of Storage of Organization of, and Access to Information items.
The database approach to data management provides significant advantages over the traditional file-based approach Define general data management concepts.
Database Systems: Design, Implementation, and Management Eighth Edition Chapter 6 Advanced Data Modeling.
Database Systems: Design, Implementation, and Management Tenth Edition Chapter 5 Advanced Data Modeling.
Database Systems: Design, Implementation, and Management Tenth Edition
BIS Database Systems School of Management, Business Information Systems, Assumption University A.Thanop Somprasong Chapter # 6 Advanced Data Modeling.
Distributed DBMS© M. T. Özsu & P. Valduriez Ch.4/1 Outline Introduction Background Distributed Database Design Database Integration ➡ Schema Matching ➡
1 A Survey of Approaches to Automatic Schema Matching Erhard Rahm Philip A. Bernstein The VLDB Journal 10: (2001)
NaLIX: A Generic Natural Language Search Environment for XML Data Presented by: Erik Mathisen 02/12/2008.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 3 The Basic (Flat) Relational Model.
1 Lecture 13: Database Heterogeneity Debriefing Project Phase 2.
Generic Schema Matching with Cupid Jayant Madhavan Philip A. Bernstein Erhard Raham Proceedings of the 27 th VLDB Conference.
Page 1 Multidatabase Querying by Context Ramon Lawrence, Ken Barker Multidatabase Querying by Context.
Distributed Database Management Systems. Reading Textbook: Ch. 4 Textbook: Ch. 4 FarkasCSCE Spring
Automatic Data Ramon Lawrence University of Manitoba
1 Chapter 2 Reviewing Tables and Queries. 2 Chapter Objectives Identify the steps required to develop an Access application Specify the characteristics.
INTEGRATION INTEGRATION Ramon Lawrence University of Iowa
BIS310: Week 7 BIS310: Structured Analysis and Design Data Modeling and Database Design.
Databases & Data Warehouses Chapter 3 Database Processing.
Semantic Interoperability Jérôme Euzenat INRIA & LIG France Natasha Noy Stanford University USA.
Ontology Matching Basics Ontology Matching by Jerome Euzenat and Pavel Shvaiko Parts I and II 11/6/2012Ontology Matching Basics - PL, CS 6521.
Semantic Matching Pavel Shvaiko Stanford University, October 31, 2003 Paper with Fausto Giunchiglia Research group (alphabetically ordered): Fausto Giunchiglia,
A survey of approaches to automatic schema matching Erhard Rahm, Universität für Informatik, Leipzig Philip A. Bernstein, Microsoft Research VLDB 2001.
Lecture 2 The Relational Model. Objectives Terminology of relational model. How tables are used to represent data. Connection between mathematical relations.
Chapter 4 The Relational Model.
Chapter 3 The Relational Model Transparencies Last Updated: Pebruari 2011 By M. Arief
Organizing Information Digitally Norm Friesen. Overview General properties of digital information Relational: tabular & linked Object-Oriented: inheritance.
Ontology Alignment/Matching Prafulla Palwe. Agenda ► Introduction  Being serious about the semantic web  Living with heterogeneity  Heterogeneity problem.
A Z Approach in Validating ORA-SS Data Models Scott Uk-Jin Lee Jing Sun Gillian Dobbie Yuan Fang Li.
Concepts and Terminology Introduction to Database.
A SURVEY OF APPROACHES TO AUTOMATIC SCHEMA MATCHING Sushant Vemparala Gaurang Telang.
Semantic Matching Fausto Giunchiglia work in collaboration with Pavel Shvaiko The Italian-Israeli Forum on Computer Science, Haifa, June 17-18, 2003.
Automatic Schema Matching Nicole Oldham CSCI 8350 (Semantic Web Univ of Georgia) Topic Presentation.
10/18/20151 Business Process Management and Semantic Technologies B. Ramamurthy.
Minor Thesis A scalable schema matching framework for relational databases Student: Ahmed Saimon Adam ID: Award: MSc (Computer & Information.
1 The Relational Database Model. 2 Learning Objectives Terminology of relational model. How tables are used to represent data. Connection between mathematical.
Chapter 8 Data Modeling Advanced Concepts Database Principles: Fundamentals of Design, Implementation, and Management Tenth Edition.
9/7/2012ISC329 Isabelle Bichindaritz1 The Relational Database Model.
1 Relational Databases and SQL. Learning Objectives Understand techniques to model complex accounting phenomena in an E-R diagram Develop E-R diagrams.
CSE 636 Data Integration Schema Matching Cupid Fall 2006.
XML Schema Integration Ray Dos Santos July 19, 2009.
DataBase Management System What is DBMS Purpose of DBMS Data Abstraction Data Definition Language Data Manipulation Language Data Models Data Keys Relationships.
A Classification of Schema-based Matching Approaches Pavel Shvaiko Meaning Coordination and Negotiation Workshop, ISWC 8 th November 2004, Hiroshima, Japan.
1 A Demo of Logical Database Design. 2 Aim of the demo To develop an understanding of the logical view of data and the importance of the relational model.
Management Information Systems, 4 th Edition 1 Chapter 8 Data and Knowledge Management.
Database Management Supplement 1. 2 I. The Hierarchy of Data Database File (Entity, Table) Record (info for a specific entity, Row) Field (Attribute,
CSE314 Database Systems Lecture 3 The Relational Data Model and Relational Database Constraints Doç. Dr. Mehmet Göktürk src: Elmasri & Navanthe 6E Pearson.
Issues in Ontology-based Information integration By Zhan Cui, Dean Jones and Paul O’Brien.
©2003 Prentice Hall Business Publishing, Accounting Information Systems, 9/e, Romney/Steinbart 4-1 Relational Databases.
Chapter 3: Modeling Data in the Organization. Business Rules Statements that define or constrain some aspect of the business Assert business structure.
©2003 Prentice Hall Business Publishing, Accounting Information Systems, 9/e, Romney/Steinbart 4-1 Relational Databases.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
1 10 Systems Analysis and Design in a Changing World, 2 nd Edition, Satzinger, Jackson, & Burd Chapter 10 Designing Databases.
LECTURE TWO Introduction to Databases: Data models Relational database concepts Introduction to DDL & DML.
Data Models. 2 The Importance of Data Models Data models –Relatively simple representations, usually graphical, of complex real-world data structures.
Of 24 lecture 11: ontology – mediation, merging & aligning.
IT 5433 LM3 Relational Data Model. Learning Objectives: List the 5 properties of relations List the properties of a candidate key, primary key and foreign.
COP Introduction to Database Structures
Logical Database Design and the Rational Model
Lecture 2 The Relational Model
MANAGING DATA RESOURCES
Accounting Information Systems 9th Edition
Relational Database Design
Business Process Management and Semantic Technologies
Presentation transcript:

A Survey of Approaches to Automatic Schema Matching (VLDB Journal, 2001) November 7, 2008 IDB SNU Presented by Kangpyo Lee

2 Contents  Introduction  Application Domains  The Match Operator  Architecture for Generic Match  Classification of Schema Matching Approaches  Schema-level Matchers  Instance-level Approaches  Combining Different Matchers  Sample Approaches from the Literature  Conclusion

3 Introduction  Operation Match  Takes two schemas as input  Produces a mapping between elements of the two schemas that correspond semantically to each other  Manual Matching  Tedious, time-consuming, error-prone, and expensive  The level of effort is at least linear in the # of matches to be performed  Automated Matching  Faster and less labor-intensive

4 Application Domains [1/2]  1. Schema integration  Investigated since the early 1980s  Given a set of independently developed schemas, construct a global view  A variation is to integrate an independently developed schema with a given conceptual schema ⇒ OASIS  2. Data warehouse  Became popular in the 1990s  A decision support database extracted from a set of data sources  The match operation is useful for designing transformations from the source format to the warehouse format

5 Application Domains [2/2]  3. E-commerce  Message translation  Translating between different message format  Translating between different message schemas  Semantic Web  Mapping messages between autonomous agents or matching declarative mediator definitions ⇒ OASIS  4. Semantic query processing  Run-time analysis of schemas  A user specifies the output of a query (e.g., SELECT), and the system figures out how to produce that output (e.g., by determining the FROM & WHERE)  The system must give a qualification (e.g., WHERE) that gives the semantics of the mappings

6 The Match Operator [1/5]  Schema  A set of elements connected by some structure  Mapping  A set of mapping elements  Certain elements of schema S1 are mapped to certain elements in S2  Mapping expression  Specifies how the S1 and S2 elements are related  Directional or non-directional  Use simple relations over scalars (e.g., =, <), functions (e.g., addition, concatenation), ER-style relationships (e.g., is-a, part-of), set-oriented relationships (e.g., overlaps, contains), etc.  Match result  The match operation outputs a mapping between two schemas

7 The Match Operator [2/5]  The criteria used to match the elements are based on heuristics  Not easily captured in a precise mathematical way  The practical, though mathematically unsatisfying, goal of producing a mapping  Similarity relation,  Over the power sets of S1 and S2  A mapping that does not include mapping expressions

8 The Match Operator [3/5]  Examples for a mapping between S1 and S2  Cust.C# = Customer.CustID  Cust.CName = Customer.Company  Concatenate(Cust.FirstName, Cust.LastName) = Customer.Contact  Cust.C# = Customer.CustID  Cust.CName = Customer.Company  {Cust.FirstName, Cust.LastName} = Customer.Contact

9 The Match Operator [4/5]  Match vs. Join  Both are binary operations that determine pairs of corresponding elements from their input operand  Match operates on metadata (schema elements) and Join on data (rows of table)  Match is more complex than Join  Join  Each element in the Join result combines only one element of the first with one matching element of the second input  Join semantics is specified by a single comparison expression (e.g., = for natural join) that must hold for all matching input elements  Match  An element in a match result can relate multiple elements from both inputs  Each element in a match result may have a different mapping expression  ⇒ The semantics of Match is less restricted than that of Join and is more difficult to capture in a consistent way

10 The Match Operator [5/5]  OuterMatch (vs. OuterJoin)  A right (or left) OuterMatch ensures that every element pf S2 (or S1) is referenced by the mapping  A full OuterMatch ensures every element of both S1 and S1 are referenced by the mapping  By ensuring that every element of a schema S is referenced in the mapping returned by Match, the mapping can be more easily composed with other mappings that refer to S

11 Architecture for Generic Match  A high-level architecture for a generic, customizable implementation of Match  A semantic-preserving importer  Translates input schemas from their native representation into the internal representation  An exporter  Translates mappings produced by the generic implementation of Match from the internal representation into the representation required by each tool  ※ The implementation of Match only determines match candidates  The user can accept, reject, or change it

12 Classification of Schema Matching Approaches [1/2]  Match Algorithms (or Matchers)  The realization of individual matchers  Instance vs. schema  Element vs. structure matching  Language vs. constraint  Matching cardinality  Auxiliary information  The combination of individual matchers  Hybrid matcher or composite matcher

13 Classification of Schema Matching Approaches [2/2]

14 Schema-level Matchers [1/14]  Only consider schema information, not instance data  The available information includes the usual properties of schema elements  Name, description, data type, relationship type, constraints, and schema structure  For each match candidate, it is customary to estimate the degree of similarity by a normalized numeric value in the range 0-1

15 Schema-level Matchers [2/14] 1. Match Granularity  Element-level matching  Determines the matching elements in the second input schema for each element of the first schema  In the simplest case, only elements at the finest level of granularity are considered ⇒ atomic level  in Tab. 2, Address.ZIP = CustomerAddress.PostalCode  Structure-level matching  Refers to matching combinations of elements that appear together in a structure  Full structural match vs. partial structural match  The need for partial matches arises because subschemas of different domains are being compared  In Tab. 2, AccountOwner may come from a finance database while Customer comes from a sales database

16  Considering known equivalence patterns in a library  E.g., relating two structures in an is-a hierarchy to a single structure  Element-level matching may also be applied to coarser grained, higher (non-atomic) level elements  E.g., file records, entities, classes, relational tables, & XML elements  Ignoring the substructure and components  In Tab. 2, Address = CustomerAddress  Element-level matching can be implemented by algorithms similar to relational join processing such as nested-loop join  Comparing each S1 element with each S2 element Schema-level Matchers [3/14] 1. Match Granularity (cont’d)

17 Schema-level Matchers [4/14] 2. Match Cardinality  An S1 (or S2) element can participate in zero, one, or many mapping elements of the match result  1:1, 1:n, n:1, and n:m

18 Schema-level Matchers [5/14] 2. Match Cardinality (cont’d)  Matching element with respect to  Different mapping elements: global cardinality  Individual mapping elements: local cardinality  There may be different match cardinalities at the instance level instead of the schema level  In row 4 of Tab. 3  Most existing approaches map each element of one schema to the element of the other schema with highest similarity  ⇒ Local 1:1 matches & global 1:1 or 1:n mappings  More work is needed on local and global n:1 and n:m mappings

19 Schema-level Matchers [6/14] 3. Linguistic Approaches  Use names & texts to find semantically similar schema elements  Name matching  Equality of names  Equality of canonical name representations after stemming & other preprocessing  E.g. CName → customer name, EmpNo → employee number  Equality of synonyms  E.g. Car → automobile, make → brand  Equality of hypernyms*  E.g. book is-a publication & article is-a publication imply book → publication, article → publication, & book → article  Similarity of names based on common substrings, edit distance, pronunciation, soundex  E.g. representedBy → representative, ShipTo → Ship2  User-provided name matches  E.g. reportsTo → manager, issue → bug

20 Schema-level Matchers [7/14] 3. Linguistic Approaches (cont’d)  Name matching (cont’d)  Exploiting synonyms & hypernyms  Requires the use of the thesauri & dictionaries  General natural language dictionaries  Domain- or enterprise-specific dictionaries & is-a taxonomies  Requires a substantial effort to be built up in a consistent way  Homonyms  A name matcher can reduce the # of wrong match candidates by exploiting mismatch information supplied by users or dictionaries  Context information  Name-based matching is possible for elements at different levels of granularity or cardinality

21 Schema-level Matchers [8/14] 3. Linguistic Approaches (cont’d)  Name matching (cont’d)  Dictionary D  Generation of a list of all match candidates  This assumes that D contains all relevant pairs of the transitive closure over similar names

22 Schema-level Matchers [9/14] 4. Description Matching  Comments  Schemas often contain comments in natural language to express the intended semantics of schema elements  Linguistic analysis of comments to determine the similarity  Extracting keywords from the description  Used for synonym comparison  Using natural language understanding technologies  To look for semantically equivalent expressions

23 Schema-level Matchers [10/14] 5. Constraint-based Approaches  Constraints  If two input schemas contain constraint information, it can be used by a matcher to determine the similarity  Equivalence of data types & domains, of key characteristics (e.g., unique, primary, foreign), of relationship cardinality (e.g., 1:1 relationships), or of is-a relationships  Equivalent data types & constraint names (e.g., string = varchar, primary key = unique) can be provided by a special synonym table

24 Schema-level Matchers [11/14] 5. Constraint-based Approaches (cont’d)  Imperfect but helpful  The use of constraint information alone often leads to imperfect n:m matches  Still, it helps to limit the # of match candidates and may be combined with other matchers (e.g., name matchers)  Structural information  Can be interpreted as constraints  Intra-schema references (e.g., foreign keys), adjacency-related information (e.g., part-of- relationships)  Tells us which elements belong to the same higher-level schema element, transitively through multi-level structures  Top-down algorithm vs. bottom-up algorithm

25 Schema-level Matchers [12/14] 5. Constraint-based Approaches (cont’d)  Structural Information (cont’d)  Observing that  Components of S2.Personnel match components of both S1.Employee & S1.Department  S1.Employee & S1.Department are interconnected by foreign key DeptNo in Employee referencing Department  The n:m SQL-like match mapping  Some inferencing was needed to know that the join should be added

26 Schema-level Matchers [13/14] 6. Reusing schema & mapping information  Schema & mapping Information  The reuse of common schema components & previously determined mappings  Reuse-oriented approaches are promising since schemas often are very similar to each other & to previously matched schemas  E.g., in E-commerce, substructures often repeat within different message formats (e.g., address fields & name fields)  The reuse of not only globally defined names but also entire schema fragments  Including data types, keys, & constraints  Defined & maintained in a schema library

27 Schema-level Matchers [14/14] 6. Reusing schema & mapping information  The reuse of existing mappings  Previously determined element-level matches  Entire structures, which is useful  When matching different but similar schemas to the same destination schema  When integrating new sources into a data warehouse or digital library

28 Instance-level Approaches [1/2]  Instance-level matching can be valuable  When useful schema information is limited  When uncovering incorrect interpretations of schema information  Especially applicable to  A linguistic characterization based on information retrieval techniques for text elements  E.g., extracting keywords and themes based on the relative frequencies of words  A constraint-based characterization for more structured data such as numerical & string elements  E.g., numerical value ranges & averages or character patterns

29 Instance-level Approaches [2/2]  Main benefit of evaluating instances  A precise characterization of the actual contents of schema elements  To enhance schema level matchers  To perform instance-level matching on its own  The per-instance match results need to be merged and abstracted to the schema level  To generate a ranked list of match candidates

30 Combining Different Matchers [1/2]  A matcher that combines several approaches is likely to achieve many good match candidates  Two Ways  Hybrid matchers  Integrates multiple matching criteria  Composite matchers  Combine the results of independently executed matchers  Hybrid Matchers  Directly combine several matching approaches to determine match candidates  Based on multiple criteria or information sources  Reduces the # of passes over the schema  Better match candidates + better performance  Combine structure- with element-level matching  Using one algorithm to generate a partial mapping & the other to complete the mapping

31 Combining Different Matchers [2/2]  Composite Matchers  Combine the results of several independently executed matchers  More flexible  Allows a selection from a repertoire of modular matchers based on application domain or schema language  Allows a flexible ordering of matchers  Selection of matchers & determining their execution order & the combination of independently determined match results  Automatically by the implementation of Match itself or its clients  Manually by a human user  But, user interaction is necessary in any case  The implementation of Match can only determine match candidates which a user can accept, reject or change

32 Sample Approaches from the Literature  Prototype Schema Matchers  SemInt (Northwestern Univ.)  LSD (Univ. ofWashington)  SKAT (Stanford Univ.)  TransScm (Tel Aviv Univ.)  DIKE (Univ. of Reggio Calabria, Univ. of Calabria)  ARTEMIS (Univ. of Milano, Univ. of Brescia) & MOMIS (Univ. of Modena and Reggio Emilia)  Cupid (Microsoft Research)  Related Prototypes  Clio (IBM Almaden and Univ. of Toronto)  Similarity flooding (Stanford Univ. and Univ. of Leipzig)  Delta (MITRE)  Tess (Univ. of Massachusetts, Amherst)  Tree matching (NYU)

33

34 Conclusion  Past Work on Schema Matching  Has mostly been done in the context of a particular application domain  The problem is so fundamental  A need to treat it as an independent problem  Future Work  On the relative performance & accuracy of different approaches